Forest Software- websites for the smaller business and charity.

Home

Web Site Design
What we do
Domain registration
Check your domain
Our Portfolio
Ask for a quote
Terms and Conditions
Website Templates

Hosting Your Site
Our Servers
Package Deals
Terms and Conditions
MS FrontPage FAQ's

Web Site Updates
Is your site up to date?
Site Updates
Redesigning sites

Search Positioning
WebPosition Gold
Professional Positioning

Website Articles
Search Engines and -
your website

Registering Keywords
Web Crawlers and Bots
Do I need a sitemap ?
Affiliate Programmes
Increase your Click-
Thru Rate
.
Launching a small business website.
What is DMOZ?

Know your customer
404 Error Pages

Other Services
M Consultancy
M Programming
M Training
Cache & M Links

Support
PC Support
MS FrontPage

Small Business Links
Legal and Finance
General Links
Virus Alerts
Business Books
PC Software
Desktop Computers
Laptop Computers
Hand Held Computers
Dell PDA's
Security Software
Advertising Products

Miscellaneous Pages

Email Forgery & SPAM
Some of our clients
What is Marketing?
Working at home
Internet Tools
Useful Links
Business Links
Local Links
Site Map

Crawling and Spidering the Web

If you imagine the internet as a series of pages each of which is linked to from other pages and in turn links on to further pages you can see the problems that search engines have in creating their databases. Just imagine the number of routes through a single website, then add in all the links to other sites as well. Search engines manage to build their sets of results by crawling the web. Crawling is the method of following links on the web both from one site to another and also from one page to another on the same website, and then gathering the contents of these websites for storage in the search engines databases.

Crawling the internet can start from a single point (starting with a popular website containing lots of links, such as DMOZ) or from an existing, older index of websites. The crawler (also known as a web robot, bot, or web spider) is a software program that can download web content (mainly web pages but also, in some cases, images, documents and other files) and then follow links within these web pages to download the linked contents. The linked contents can be on the same site or on a different website. The crawling continues until it finds a logical stop, such as a dead end with no links or reaching a pre-set number of levels inside the website's link structure. It goes without saying that if a website is not linked from any other page on the internet the bot will be unable to locate it and include it in the search engine database. Conversely, if you have a page with just a single link into it there is a chance that it will be found. Therefore, if the website is new, and has no links from other sites, that website has to be submitted to each of the search engines for crawling, although it is better to have links from other sites.

The efficiency of the webbot means it can crawl multiple websites at the same time, so as to collect billions of website contents (Google currently claims that it searches 8,168,684,336 web pages) as frequently as it can. Some sites such as News and media sites are crawled more frequently (possibly every hour or so) by advanced search engines like Google, in order to deliver updated news and content in their search results, whilst other "less important" sites may be spidered on a daily, weekly or even a monthly basis.

Although the webbot is visiting your site to get as many pages as possible, if it is well written it should not flood a single website with a high volume of requests at the same time, but spreads the crawling over a period of time so that the web site does not crash from trying to serve too many pages at once.

It is believed that, usually, search engines crawl only a few (three or four) levels deep from the homepage of a website. Note that this is not the number of directory levels that the page exists in but the 'number of clicks' needed to get to the page from the home page. One way of ensuring that all of your pages are crawled if you have a large site is to build a site map that contains a link to every page on your website, link to this sitemap from your home page and the crawlers will find the site map on their next visit ( using the generally accepted 3 click rule : Home Page - Click - Site Map - Click - Any page on the site = 2 clicks).

You may see the term 'deep crawl' used in SEO (search engine optimisation) forums and websites, the term deep crawl is used to denote that the crawler or spider can index pages that are many levels deep, not that the spider is more capable of reading pages in deep sub-directories. Google and MSN are examples of a deep crawler.

Controlling Crawlers

Crawlers or web robots follow guidelines specified for them by the website owner using the robots exclusion protocol (robots.txt). The robots.txt can specify the files or folders that the owner does not want the crawler to index in its database. Details of the format of the robots.txt file are available on many websites including the Search Engine World site

Spotting Crawlers

If you have access to your websites log files you can easily spot the majority of crawlers as they access your website, for example extracts from a recent website log contained (with the crawler identified in green) :-

66.249.64.25 - - [06/Mar/2005:05:20:19 +0000] "GET /hosting.html HTTP/1.0" 200 14898 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
212.27.41.27 - - [06/Mar/2005:05:20:35 +0000] "GET /dwodp/index.php/Regional/Oceania/French_Polynesia/Moorea/ HTTP/1.0" 200 77 "-" "Pompos/1.3 http://dir.com/pompos.html"
66.249.64.13 - - [06/Mar/2005:05:21:04 +0000] "GET /search-engine-article1.html HTTP/1.0" 200 16056 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

and later :-

66.196.90.83 - - [11/Mar/2005:03:56:30 +0000] "GET /dwodp/index.php/Recreation/Guns/Competition_Shooting/Cowboy/Cowboy_Reenactors/ HTTP/1.0" 200 23700 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
207.46.98.68 - - [11/Mar/2005:03:56:43 +0000] "GET /dwodp/index.php/World/Deutsch/Regional/Afrika/%c3%84gypten/ HTTP/1.0" 200 27308 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
207.46.98.68 - - [11/Mar/2005:03:56:55 +0000] "GET /dwodp/index.php/Kids_and_Teens/Directories/ HTTP/1.0" 200 67104 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
207.46.98.68 - - [11/Mar/2005:03:57:27 +0000] "GET /dwodp/index.php/World/Norsk/Regionalt/Europa/Danmark/ HTTP/1.0" 200 18125 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
65.214.44.157 - - [11/Mar/2005:03:57:34 +0000] "GET /dwodp/index.php/Business/Arts_and_Entertainment/Tools_and_Equipment/Manufacturers/Audio/DJ/ HTTP/1.0" 200 19920 "-" "Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://sp.ask.com/docs/about/tech_crawling.html)"
207.68.146.55 - - [11/Mar/2005:03:58:14 +0000] "GET /index.html HTTP/1.0" 200 19423 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
65.214.44.157 - - [11/Mar/2005:03:59:49 +0000] "GET /dwodp/index.php/Business/Construction_and_Maintenance/Environmental/Environmentally_Safe_Materials/ HTTP/1.0" 200 26511 "-" "Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://sp.ask.com/docs/about/tech_crawling.html)"

Back to the Search Engine Article Page



John Mitchell T/A Forest Software
9 Pembroke Grove, Glinton, Peterborough, Cambridgeshire, PE6 7LG
Tel : +44 (0)870 747 4943      
Skype Handle : john-glinton