What is a search engine spider?
A search engine spider, also called crawler or bot, is a program designed to browse the Internet in a systematic, automated manner and retrieve information about websites.
Search engine spiders extract information from the pages they visit and store them in a way that allows search engines to process and index the data and quickly retrieve relevant parts of the data in response to search queries.
How do spiders work?
Search engine spiders use the hyperlinks contained on web page to move from one website to the next – or crawl from one web page to another if you prefer. If a spider is given a list of URLs to visit, it begins visiting the URLs on the list, identifies the hyperlinks on those URLs and adds the hyperlinked pages to the list of URLs to visit, thus expanding the so-called “crawl frontier”. The hyperlinked pages could belong to the same website or be links to external pages.
With more and more web pages being added to the World Wide Web, and existing web pages being changed and updated frequently, one of the main challenges for search engine spiders is to efficiently crawl as many new and updated web pages as possible. Because of this challenge, search engine spiders use a set of rules that help them determine which pages to crawl how often and how to distribute the activities of multiple spiders that are crawling the web at the same time.
Search engines generally have multiple copies of their spiders crawling the web at the same time, and do not provide much information about exactly which rules their spiders follow to avoid search engine spammers using this information to manipulate crawls (and ultimately search engine rankings). In general, search engine spiders are guided by some measure of website importance as expressed in the quality and popularity of a site. They can also “learn” how often pages are updated and when is a good time to spider the page again for new content. This is both good for the search engine and the webmaster, as it uses less bandwidth.
What do spiders read?
- HTML code, including text and meta tags
- Sitemaps make the spiders’ job easier because they contain links to each page – every site should have one!
- Most spiders should be able to handle Flash – but ensure your web page also contains HTML text
What do spiders ignore?
- Image maps may confuse spiders
- Temporary redirects
- Any pages excluded from spider access in your robots.txt file
- The use of frames means search engine spiders may be unable to find indexable HTML content
- Spiders still may have trouble crawling database content where the search string includes 3 or more variables
- Spiders will not be able to make sense of images without any attached text descriptions (ALT tags).
Can spiders do any damage?
- Spiders can chew up large amounts of bandwidth which can be quite costly for web masters
- A spider can put too much load on a website by requesting a lot of pages at once, however that should be avoided by politeness policies that spiders normally follow
- Poorly written spiders may crash web servers
- There have been reports that the Googlebot managed to delete the entire content of a site with a poorly written content management system (we’ve reported on this story, which was also posted here). While skeptics think this may be an urban myth, I think the story is certainly a good reminder to check your database structure and access permits.
How to make your site spider friendly
Try and provide search engine spiders with an easy way to navigate through your site (e.g. through a sitemap or through HTML links) and provide them with plenty of HTML copy to index.
And, of course, make sure search engine spiders find your site in the first place – that means you need incoming links! Directory listings are a good source of incoming links, and you should also request links to your site from relevant, related websites (e.g. supplier, industry association or customer websites).
Important search engine spiders
Spiders have names, just like browsers do. All good web statistics programs should give you a report on spiders that have crawled your site. Alternatively, you can check your log files. Here are the names of the most important search engine spiders (thanks to http://www.jafsoft.com/searchengines/webbots.html – check out the full list on their site):
Google’s Spider: Googlebot
Yahoo!’s Spider: Slurp (the Inktomi spider)
MSN’s Spider: MSNBOT
Ask’s Spider: teoma_agent1
Abacho’s Spider: AbachoBOT
Aesop’s Spider: AESOP_com_SpiderMan
Alexa’s Spider: ia_archiver
AltaVista’s Spider Scooter or Mercator
AllTheWeb’s Spider: FAST-WebCrawler
Baidu’s Spider: Baiduspider
Entireweb’s Spider: Speedy Spider
Excite’s Spider: ArchitextSpider
Infoseek’s Spider: UltraSeek or InfoSeek Sidewinder
Looksmart’s Spider: MantraAgent
Lycos’ Spider: Lycos_Spider_(T-Rex)
Mirago’s Spider: HenryTheMiragoRobot
ScrubTheWeb’s Spider: Scrubby/
Singingfish’s Spider: asterias
WiseNut’s Spider: ZyBorg
If you have any specific questions aobut search engine spiders, let us know and we’ll do our best to provide you with answers!