Contact Us
 
 

Welcome to ineedhits Blog

Welcome to the ineedhits Search Engine Marketing blog, where we share the latest search engine and online marketing news, releases, industry trends and great DIY tips and advice.



Monday, July 3, 2006

FAQ: Search Engine Spiders

Posted by @ 8:00 pm
5
  •  

Search Engine Spider FAQWhat is a search engine spider?

A search engine spider, also called crawler or bot, is a program designed to browse the Internet in a systematic, automated manner and retrieve information about websites.

Search engine spiders extract information from the pages they visit and store them in a way that allows search engines to process and index the data and quickly retrieve relevant parts of the data in response to search queries.

How do spiders work?

Search engine spiders use the hyperlinks contained on web page to move from one website to the next – or crawl from one web page to another if you prefer. If a spider is given a list of URLs to visit, it begins visiting the URLs on the list, identifies the hyperlinks on those URLs and adds the hyperlinked pages to the list of URLs to visit, thus expanding the so-called “crawl frontier”. The hyperlinked pages could belong to the same website or be links to external pages.

With more and more web pages being added to the World Wide Web, and existing web pages being changed and updated frequently, one of the main challenges for search engine spiders is to efficiently crawl as many new and updated web pages as possible. Because of this challenge, search engine spiders use a set of rules that help them determine which pages to crawl how often and how to distribute the activities of multiple spiders that are crawling the web at the same time.

Search engines generally have multiple copies of their spiders crawling the web at the same time, and do not provide much information about exactly which rules their spiders follow to avoid search engine spammers using this information to manipulate crawls (and ultimately search engine rankings). In general, search engine spiders are guided by some measure of website importance as expressed in the quality and popularity of a site. They can also “learn” how often pages are updated and when is a good time to spider the page again for new content. This is both good for the search engine and the webmaster, as it uses less bandwidth.

What do spiders read?

  • HTML code, including text and meta tags
  • Links
  • Sitemaps make the spiders’ job easier because they contain links to each page – every site should have one!
  • Most spiders should be able to handle Flash – but ensure your web page also contains HTML text
  • JavaScript. Spiders can read elements of JavaScript, but it may still cause some problems.

What do spiders ignore?

  • Image maps may confuse spiders
  • Temporary redirects
  • Any pages excluded from spider access in your robots.txt file
  • The use of frames means search engine spiders may be unable to find indexable HTML content
  • Spiders still may have trouble crawling database content where the search string includes 3 or more variables
  • Spiders will not be able to make sense of images without any attached text descriptions (ALT tags).

Can spiders do any damage?

  • Spiders can chew up large amounts of bandwidth which can be quite costly for web masters
  • A spider can put too much load on a website by requesting a lot of pages at once, however that should be avoided by politeness policies that spiders normally follow
  • Poorly written spiders may crash web servers
  • There have been reports that the Googlebot managed to delete the entire content of a site with a poorly written content management system (we’ve reported on this story, which was also posted here). While skeptics think this may be an urban myth, I think the story is certainly a good reminder to check your database structure and access permits.

How to make your site spider friendly

Try and provide search engine spiders with an easy way to navigate through your site (e.g. through a sitemap or through HTML links) and provide them with plenty of HTML copy to index.

And, of course, make sure search engine spiders find your site in the first place – that means you need incoming links! Directory listings are a good source of incoming links, and you should also request links to your site from relevant, related websites (e.g. supplier, industry association or customer websites).

Important search engine spiders

Spiders have names, just like browsers do. All good web statistics programs should give you a report on spiders that have crawled your site. Alternatively, you can check your log files. Here are the names of the most important search engine spiders (thanks to http://www.jafsoft.com/searchengines/webbots.html – check out the full list on their site):

Google’s Spider: Googlebot
Yahoo!’s Spider: Slurp (the Inktomi spider)
MSN’s Spider: MSNBOT
Ask’s Spider: teoma_agent1
Abacho’s Spider: AbachoBOT
Aesop’s Spider: AESOP_com_SpiderMan
Alexa’s Spider: ia_archiver
AltaVista’s Spider Scooter or Mercator
AllTheWeb’s Spider: FAST-WebCrawler
Baidu’s Spider: Baiduspider
Entireweb’s Spider: Speedy Spider
Excite’s Spider: ArchitextSpider
Infoseek’s Spider: UltraSeek or InfoSeek Sidewinder
Looksmart’s Spider: MantraAgent
Lycos’ Spider: Lycos_Spider_(T-Rex)
Mirago’s Spider: HenryTheMiragoRobot
ScrubTheWeb’s Spider: Scrubby/
Singingfish’s Spider: asterias
WiseNut’s Spider: ZyBorg

If you have any specific questions aobut search engine spiders, let us know and we’ll do our best to provide you with answers!


Is your site ready for search engine spiders? Find out with the Search Readiness Report. Only $29.95!









Discussion (5 - comments)

I am buidling a new site in .net Should I rebuild in HTML??

By Anonymous - July 10, 2006



Hi there,

.net is fine (we use it ourselves – see the .aspx endings of our URLs!). To be on the safe side, the important thing is that you create static pages with HTML content (look at our homepage for example – it’s written in .net but the output is a static page, it’s not database-driven). And as long as there aren’t too many database parameters, search engines should also be able to crawl dynamic pages.

By Nancy Hackett - July 11, 2006



Do spiders read html comments?

By Gorka - October 2, 2006



Is this statement true: “If a new web site shows up, the spikders will appear every three days looking for more. if they find nothing new, they come back every 10 days. If the site is static, they stop coming.”

By Anonymous - October 10, 2008



[...] Website Submission Page Entire Web I Need Hits Good FAQ about SEO terms Wordtracker This entry was written by admin, posted on June 9, 2009 at 10:58 am, filed under [...]

By Seo Reaserch Tools - - June 9, 2009




Add Your Comments







SUBSCRIBE

Keep up to date with the latest from our blogs.

Subscribe to all blog posts

The Newsletter
BROWSE OUR POSTS




  • New Posts
  • Popular
  • Comments


Jobthread



More in Search News (1538 of 1797 articles)


Online shopping in Canada is on the rise and showing positive signs of growth, despite still being overshadowed by the ...