If you’re serious about your SEO, then use of a Robots.txt file should definitely be a part of your strategy.
Surprisingly, many website owners forget to maintain, let alone create a robots.txt file for their websites, so here’s a guide on what a robots.txt file is and how best to use it for SEO purposes.
What is a robots.txt file?
A robots.txt file provides restrictions to search engine robots (known as “bots”) that crawl the web. Robots are used to find content to index in the search engine’s database.
These bots are automated, and before they access any sections of a site, they check to see if a robots.txt file exists that prevents them from indexing certain pages.
The robots.txt file is a simple text file (no HTML), that must be placed in your root directory, for example:
Reasons for using a robots.txt file?
There are 3 primary reasons for using a robots.txt file on your website:
- Information you don’t want made public through search
In situations where you have content on your website which you don’t want accessed via searches, the robots.txt will prevent search engines from including it in their index.
- Duplicate Content
Often similar content is presented on a website under various URLs (e.g. the same blog post might appear under various categories). Duplicate content can incur penalties by search engines which is bad from an SEO point of view. The robots.txt file can help you control which version of the content the search engines include in their index
- Manage bandwidth usage
Some website’s have limited bandwidth allowances (based on hosting packages). As robots use up bandwidth when indexing your site, in some instances – you might want to stop some user agents from indexing elements of your site to conserve bandwidth usage.
How to create a robots.txt file?
The robots.txt file is just a simple text file. To create your own robots.txt file, open a new document in a simple text editor (e.g. notepad).
The content of a robots.txt file consists of “records” which tell the specific search engine robots what to index and what not to access.
Each of these records consist of two fields – the user agent line (which specifies the robot to control) and one or more Disallow lines. Here’s an example:
This example record allows the “googlebot”, which is Google’s spider to access every page from a site except files from the “admin” directory. All files in the “admin” directory will be ignored.
If you want only specific pages not indexed, then you need to specify the exact file. For example:
Should you want your entire site and all its content to be indexed, then simply leave the disallow line blank.
If you want all the search engine robots to have access to the same content, you can use a generic user agent record – which will control all of them in the same way.
How to find which User Agents to control?
The first place to look for a list of the robots currently indexing your website is in your log files.
For SEO purposes, you’ll generally want all search engines indexing the same content, so using “User-agent: *” is the best strategy.
If you want to get specific with your user agents, you can find a comprehensive list at http://www.user-agents.org/
Common Robots.txt Mistakes to Avoid
If you don’t format your robots.txt file properly, some or all files of your Web site might not get indexed by search engines. To avoid this, do the following:
- There is no “Allow” command. Everything is allowed by default.
- Don’t add the “disallow” line above the user agent. Ensure the user agent always sits above the disallow commands.
- Don’t add more one file or directory to a disallow command.
- Maintain cases for file names and directories in your robots.txt file. Names on your server are case sensitive, so getting the cases wrong will cause issues.
- Avoid listing all files in a directory. If you want the whole directory ignored, just use a generic directory entry.
If you want to check your Robots.txt file is implemented correctly, visit your Google Webmaster Center. It allows you to check your robots.txt. Google will automatically and in real time retrieve the robots.txt from your website.
So if you haven’t created a robots.txt file yet for your website – get going.
And it’s always a good idea to check your robots.txt file regularly as websites are constantly changing – so make sure your current robots.txt file is up to date.