What is it? What does it do?
Robots.txt is a file which resides on your webserver’s root directory. The contents of the robots.txt file contains instructions for crawlers (robots) to follow. You can define and instruct crawlers where they do not have permission to search.
How does it benefit me?
By default, a crawler will search an entire website for content unless it has been instructed otherwise. By defining paths and directories that are not allowed for crawling, you will be able to “hide” content from crawlers. This helps a crawler be more efficient when crawling through your website. It also has the added benefit (however, little) of letting a search engine know that your website is standards compliant.
How do I make it? How do I use it?
Paste one of the following examples into a “robots.txt” file – then upload it to your webserver’s root.
The following will block all User Agents visiting the entire website:
The following will block the Internet Archive from visiting the entire website:
The following will block MSN from visiting the images and temp directories:
The following will allow Google to visit the entire website:
Does it always work?
No, it’s important to remember that crawlers can ignore your robots file. This is especially true with rogue scripts that scan for security vulnerabilities and email addresses, and not to mention general troublemakers and curious eyes. Also, your robots.txt file is a public file – so don’t use the robots.txt file to hide important data or information.
Some crawlers now support “Allow”. This becomes handy for instances where you want to block everything with exceptions
The following will block all crawlers except Google:
A more advanced directive of robots.txt is to declare the speed of a robot crawling through your website. This has the benefit of controlling the bandwidth and limiting the number of requests being sent on your webserver. The Crawl-Delay value is in seconds.
You can have even more control by declaring Meta Tag specific crawler instructions inside your pages. These instructions are not supported in the robots.txt file. For the most part, the robots file will take care of global instructions, while robot meta tags will address page-specific instructions.
The following, when put in the HEAD section of your HTML page, will block DMOZ (Open Directory Project) and Yahoo Directory:
<meta name="robots" content="noodp,noydir">
The following will block crawlers from indexing the content, or the links, on a page:
<meta name="robots" content="noindex, nofollow">
While the robots.txt file is originally, and by legacy, an exclusion protocol – a sitemap can be considered an inclusion protocol. A sitemap file can be deployed a couple ways:
- Create a sitemap.xml file to be uploaded in the appropriate directory
- Declare the (multiple) sitemap paths inside the robots file
Here is an example of how to declare a sitemap path from within your robots.txt file: