robots.txt

December 11, 2011

What is it? What does it do?

Robots.txt is a file which resides on your webserver’s root directory. The contents of the robots.txt file contains instructions for crawlers (robots) to follow. You can define and instruct crawlers where they do not have permission to search.

How does it benefit me?

By default, a crawler will search an entire website for content unless it has been instructed otherwise. By defining paths and directories that are not allowed for crawling, you will be able to “hide” content from crawlers. This helps a crawler be more efficient when crawling through your website. It also has the added benefit (however, little) of letting a search engine know that your website is standards compliant.

How do I make it? How do I use it?

Paste one of the following examples into a “robots.txt” file – then upload it to your webserver’s root.

The following will block all User Agents visiting the entire website:
User-agent: *
Disallow: /

The following will block the Internet Archive from visiting the entire website:
User-agent: ia_archiver
Disallow: /

The following will block MSN from visiting the images and temp directories:
User-agent: MSNBot
Disallow: /images/
Disallow: /temp/

The following will allow Google to visit the entire website:
User-agent: googlebot
Disallow:

Does it always work?

No, it’s important to remember that crawlers can ignore your robots file. This is especially true with rogue scripts that scan for security vulnerabilities and email addresses, and not to mention general troublemakers and curious eyes. Also, your robots.txt file is a public file – so don’t use the robots.txt file to hide important data or information.

Allow?

Some crawlers now support “Allow”. This becomes handy for instances where you want to block everything with exceptions

The following will block all crawlers except Google:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

Crawl-Delay

A more advanced directive of robots.txt is to declare the speed of a robot crawling through your website. This has the benefit of controlling the bandwidth and limiting the number of requests being sent on your webserver. The Crawl-Delay value is in seconds.

User-agent: MSNBot
Crawl-Delay: 20
Allow: /

META Tag

You can have even more control by declaring Meta Tag specific crawler instructions inside your pages. These instructions are not supported in the robots.txt file. For the most part, the robots file will take care of global instructions, while robot meta tags will address page-specific instructions.

The following, when put in the HEAD section of your HTML page, will block DMOZ (Open Directory Project) and Yahoo Directory:
<meta name="robots" content="noodp,noydir">

The following will block crawlers from indexing the content, or the links, on a page:
<meta name="robots" content="noindex, nofollow">

Sitemap

While the robots.txt file is originally, and by legacy, an exclusion protocol – a sitemap can be considered an inclusion protocol. A sitemap file can be deployed a couple ways:

  1. Create a sitemap.xml file to be uploaded in the appropriate directory
  2. Declare the (multiple) sitemap paths inside the robots file

Here is an example of how to declare a sitemap path from within your robots.txt file:
Sitemap: http://www.example.com/sitemaps/sitemap_1.xml
Sitemap: http://www.example.com/sitemaps/sitemap_2.xml
User-agent: *
Disallow: /temp/
Disallow: /cgi-bin/

What is it? What does it do?

Robots.txt is a file which resides on your webserver’s root directory. The contents of the robots.txt file contains instructions for crawlers (robots) to follow. You can define and instruct crawlers where they do not have permission to search.

How does it benefit me?

By default, a crawler will search an entire website for content unless it has been instructed otherwise. By defining paths and directories that are not allowed for crawling, you will be able to “hide” content from crawlers. This helps a crawler be more efficient when crawling through your website. It also has the added benefit (however, little) of letting a search engine know that your website is standards compliant.

How do I make it? How do I use it?

Paste one of the following examples into a “robots.txt” file – then upload it to your webserver’s root.

The following will block all User Agents visiting the entire website:
User-agent: *
Disallow: /

The following will block the Internet Archive from visiting the entire website:
User-agent: ia_archiver
Disallow: /

The following will block MSN from visiting the images and temp directories:
User-agent: MSNBot
Disallow: /images/
Disallow: /temp/

The following will allow Google to visit the entire website:
User-agent: googlebot
Disallow:

Does it always work?

No, it’s important to remember that crawlers can ignore your robots file. This is especially true with rogue scripts that scan for security vulnerabilities and email addresses, and not to mention general troublemakers and curious eyes. Also, your robots.txt file is a public file – so don’t use the robots.txt file to hide important data or information.

Allow?

Some crawlers now support “Allow”. This becomes handy for instances where you want to block everything with exceptions

The following will block all crawlers except Google:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

Crawl-Delay

A more advanced directive of robots.txt is to declare the speed of a robot crawling through your website. This has the benefit of controlling the bandwidth and limiting the number of requests being sent on your webserver. The Crawl-Delay value is in seconds.

User-agent: MSNBot
Crawl-Delay: 20
Allow: /

META Tag

You can have even more control by declaring Meta Tag specific crawler instructions inside your pages. These instructions are not supported in the robots.txt file. For the most part, the robots file will take care of global instructions, while robot meta tags will address page-specific instructions.

The following, when put in the HEAD section of your HTML page, will block DMOZ (Open Directory Project) and Yahoo Directory:
<meta name="robots" content="noodp,noydir">

The following will block crawlers from indexing the content, or the links, on a page:
<meta name="robots" content="noindex, nofollow">

Sitemap

While the robots.txt file is originally, and by legacy, an exclusion protocol – a sitemap can be considered an inclusion protocol. A sitemap file can be deployed a couple ways:

  1. Create a sitemap.xml file to be uploaded in the appropriate directory
  2. Declare the (multiple) sitemap paths inside the robots file

Here is an example of how to declare a sitemap path from within your robots.txt file:
Sitemap: http://www.example.com/sitemaps/sitemap_1.xml
Sitemap: http://www.example.com/sitemaps/sitemap_2.xml
User-agent: *
Disallow: /temp/
Disallow: /cgi-bin/

codeseo

Next Post