Robots File – Usage and Best Practices

A robots.txt file restricts or allow Search Engine Robots (known as “bots”) that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents or allow them from accessing certain pages. If you do not put a robots.txt file in your website root directory (/public_html or www) then in log files there will be a warning that you have not robots file in your website root whenever a crawler will visit your web pages. So its necessary to use Robots file and to use it in a right way.

How to create a robots.txt file

This example allows all robots to visit all files because the wildcard “*” specifies all robots:

User-agent: *

This example keeps all robots out. No robot will visit your site and your pages will not be indexed by search engines:

User-agent: *
Disallow: /

The next is an example that tells all crawlers not to enter into four directories of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Example that tells all crawlers not to enter one specific file:

User-agent: *
Disallow: /directory/file.html
Crawl-delay directive

Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server:

User-agent: *
Crawl-delay: 10
Allow Directive

Allow file but disallow folder containing file, for example:

Allow: /folder1/myfile.html
Disallow: /folder1/
Extended Standard

An Extended Standard for Robot Exclusion has been proposed, which adds several new directives, such as Visit-time and Request-rate. For example:

User-agent: *
Disallow: /downloads/
Request-rate: 1/5         # maximum rate is one page every 5 seconds
Visit-time: 0600-0845     # only visit between 06:00 and 08:45 UTC (GMT)

Also read How to Use Correct robots.txt file

More Information
Google Guidelines about robots.txt