The Web Robots Pages

Robots Standard

Sometimes people find they have not been indexed by an indexing robot, or that a resource discovery robot has visited part of a site but for some reason not the whole site that should be visited by robots.

In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to show the robot what it should do. This is achieved through this mechanism:

The Robots Protocol
A Web site administrator can indicate which parts of the site should be visited by a robot, by providing a specially formatted file on their site, in http://.../robots.txt.

The Robots.txt example file
This is a zipped file that can be used by webmasters on their sites. To use it, download the archive, unzip and upload the robots.txt file into your root on your server.

The rest of this page below provides full details on these facilities.

Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work for every Robot. The main search engines, however, do respect the robots protocol, and will follow your directions as to indexing your web site..


The Robots Protocol

The Robots Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should be visited and indexed by the robot.

In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like:

User-agent: *
Disallow: /
to see if it is allowed to retrieve the document. If it is allowed to crawl and index the site it will do so and the pages it has crawled will then be added to a search engine's index. The precise details on how these rules can be specified, and what they mean, can be found in:



The Web Robots Page