Server Automation ToolsDancing Tools

"Software Tools to Maximize Up-time and Automate Functions in Microsoft’s Internet Information Server”


How to Write a Robots.txt file

How to Write a Robots.txt File

This article was previously published under Q217103

SUMMARY

Web Spiders, often called Robots, are WWW search engines that "crawl" across the Internet and index pages on Web servers. A Web Spider will then catalog that information and make it available to the Internet for searching. This makes it easier for users to find specific information on the Internet by allowing "one-stop shopping" through the Spider's WWW site. Most Robots also prioritize documents that are on the Internet, allowing search results to be "scored" or arranged in order of most likely matches on a search.

A Robots.txt file is a special text file that is always located in your Web server's root directory. This file contains restrictions for Web Spiders, telling them where they have permission to search. It should be noted that Web Robots are not required to respect Robots.txt files, but most well-written Web Spiders follow the rules you define.

MORE INFORMATION

A Robot identifies itself when it browses your site, which is known as the "User-agent" and appears in the logs for IIS. Generally, the flow of events when a Web Spider crawls your site is similar to the following:

  1. The Robot asks for your ⁄robots.txt file and looks for a "User- agent:" line that refers to it specifically.
    • If it finds an entry for itself, such as "User-agent: WebRobot," then it follows the rules that pertain to it.
    • If is does not find an entry for itself, it looks for a global set of rules, such as "User-agent: *," and obeys those rules.
    • If the Robot has an entry for itself and a global set of rules is also present, the Robot's personal rules will supersede the global rules.
      1. Rules for a user-agent are set up as "Disallow:" statements that tell a robot where it cannot search. A disallow statement is applied to any address that may have been requested by the Robot. For example:
        • "Disallow: ⁄test" causes a Web Spider to ignore ⁄test⁄index.htm, and so on.
        • "Disallow: ⁄" causes a Web Spider to ignore the whole site; sometimes this is desirable.
        • "Disallow: " allows a Web Spider to crawl the whole site.
          1. Lines that begin with the pound symbol (#) denote comments, which can be useful when creating long sets of rules.Examples
            • This example disallows all Web Spiders for the entire site:
# Make changes for all web spiders

User-agent: *

Disallow: ⁄

                                                     

  • The following example disallows a Robot named "WebSpider" from the virtual paths "⁄marketing" and "⁄sales":
# Tell "WebSpider" where it can't go

User-agent: WebSpider

Disallow: ⁄marketing

Disallow: ⁄sales

# Allow all other robots to browse everywhere

User-agent: *

Disallow:

                                                     

  • This example allows only a Web Spider named "SpiderOne" into a site, while denying all other Spiders:
# Allow "SpiderOne" in the site

User-agent: SpiderOne

Disallow:

# Deny all other spiders

User-agent: *

Disallow: ⁄

                                                     

  • This last example disallows FrontPage-related paths in the root of your Web site:
# Ignore FrontPage files

User-agent: *

Disallow: ⁄_borders

Disallow: ⁄_derived

Disallow: ⁄_fpclass

Disallow: ⁄_overlay

Disallow: ⁄_private

Disallow: ⁄_themes

Disallow: ⁄_vti_bin

Disallow: ⁄_vti_cnf

Disallow: ⁄_vti_log

Disallow: ⁄_vti_map

Disallow: ⁄_vti_pvt

Disallow: ⁄_vti_txt

                                                     

For more information on writing Robots.txt files, see the following Web sites:

http:⁄⁄www.robotstxt.org⁄wc⁄norobots.html

http:⁄⁄www.robotstxt.org⁄wc⁄faq.html

http:⁄⁄www.robotstxt.org⁄

The information in this article applies to:

  • Microsoft Internet Information Services version 6.0
  • Microsoft Internet Information Server 1.0
  • Microsoft Internet Information Server 2.0
  • Microsoft Internet Information Server 3.0
  • Microsoft Internet Information Server 4.0
  • Microsoft Internet Information Server 5.0
  • Microsoft Internet Information Server 6.0


If you like this article or maybe had a similar experience and would like to share it with other readers then feel free to:
Simply fill in YOUR e-mail address, your name and your comment. Click the SEND button.

After submitting your comment, you will be transported back to the article you commented on!

  Name E-Mail Address
You
Sending to
  Your Message



John Cesta is a contract programmer. John's current project is designer and lead developer of the automated hosting software at bestcfhosting.com, a ColdFusion MX hosting company. John is currently working on commercializing his programs and offering them to the IIS community at serverautomationtools.com