robots.txt

One thing I've wondered about is the syntax of the robots.txt file, where it's placed, and how it's used. I've known that it is used to block spiders from accessing your site, but that's about it. I've had to look into it recently because we're offering free memberships at work, and we don't want them indexed by search engines. I've also wondered how we can exclude certain areas, such as where we collate our site statistics, from these engines.

As it turns out, it's really dead simple. Simply create a robots.txt file in your htmlroot, and the syntax is as follows:

User-agent: *
Disallow: /path/
Disallow: /path/to/file

The User-agent can specify specific agents or the wildcard; there are so many spiders out there, it's probably safest to simply disallow all of them. The Disallow line should have only one path or name, but you can have multiple Disallow lines, so you can exclude any number of paths or files.