robots.txt
One thing I've wondered about is the syntax of the robots.txt file, where it's
placed, and how it's used. I've known that it is used to block spiders from
accessing your site, but that's about it. I've had to look into it recently
because we're offering free memberships at work, and we don't want them indexed
by search engines. I've also wondered how we can exclude certain areas, such as
where we collate our site statistics, from these engines.
As it turns out, it's really dead simple. Simply create a robots.txt file in
your htmlroot, and the syntax is as follows:
User-agent: *
Disallow: /path/
Disallow: /path/to/file
The User-agent can specify specific agents or the wildcard; there are so many
spiders out there, it's probably safest to simply disallow all of them. The
Disallow line should have only one path or name, but you can have multiple
Disallow lines, so you can exclude any number of paths or files.