robots.txt
One thing I've wondered about is the syntax of the robots.txt
file, where it's
placed, and how it's used. I've known that it is used to block spiders from
accessing your site, but that's about it. I've had to look into it recently
because we're offering free memberships at work, and we don't want them indexed
by search engines. I've also wondered how we can exclude certain areas, such as
where we collate our site statistics, from these engines.
As it turns out, it's really dead simple. Simply create a robots.txt
file in
your htmlroot, and the syntax is as follows:
User-agent: *
Disallow: /path/
Disallow: /path/to/file
The User-agent
can specify specific agents or the wildcard; there are so many
spiders out there, it's probably safest to simply disallow all of them. The
Disallow
line should have only one path or name, but you can have multiple
Disallow
lines, so you can exclude any number of paths or files.