[thelist] Stopping robots.txt being read
Daniel J. Cody
djc at members.evolt.org
Mon Nov 26 14:11:15 CST 2001
Hey Andy -
Without some sort of server-side checking on the USER-AGENT string sent
from the browser, I can't think of any ways off the top of my head to
let 'real' robots read the robots.txt file while keeping people from
reading it.
If you're using apache, a decent(but not fullproof because a user-agent
string can be changed) way of keeping regular people - and thereby
regular browsers - out of the robots.txt file would be to use something
like this in your httpd.conf file:
BrowserMatchNoCase ^Mozilla exclude
BrowserMatchNoCase MSIE^ exclude
<Files ~ "robots\.txt">
Order allow,deny
Deny from env=exclude
</Files>
This would disallow browsers with mozilla and msie in the user-agent
string from reading any robots.txt file on the server, while letting any
user-agent that *doesn't* have mozilla or msie in it read robots.txt.
Obviously this would need to be tested and tweaked a bit, but it should
give you a lead in which direction to go.. :)
Shout if you have any other questions :)
.djc.
Andy Warwick wrote:
> Just been reading the rather scary article on CNET, about how Google can be used
> to find passwords etc.
>
> http://news.cnet.com/news/0-1005-200-7946411.html?tag=tp_pr
>
> While I'd already thought about - and covered - the issues raised, and wouldn't
> dream of putting sensitive files in a public area, it did bring back to mind an
> important question that has been bugging me for a while.
>
> How does one go about stopping a robots.txt file being read in a browser. Given
> the file has to be accesible to a search engine, how do you protect it so that a
> human can't simply type in the robots.txt URL manually, read the file, and make
> some educated guesses about where stuff is on the server.
>
> For instance, type in www.<mysite>.co.uk/robots.txt and it reveals that a
> directory called /licences is disallowed. Seems like a good place to start
> reverse-engineering a site's structure for backdoors. (there's actually no such
> directory, so don't bother...)
>
> Any good way of stopping humans reading robots.txt, while still allowing robots
> to use it?
More information about the thelist
mailing list