[thelist] Stopping robots.txt being read

Daniel J. Cody djc at members.evolt.org
Mon Nov 26 14:11:15 CST 2001


Hey Andy -

Without some sort of server-side checking on the USER-AGENT string sent 
from the browser, I can't think of any ways off the top of my head to 
let 'real' robots read the robots.txt file while keeping people from 
reading it.

If you're using apache, a decent(but not fullproof because a user-agent 
string can be changed) way of keeping regular people - and thereby 
regular browsers - out of the robots.txt file would be to use something 
like this in your httpd.conf file:

BrowserMatchNoCase ^Mozilla exclude
BrowserMatchNoCase MSIE^ exclude

<Files ~ "robots\.txt">
     Order allow,deny
     Deny from env=exclude
</Files>

This would disallow browsers with mozilla and msie in the user-agent 
string from reading any robots.txt file on the server, while letting any 
user-agent that *doesn't* have mozilla or msie in it read robots.txt. 
Obviously this would need to be tested and tweaked a bit, but it should 
give you a lead in which direction to go.. :)

Shout if you have any other questions :)

.djc.

Andy Warwick wrote:

> Just been reading the rather scary article on CNET, about how Google can be used
> to find passwords etc.
> 
> http://news.cnet.com/news/0-1005-200-7946411.html?tag=tp_pr
> 
> While I'd already thought about - and covered - the issues raised, and wouldn't
> dream of putting sensitive files in a public area, it did bring back to mind an
> important question that has been bugging me for a while.
> 
> How does one go about stopping a robots.txt file being read in a browser. Given
> the file has to be accesible to a search engine, how do you protect it so that a
> human can't simply type in the robots.txt URL manually, read the file, and make
> some educated guesses about where stuff is on the server.
> 
> For instance, type in www.<mysite>.co.uk/robots.txt and it reveals that a
> directory called /licences is disallowed. Seems like a good place to start
> reverse-engineering a site's structure for backdoors. (there's actually no such
> directory, so don't bother...)
> 
> Any good way of stopping humans reading robots.txt, while still allowing robots
> to use it?






More information about the thelist mailing list