[thelist] Another way of trapping bots (was: Congrats to evolt.org....)

Paola Kathuria paola at limitless.co.uk
Fri Sep 21 18:36:45 CDT 2001

I like the idea of setting a trap for robots in robots.txt

Here's another way of trapping bots if you want to deny them
specific content on a generated site.

I have found that agent string isn't reliable to detect many or
even most robots because 1) some have been rotating agent strings
within a session for years (I started saving examples of these
recently and such a visit log is at:
http://www.limitless.co.uk/~paola/tmp/log-rotate-agent.txt).  And
2) many spiders and bots can be set to supply any agent string
(this is also true of Wget) - many pretend to be IE 5+ because of
browser-detection on IE 5-specific sites.

In my resource the Colour Selector (http://www.limitless.co.uk/colour/),
there are 4,500 internal links in just 16 pages because of the clickable
colour palettes which can be up to 60k.  The web server already sets the
session id to "null" if a visitor is recognised as a spider/robot (by
means of a file look-up) - this was done because some search engine
spiders go beserk when the URL changes for an indexed page and they
try to reindex everything again.

However, new robots appear daily and so I created a link which only
robots would follow - the link is of a 1-pixel transparent gif (with
no alt text) and it is to an empty page that just sets a session
variable.  If the session id is "null" or if the "caught" variable has
been set, the clickable palettes with their 4,500 links won't appear
in the Colour Selector.  The catch link is at the very top and very
bottom of pages.


More information about the thelist mailing list