[thelist] Trigger for Spambots IP Address Filtering

J.J.SOLARI jjsolari at pobox.com
Fri Apr 1 08:40:58 CST 2005


Hello all,

My hosting fees depend mainly on trafic, and robots are currently
consuming about 10% of it, which is plenty. In order to reduce as
much as possible useless trafic, I intend to filter out all robots
that do not respect the Robots Exclusion Protocol, i.e. rules found
in robots.txt to disallow access to directories which spidering is
not desired such as images and compressed archives, at least in my
case.

The strategy is the following (Apache server here).

In /index.html, there is a small transparent image linking to
directory /abadbot, which access is specifically denied to any robot
in robots.txt (via User-agent: * and Disallow: /abadbot): the whole
thing is supposed to be a trigger for a bad robots' trap.

As a result of any such intrusion, it will be processed by some PHP
script, which basically will store the referrer IP address
(REMOTE_ADDR) in a database, then, in case some visitor had got
there by mistake (?), it will build a page with some text
instructing the supposedly human visitor what is going on, that is,
IP address is captured and any subsequent access to the site is now
undefinitely denied, unless the visitor checks the checkbox
hereunder and cliks OK to release IP address in question.

Since IP is registered, it is possible to reject any subsequent
connexion from it, and the next step is to check IP address of all
incoming requests for any correspondance with a bad IP address in
database: if no correspondance, go transparently to requested URI,
else issue a Http status 410 Gone.

However, this is complex and it raises several issues, including
overall file delivery performance and search engine ranking.

So I have not yet determined what would be the best compromise, and
any advices and/or caveats are very welcome.

Thanks,

JJS.


More information about the thelist mailing list