[thelist] Identify a Web Crawler's request

Tue Jul 6 08:05:30 CDT 2004

On 6 Jul 2004, at 11:56, David Travis wrote:
> I am working on a site, which requires IE6. In order to prevent users 
> who
> work with other browsers from accessing the site I wrote some kind of 
> filter
> to check the user agent string,

This is unreliable. User agent headers vary a lot and are often forged 
(because some sites insist that users use IE and the user doesn't want 
to (can't say I blame them either, I wouldn't want to use IE)).

>  and redirect the user to an upgrade-your-browser page.

Oh wonderful! "Use up many megabytes of your bandwidth upgrading your 
browser to one with fewer features, a rubbish security record, and 
which isn't available for your platform". Some upgrade.

> This redirection also causes requests from
> web-crawlers (search engines) to be redirected to this page.

If your site needs IE, then this isn't much of an issue is it? As the 
crawler isn't IE, it won't be able to read the site.

> Now to the question: How do I identify a request from a web-crawler?

Typically through the user agent string. Although looking for requests 
to robots.txt may help. You won't catch all robots though (although I 
doubt you want a spammer's email address harvester to access the 
site... it'll probably pose as IE though).

--
David Dorward
      <http://dorward.me.uk/>
<http://blog.dorward.me.uk/>