[thelist] Identify a Web Crawler's request
David Dorward
evolt at david.us-lot.org
Tue Jul 6 08:05:30 CDT 2004
On 6 Jul 2004, at 11:56, David Travis wrote:
> I am working on a site, which requires IE6. In order to prevent users
> who
> work with other browsers from accessing the site I wrote some kind of
> filter
> to check the user agent string,
This is unreliable. User agent headers vary a lot and are often forged
(because some sites insist that users use IE and the user doesn't want
to (can't say I blame them either, I wouldn't want to use IE)).
> and redirect the user to an upgrade-your-browser page.
Oh wonderful! "Use up many megabytes of your bandwidth upgrading your
browser to one with fewer features, a rubbish security record, and
which isn't available for your platform". Some upgrade.
> This redirection also causes requests from
> web-crawlers (search engines) to be redirected to this page.
If your site needs IE, then this isn't much of an issue is it? As the
crawler isn't IE, it won't be able to read the site.
> Now to the question: How do I identify a request from a web-crawler?
Typically through the user agent string. Although looking for requests
to robots.txt may help. You won't catch all robots though (although I
doubt you want a spammer's email address harvester to access the
site... it'll probably pose as IE though).
--
David Dorward
<http://dorward.me.uk/>
<http://blog.dorward.me.uk/>
More information about the thelist
mailing list