[thelist] Identify a Web Crawler's request
Simon Perry
thelist at si-designs.co.uk
Tue Jul 6 08:56:23 CDT 2004
David Travis wrote:
>Hi All,
>
>Interesting question.
>
>I am working on a site, which requires IE6. In order to prevent users who
>work with other browsers from accessing the site I wrote some kind of filter
>to check the user agent string, and redirect the user to an
>upgrade-your-browser page. This redirection also causes requests from
>web-crawlers (search engines) to be redirected to this page.
>
>
Why would any informed user[0] want to use IE6? Are you using active X
or other M$ proprietary components?
>The site contains a lot of content, which I want to be added to the search
>engines' indexes.
>
>
If this content was written in a semantically and to published web
standards in the first place every user agent available, including
search engine robots, would have full access to your content. If you are
using proprietary components then google isn't likely to understand even
if you do allow it access.
>Now to the question: How do I identify a request from a web-crawler? Is
>there a standard header in the HTTP Request to check? I am particularly
>interested in Google's headers since it is most popular.
>
>
That way lies madness, browser sniffing is unreliable at the best of
times. Many browser allow the user to spoof the identity string. The
bots I have written have used various browser strings but mostly IE6.
Simon
[0] http://www.theregister.co.uk/2004/06/28/cert_ditch_explorer/
More information about the thelist
mailing list