[thelist] Identify a Web Crawler's request

Simon Perry thelist at si-designs.co.uk
Tue Jul 6 08:56:23 CDT 2004


David Travis wrote:

>Hi All,
>
>Interesting question.
>
>I am working on a site, which requires IE6. In order to prevent users who
>work with other browsers from accessing the site I wrote some kind of filter
>to check the user agent string, and redirect the user to an
>upgrade-your-browser page. This redirection also causes requests from
>web-crawlers (search engines) to be redirected to this page.
>  
>
Why would any informed user[0] want to use IE6? Are you using active X 
or other M$ proprietary components?

>The site contains a lot of content, which I want to be added to the search
>engines' indexes.
>  
>
If this content was written in a semantically and to published web 
standards in the first place every user agent available, including 
search engine robots, would have full access to your content. If you are 
using proprietary components then google isn't likely to understand even 
if you do allow it access.

>Now to the question: How do I identify a request from a web-crawler? Is
>there a standard header in the HTTP Request to check? I am particularly
>interested in Google's headers since it is most popular.
>  
>
That way lies madness, browser sniffing is unreliable at the best of 
times. Many browser allow the user to spoof the identity string. The 
bots I have written have used various browser strings but mostly IE6.

Simon

[0] http://www.theregister.co.uk/2004/06/28/cert_ditch_explorer/


More information about the thelist mailing list