[thelist] Identify a Web Crawler's request

Liam Delahunty liam at megaproducts.co.uk
Tue Jul 6 08:26:34 CDT 2004


on 06/07/2004 11:56 David Travis wrote:

> Now to the question: How do I identify a request from a web-crawler? Is
> there a standard header in the HTTP Request to check? I am particularly
> interested in Google's headers since it is most popular.

You could look at the HTTP User-Agent.

If using php see http://uk2.php.net/reserved.variables, 
$_SERVER['HTTP_USER_AGENT'] and
http://uk2.php.net/manual/en/function.get-browser.php

For lists of agents etc:
The Web Robots Database
http://www.robotstxt.org/wc/active.html

Googlebot
http://www.robotstxt.org/wc/active/html/googlebot.html
http://www.google.com/bot.html

Search Engine Robots
http://www.jafsoft.com/searchengines/webbots.html

Search Engine IP Addresses
http://www.iplists.com

List of User-Agents (Spiders, Robots, Crawler, Browser)
http://www.psychedelix.com/agents.html

-- 
Kind regards, Liam Delahunty, Mega Products Ltd
http://www.megaproducts.co.uk/ Internet Design & Development


More information about the thelist mailing list