[thelist] Identify a Web Crawler's request

Robert Gormley robert at pennyonthesidewalk.com
Wed Jul 7 01:29:44 CDT 2004


> Why?
> 
> Mozilla Firefox 0.9 is not good enough? Konqueror 3.2.3 is not good
> enough? Opera 7.0 is not good enough? Lynx 2.8.5 is not good enough?

Take this from someone who develops using Firefox, and then augments
presentation layer CSS to suit various platforms, I can say this:

No. Firefox isn't good enough if you are doing specialised development for
things like certain client-side scripting, ActiveX, any other technology
which has a specialised purpose. You can bleat on and on about accessibility
of content, content being king, and so forth, but the reality is that there
/are/ rich media sites which use different technology, there /are/ extranets
which would like to provide a richer client-side experience, and no, you
don't /have/ to visit them, but there /are/ situations and circumstances
where there is a tangible benefit to being able to utilise this.

> > This redirection also causes requests from web-crawlers (search
> > engines) to be redirected to this page.
> 
> As well it should, because you've resorted to blatantly clueless
> behavior.
> 
> > The site contains a lot of content, which I want to be added to the
> > search engines' indexes.
> 
> They are World Wide Web search engines, not Microsoft IE-only narrow web
> search engines. So no, your content does not belong in them until you
> have a World Wide Web site.

Wow, you've graduated to being an arbiter of search engine indexing policy,
too?

By your rationale, what the hell is Google doing, allowing indexing of Word
documents, PDF documents, VRML. After all, none of these are visible to a
HTTP user agent. Or have you stamped the "clueless" boot on them, too, since
you apparently know better?

> > Now to the question: How do I identify a request from a web-crawler?
> > Is there a standard header in the HTTP Request to check? I am
> > particularly interested in Google's headers since it is most popular.
> 
> Make a site for the World Wide Web, not just one browser that only works
> on PCs running Windows. Browser detect garbage is the hallmark of those
> patently devoid of clue.

Browser detection is a necessary evil in a world where all browsers are
/NOT/ created evilly, and functionality in one is not available in others.
Yes, the "world wide web" was written to be content agnostic, but content
exists in more forms than just text and image. You throw about the word
clueless, fully aware that you are being entirely selective about your
argument, ignoring certain qualifiers that the person used in their
original, and subsequent followup post.



More information about the thelist mailing list