[thelist] Robots.txt

Michael Buffington mike at price.com
Tue Dec 5 11:30:43 CST 2000


We actually wrote a Perl script that saves the last 20 minutes of log file
entries in a "buffer" of sorts.

It actively keeps watch of those last 20 minutes, and if it sees any high
frequency hits (high frequency basically means, in our book, any amount of
hits that just don't seem humanly possible, like 30 pages views per minute
from any single IP aside from an AOL proxy).

These IP's along with thier user agent get logged and put into a possible
spiders table.  Whenever these guys come in, we copy thier entries to
seperate logs, and watch thier every move, segregated from the bulk of
logfiles. (Our daily log files can easily reach 300-400 megs). It's much
easier to quickly assess what kind of "impact" a spider made this way.

If we feel we've missed anything, we also have a Perl script that runs
through the entire day's logs, and looks for high frequency hits, but this
takes about 12 hours to run, making it good as a Plan B approach only.

If you don't have access to fancy Perl scripts, there's a very easy way to
tell if you've been spidered - simply watch your traffic.  If you have a
"huge" spike in relevant terms in a very short time period, chances are you
got hit by a spider.

In my book, for most sites, being hit by a spider isn't a bad thing.  It
means your site is getting "known".  For a site like Price.com, we have to
be very careful.  We want non-competitive search engines look at us, but a
competitor who's stepping through our site, one product at a time, is not
good news. We try to discourage that.

One thing I forgot to mention earlier, is that we've seen spiders who don't
ever bother to look at our robots.txt.  Because of this, we don't use
robots.txt showing up in our logfiles as an indication of a visiting spider.
That's why we look for high frequency.

Maybe I should write an article on this.  It would be cool to include some
of the Perl we use, but I'd have to see if the I/P folks would be happy with
that.  I'll see what I can do.

Michael Buffington
mike at price.com
(714) 556-3890 x222
http://www.michaelbuffington.com
http://www.price.com 

-----Original Message-----
From: A. Erickson [mailto:amanda at gawow.com]
Sent: Tuesday, December 05, 2000 9:10 AM
To: thelist at lists.evolt.org
Subject: RE: [thelist] Robots.txt


That is really interesting! I know that you can specify specific bots in
terms of not indexing but it would be difficult for the average user to
figure out which bots were indexing and how.

You should write an article on thesite!

- amanda

> -----Original Message-----
> From: thelist-admin at lists.evolt.org
> [mailto:thelist-admin at lists.evolt.org]On Behalf Of Michael Buffington
> Sent: Tuesday, December 05, 2000 8:57 AM
> To: 'thelist at lists.evolt.org'
> Subject: RE: [thelist] Robots.txt
>
>
> We've hired a few folks over time that have worked directly or indirectly
> with some of Price.com's competitors.
>
> It's pretty well known that a handful of these competitors DO look in the
> "don't spider" directories.
>
> We've actually watched spiders as they've come in and watched
> them read our
> robots.txt, and either immediately step into the directory or come back to
> the directory at a later time.
>
> While I doubt anyone here is using robots.txt as a security system, I
> doesn't hurt to reiterate that it is in no way a security system.
>
> It should also be said again that most of the more well known and
> reputable
> organizations do follow the rules.  It seems that only a handful of small
> shops tend to either ignore them, are unaware of them, or choose to break
> them.
>
> Michael Buffington
> mike at price.com
> (714) 556-3890 x222
> http://www.michaelbuffington.com
> http://www.price.com
>
> -----Original Message-----
> From: A. Erickson [mailto:amanda at gawow.com]
> Sent: Monday, December 04, 2000 6:35 PM
> To: thelist at lists.evolt.org
> Subject: RE: [thelist] Robots.txt
>
>
> I can't imagine it would make a difference -- neither help nor harm.
>
> I have a side question, too. Robots can only crawl that which is linked,
> correct? So, if I have stuff that isn't linked anywhere on my
> site is there
> any point in including that directory as a "do not search" item?
>
> Any robots that look at the "do not search" and purposefully search it?
>
> - amanda (the paranoid one)
>
> > -----Original Message-----
> > From: thelist-admin at lists.evolt.org
> > [mailto:thelist-admin at lists.evolt.org]On Behalf Of Jay Fitzgerald
> > Sent: Monday, December 04, 2000 3:17 PM
> > To: thelist at lists.evolt.org
> > Subject: Re: [thelist] Robots.txt
> >
> >
> > what do you think of just having an empty robots.txt file? That is what
> > I usually use, a 0 byte file with nothing in it at all. It eliminates
> > the 404 error but are there any downsides?
> >
> > --
> > Jay Fitzgerald - N at ta$ - Internet Director
> > ===================================
> > Digital Athlete Gamers League   http://www.dagl.net
> > ===================================
> > ICQ: 38823829
> >
> >
> >
> > ---------------------------------------
> > For unsubscribe and other options, including
> > the Tip Harvester and archive of TheList go to:
> > http://lists.evolt.org Workers of the Web, evolt !
> >
>
>
> ---------------------------------------
> For unsubscribe and other options, including
> the Tip Harvester and archive of TheList go to:
> http://lists.evolt.org Workers of the Web, evolt !
>
> ---------------------------------------
> For unsubscribe and other options, including
> the Tip Harvester and archive of TheList go to:
> http://lists.evolt.org Workers of the Web, evolt !
>


---------------------------------------
For unsubscribe and other options, including
the Tip Harvester and archive of TheList go to:
http://lists.evolt.org Workers of the Web, evolt ! 




More information about the thelist mailing list