[thelist] Spider Attack

Stuart Young syoung at unitec.ac.nz
Wed Mar 23 20:02:33 CST 2005


Re bandwidth usage by search spiders.

..............................................................
> Mark Mandel wrote 16/03/2005 15:36:06 >

[Search] Spiders don't tend to take up too much bandwitdth anyway - all
they
grab are html files, which shouldn't be huge in the 1st place.
..............................................................

er, no, search spiders grap word docs and PDFs too, otherwise how come
Word and PDF show up in Google results? However presumably these don't
change so only have to be downloaded once.

<tip author="Stuart Young>
the main bandwidth issue from search spiders is when you have a dynamic
site with lots of options for people e.g. rate this link, printable
version, multiple possible menu pages depending on how user filters
results etc. 

This means that the spider not only downloads the page you want them to
get, they also download all the alternate versions as well. A
well-thought out robots.txt file is essential for blocking access to all
those alternate version pages that it is pointless to have indexed,
while still allowing the spider to get to the content.

It is also important to make sure your web architecture is consistent.
For example the same dynamic page can be called up by putting the query
strings in a different order, so

displaypage.php?cat=main&date=2005&id=2345
displaypage.php?id=2345&cat=main&date=2005
displaypage.php?id=2345&date=2005&cat=main

are all the same page. However they have different URLs so Google
treats them as different pages and will download all of them. So, you
should always make sure when creating the links in your web application,
that you have the query strings in the same order each time you use
them. If you monitor the referrer logs and watch spiders visiting your
site, you can quickly spot any occassions where there must be a link
that is in a different order.
</tip>
cheers

--
Dr Stuart Young,       	+64 (0)9-815 4321 x 8656
<syoung at unitec.ac.nz> 	+64 021 183 2846 (mob)
Lecturer, School of Computing and Information Technology,
Unitec New Zealand, Auckland, New Zealand
http://tinyurl.com/4956o
(the official URL for my staffpage is too long and complex)
http://www.pixelandgrain.co.nz/
Web development, graphic design and photography


More information about the thelist mailing list