[thelist] site auditing

sam at sam-i-am.com sam at sam-i-am.com
Fri Feb 7 21:43:01 CST 2003


On Fri, 7 Feb 2003, j.d. welch wrote:

> does anyone have a recommendation of a programatic approach to auditing
> a site for
>
> a) unreferenced documents (pages with nothing linking to them)
> b) unreferenced images (the image isn't called by any of the pages)
>
I get faced with this problem every now and then and have tried all the
methods suggested. It depends how big the site is. Is manually checking
the result feasible? Are we talking 10s, 100s, 1000s or more pages?

I don't have an out of the box solution, but here's what I'd do:
Get a full file listing from the webroot. (on win32 dir /b/s *.* >
filelist.txt)
Point Xenu at the site and save off the result (it will export to a csv
file).
The delta between the 2 (once you've done a little massaging to get paths
to match etc) is a crude "orphans list". If you drop the 2 lists
into Excel you can sort and filter to your hearts content.
You'll almost certainly hit trouble if you start deleting those orphan
files though. Most link checkers don't look in background="" attributes,
don't check javascript (and if they do they'll likely miss filenames that
get assembled in the script). Add CSS and any import, url() dependancies.
Oh and Flash and any files any given movie might load. Etc. etc.
The only reliable way to know which of these files are used is to use a
real person and real browser. Rollover all the rollovers, play all the
Flash games etc.
I wrote (hacked together) a proxy server for this purpose, that just logs
each file requested. That way I can hit key pages on the site with
different browsers, OSs, and catch browser forking and content
negotiation.
It's still not a perfect list, but it should narrow it down enough to go
after the remainder by hand, searching through the access logs, the source
code.

Perl is your friend for this kind of task. People have been using it for
this kind of thing for years and years and you'll find lots of helpful
code and modules out there. (In an earlier effort I made an html parser
that knew about most of the above-mentioned places dependancies can lurk)

And then you can always just delete the maybe orphans, and just check the
error logs after a week or 4 to see which you need to put back :)

hth
Sam




More information about the thelist mailing list