[thelist] site auditing

j.d. welch so.there at showtunepink.com
Fri Feb 7 22:58:01 CST 2003


On Friday, February 7, 2003, at 10:44 PM, <sam at sam-i-am.com> wrote:

> I get faced with this problem every now and then and have tried all the
> methods suggested. It depends how big the site is. Is manually checking
> the result feasible? Are we talking 10s, 100s, 1000s or more pages?

find . -name '*.html' = 591. ugh. since i'm volunteering for them now,
doing anything manual and tedious is bad.

> I don't have an out of the box solution, but here's what I'd do:
> Get a full file listing from the webroot. (on win32 [...]
> Point Xenu at the site and save off the result (it will export to a csv
> file).[...]

sorry, no windows.  but, yes, a thorough site map/spider result should
provide a decent list of 'good' pages.  <wild ass idea>the site
(sageweb.sage.org, fwiw) is superbly well indexed by google; i don't
suppose there's some wild way to leverage that as an index of good
pages?</> i'm certain to find a good perl spider/link checker module,
at least.

> You'll almost certainly hit trouble if you start deleting those orphan
> files though. Most link checkers don't look in background=""
> attributes,
> don't check javascript (and if they do they'll likely miss filenames
> that
> get assembled in the script). Add CSS and any import, url()
> dependancies.
> Oh and Flash and any files any given movie might load. Etc. etc.
> The only reliable way to know which of these files are used is to use a
> real person and real browser. Rollover all the rollovers, play all the
> Flash games etc.

it acutally shouldn't be that bad.  the site is for a computing
organization, so javascript/flash/pages with wierd extensions/other
crap doesn't exist, fortunately. css is in one, central file.  html
files all end in .html, includes & logical components have no extension
(not my idea, but whatever).  templating is with HTML::Mason, so much
of the navigation/layout logic is not in the individual .html files.

> Perl is your friend for this kind of task. People have been using it
> for
> this kind of thing for years and years and you'll find lots of helpful
> code and modules out there. (In an earlier effort I made an html parser
> that knew about most of the above-mentioned places dependancies can
> lurk)

yeah, i agree.  i think it's the only way to deal with the images bit;
regex searches for things that look like references to images, compare
to listing of images (which are fortunately in one directory), and go.

> And then you can always just delete the maybe orphans, and just check
> the
> error logs after a week or 4 to see which you need to put back :)

good idea.

> hth

yep, more to think about.  thanks to everyone for your input.

-jd

------------------------------------------------------------------
    J.D. Welch			|    so.there at showtunepink.com
    graphic designer    	|    http://www.showtunepink.com
    web developer       	|    http://kitschparade.ath.cx
------------------------------------------------------------------




More information about the thelist mailing list