[thelist] strip html etc

Tobyn Baugher toby at rsux.com
Fri Aug 22 07:50:04 CDT 2003


On Thu, Aug 21, 2003 at 09:18:00PM -0500, george donnelly wrote:
> My current task is to separate the content from the presentation in about
> 5000+ html pages so that it can be dumped into the new site.

ACK! Boy am I glad I'm not the one having to do this! (sorry :P)

> Does anyone have any suggestions or experiences with this kind of task? Can
> anyone point me to any good practices for this?

There's always the possibility that all the docs were written by the
same tool with the exact same layout. If that's the case then pattern
matching is your friend and a good Perl/Python/Awk programmer could
likely do the work you need in under an hour.

Most likely, however, you've got two options:

1) Pipe your HTML through "lynx -dump" on a UNIX(/CYGWIN) machine and
see if the output is more easily parseable than the HTML itself.
HTML::Parser in Perl might also be more helpful.

or

2) Go through the documents by hand and remember for the future that
this is why XHTML and CSS2 need to become the standard NOW.

Wish I could be of more assistance. This is one of those things that
seems deceptively simple until you actually have to do it.

-Tobyn

-- 
Tobyn Baugher <toby at rsux.com>
AIM: dieplzkthxbye  ICQ: 14281524  IRC: toby at EFnet


More information about the thelist mailing list