[thelist] strip html etc

george donnelly list at zettai.net
Fri Aug 22 10:47:29 CDT 2003


[Tobyn Baugher wrote (toby at rsux.com) on 8/22/03 7:50 AM]

> On Thu, Aug 21, 2003 at 09:18:00PM -0500, george donnelly wrote:
>> My current task is to separate the content from the presentation in about
>> 5000+ html pages so that it can be dumped into the new site.
> 
> ACK! Boy am I glad I'm not the one having to do this! (sorry :P)

yeah  i'm sorry too :(

>> Does anyone have any suggestions or experiences with this kind of task? Can
>> anyone point me to any good practices for this?
> 
> There's always the possibility that all the docs were written by the
> same tool with the exact same layout. If that's the case then pattern
> matching is your friend and a good Perl/Python/Awk programmer could
> likely do the work you need in under an hour.

yes, there is a lot of pattern matching that can be done. I was hoping
against hope someone might know of a tool out there for this so i wouldn't
have to write it myself.

> Most likely, however, you've got two options:
> 
> 1) Pipe your HTML through "lynx -dump" on a UNIX(/CYGWIN) machine and
> see if the output is more easily parseable than the HTML itself.
> HTML::Parser in Perl might also be more helpful.

good idea. i'll try that.

> or
> 
> 2) Go through the documents by hand and remember for the future that
> this is why XHTML and CSS2 need to become the standard NOW.

heh, i've known that for years. the Zope based system I'm
building/implementing for them will not *allow* them to screw themselves up
this bad again. ;)

> Wish I could be of more assistance. This is one of those things that
> seems deceptively simple until you actually have to do it.

right, thanks for your suggesstions tho.

<-->
george donnelly ~ http://www.zettai.net/ ~ "Quality Zope Hosting"
Shared and Dedicated Zope Hosting ~ Zope Servers ~ Zope Websites
Yahoo, AIM: zettainet ~ MSN: zettainet at hotmail.com ~ ICQ: 51907738



More information about the thelist mailing list