[thelist] strip html etc
george donnelly
list at zettai.net
Fri Aug 22 10:47:29 CDT 2003
[Tobyn Baugher wrote (toby at rsux.com) on 8/22/03 7:50 AM]
> On Thu, Aug 21, 2003 at 09:18:00PM -0500, george donnelly wrote:
>> My current task is to separate the content from the presentation in about
>> 5000+ html pages so that it can be dumped into the new site.
>
> ACK! Boy am I glad I'm not the one having to do this! (sorry :P)
yeah i'm sorry too :(
>> Does anyone have any suggestions or experiences with this kind of task? Can
>> anyone point me to any good practices for this?
>
> There's always the possibility that all the docs were written by the
> same tool with the exact same layout. If that's the case then pattern
> matching is your friend and a good Perl/Python/Awk programmer could
> likely do the work you need in under an hour.
yes, there is a lot of pattern matching that can be done. I was hoping
against hope someone might know of a tool out there for this so i wouldn't
have to write it myself.
> Most likely, however, you've got two options:
>
> 1) Pipe your HTML through "lynx -dump" on a UNIX(/CYGWIN) machine and
> see if the output is more easily parseable than the HTML itself.
> HTML::Parser in Perl might also be more helpful.
good idea. i'll try that.
> or
>
> 2) Go through the documents by hand and remember for the future that
> this is why XHTML and CSS2 need to become the standard NOW.
heh, i've known that for years. the Zope based system I'm
building/implementing for them will not *allow* them to screw themselves up
this bad again. ;)
> Wish I could be of more assistance. This is one of those things that
> seems deceptively simple until you actually have to do it.
right, thanks for your suggesstions tho.
<-->
george donnelly ~ http://www.zettai.net/ ~ "Quality Zope Hosting"
Shared and Dedicated Zope Hosting ~ Zope Servers ~ Zope Websites
Yahoo, AIM: zettainet ~ MSN: zettainet at hotmail.com ~ ICQ: 51907738
More information about the thelist
mailing list