[thelist] strip html etc

Jono Young Jono at brookgroup.com
Fri Aug 22 09:41:46 CDT 2003


I am not sure if this will help, but here's a possible solution, or two:
1.  Text Edit Plus allows you to run a script that will remove HTML from a
document (Scripts>Convert>HTML->Mac) I am running Text Edit Plus on a Mac, I
am not sure if there's a PC equivalent or version?
2.  There's an application called WebWacker, which grabs websites and
content, it may have an option to do what you are looking for?  It is made
by BlueSquirrel.
<http://www.bluesquirrel.com>
I think WebWacker may have been replaced by Grap-a-Site, but it does the
same thing.

On 8/22/03 8:50 AM, "Tobyn Baugher" <toby at rsux.com> wrote:

> On Thu, Aug 21, 2003 at 09:18:00PM -0500, george donnelly wrote:
>> My current task is to separate the content from the presentation in about
>> 5000+ html pages so that it can be dumped into the new site.
> 
> ACK! Boy am I glad I'm not the one having to do this! (sorry :P)
> 
>> Does anyone have any suggestions or experiences with this kind of task? Can
>> anyone point me to any good practices for this?
> 
> There's always the possibility that all the docs were written by the
> same tool with the exact same layout. If that's the case then pattern
> matching is your friend and a good Perl/Python/Awk programmer could
> likely do the work you need in under an hour.
> 
> Most likely, however, you've got two options:
> 
> 1) Pipe your HTML through "lynx -dump" on a UNIX(/CYGWIN) machine and
> see if the output is more easily parseable than the HTML itself.
> HTML::Parser in Perl might also be more helpful.
> 
> or
> 
> 2) Go through the documents by hand and remember for the future that
> this is why XHTML and CSS2 need to become the standard NOW.
> 
> Wish I could be of more assistance. This is one of those things that
> seems deceptively simple until you actually have to do it.
> 
> -Tobyn

-- 
Jono Young
Designer/Illustrator
Brook Group, LTD
8231 Main Street
Ellicott City, MD 21043
T: 410-465-7805 xt: 16
<http://www.brookgroup.com/>



More information about the thelist mailing list