[thelist] strip html etc

Sam sam at sam-i-am.com
Fri Aug 22 12:51:15 CDT 2003


I'm working through a similar problem on my site. My approach was to:

Inventory the pages: I wrote a perl script to crawl the directory tree 
and output filename, extension and title to a tab delimited file.

I imported that into Excel, and whent through removing obvious 
duplicates, orphans, backups, tmp files etc. The inventory is key. You 
can track progress, filter and sort and so on.

All my subsequent scripts accepted a filelist which I got by filtering 
in excel and copying the filename column to a text file.

I then re-factored that original script to match one of a few templates 
  i'd used over the years. This gave me a group of files that I was 
fairly confident I could extract content from fairly cleanly.
I then went through the remainder to see what was up. Of the 500 or so 
"html" files I started with, I whittled it down to about 50 that would 
need manual content extranction.
When I've done similar work for a client, at this point you can 
priorise. You might shoot for a full, flawless extraction/migration for 
the top priority pages, and maybe put a new header and footer on the low 
priority pages.

My extraction scripts were fairly blunt tools - quick throwaway perl 
scripts with pattern matches to extract content from a known template. 
In my case I moved all content directly into an interim template that 
was a load of id'd divs for the different content areas.
When I finally had a final html template(s), migration from the interim 
was trivial.

This kind of approach might work for a one-off, one-person effort. If 
you need to distribute the work or hand over your tools afterwards I 
guess you'd have to be a little less informal and actually write nice 
code :)
I think my main advice would be to not try and write one script to do it 
all in one go. Allow for manual intervention at each step of the way. 
Even with 5000 files, some tasks are better handled by hand than 
programatically.

Sam

george donnelly wrote:

> My current task is to separate the content from the presentation in about
> 5000+ html pages so that it can be dumped into the new site.



More information about the thelist mailing list