[thelist] Text mining

David Bindel dbindel at austin.rr.com
Wed Mar 19 22:15:42 CST 2003


> -----Original Message-----
> From: thelist-bounces at lists.css-discuss.org 
> [mailto:thelist-bounces at lists.css-discuss.org] On Behalf Of Steve Hasz
> Sent: Wednesday, March 19, 2003 4:05 PM
> To: thelist at lists.evolt.org.uk
> Subject: RE: [thelist] Text mining
> 
> I've been thinking about 
> doing it with a modified Print this Page PHP script and then 
> parsing what comes from that.

Here is my idea to help the original person's Text Mining question:

1) Use PHP to retrieve the email records in the database

2) From each record, take the HTML fragment (i.e. the article), and
dynamically insert it into a template (i.e. "the rest" of the HTML page
- to make it near-valid HTML), and save the resulting complete HTML page
as a temporary file.

3) Pass the temporary page's filename to HTML Tidy (and other command
line parameters) through PHP's exec() function

4) After HTML Tidy has tidied the temp page, use regexp to extract the
non-template HTML (i.e. the article) into a PHP variable.  I have
absolutely no experience with regular expressions myself, but from what
I have seen and heard about extracting certain pieces of HTML, it
shouldn't be too hard.

5) Finally, UPDATE the database record with the new valid HTML chunk.

6) Loop through all of the records and perform steps 1-6 on each.  I'm
not sure how this would perform based on time and resource usage, but
the system seems logical enough to me, and I think it would work if
implemented correctly.

HTH,
David Bindel

-- 
    David I. Bindel
  Website Development
 dbindel at austin.rr.com
  www.davidbindel.com



More information about the thelist mailing list