[thelist] Text mining

Wed Mar 19 12:02:43 CST 2003

Hey there,

Thanks for your reply.  You replied directly to me, and not to the list, however
your message made me think that maybe I hadn't explained my problem fully.  I
hope you don't mind me reproducing your email here.

>I have a guy I use for that work.  Most recently he parsed 11,000 emails for
>a mailing list archive and created a system that takes care of that and
>builds the archive automatically as it goes now.  The guy's name is...<snip>

Crikey, that's a lot of emails.  It's not something we want to farm out though.

>If you want to do the work yourself, you can use PHP to parse them out and
>then get them loaded into a database in a more future friendly format.  Then
>just draw them out of the database by doing calls to the db.  That way
>you're building one page good page instead of 3,000.

It's a bit more complicated than taking some text files with a couple of entries
and plopping them in a database I'm afraid.

We have a simple CMS that was written a long time ago when our requirements we
somewhat less than they are now, and a lot, lot less than we think they're going
to become.  Articles are stored in a database, with some of the data we index
the articles by in their own fields (publishing date, article title, etc).  
Then there a big text record that contains an HTML fragment representing the
article itself.  Now originally this was in the form of:

My News Item

This happened.

And this happened.

Oh and this happened.


by Mr Journalist Man

This article then goes through an export in order to wrap a few other bits of
information around the article, and to list it on the front page and archive
pages.  It wouldn't be that difficult to reformulate this sort of thing into
valid XML for the new CMS.

But over the years they (the journalists) have had to fit much more infomation
and meta-information (that really should go in databases) into this blob of
HTML.  They even hard link in images after uploading them to a folder on the
server.  We now have a fairly comprehensive in-house XML based CMS that we need
to port to.  It's going to use XSLT and will have links to other databases
containing assets required by the articles.  Hopefully this will be, as you
said, pretty future proof.

I don't really want to have to write a custom text processor (in php or in any
other language) to handle this and I was hoping that something might
exist that could infer structure from formatting.  For instance, a
piece of software that could figure out that:

<p>
<b>Item 1</b> - Value 1
<br />
<b>Item 2</b> - Value 2
<br />
<b>Item 3</b> - Value 3
</p>

   was a crude way of representing tabulated data.  That's just the first level
though.  We also need to spot types of data in the text, and markup in the
correct way.  For instance, Item 1 might be a company name and Item 2 might be a
price or a product specification.

What I think I'm going to have to use is a combination of HTMLTidy (generate
valid XML), XSLT (attempt to spot structure in markup), some sort of text stream
processor like 'sed' (for spotting patterns in the data that XSLT isn't that
well suited for) and then have someone manually direct the different styles of
article to the relevant processor.

...That is, unless anyone here has any better ideas (I hope somebody does).  I
appreciate that some manual labour is unavoidable, nevertheless I'd like to
minimise the amount involved.

I hope that explains my request a little more comprehensively.

Thanks again

Mark