[thelist] Text mining

Mark Kennedy mark at eurogamer.net
Wed Mar 19 05:33:50 CST 2003


Hi all,

We're moving our internal content management system over to start using XML
instead of the sloppy HTML our staff writers having been using up till now.  We
have about 3000 article and news posts written in about a dozen styles of 'tag
soup' HTML and I don't really want to have to coordinate manual conversion of
every single one into our new document markup language.

Does anybody have any experience with automatically manipulating this sort of
information?  Or at least semi-automatic?  It would probably be acceptable to
get the data into an intermediate form which can be quickly markup up by hand
with a load of editor macros.  Still, 3000 is a lot of articles :)

Thanks in advance

Mark




More information about the thelist mailing list