[thelist] Converting MS Word to text, preserving entities

Michael Mell mike at nthwave.net
Mon Apr 8 11:36:01 CDT 2002


Francois Jordaan wrote:

> A week ago, Michael Mell wrote,
> > I've already written the basics of a simple tool in Python to convert
> > rtf
>
> To get to the point, I'm looking for a simple conversion tool that'll take
> Word docs or RTF and convert them to text with all extended characters
> correctly converted to numeric entities. Does such a tool already exist?
> Mike, does your Python tool do that?

Yes. http://www.nthwave.net/rtf2HTML/
I have not yet read the rtf spec or incorporated the plethora of rtf codes
into the script. However, what is there works for me and is easily extendable.
To include codes that your authors use, simple edit the two dictionaries at
the top of the script. The script contains further documentation.

The script will create a new file with a .txt extension. At the top of this
new file, there will be about a page full of rtf junk that you can delete. The
rest of the file will be your converted document.

Please let me know how you would like this to be further improved (aside from
the obvious one of including all the codes). I can't always read all of
[thelist], so a private message will be more certain to get my attention.

Mike

--
mike[at]nthwave.net
llemekim         YahooIM
415.455.8812     voice
419.735.1167     fax





More information about the thelist mailing list