[thelist] Converting MS Word to text, preserving entities

Francois Jordaan francois.jordaan at wheel.co.uk
Mon Apr 8 03:54:00 CDT 2002


A week ago, Michael Mell wrote,

> Re: [thelist] MS Doc File Format Specs?

> I've already written the basics of a simple tool in Python to convert
> rtf
> (esp. MSWord) files to a user-defined format. The format could be html
> or
> some template-ready syntax. The tool lets me easily move Word
> files into
> my templating system. If any one is interested in such a
> tool, send me a
> quick hello and let me know how you would use it.

I've recently did a bit of research to gauge the degree of browser support
for extended characters, based on aardvark's character entitity chart
article:
http://www.fjordaan.uklinux.net/entities/entities_support.html
and have created some add-ons for Dreamweaver and Textpad to more easily
insert the correct numeric entities for frequently-used extended characters
http://www.fjordaan.uklinux.net/moveabletype/fblog/archives/000054.html

My next rant was going to be about the fact that this is still not very
useful in practice, because I'm just addressing the means of entering
correct entities when originating a document in HTML. The fact is that most
content on the web is not originated in an HTML editor, but in MSWord or
other text editors, and correct character entities usually fall victim to
content population methods. Copying and pasting from Word into Dreamweaver,
or into a CMS, for example. Rarely is time (or expertise) budgeted towards
fixing these entities.

To get to the point, I'm looking for a simple conversion tool that'll take
Word docs or RTF and convert them to text with all extended characters
correctly converted to numeric entities. Does such a tool already exist?
Mike, does your Python tool do that?

Converting Word docs to HTML is an old chestnut, and I haven't found
anything that does a perfect job yet. FWIW, my bookmarks:
http://www.w3.org/People/Raggett/tidy/
http://www.wvware.com/
http://www.textism.com/resources/cleanwordhtml/index.html
http://philip.greenspun.com/wtr/word.html

However, I've decided for now to just try and crack the problem of
preserving correct typography.

Can anyone help? How do other people (especially those with CMSes) handle
content population and typographical standards?

francois

_____________________________________________________________________
This message has been checked for all known viruses by UUNET delivered
through the MessageLabs Virus Control Centre. For further information visit
http://www.uk.uu.net/products/security/virus/



More information about the thelist mailing list