[thelist] Converting MS Word to text, preserving entities

Michael Mell mike at nthwave.net
Mon Apr 8 12:38:00 CDT 2002


I don't know Zope, but I understand the concept. I think you could do this
seamlessly with just a little coding:
The rtf files are found in a known location, the rtf2HTML.py does its
conversion and ouputs to another known location.

As part of the conversion, you'll need to strip out the junk rtf I mentioned.
Until rtf2HTML handles it better, I would suggest your author a unique string
at the top of the document e.g. "<content start>". The you would have a script
delete the file contents from the begining up to the location of the end of
the unique string.

I would try not to edit rtf2HTML, since it's likely I'll make a better one and
then you'd be stuck with the current inadequate version. Just build a wrapper
around it or call a second process.

m


martin.p.burns at uk.pwcglobal.com wrote:

> Memo from Martin P Burns of PricewaterhouseCoopers
>
> -------------------- Start of message text --------------------
>
> Hi Michael
>
> How easy do you think this would be to integrate into Zope?
>
> What I'm after is a setup where content editors can upload an
> RTF file to Zope and have it nicely drop into the standard template.
>
> Cheers
> Martin
>
> Francois Jordaan wrote:
>
> > A week ago, Michael Mell wrote,
> > > I've already written the basics of a simple tool in Python to convert
> > > rtf
> >
> > To get to the point, I'm looking for a simple conversion tool that'll
> take
> > Word docs or RTF and convert them to text with all extended characters
> > correctly converted to numeric entities. Does such a tool already exist?
> > Mike, does your Python tool do that?
>
> Yes. http://www.nthwave.net/rtf2HTML/
> I have not yet read the rtf spec or incorporated the plethora of rtf codes
> into the script. However, what is there works for me and is easily
> extendable.
> To include codes that your authors use, simple edit the two dictionaries at
> the top of the script. The script contains further documentation.
>
> The script will create a new file with a .txt extension. At the top of this
> new file, there will be about a page full of rtf junk that you can delete.
> The
> rest of the file will be your converted document.
>
> Please let me know how you would like this to be further improved (aside
> from
> the obvious one of including all the codes). I can't always read all of
> [thelist], so a private message will be more certain to get my attention.
>
> --------------------- End of message text --------------------
>
> This e-mail is sent by the above named in their
> individual, non-business capacity and is not on
> behalf of PricewaterhouseCoopers.
>
> PricewaterhouseCoopers may monitor outgoing and incoming
> e-mails and other telecommunications on its e-mail and
> telecommunications systems.
> ----------------------------------------------------------------
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material.  Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited.   If you received
> this in error, please contact the sender and delete the material from any
> computer.

--
mike[at]nthwave.net
llemekim         YahooIM
415.455.8812     voice
419.735.1167     fax





More information about the thelist mailing list