[thelist] Re: Do I need a CMS

Techwatcher techwatcher at accesswriters.com
Fri Jun 14 08:56:01 CDT 2002


> I am building a site now with an Articles section.  This Articles
section
> will contain the approx 1000 articles the site owner has written in
MS Word.
> I have a utility to convert them (en masse) to HTML and I am writing
a Perl
> script to strip off everything from these HTML files but the actual
article
> text.  The articles are divided into categories and some of the
categories
> are divided into subcategories.
[snip]
> I am wondering if I am missing any functionality here that might be
useful.

Hi, Hershel --

Many years ago, I read my first book about HTML and realized instantly
that because I used a WP package (XyWrite) that was pure ASCII, and its
control codes were all optionally hidden inside << and >> (i.e., the
double European quotation marks), AND because this package came with
its own proper programming language, I could easily write a program to
take my existing text files and convert them to HTML.

So I have a few suggestions for you, since I've essentially already
done what you're about to do, but in a more convenient form. The
strategy to follow is this:
Convert every formatting code you'd like to preserve. Then strip out
any remaining formatting codes.

If a particular text file has a header (text to appear on top of all
pages), you might want to convert that to title tag. (Of course, leave
the auto-page numbering stuff to strip out later.) Or, since you plan
to use a CMS, you might want to have all your titles blank initially...
up to you.

Second, create boilerplate for your program to insert at top of every
file (you know, the DTD declaration if any, the title if any, the head
if any, and body tag), and at bottom (close the body tag, close the
html tag).

Next, headings and subheadings: did he set up styles? If he did,
convert the appropriate heading style to the correct numbered H# tag.
If he didn't, can you discern what levels he used for what combination
of centering, bold or italic, or font changes? If so, convert those.

Third, convert any italics or bold the author used. I converted to 0EM
and STRONG tags, but you might want to check his text. If it's academic
writing and his italics normally are used for citations, use the CITE
tag instead. Also, if he used underlining at all, check why and convert
appropriately. (Many academic authors use them for CITE.)

The LAST thing you want to do is strip out all other control codes. The
trickiest thing you probably want to do is replace all CR-LF character
combinations with something like </p>CR-LF<p>
Why backwards? Because it's easiest to then remove the first </p> from
your file, and the last <p> from your file, within your program. There
will still be some cleanup to do, but it should be minimal. For
example, P tags before H tags can be automatically (regex) found and
deleted.

Hope this helps.

Cheers --
Carol (techwatcher)



More information about the thelist mailing list