[thelist] how to avoid character encoding problems

Sarah Adams mr.sanders at geekjock.ca
Fri Apr 13 08:44:12 CDT 2007


Hopefully, you won't all tell me that it's impossible to do! Basically,
I'm trying to come up with a plan to minimize my future character
encoding woes for a site that's in beta now. I'm willing to put in some
extra work now to avoid such work later on.

Right now there are a few issues with data that's already been input
into the site; accented characters were somehow saved incorrectly in the
database so all we have left are question marks. I'm guessing there was
an encoding mismatch somewhere along the way, and characters entered as
ASCII were saved as UTF... or something like that! To be honest, despite
a lot of reading on the subject, I'm not quite getting it. It seems like
all the articles I read on the subject are just a little bit too
technical for me to really *get*.

Also, it's my understanding that utf-8 is recommended over iso-8859-1
because it's more forward compatible and supports more
languages/characters, but when I've tried to switch sites from
iso-8859-1 to utf-8 I've just run into all kinds of trouble with
accented characters (usually French and German) and some punctuation (em
dashes, "smart" quotes, etc) and decided it wasn't worth it. But since
I'm in the early stages with this site, I'd like to do things right
starting now, even if I have to fix some of the data that's in the site
now (stored both in HTML/PHP and in the MySQL database).

I know that in ColdFusion there are a lot of little things you can do to
ensure you don't have character encoding mismatches along the way:

- setting the encoding in ColdFusion Administrator (i.e. the value that
ColdFusion will send in all page headers)

- including this in Application.cfm:
  <cfscript>
    SetEncoding('form', 'iso-8859-1');
    SetEncoding('url', 'iso-8859-1');
  </cfscript>
  <cfcontent type="text/html; charset=iso-8859-1">

- including this at the top of *all* scripts:
  <cfprocessingdirective pageencoding="iso-8859-1">

- including this in the HTML:
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

- and setting the datasource to use the proper encoding

So here are my questions:

1) What are similar techniques to ensure proper character encoding and
no mismatches for a site hosted on Apache and written in PHP?

2) Which character encoding should I use, iso-8859-1 or utf-8?

3) Any suggestions for making the switch from one encoding to the other
less painful?

4) Depending on which encoding I use, does it matter how accented
characters are entered in the site? If I understand correctly, typing
Alt + 130 (on the num pad) will give you an "é" (that's a lowercase "e"
with an acute accent), but so will Alt + 0233 (é) - the former being in
ASCII, the latter in UTF. Is this correct? Do you have to enter the
characters using one method or the other depending on the encoding of
the site, or are they interchangeable?

-- 
sarah adams
web developer & programmer
portfolio: http://sarah.designshift.com
blog: http://hardedge.ca



More information about the thelist mailing list