[thelist] how to avoid character encoding problems
kasimir-k
kasimir.k.lists at gmail.com
Fri Apr 13 17:48:49 CDT 2007
Sarah Adams scribeva in 13/04/2007 13:44:
> all the articles I read on the subject are just a little bit too
> technical for me to really *get*.
This might be way too elementary for you, but when I have to explain
character coding issues to a non technical person, I usually say
something along these lines:
Computers deal only with numbers. So anything that is not a number, must
first be converted to a number - for letters, this is called character
coding, and there are many standards for it. To get those numbers back
to letters, one must know which standard was used to create those
numbers from the letters.
Sorry if it sounded too naïve, but it think it often helps to get down
to the very basics.
> - including this in the HTML:
> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
It's actually quite important to make that available already in the HTTP
header - the HTML meta tag is a nice and often useful extra, but it's
not the real thing. In PHP you'd just:
header("Content-Type: text/html; charset=iso-8859-1");
> 1) What are similar techniques to ensure proper character encoding and
> no mismatches for a site hosted on Apache and written in PHP?
There's quite a lot to this, but to get started I'll throw in a something.
With Apache there's not much you can do - sometimes I put in .htaccess:
AddDefaultCharset utf-8
With PHP there's more... The fundamental thing to bear in mind is, that
(until we get PHP6) for PHP one character equals one byte - which
doesn't really play nice with multibyte character codings like UTF-8
(many UTF-8 characters actually do take only one byte, but others take
two, three or even four bytes), So if you want to do things like compare
strings or check their length, you should use the multibyte string
functions <http://php.net/mbstring>. The unfortunate thing is that many
hosts don't provide PHP with them, but on the other hand, you'll often
filtered your strings already in MySQL, which is much more UTF-8 friendly.
Speaking of MySQL, there are a couple gotchas there too. If you start
with UTF-8 from the very beginning, there are a lot less of them, but
still I want to remind you to check that the MySQL server, connection
and client all use utf8.
> 2) Which character encoding should I use, iso-8859-1 or utf-8?
UTF-8 - no question about it. ISO-8859-1 doesn't even have the euro sign
(€), so it's no good in European context.
> 3) Any suggestions for making the switch from one encoding to the other
> less painful?
A plenty of pain killers :-)
I like EditPad pro <http://www.editpadpro.com/> - I can easily try
different character codings for a text until I find the right one, also
I can as easily change the coding and just resave - the easiest way to
get from say ISO-8859-1 to UTF-8.
> 4) Depending on which encoding I use, does it matter how accented
> characters are entered in the site? If I understand correctly, typing
> Alt + 130 (on the num pad) will give you an "é" (that's a lowercase "e"
> with an acute accent), but so will Alt + 0233 (é) - the former being in
> ASCII, the latter in UTF. Is this correct? Do you have to enter the
> characters using one method or the other depending on the encoding of
> the site, or are they interchangeable?
This is a good - if a bit confused ;-) question. First, I assume when
you say "entered in the site" you mean forms that are submitted to the
server. Sorry if that seems nitpicking, but it's actually crucial here -
what matters is how the entered text is stored in the DB.
Remember, there's no letters, only numbers. The client (browser) sends a
sequence of bytes (numbers) to the server, and the server must make
sense of them - and in order to do that, it must know what encoding the
client used.
Unfortunately the ugly truth is that the specs are, well, less than
perfect here... On form submission using the default content type, they
say nothing about encoding non-ASCII characters[0]!
But fortunately the practice works better that the specs. Usually the
browsers encode the content of form inputs using the encoding specified
for the document containing the form. So if your form page was served as
UTF-8, then your server will receive UTF-8. To be an the safe side, you
can specify your forms accept-charset attribute, and set the enctype
attribute to multipart/form-data[1].
And how the text is input in the form field does not matter, as the
encodings can not be mixed. And actually, you are a bit misguided with
those e-acutes - with the ways you described you don't get ASCII nor
UTF-8. (Assuming Windows) with alt-xxx you get a character from ancient
code page 437[2], and with alt-0xxx you get a Win-1252[3] character. But
both ways will enter e-acute in the form, and on submission that e-acute
is then encoded using the form's/document's encoding.
One last thing: often you'll see phrases "character set" and "character
coding" used as if they were the same thing. Often that is the case in
practice, but it's important to bear in mind that they are two very
distinct things. With for example ASCII they are interchangeable: 'A' is
always encoded as 65. But with Unicode it's not that simple. Basically
Unicode character set includes (or has place for) 1,114,112 different
characters (not all are currently used though) - if we wanted to use the
same amount of bytes for each character, we'd need four of them. But if
most characters we use are in basic latin, which can be expressed with
just one byte, four bytes would be terrible waste of space. So the
character set of Unicode has various character encodings, like UTF-32
which uses always four bytes for one character, and UTF-8 which uses
just one byte where possible.
.k
[0] <http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1>
[1] <http://www.w3.org/TR/html401/interact/forms.html#h-17.3>
[2] <http://en.wikipedia.org/wiki/Code_page_437>
[3] <http://en.wikipedia.org/wiki/Windows_1252>
More information about the thelist
mailing list