[thelist] how to avoid character encoding problems

kasimir-k kasimir.k.lists at gmail.com
Fri Apr 13 17:48:49 CDT 2007


Sarah Adams scribeva in 13/04/2007 13:44:
> all the articles I read on the subject are just a little bit too
> technical for me to really *get*.

This might be way too elementary for you, but when I have to explain 
character coding issues to a non technical person, I usually say 
something along these lines:

Computers deal only with numbers. So anything that is not a number, must 
first be converted to a number  - for letters, this is called character 
coding, and there are many standards for it. To get those numbers back 
to letters, one must know which standard was used to create those 
numbers from the letters.

Sorry if it sounded too naïve, but it think it often helps to get down 
to the very basics.

> - including this in the HTML:
>   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

It's actually quite important to make that available already in the HTTP 
header - the HTML meta tag is a nice and often useful extra, but it's 
not the real thing. In PHP you'd just:
header("Content-Type: text/html; charset=iso-8859-1");

> 1) What are similar techniques to ensure proper character encoding and
> no mismatches for a site hosted on Apache and written in PHP?

There's quite a lot to this, but to get started I'll throw in a something.

With Apache there's not much you can do - sometimes I put in .htaccess:
AddDefaultCharset utf-8

With PHP there's more... The fundamental thing to bear in mind is, that 
(until we get PHP6) for PHP one character equals one byte - which 
doesn't really play nice with multibyte character codings like UTF-8 
(many UTF-8 characters actually do take only one byte, but others take 
two, three or even four bytes), So if you want to do things like compare 
strings or check their length, you should use the multibyte string 
functions <http://php.net/mbstring>. The unfortunate thing is that many 
hosts don't provide PHP with them, but on the other hand, you'll often 
filtered your strings already in MySQL,  which is much more UTF-8 friendly.

Speaking of MySQL, there are a couple gotchas there too. If you start 
with UTF-8 from the very beginning, there are a lot less of them, but 
still I want to remind you to check that the MySQL server, connection 
and client all use utf8.

> 2) Which character encoding should I use, iso-8859-1 or utf-8?

UTF-8 - no question about it. ISO-8859-1 doesn't even have the euro sign 
(€), so it's no good in European context.

> 3) Any suggestions for making the switch from one encoding to the other
> less painful?

A plenty of pain killers :-)

I like EditPad pro <http://www.editpadpro.com/> - I can easily try 
different character codings for a text until I find the right one, also 
I can as easily change the coding and just resave - the easiest way to 
get from say ISO-8859-1 to UTF-8.

> 4) Depending on which encoding I use, does it matter how accented
> characters are entered in the site? If I understand correctly, typing
> Alt + 130 (on the num pad) will give you an "é" (that's a lowercase "e"
> with an acute accent), but so will Alt + 0233 (é) - the former being in
> ASCII, the latter in UTF. Is this correct? Do you have to enter the
> characters using one method or the other depending on the encoding of
> the site, or are they interchangeable?

This is a good - if a bit confused ;-) question. First, I assume when 
you say "entered in the site" you mean forms that are submitted to the 
server. Sorry if that seems nitpicking, but it's actually crucial here - 
what matters is how the entered text is stored in the DB.

Remember, there's no letters, only numbers. The client (browser) sends a 
sequence of bytes (numbers) to the server, and the server must make 
sense of them - and in order to do that, it must know what encoding the 
client used.

Unfortunately the ugly truth is that the specs are, well, less than 
perfect here... On form submission using the default content type, they 
say nothing about encoding non-ASCII characters[0]!

But fortunately the practice works better that the specs. Usually the 
browsers encode the content of form inputs using the encoding specified 
for the document containing the form. So if your form page was served as 
UTF-8, then your server will receive UTF-8. To be an the safe side, you 
can specify your forms accept-charset attribute, and set the enctype 
attribute to multipart/form-data[1].

And how the text is input in the form field does not matter, as the 
encodings can not be mixed. And actually, you are a bit misguided with 
those e-acutes - with the ways you described you don't get ASCII nor 
UTF-8. (Assuming Windows) with alt-xxx you get a character from ancient 
code page 437[2], and with alt-0xxx you get a Win-1252[3] character. But 
both ways will enter e-acute in the form, and on submission that e-acute 
is then encoded using the form's/document's encoding.

One last thing: often you'll see phrases "character set" and "character 
coding" used as if they were the same thing. Often that is the case in 
practice, but it's important to bear in mind that they are two very 
distinct things. With for example ASCII they are interchangeable: 'A' is 
always encoded as 65. But with Unicode it's not that simple. Basically 
Unicode character set includes (or has place for) 1,114,112 different 
characters (not all are currently used though) - if we wanted to use the 
same amount of bytes for each character, we'd need four of them. But if 
most characters we use are in basic latin, which can be expressed with 
just one byte, four bytes would be terrible waste of space. So the 
character set of Unicode has various character encodings, like UTF-32 
which uses always four bytes for one character, and UTF-8 which uses 
just one byte where possible.

.k


[0] <http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1>
[1] <http://www.w3.org/TR/html401/interact/forms.html#h-17.3>
[2] <http://en.wikipedia.org/wiki/Code_page_437>
[3] <http://en.wikipedia.org/wiki/Windows_1252>



More information about the thelist mailing list