[thelist] UTF-8/FormMail/PHP headaches

Andrew Clover and at doxdesk.com
Thu May 23 09:38:07 CDT 2002


Peter Johansson <peter at johansson.org> wrote:

> I can understand that the characters are garbled, using different charsets
> and all, but how can I make my FormMail-script to cope with both variants
> of encoding?

You can't.

Browsers *should*, according to the form-data RFC, include a Content-Type
header with charset in every subpart of the submission (presumably unless
only us-ascii characters are used).

However, no browser does this. This means there is no way to detect what
charset was submitted, and the only thing you can do is use the charset of
the document containing the form that was submitted. In modern browsers
(even N4 usually) this will work, but you could detect non-UTF-8 content
by looking for invalid UTF-8 sequences for older browsers.

UTF-8 encodes all extended characters as a character code between 0xC0 and
0xFF followed by a number of characters in the range 0x80 to 0xBF (the
number depends on the first character). So a simple check is to see if there
are any 0xC0-0xFF characters not followed by 0x80-0xBF, or any 0x80-0xBF
characters not preceded by an 0xC0-0xFF. In either case you know you don't
have UTF-8 and you can try a different encoding.

> the page, but no success. I've also tried the "accept-charset" but I
> couldn't get that to work either.

accept-charset is the correct attribute, but no browser supports this.
Instead they use the charset of the document containing the form, as set
by the Content-Type header (either in HTTP or with a <meta> tag).

--
Andrew Clover
mailto:and at doxdesk.com
http://and.doxdesk.com/



More information about the thelist mailing list