[thelist] charset, multipart/form-data and multipart/x-www-form-urlencoded

Sun Jan 24 08:29:02 CST 2010

Bill Moseley scribeva in 2010-01-23 19:40:

> I'm curious why the browser is not telling me what character encoding it
> used.  Do I just have to assume that the character encoding is what I
> specified in the accept-charset in the <form> element?  Obviously, clients
> don't have to read my form before posting.  I do decode all content as utf-8
> (and thus an error will be generated if invalid utf8 is detected).
> 
> Just seems odd.  When sending a series of octets that represent text to some
> remote server sure seems like the client would need to specify the character
> encoding used to encode those octets.
> 
> Am I missing some fundamental part of http?
> 

I've been wondering this too. Now I did some more reading, and it still 
seems pretty messy :-}

HTTP spec says:[0]
"The "charset" parameter is used with some media types to define the 
character set (section 3.4) of the data. When no explicit charset 
parameter is provided by the sender, media subtypes of the "text" type 
are defined to have a default charset value of "ISO-8859-1" when 
received via HTTP. Data in character sets other than "ISO-8859-1" or its 
subsets MUST be labeled with an appropriate charset value. See section 
3.4.1 for compatibility problems."

So any HTTP message having a body (e.g. a POST request) should have a 
Content-Type header and if the character encoding is not ISO-8859-1 it 
the header should have parameter "charset".

HTML spec says:[1]
"Note. The "get" method restricts form data set values to ASCII characters"
It is because GET only allows application/x-www-form-urlencoded, and for 
that (be it GET or POST) ...
"Non-alphanumeric characters are replaced by `%HH', a percent sign and 
two hexadecimal digits representing the ASCII code of the character. "
... only ASCII is allowed.

On the other hand, multipart/form-data
"As with all multipart MIME types, each part has an optional 
"Content-Type" header that defaults to "text/plain". User agents should 
supply the "Content-Type" header, accompanied by a "charset" parameter." 
(In your example BTW you had only the HTTP-message's Content-Type - but 
the charset should be with each part's Content-Type)

So if it were as the spec say, we should be using multipart/form-data if 
our application works in UTF-8, and GET-forms could not use UTF-8, only 
ASCII.

But that's not how it goes in the real world - 
application/x-www-form-urlencoded is used for non-ASCII forms too. Here 
are some interesting discussions on this [2][3][4][5] - It seems that if 
you declare the page containing the form to be UTF-8 encoded, you can be 
sure that the browser send you UTF-8 back.

Obviously - as you pointed out - the POST request does not necessarily 
come from a client that has processed the page containing the form. If 
this is a valid use case in your application, then I guess you have to 
try to test the input's character encoding.

But if we limit the question to web browsers processing HTML forms, then 
the HTML5 spec can be helpful[6]. (One of its aims is to 
specify/document how HTML is currently implemented (as current specs are 
quite vague in some places).)

So overall it's quite a mess, but fortunately it's still easy to get 
things working right with UTF-8.

.k

[0]http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
[1]http://www.w3.org/TR/html401/interact/forms.html#h-17.13
[2]https://bugzilla.mozilla.org/show_bug.cgi?id=7533
[3]https://bugzilla.mozilla.org/show_bug.cgi?id=18643
[4]http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
[5]http://intertwingly.net/blog/2004/04/15/Character-Encoding-and-HTML-Forms
[6]http://dev.w3.org/html5/spec/Overview.html#url-encoded-form-data