[thelist] charset, multipart/form-data and multipart/x-www-form-urlencoded

Bill Moseley moseley at hank.org
Sun Jan 24 10:34:16 CST 2010


any HTTP message having a body (e.g. a POST request) should have a
Content-Type header and if the character encoding is not ISO-8859-1 it the
header should have parameter "charset".


That sure makes sense to me.  So, I find it very interesting that the
browsers  I tested never sent a content type charset even for form-data
(i.e. a content type with each part's headers.  The exception was with an
upload field where the content type was specified (but not charset).

And since it seems the RFCs say it should send a charset but I'm not seeing
it then I would assume *I'm* doing something wrong.  Seems unlikely that
Firefox, Chrome (and IE6, all I had available in a VM) would all be broken.



You can look yourself with Wireshark, but here's an example from Chrome with
form-data type of post:

Again, my page with the form includes these headers:

HTTP/1.0 200 OK
Cache-Control: no-store, no-cache, must-revalidate
Connection: close
Date: Sun, 24 Jan 2010 16:02:27 GMT
Pragma: no-cache
Content-Length: 33197
Content-Type: text/html; charset=utf-8
Expires: -1
Status: 200

<form accept-charset="utf-8" enctype="multipart/form-data" action="/upload"
id="form0" method="post">


And the POST request for the above form:


POST /upload HTTP/1.1
Host: localhost
Connection: keep-alive
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/532.5
(KHTML, like Gecko) Chrome/4.0.249.43 Safari/532.5
Referer: http://localhost/upload
Content-Length: 65249
Cache-Control: max-age=0
Origin: http://localhost
Content-Type: multipart/form-data;
boundary=----WebKitFormBoundaryb1ZJGZj1uxi40wSR
Accept:
application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: gzip,deflate
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3

-----WebKitFormBoundaryb1ZJGZj1uxi40wSR
Content-Disposition: form-data; name="first_name"

Bill
------WebKitFormBoundaryb1ZJGZj1uxi40wSR
Content-Disposition: form-data; name="last_name"

...............
------WebKitFormBoundaryb1ZJGZj1uxi40wSR
Content-Disposition: form-data; name="file";
filename="Emergency_phone_numbers.pdf"
Content-Type: application/pdf

%PDF-1.3
%...........
2 0 obj



The "...." in the last_name are because that field contained non-ascii
characters (I copied some Russian text from a recent spam email) and that's
how it copy-n-pasted in from Wireshark.


I'm decoding the post parameters as utf8 and that works fine -- i.e. the
form is submitting parameters encoded as utf8 otherwise the decode would
generate an error.

I'm blindly decoding as utf8 all input parameters because no charset is
provided, and because my form has a charset of utf8 and because of the
accept-charset on the <form>.  But, I'd rather decode based on charset
provided by the browser, but I can't do that if none is provided.




>
> HTML spec says:[1]
> "Note. The "get" method restricts form data set values to ASCII characters"
> It is because GET only allows application/x-www-form-urlencoded, and for
> that (be it GET or POST) ...
> "Non-alphanumeric characters are replaced by `%HH', a percent sign and two
> hexadecimal digits representing the ASCII code of the character. "
> ... only ASCII is allowed.
>

And in the real world GET requests often include utf8 encoded parameters.  I
do this all the time and just assume it's utf8 (although I could just as
easily have a ?charset=<encoding> parameter to be explicit.


> So overall it's quite a mess, but fortunately it's still easy to get things
> working right with UTF-8.
>

True, it works.  Just kind of a mystery why it's not working like we expect
-- why the browsers are omitting the charset.  Is it because they think they
are sending latin-1 or because my form is not correct or it's just a big
common bug in all the browsers?

Thanks for all the feedback and the links.  Good resource for the archives.



> [0]http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
> [1]http://www.w3.org/TR/html401/interact/forms.html#h-17.13
> [2]https://bugzilla.mozilla.org/show_bug.cgi?id=7533
> [3]https://bugzilla.mozilla.org/show_bug.cgi?id=18643
> [4]
> http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
> [5]
> http://intertwingly.net/blog/2004/04/15/Character-Encoding-and-HTML-Forms
> [6]http://dev.w3.org/html5/spec/Overview.html#url-encoded-form-data
>


-- 
Bill Moseley
moseley at hank.org


More information about the thelist mailing list