[thelist] How to Code Multiple Languages?

Dejan Kozina dejan at kozina.com
Fri Dec 8 10:46:30 CST 2006


Hi Craig.
You were probably searching for a way to 'encode multiple languages', 
while what you seem to be looking for is how to encode multiple 
character sets. UTF-8 is the definite answer to this.

To start with, you need an editor capable of saving your text as utf-8. 
My own choice is Notepad++ (http://notepad-plus.sourceforge.net), but 
there are many other open source apps out there, like SciTE or jEdit. I 
suppose there should be WYSIWYG editors too, but I can't advice on those 
as I never use them. Just input or copy/paste your content and choose 
the appropriate charset when saving. You can safely save plain ASCII as 
utf-8, too.
One thing you should avoid is having the editor save the file with a 
Byte Order Mark (BOM), as this sequence of control characters is useless 
on the web and some browsers display it at the start of the page. If 
you're unsure if your editor is doing this just search your 
work-in-progress folder for files containing "". If any is found, 
open in a plain (i.e. not unicode capable) text editor and manually 
remove from the beginning of the file, saving the result. Plain text 
editors can be used with utf-8 documents as long as you change the ASCII 
charactes only and leave alone everything else.

Next thing you should do is tell the browser about the encoding. The 
proper way to this is tho have the server send the correct Content-Type 
HTTP response headers. If your website runs on Apache and you can set up 
per-directory configuration with a .htaccess file, add 
"AddDefaultCharset utf-8" to it. If you can't, the second best is to add 
a meta element to the head section of your documents: "<meta 
http-equiv="content-type" content="text/html; charset=utf-8">". You 
should do this even if you use HTTP headers, so your pages will be 
displayed correctly even when saved on disk.

The last hitch is hoping the browser has a font capable of displaying 
the unicode characters you're sending to it. You should avoid fancy 
font-family declarations and stay on the generic side, letting the 
browser choose, as this isn't something you can control. Since W2K and 
OSX browsers are capable of displaying practically everything you can 
throw at them, unless you force them to a specific font which may or may 
not contain the gliphs (shapes) required for non-ASCII content.

CSS files also accept a charset declaration: just put '@charset 
"utf-8";' as the first line of your css file. This has a sense only if 
you use css-generated content, but won't break anything if you use it 
anyway.
The script element also accepts a charset attribute, as in <script 
charset="utf-8" ... ></script>. Your mileage may.vary.

Done that you can set up the proper mark-up related to the document 
language  (mind the difference between a language and a character set: a 
language can use more than one character set and many languages can 
share the same character set). If your document has a primary (main) 
language define the lang attribute of html (e.g. <html lang="en">) and 
declare separately the lang attibute for every block of content in a 
different language (set lang="whatever" on divs, span and other 
containers, it will be inherited down the DOM tree). If you cannot or 
want not set a language as primary, leave lang out of the html element 
and mark up with it all the content.

There is a 'Content-Language' http header too. Use it to list the 
languages used in the document. There is no .htaccess shortcut to this, 
you shoud use mod_headers syntax (if your server has it installed) and 
add it to the head too ('<meta http-equiv="content-language" 
content="en,jp">'). Multiple comma separated languages are OK.


As per the accessibility specs, you should further use a hreflang 
attribute with links to a page in a different language ( '<a href="..." 
hreflang="fr">'). I'n not at all sure if screen readers do something 
meaningful with it, anyway...
If said links are used with a client-side image map take note that the 
area element does not accept a hreflang attribute. Use plain anchor 
links ('<a>') instead, as they can have shape and coords attributes.

Well, that more or less it, unless you have to use server-side 
scripting, where you have to set up correctly your database tables and 
such...

Hope this helps.

djn


Craig Givens wrote:
> I've been researching encoding and charsets like Unicode, UTF-8 and
> big5 and there just doesn't seem to be a way to do this.

-- 
Dejan Kozina
Dolina 346 (TS) - I-34018 Italy
tel./fax: +39 040 228 436 - cell.: +39 348 7355 225
http://www.kozina.com/  - e-mail: dejan at kozina.com



More information about the thelist mailing list