[thelist] Building a site in Japanese

Andrew Clover and-evolt at doxdesk.com
Mon May 24 16:51:40 CDT 2004


bread_man <breadilicious at hotmail.com> wrote:

> I've looked at other websites in Japanese and figure that everything is pretty
> much the same except for the character set.  The copy is in encoded somehow
> and when the page is rendered, the japanese letters and words appear.

Yep. This is just another character encoding. Good old Latin-1 has 
letters with accents on in the top 128 bytes; Japanese has all those 
funky kanji instead.

The most popular encoding for Japanese web pages is probably Shift-JIS. 
This is a 'double-byte character set' where including a top-bit-set byte 
means that this and the next byte are taken together as a single 
character. (This is necessary because there are rather more than 128 
Japanese characters.)

However, do *not* use Shift-JIS for anything(*) (or ISO-2022 or EUC for 
that matter, two other equally horrible encodings). Use Unicode, saved 
in a sensible encoding such as UTF-8, and you'll be able to use all 
possible characters including Japanese ones, Latin characters with 
accents, Greek and so on, all at the same time.

(* - on the web, anyway. There is still reason to use Shift-JIS is 
e-mail, unfortunately, due to some incredibly crap webmail providers.)

> Forgive me if this sounds really stupid, but what will I need to get the
> encoded copy into my pages?

You will need at least one Japanese font installed. It is a good idea to 
install Japanese encodings too, so that your browser doesn't get stuck 
when it hits a page with a peculiarly Japanese encoding like Shift-JIS. 
For WinXP you can do both from Control Panel -> Regional and Language 
Options -> Languages -> Supplemental Language Support.

Then cut and paste into a text editor with full Unicode support. Notepad 
on Windows NT/2000/XP/2003 supports Unicode fine, but it is still 
Notepad, ugh.

My favorite Unicode-capable editor for Windows is from www.emeditor.com, 
but there are surely lots of others to choose from. On Linux, the KDE 
text editors are fine also. Don't know about Macs.

Take note of the encoding you save under (normally UTF-8) and make sure 
you specify this encoding in the Content-Type charset parameter so that 
the browser can tell it's UTF-8 without having to guess and maybe get it 
wrong. If you don't have access to the server config to set the default 
charset, use a meta tag:

   <head>
     ...
     <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
   </head>

> Any gotchas or tips would be appreciated.

Get them to indicate what is actually *supposed* to be a line break in 
the copy they send you, and what's just text wrapping. When you don't 
grok the language and there aren't spaces words it can be hard to tell 
what the structure is supposed to be if you've only been sent plain text.

Avoid using the standard 'serif' and 'sans-serif' generic CSS fonts; 
they will on some machines for no apparent reason choose a font without 
any Japanese characters in, resulting in a page where all the characters 
are rendered as empty squares. Put one of the common Japanese font names 
before any generics. But beware! Fonts can change names depending on the 
native character set: what you see as 'MS PGothic' will be available to 
a Japanese IE user only under the name 'MS P[1][2]', where [1] is the 
kanji represented by Unicode code point 26126 and [2] is the kanji 
represented by Unicode code point 26397. So in any font-family CSS 
declarations, include both names.

Oh, and don't include backslashes in your page text. For reasons too 
arcane and tedious to go into, they'll come out as yen symbols instead 
on native Japanese machines.

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


More information about the thelist mailing list