[thelist] Building a site in Japanese
Andrew Clover
and-evolt at doxdesk.com
Mon May 24 16:51:40 CDT 2004
bread_man <breadilicious at hotmail.com> wrote:
> I've looked at other websites in Japanese and figure that everything is pretty
> much the same except for the character set. The copy is in encoded somehow
> and when the page is rendered, the japanese letters and words appear.
Yep. This is just another character encoding. Good old Latin-1 has
letters with accents on in the top 128 bytes; Japanese has all those
funky kanji instead.
The most popular encoding for Japanese web pages is probably Shift-JIS.
This is a 'double-byte character set' where including a top-bit-set byte
means that this and the next byte are taken together as a single
character. (This is necessary because there are rather more than 128
Japanese characters.)
However, do *not* use Shift-JIS for anything(*) (or ISO-2022 or EUC for
that matter, two other equally horrible encodings). Use Unicode, saved
in a sensible encoding such as UTF-8, and you'll be able to use all
possible characters including Japanese ones, Latin characters with
accents, Greek and so on, all at the same time.
(* - on the web, anyway. There is still reason to use Shift-JIS is
e-mail, unfortunately, due to some incredibly crap webmail providers.)
> Forgive me if this sounds really stupid, but what will I need to get the
> encoded copy into my pages?
You will need at least one Japanese font installed. It is a good idea to
install Japanese encodings too, so that your browser doesn't get stuck
when it hits a page with a peculiarly Japanese encoding like Shift-JIS.
For WinXP you can do both from Control Panel -> Regional and Language
Options -> Languages -> Supplemental Language Support.
Then cut and paste into a text editor with full Unicode support. Notepad
on Windows NT/2000/XP/2003 supports Unicode fine, but it is still
Notepad, ugh.
My favorite Unicode-capable editor for Windows is from www.emeditor.com,
but there are surely lots of others to choose from. On Linux, the KDE
text editors are fine also. Don't know about Macs.
Take note of the encoding you save under (normally UTF-8) and make sure
you specify this encoding in the Content-Type charset parameter so that
the browser can tell it's UTF-8 without having to guess and maybe get it
wrong. If you don't have access to the server config to set the default
charset, use a meta tag:
<head>
...
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
</head>
> Any gotchas or tips would be appreciated.
Get them to indicate what is actually *supposed* to be a line break in
the copy they send you, and what's just text wrapping. When you don't
grok the language and there aren't spaces words it can be hard to tell
what the structure is supposed to be if you've only been sent plain text.
Avoid using the standard 'serif' and 'sans-serif' generic CSS fonts;
they will on some machines for no apparent reason choose a font without
any Japanese characters in, resulting in a page where all the characters
are rendered as empty squares. Put one of the common Japanese font names
before any generics. But beware! Fonts can change names depending on the
native character set: what you see as 'MS PGothic' will be available to
a Japanese IE user only under the name 'MS P[1][2]', where [1] is the
kanji represented by Unicode code point 26126 and [2] is the kanji
represented by Unicode code point 26397. So in any font-family CSS
declarations, include both names.
Oh, and don't include backslashes in your page text. For reasons too
arcane and tedious to go into, they'll come out as yen symbols instead
on native Japanese machines.
--
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
More information about the thelist
mailing list