Charset (Was: RE: [Javascript] JS Marquee - Advanced!

Troy III Ajnej trojani2000 at hotmail.com
Mon Aug 14 11:29:04 CDT 2006


> I see nothing there that says IE scans documents twice.      -Well, you shouldn't!
 
Why do you always have to read me wrong?!
The topic was Charset encoding not ie scanning.
That was a slight digression lead by his incorrect remark
but still expanding the same topic.
 
But, Now I'll do something smarter... 
***
5.2 Character encodings
/... character encoding .../
The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes 
into a sequence of characters. /.../The conversion method can range from simple one-to-one correspondence to 
complex switching schemes or algorithms.
 
A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as 
large as [ISO10646]. There are several different encodings of parts of [ISO10646] in addition to encodings of the 
entire character set (such as UCS-4).
Attention here!
5.2.1 Choosing an encoding
Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice, and the 
choice largely depends on the conventions used by the system software. These tools may employ any convenient 
encoding that covers most of the characters contained in the document, provided the encoding is correctly labeled. 
Occasional characters that fall outside this encoding may still be represented by character references. These always 
refer to the document character set, not the character encoding.
//!This one is tricky:
Servers and proxies may change a character encoding (called transcoding) on the fly to meet the requests of user 
agents (see section 14.2 of [RFC2616], the "Accept-Charset" HTTP request header). Servers and proxies do not have 
to serve a document in a character encoding that covers the entire document character set.
Commonly used character encodings on the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most 
Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), EUC-JP (another 
Japanese encoding), and UTF-8 (an encoding of ISO 10646 using a different number of bytes for different characters). 
Names for character encodings are case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent.
This specification does not mandate which character encodings a user agent must support.
Conforming user agents must correctly map to ISO 10646 all characters in any character encodings that they recognize 
(or they must behave as if they did).
 
//More attention here:
Notes on specific encodings 
When HTML text is transmitted in UTF-16 (charset=UTF-16), text data should be transmitted in network byte order ("big-endian", 
high-order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], clause C3, page 3-1.
Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin 
with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, 
becomes hexadecimal FFFE, a character guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first 
bytes of a text would know that bytes have to be reversed for the remainder of the text.
The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used. For information about 
ISO 8859-8 and the bidirectional algorithm, please consult the section on bidirectionality and character encoding.
//This one goes for the Founder of PHP
5.2.2 Specifying the character encoding
How does a server determine which character encoding applies for a document it serves? Some servers examine the first few bytes of 
the document, or check against a database of known files and encodings. Many modern servers give Web masters more control over 
charset configuration than old servers do. Web masters should use these mechanisms to send out a "charset" parameter whenever possible, 
but should take care not to identify a document with the wrong "charset" parameter value.
How does a user agent know which character encoding has been used? The server should provide this information. The most straightforward 
way for a server to inform the user agent about the character encoding of the document is to use the "charset" parameter of the "Content-Type" 
header field of the HTTP protocol ([RFC2616], sections 3.4 and 14.17) For example, the following HTTP header announces that the character 
encoding is EUC-JP:Content-Type: text/html; charset=EUC-JP

Please consult the section on conformance for the definition of text/html.
The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from 
the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to 
be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.
To address server or configuration limitations, HTML documents may include explicit information about the document's character encoding; the META 
element can be used to provide user agents with this information.
For example, to specify that the character encoding of the current document is "EUC-JP", a document should include the following META declaration:<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

The META declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least 
until the META element is parsed). META declarations should appear as early as possible in the HEAD element.
For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides 
the charset attribute on several elements. By combining these mechanisms, an author can greatly improve the chances that, when the user retrieves a 
resource, the user agent will recognize the character encoding.
To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

An HTTP "charset" parameter in a "Content-Type" field. 
A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". 
The charset attribute set on an element that designates an external resource. 
In addition to this list of priorities, the user agent may use heuristics and user settings. For example, many user agents use a heuristic to distinguish the 
various encodings used for Japanese text. Also, user agents typically have a user-definable, local default character encoding which they apply in the 
absence of other indicators.
User agents may provide a mechanism that allows users to override incorrect "charset" information. However, if a user agent offers such a mechanism, 
it should only offer it for browsing and not for editing, to avoid the creation of Web pages marked with an incorrect "charset" parameter.

Note. If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to 
avoid conflicts with present or future versions of the standard. This is highly discouraged, however, for reasons of portability.
 
ETC
 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                         Troy III                           progressive art enterprise~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
_________________________________________________________________
Try Live.com - your fast, personalized homepage with all the things you care about in one place.
http://www.live.com/getstarted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.evolt.org/pipermail/javascript/attachments/20060814/b42713f1/attachment.htm>


More information about the Javascript mailing list