[thelist] Weird Data from Text File: 

liorean liorean at gmail.com
Tue Mar 11 04:21:51 CDT 2008


> Casey Crookston wrote:
> > Advertiser
>  >
>  > I have NO idea where the  is coming from!  It most certainly is not
> > in the text file!!!
>  >
>  > Any ideas?

It's a typical encoding problem, as follows:
- UTF-8 is an 8-bit unit variable width encoding that is a superset of US-ASCII.
- UTF-16 has a magical cookie (the Byte Order Mark, BOM) that tells
implementations which byte order the document is encoded using.
- Windows traditionally uses an ANSI encoding (which one depends on
locale) that is 8-bit unit variable width and is a superset of
US-ASCII.

In order to differentiate UTF-8 from ANSI, Microsoft inserts the
UTF-16 BOM, encoded as three separate UTF-8 code units, to tell that
the encoding is UTF-8.
If that magical cookie is interpreted as ANSI, it will be interpreted
as the three character sequence "".

So, it's there because you're treating a UTF-8 encoded file as ANSI.
-- 
David "liorean" Andersson


More information about the thelist mailing list