[thelist] FW: XML PHP Special Characters
Andrew Clover
and-evolt at doxdesk.com
Tue Mar 29 06:31:37 CST 2005
Mark Joslyn <Mark.Joslyn at SolimarSystems.com> wrote:
> I have a special character inside the XML document (trademark symbol ™) that
> is being parsed and what is returned is a question mark.
Disclaimer: I don't really do PHP and don't know what I'm talking about.
But seem to be replying to PHP questions anyway of late. Ho hum.
> Is there a way I can have this special character go through the parsing
> process but still return a valid trademark symbol?
I believe the problem is that the XML parser has to pass back normal PHP
byte-strings to your character data handler for all the text content.
Character references (&#n;) must always be converted to characters by
XML parsers (they can't be kept as separate reference objects).
The 'target encoding' it chooses to use for returning strings defaults
to the same encoding as the document itself - in this case Latin-1.
However, Latin-1 does not include a trademark symbol (Windows codepage
1252 - which is based on Latin-1 - does, but actual standard ISO-8859-1
does not), so the character data handler can't be sent the character,
and gets a question mark instead.
To get trademarks returned unmolested, use a target encoding that
includes that character - most likely UTF-8. Either set the target
encoding manually:
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, 'utf-8');
Or, simpler, just use UTF-8 for all your documents.
PHP's lack of a native Unicode string type is a real hassle for XML
processing tasks. Dealing with XML as byte strings the Wrong Thing and
rather sad really.
--
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
More information about the thelist
mailing list