[thelist] FW: XML PHP Special Characters

Andrew Clover and-evolt at doxdesk.com
Tue Mar 29 06:31:37 CST 2005


Mark Joslyn <Mark.Joslyn at SolimarSystems.com> wrote:

> I have a special character inside the XML document (trademark symbol ™) that
> is being parsed and what is returned is a question mark.

Disclaimer: I don't really do PHP and don't know what I'm talking about. 
But seem to be replying to PHP questions anyway of late. Ho hum.

> Is there a way I can have this special character go through the parsing
> process but still return a valid trademark symbol?

I believe the problem is that the XML parser has to pass back normal PHP 
byte-strings to your character data handler for all the text content. 
Character references (&#n;) must always be converted to characters by 
XML parsers (they can't be kept as separate reference objects).

The 'target encoding' it chooses to use for returning strings defaults 
to the same encoding as the document itself - in this case Latin-1. 
However, Latin-1 does not include a trademark symbol (Windows codepage 
1252 - which is based on Latin-1 - does, but actual standard ISO-8859-1 
does not), so the character data handler can't be sent the character, 
and gets a question mark instead.

To get trademarks returned unmolested, use a target encoding that 
includes that character - most likely UTF-8. Either set the target 
encoding manually:

   xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, 'utf-8');

Or, simpler, just use UTF-8 for all your documents.

PHP's lack of a native Unicode string type is a real hassle for XML 
processing tasks. Dealing with XML as byte strings the Wrong Thing and 
rather sad really.

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


More information about the thelist mailing list