[thelist] symmetrical CDATA, with accented or multibyte characters

Mike Migurski mike-evolt at teczno.com
Fri Aug 15 19:53:21 CDT 2003


Would anyone mind pointing me in the right direction for information on
correct ways to encode multibyte information in XML CDATA? I've done few
google searches, and am familiar with the basic ideas but in attempting to
write symmetrical functions (PHP) for this stuff (ones that parse data,
and spit it back out identically) I'm tripping over common accented

For example, my test input in one spot is "éeek!", which gets parsed
and re-output as "<![CDATA[eeek!]]>" (the first e is accented, as
intended). If i then re-input that to the same parser and check the
subsequent output, it ends up looking like "<![CDATA[?k!]]>". Obviously,
PHP's parser is having difficulty reading the multibyte accented-e
character - what is the correct way to indicate such characters to the

In reading through http://www.w3.org/TR/REC-xml, it seems that I need to
be correctly specifying my encoding in the xml declaration, but doing so
causes PHP to choke on the XML information altogether. Specifying the
encoding in xml_parser_create() seems to help somewhat, but now parsing is
stopping as certain multibyte characters are encountered in CDATA

michal migurski- contact info and pgp key:
sf/ca            http://mike.teczno.com/contact.html

More information about the thelist mailing list