[thelist] Removing Microsoft Word special characters

rudy rudy937 at rogers.com
Thu Sep 18 10:25:41 CDT 2003


>  $body = ereg_replace(149, "•", $body); // bullet
>  ... and others

kris, you may want to reconsider mapping away from 
perfectly good code values to microsoftisms...

  "MS Windows introduced a group of codings in 
   which these code positions [128-159] were used 
   for printable characters, some of which are much 
   in demand with certain authors: the trademark 
   glyph, matched quotes and so forth. These are 
   the encodings such as "code page" 1252. It would 
   appear to be protocol-correct to offer documents 
   in these encodings, with 8-bit characters in that 
   range, as long as they are sent with an appropriate 
   charset value and the recipient accepts this charset 
   encoding. THAT IS NOT AT ALL THE SAME THING AS 
   ATTEMPTING TO REPRESENT THOSE CHARACTERS BY NUMERIC 
   CHARACTER REFERENCES SUCH AS ™ AS ONE SO OFTEN 
   SEES. The meaning of the latter construct is undefined 
   (N.B: not "illegal", but "undefined") in standard 
   HTML: the protocol-correct representation of a trademark 
   as a numeric character reference is in fact ™ 
   as can be seen in the W3C reference already cited; and 
   correspondingly for the matched quotes and such."  
     -- http://ppewww.ph.gla.ac.uk/~flavell/charset/internat.html

see adrian's seminal article  http://evolt.org/entities

rudy



More information about the thelist mailing list