[thelist] Javascript regular expression question

liorean liorean at gmail.com
Thu Apr 19 21:06:39 CDT 2007


On 19/04/07, Judah McAuley <judah at wiredotter.com> wrote:
> This particular part of the application is a front end that produces an
> xml document, so the character set is actually UTF-8 and I'm allowing
> non-ascii characters, so the "strip out the rest" won't quite work,
> alas. A good idea though. Basically, I'm trying to strip out things that
> will cause problems when inserted into a CDATA block. As it turns out,
> control characters will wreak havoc with xml parsers even inside a CDATA
> block.

Control characters yes. Horizontal tab, no. I suggest you take the XML
allowed characters set and invert it. As per the XML 1.0 spec, that
is:

    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Inverted, the JS regex pattern for matching those XML illegal chars should be:
    [\u0000-\u0008\u000b\u000c\u000e-\u001f\ud800-\udfff\ufffe\uffff]

> Hmmm...might be a problem with pasting the control character then. I was
> trying to paste in x03 This character actually ended up in an error on
> this app which is why I'm testing it. I still don't know how the end
> user got it in there, but there it is.

Then try instead of pasting to use string escapes and see if it
catches it then. Remember that browser code for form controls input,
browser cut and paste code, OS clipboard and the progam inserting the
text into the clipboard may do some kind of filtering or
normalisation.

> I'll try putting the regexes in place and see how it goes.

Probably works superbly.
-- 
David "liorean" Andersson



More information about the thelist mailing list