[thelist] RegExp for name validation...

Andrew Seguin asegu at borg.darktech.org
Mon Dec 15 19:42:54 CST 2003


Ok, due to your comment below...

>>>...<<<
> And, because of security protocols, we are to defined *allowed*
> characters,
> not defined *disallowed* characters. The first is really a small
sub-set, and easier to define and secure.
<<<...>>>

...I go 'Ahhhh'. Ok, clears it up a bit. Yup, limited character sub-sets
are safer for future security (you never know when the next character you
allow through could be combined with a bufferflow vulnerability and up
goes your software...)

hmm. In that case I think...

[alpha][alphanum or punct or space][alphanum] could be a generic, starting
point for representing the data... after all you don't really mind the
order of names I take it then... hmm.

Then, I'd sit with a character table baesd on your input (ansi? ascii?
iso8859-1? => ex:éâô   iso8859-2? => ex:š&#273;&#269;&#263;ž .. greek
characters, etc. and for each major set (latin, latin extended - western
europe, cyrillic, greek, hebrew,etc) that you can recognize what is alpha
and what is not. I'd then define variables that define the ranges (better
demonstrate to people reading the code what is meant to be allowed):

  //Basic latin character set.
  $latinAlpha = "a-zA-Z";
  $latinNum   = "0-9";
  $latinPunctuationAllowed = ",'-";

  //Croatian as sample beyond-latin character set.
  $croatianAlpha = "abc&#263;&#269;d&#273;...zž";
  $croatianNum = $latinNum;
  $croatianPunctuationAllowed = $latinPunctuationAllowed;
  $cyrilicAlpha = "...";
  $cyrilicNum...
  ...

  $alpha = $latinAlpha . $croatianAlpha . $cyrilicAlpha ...;
  $num   = $latinNum   . $croatianNum   . $cyrilicNum   ...;
  ...

>From that, I'd then validate the input for the very generic form (as above).

Note: an idea that might be of you is to use charmap to get descriptions
quickly..
ex:(windows 'charmap' comes to my mind for quickly associating what it
looks like, what language[set] and whether or not it can be considered
alpha or not  (ex: '&#974;', 'u+03CE', 'Greek small letter Omega with
Tonos').


Then, once that is ok (it can be, to my knowledge, a quicker test then
full validation through all rules immediately) I'd go through some of the
more precise tests (as are defined by your needs/interests/client's
interests/other sources). I don't see this character lists idea as being
highly performant though... (the list of characters chould get long
quickly enough I guess..).. but maybe it could lead to something that is
more?


I will say that the caracters (', ", &, ; and \) I usualy avoid like the
plague in my higher 'security' projects (not that most of my projects get
security considerations by the management)... all depending on the
protocol.
If you do allow them, be careful how you deal with them... although I'm
guessing you know that already and I'm just wasting list space.. *stops
with a sigh thinking of past projects*

Again, Good luck. Sounds like a very interesting project.
Hope wasn't too much for you..
Andrew




More information about the thelist mailing list