[thelist] RegEx problem again

Kasimir K evolt at kasimir-k.fi
Mon Aug 8 12:11:42 CDT 2005


Tom Dell'Aringa scribeva in 2005-08-08 15:14:
> - Minimum 3 characters
> - Only alpha characters
> - Can include a wildcard character of '*' (in which case you would
> require 4 characters)
> 
> To that end I had this:
> 
> var namematch = /^[a-zA-Z]{3,}$/.test(nametext.value);
> namematch = (namematch  || /^[a-zA-Z\*]{4,}$/.test(nametext.value));
> 
> Which worked just fine, till we realized we need to allow one or more
> SPACES as well! How do you add a space to the pattern? (and allow for
> as many spaces as they like?)

Where will the spaces be allowed? And how many consecutive spaces?

Kowalkowski, Lee (ASPIRE) scribeva in 2005-08-08 15:34:
 > Just add a space to the character sets, so [a-zA-Z] becomes [a-zA-Z ].
 > Note the space before the closing square bracket.

This would allow names like '  a' and even '   '.

You could use: /^[a-zA-Z][a-zA-Z ]{2,}$/
first require a-zA-Z, then allow spaces too. This would match 'a  ' too 
though (character followed by two spaces).

How about requiring that a space should always follow a character:
/^[a-zA-Z]([a-zA-Z]*[ ]?){2,}$/
First an a-zA-Z, then at least twice at least one a-zA-Z followed by an 
optional space.

Even that would leave a bit to hope for... that would not match 
Dell'Aringa, because of the ' character...

How about names like Rättö or Piña? 'ä', 'ö' and 'ñ' are not in between 
'a' and 'z'...

You could use \w to match any alphanumeric character, but that has some 
problems: it would include numbers and underscores, and it works 
differently in different browsers... (put the following in your 
browser's address bar: javascript:alert(/[\w]{2,}/.test('ññ')); - in 
MSIE you get false, in FF true...)

A bit tricky situation, isn't it.

If you need to match names written with characters other than A-Z and 
a-z, you could use Unicode notion: [\U0041-\U005A] matches characters 
with code point between hex 41 and 5A (latin capital letter A and Z).

At http://www.unicode.org/Public/UNIDATA/UnicodeData.txt you'll find the 
codepoints (among other data :-)
At http://www.unicode.org/Public/UNIDATA/Blocks.txt you'll find the 
blocks (Hebrew is between 0590..05FF and Arabic 0600..06FF)


Sorry Tom if this confuses more than helps... but once I started to 
think about Dell'Aringa, Rättö and Piña I just couldn't stop... And I 
too would love to hear RegEx gurus' views on this.

.k


More information about the thelist mailing list