[thelist] RegEx problem again

Kasimir K evolt at kasimir-k.fi
Mon Aug 8 15:00:32 CDT 2005



> Tom Dell'Aringa scribeva in 2005-08-08 15:14: 
>>- Minimum 3 characters
>>- Only alpha characters
>>- Can include a wildcard character of '*' (in which case you would
>>require 4 characters)

>>we need to allow one or more SPACES as well!

Kasimir K scribeva in 2005-08-08 17:11:
> Even that would leave a bit to hope for... that would not match 
> Dell'Aringa, because of the ' character...
> 
> How about names like Rättö or Piña? 'ä', 'ö' and 'ñ' are not in between 
> 'a' and 'z'...

I gave this another thought, along these lines:
- first character must be a letter
- following characters may be
   - a letter
   - a space not followed by a space nor an apostrophe
   - an apostrophe, not followed by an apostrophe
- optionally ending with an asterisk

Let's first forget Rättö and Piña to keep things simple :-)
- start of the string: ^
- a letter: [A-Za-z]

easy so far, let's forget the middle part and look at the end
- optional asterisk: \*?
- end of the string $

now the middle part
- at lest two characters, be it letters, spaces or apostrophes: (){2,}
- inside those parentheses three options: ()|()|()
- first option, a letter (we don't actually need the parentheses around 
this): [A-Za-z]
- second option, a space not followed by a space nor an apostrophe, wow, 
  this calls for a negative lookahead (we would't need parentheses here 
either, but I keep them for clarity): ( (?![ ']))
- the third option is similar: ('(?!'))

now we just put all these together:


^[A-Za-z]([A-Za-z]|( (?![ ']))|('(?!'))){2,}\*?$


Let's now get back to Rättö and Piña (and their more exotic friends :-). 
We do that by including letters from the following Unicode blocks:

0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
0250..02AF; IPA Extensions

The lower end we can deal with [A-Za-z], and then there are some control 
characters, so the range becomes [\u00C0-\u02AF] and our regular 
expression (for clarity and to avoid random line breaks, I've added 
some, remove those before use):

^[A-Za-z\u00C0-\u02AF]
(
[A-Za-z\u00C0-\u02AF]
|
( (?![ ']))
|
('(?!'))
){2,}\*?$

In our range we still have two non-letter characters:
00F7;division sign
00D7;multiplication sign

If you want to get rid of them, you have to change
[A-Za-z\u00C0-\u02AF]
to
[A-Za-z\u00C0-\u00F6\u00F8-\u00D6\u00D8-\u02AF]

For now I don't dwell on other than Latin scripts, as I'm not that good 
with Chinese punctuation :-)

It wasn't that tricky after all: with a little help from lookaheads and 
Unicode we ended up with fairly concise RegExp.

.k


More information about the thelist mailing list