[thelist] RegEx problem again
Kasimir K
evolt at kasimir-k.fi
Mon Aug 8 15:00:32 CDT 2005
> Tom Dell'Aringa scribeva in 2005-08-08 15:14:
>>- Minimum 3 characters
>>- Only alpha characters
>>- Can include a wildcard character of '*' (in which case you would
>>require 4 characters)
>>we need to allow one or more SPACES as well!
Kasimir K scribeva in 2005-08-08 17:11:
> Even that would leave a bit to hope for... that would not match
> Dell'Aringa, because of the ' character...
>
> How about names like Rättö or Piña? 'ä', 'ö' and 'ñ' are not in between
> 'a' and 'z'...
I gave this another thought, along these lines:
- first character must be a letter
- following characters may be
- a letter
- a space not followed by a space nor an apostrophe
- an apostrophe, not followed by an apostrophe
- optionally ending with an asterisk
Let's first forget Rättö and Piña to keep things simple :-)
- start of the string: ^
- a letter: [A-Za-z]
easy so far, let's forget the middle part and look at the end
- optional asterisk: \*?
- end of the string $
now the middle part
- at lest two characters, be it letters, spaces or apostrophes: (){2,}
- inside those parentheses three options: ()|()|()
- first option, a letter (we don't actually need the parentheses around
this): [A-Za-z]
- second option, a space not followed by a space nor an apostrophe, wow,
this calls for a negative lookahead (we would't need parentheses here
either, but I keep them for clarity): ( (?![ ']))
- the third option is similar: ('(?!'))
now we just put all these together:
^[A-Za-z]([A-Za-z]|( (?![ ']))|('(?!'))){2,}\*?$
Let's now get back to Rättö and Piña (and their more exotic friends :-).
We do that by including letters from the following Unicode blocks:
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
0250..02AF; IPA Extensions
The lower end we can deal with [A-Za-z], and then there are some control
characters, so the range becomes [\u00C0-\u02AF] and our regular
expression (for clarity and to avoid random line breaks, I've added
some, remove those before use):
^[A-Za-z\u00C0-\u02AF]
(
[A-Za-z\u00C0-\u02AF]
|
( (?![ ']))
|
('(?!'))
){2,}\*?$
In our range we still have two non-letter characters:
00F7;division sign
00D7;multiplication sign
If you want to get rid of them, you have to change
[A-Za-z\u00C0-\u02AF]
to
[A-Za-z\u00C0-\u00F6\u00F8-\u00D6\u00D8-\u02AF]
For now I don't dwell on other than Latin scripts, as I'm not that good
with Chinese punctuation :-)
It wasn't that tricky after all: with a little help from lookaheads and
Unicode we ended up with fairly concise RegExp.
.k
More information about the thelist
mailing list