[thelist] Regular expression to check if other regular expressions are valid
VOLKAN ÖZÇELİK
volkan.ozcelik at gmail.com
Wed Jun 22 14:42:11 CDT 2005
Some not perfectly-related but useful info:
HTH,
Volkan.
Why computers cannot auto-generate regular expressions
======================================================
A lot of people seem to be looking for a program that can automatically
generate regular expressions for them. The program would only take
examples of valid matches as input. Unfortunately, no
computer program will ever be able to generate a meaningful regular
expression based purely on a list of valid matches. Let me show you
why.
Suppose you provide the examples 111111 and 999999, should the computer
generate:
1. A regex matching exactly those two examples: (?:111111|999999)
2. A regex matching 6 identical digits (\d)\1{5}
3. A regex matching 6 ones and nines [19]{6}
4. A regex matching any 6 digits \d{6}
5. Any of the above three, with word boundaries, e.g. \b\d{6}\b
6. Any of the first three, not preceded or followed by a digit, e.g.
(?<!\d)\d{6}(?!\d)
As you can see, there are many ways in which examples can be
generalized into a regular expression. The only way for the computer
to build a predictable regular expression is to require you to list
*all* possible matches. Then it could generate a search pattern that
matches exactly those matches.
If you don't want to list all possible matches, you need a higher-level
description. That's exactly what regular expressions are designed to
provide. Instead of providing a long list of 6-digit numbers, you
simply tell the program to match "any six digits". In regular
expression syntax, this becomes \d{6}.
Any method of providing a higher-level description that is as flexible
as regular expressions will also be as complex as regular expressions.
When you do have an exhaustive list of all possible matches, an
optimized plain text search processing the whole list at once will be
as fast as or faster than a regex search. The plain text search can be
optimized to scan the text only once, without backtracking like regular
expressions do.
If you need to search for multiple search terms through a string in an
application you're developing, check if your programming library
includes a function to search for an array of strings. Such a
function, if it uses a proper algorithm, will search as fast as or
faster than any regex engine. On top of that, there's no overhead in
compiling the regular expression.
(From regexbuddy newsletter 2005-06-22)
On 6/22/05, VOLKAN ÖZÇELİK <volkan.ozcelik at gmail.com> wrote:
> I've recently written a regular expression which matches a regular expression.
>
> /[^/\*](?<![/\S]/.)([^/\\\r\n]|\\.)*/(?=[ig]{0,2}[^\S])
>
> However it can match invalid reg exs as well. To verify the validity
> of a regular expression cannot be done simply with a regular
> expression imho.
>
> Take for example
>
> /1(?=(?=(?=1 ... )2)3)/
>
> which is a valid regular expression with infinitely many positive
> lookaheads. I've forgotten the stuff with theory of large numbers etc,
> but hope that an interested can prove that it is impossible to match
> such a regex with a regex.
>
> Cheers,
> Volkan.
>
> On 6/22/05, lustig at brandeis.edu <lustig at brandeis.edu> wrote:
> > Has anyone written a regular expression to check if other regular expressions
> > are valid? Because regular expressions can't implement full turing machines,
> > rice's theorum shouldn't apply to them. Has it been done?
> >
> > Jason
> > --
> >
> > * * Please support the community that supports you. * *
> > http://evolt.org/help_support_evolt/
> >
> > For unsubscribe and other options, including the Tip Harvester
> > and archives of thelist go to: http://lists.evolt.org
> > Workers of the Web, evolt !
> >
>
More information about the thelist
mailing list