[thelist] Regular expression to check if other regular expressions are valid

VOLKAN ÖZÇELİK volkan.ozcelik at gmail.com
Wed Jun 22 14:42:11 CDT 2005


Some not perfectly-related but useful info:

HTH,
Volkan.

Why computers cannot auto-generate regular expressions
======================================================

A lot of people seem to be  looking for a program that can automatically
generate  regular expressions for  them.   The program  would  only take
examples of  valid matches as  input.  Unfortunately, no
computer  program will ever  be  able to generate  a  meaningful regular
expression based  purely on a  list of valid matches.   Let me  show you
why.

Suppose you provide the examples  111111 and 999999, should the computer
generate:

1. A regex matching exactly those two examples: (?:111111|999999)
2. A regex matching 6 identical digits (\d)\1{5}
3. A regex matching 6 ones and nines [19]{6}
4. A regex matching any 6 digits \d{6}
5. Any of the above three, with word boundaries, e.g. \b\d{6}\b
6. Any of the first three, not preceded or followed by a digit, e.g.
  (?<!\d)\d{6}(?!\d)

As  you  can  see,  there  are  many  ways  in  which  examples  can  be
generalized into  a regular expression.   The only way for  the computer
to build  a predictable  regular expression  is to  require you  to list
*all* possible  matches.  Then it  could generate a search  pattern that
matches exactly those matches.

If you don't want to list  all possible matches, you need a higher-level
description.   That's exactly what  regular expressions are  designed to
provide.   Instead  of providing  a long  list of  6-digit numbers,  you
simply  tell  the  program  to  match "any  six  digits".    In  regular
expression syntax, this becomes \d{6}.

Any method of  providing a higher-level description that  is as flexible
as regular expressions  will also be as complex  as regular expressions.

When  you  do have  an  exhaustive  list  of  all possible  matches,  an
optimized plain  text search processing the  whole list at once  will be
as fast as or faster than a regex  search.  The plain text search can be
optimized to scan the text  only once, without backtracking like regular
expressions do.  

If you need to  search for multiple search terms through  a string in an
application  you're  developing,  check   if  your  programming  library
includes  a  function to  search  for  an  array  of  strings.   Such  a
function, if  it  uses a proper  algorithm,  will search as  fast  as or
faster than any  regex engine.  On  top of that, there's  no overhead in
compiling the regular expression.

(From regexbuddy newsletter 2005-06-22)

On 6/22/05, VOLKAN ÖZÇELİK <volkan.ozcelik at gmail.com> wrote:
> I've recently written a regular expression which matches a regular expression.
> 
> /[^/\*](?<![/\S]/.)([^/\\\r\n]|\\.)*/(?=[ig]{0,2}[^\S])
> 
> However it can match invalid reg exs as well. To verify the validity
> of a regular expression cannot be done simply with a regular
> expression imho.
> 
> Take for example
> 
> /1(?=(?=(?=1 ... )2)3)/
> 
> which is a valid regular expression with infinitely many positive
> lookaheads. I've forgotten the stuff with theory of large numbers etc,
> but hope that an interested can prove that it is impossible to match
> such a regex with a regex.
> 
> Cheers,
> Volkan.
> 
> On 6/22/05, lustig at brandeis.edu <lustig at brandeis.edu> wrote:
> > Has anyone written a regular expression to check if other regular expressions
> > are valid? Because regular expressions can't implement full turing machines,
> > rice's theorum shouldn't apply to them. Has it been done?
> >
> > Jason
> > --
> >
> > * * Please support the community that supports you.  * *
> > http://evolt.org/help_support_evolt/
> >
> > For unsubscribe and other options, including the Tip Harvester
> > and archives of thelist go to: http://lists.evolt.org
> > Workers of the Web, evolt !
> >
>


More information about the thelist mailing list