[Javascript] regexp - how to exclude a substring?

Paul Novitski paul at novitskisoftware.com
Mon May 23 17:41:20 CDT 2005


Shawn,

Consider the interesting problem of selecting a chunk of HTML based on a 
complex CSS-style selector:

         div p.intro span

Ideally, I'd be able to convert this into a single regular expression, 
something like:

         /<div[ >].*<p [^>]*class=\"[^\"]*intro[ \"].*<span/si

This will locate a <div followed by a <p class="intro" followed by a span 
-- but won't guarantee any parent-child-grandchild relationship between 
them.  It will match both of these:

         <div>
            <p class="intro">
               <span>
and:
         <div></div>
         <p class="intro"></p>
         <span>

That's why I've wanted to exclude a string in the regexp, not just a 
character.  However, it appears that I have hit the ceiling of what regular 
expressions can do in this area so I'll let go of that.


My current strategy is to initialize the template engine by walking the 
document recording the lineage of each element:

<html>                          0:html
   <body>                        1:html body
     <div id="content">          2:html body div#content
        <h2>                     3:html body div#content h2
        </h2>
        <p class="intro">        4:html body div#content p.intro
          <span>                 5:html body div#content p.intro span
          </span>
        </p>
     </div>
     <ul id="nav" class="menu">  6:html body ul#nav.menu
...

(I wonder if this is how some rendering engines work internally, so they 
don't have to keep re-parsing the tree repeatedly?)

Then I can search those lineage strings for the matches I want.  Assuming 
that every tag name is preceded by a space, and that #ids come before 
.classes, then this regex should work to pinpoint the desired element:

         /(\d)+:.* div.* p[^ ]*\.intro.* span/
will match:
         div p.intro span
in:
         5:html body div#content p.intro span


Then my parenthetical expression (\d) will yield the key number, n'est ce pas?

Paul


At 11:24 AM 5/23/2005, Shawn Milo wrote:
 > Maybe you can answer a more general question I have about regular
 > expressions: why, when you search for
 >          <div.*<\/div
 > does regexp return a string that stretches all the way to the last </div
 > found and not simply to the first one it encounters?

Easy: It's called "greedy matching," and every regex engine does it.

That's where I thought maybe something like a lookahead or lookbehind
might come in handy, because you can say something like:

<div, then a </div, without another </div before it

As for your other comments, I'll re-read and see if a thought is born.

Shawn





More information about the Javascript mailing list