[Javascript] regexp - how to exclude a substring?
Paul Novitski
paul at novitskisoftware.com
Mon May 23 17:41:20 CDT 2005
Shawn,
Consider the interesting problem of selecting a chunk of HTML based on a
complex CSS-style selector:
div p.intro span
Ideally, I'd be able to convert this into a single regular expression,
something like:
/<div[ >].*<p [^>]*class=\"[^\"]*intro[ \"].*<span/si
This will locate a <div followed by a <p class="intro" followed by a span
-- but won't guarantee any parent-child-grandchild relationship between
them. It will match both of these:
<div>
<p class="intro">
<span>
and:
<div></div>
<p class="intro"></p>
<span>
That's why I've wanted to exclude a string in the regexp, not just a
character. However, it appears that I have hit the ceiling of what regular
expressions can do in this area so I'll let go of that.
My current strategy is to initialize the template engine by walking the
document recording the lineage of each element:
<html> 0:html
<body> 1:html body
<div id="content"> 2:html body div#content
<h2> 3:html body div#content h2
</h2>
<p class="intro"> 4:html body div#content p.intro
<span> 5:html body div#content p.intro span
</span>
</p>
</div>
<ul id="nav" class="menu"> 6:html body ul#nav.menu
...
(I wonder if this is how some rendering engines work internally, so they
don't have to keep re-parsing the tree repeatedly?)
Then I can search those lineage strings for the matches I want. Assuming
that every tag name is preceded by a space, and that #ids come before
.classes, then this regex should work to pinpoint the desired element:
/(\d)+:.* div.* p[^ ]*\.intro.* span/
will match:
div p.intro span
in:
5:html body div#content p.intro span
Then my parenthetical expression (\d) will yield the key number, n'est ce pas?
Paul
At 11:24 AM 5/23/2005, Shawn Milo wrote:
> Maybe you can answer a more general question I have about regular
> expressions: why, when you search for
> <div.*<\/div
> does regexp return a string that stretches all the way to the last </div
> found and not simply to the first one it encounters?
Easy: It's called "greedy matching," and every regex engine does it.
That's where I thought maybe something like a lookahead or lookbehind
might come in handy, because you can say something like:
<div, then a </div, without another </div before it
As for your other comments, I'll re-read and see if a thought is born.
Shawn
More information about the Javascript
mailing list