[Javascript] Advanced regexp problem

Erik Beijnoff erik at addsystems.com
Mon May 21 11:24:50 CDT 2001


I have a rather tricky problem. I need to be able to take any given
HTML-string, parse it and split it up into smaller sections, depending on
the tag nesting.

Example tag:   Beforetags<div first><div second></div><div
third><hr></div></div>Aftertags

Note how the tags i nested: One outer div with two consecutive div:s inside
of it. Inside the second inner div, there's a <hr>. I want to match all the
innermost and ignore the outer one for the moment.

The point is this: if I can match all the  innermost tags I can remove them
and then run the remaining string through the regexp and get out the next
set of tags until the string is completely broken up into matching tags.
Therefore I only need to match the innermost tags.

Solution as far as of now:
var regexp= /<(\w+)[^>]*>[^<]*<\/\1>/ig   (bold for clarity)

first bold part: match any beginning of tag: ------starts with "<", then one
or more letters followed by zero or more characters which isn't a ">", then
a ">"-------
middle part: any character which isn't a "<"
finsihing part: matching closing tag ---------start with  "</", then the
character combination which made up the starting tag, then a ">"

This works fine as long as the string is made up of matching tag pairs, but
breaks down whenever a nonclosed tag is used, in the example this is the
<hr>, the expression can't match the outer div tag, that's correct. It
matches the first inner div, which also is correct but encounters problems
when it tries to match the second inner div. Since there's a <hr> inside the
div, it can't be matched correctly and just skips over this tag.

What i need:
Somehow I need to replace the middle part of the expression with something
that says "as long as I don't encounter a closing tag, keep on testing"
instead of "as long as i don't encounter a <, keep on  testing."

I've tried something like [^(</)]*  for the middle part, and then adjust the
third part of the expression accordingly, but no good result. What I want
the preceding regexp to say is "as long as i dont encounter a </, keep on
testing", but I can't get the (pattern) to work like this: [^(pattern)]*
(zero or more not equal to pattern).

I'm sorry about my long question but I hope someone has a good grasp one
pattern matching to solve this one. It's critical to me and I would be most
grateful.

/Cloak

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.evolt.org/pipermail/javascript/attachments/20010521/f207b91f/attachment.htm>


More information about the Javascript mailing list