[Javascript] regexp - how to exclude a substring?

Mon May 23 03:37:55 CDT 2005

At 10:33 PM 5/21/2005, Shawn Milo wrote:

>On 5/21/05, Paul Novitski <paul at novitskisoftware.com> wrote:
>
> > Pending the discovery of the right piece of regexp to do my parsing for me,
> > I've used a temporary solution that's very much like what you suggest: I
> > split the HTML into an array at every < (not >) so that each array element
> > begins with either DIV or /DIV; then I walk the array beginning with my
>
>Are  you splitting on '<' or '<div'? Because if it's the first, then
>how are you making
>sure that you deal only with divs?

I'm splitting at < so that each array element begins with either TAG or 
/TAG, e.g.:

    0   div id="wrapper">
    1    div>
    2   p>Here's some content
    3    /p>
    4    /div>
    5    div>Here's some other content
    6    /div>
    7    /div>
    8   ul>
    ...

If I were to split on <TAG such as "<div" then I'd have much fewer array 
elements to walk through looking for my desired close-tag:

    0   div id="wrapper">
    1    div><p>Here's some content</p></div>
    2    div>Here's some other content</div></div><ul>
    ...

On the other hand, the close-tags wouldn't occur at the beginning of each 
array element and I'd have to count the number of </div instances in each 
array element to determine when -- and where -- my initial target element 
closed.  Not difficult, but not as simple a loop as:

    for each array element:
    {
       if matching open-tag:
          nest++
       elseif matching close-tag:
          nest--

       if nest == 0:
          break
    }

> > The purpose of this is to extract segments of an HTML template for
> > selective processing.
> >
>Can you give more detail? If so, maybe we can be of more help.

I'm creating an HTML templating system such that the parent program can ask 
for chunks of HTML from the template file using CSS-style selectors.  As 
one of many possible examples, the template might contain a model of a 
thumbnail gallery such as:

         <div id="thumbnails">
            <div class="thumb">
               <img src="" alt="" title="" />
            </div>
         </div>

The parent program can read the "thumbnails" div and clone the "thumb" div 
for each thumbnail in its array or data table, outputting div#thumbnails to 
the generated page with the desired set of thumbs:

         <div id="thumbnails">
            <div class="thumb">
               <img src="Avacado.jpg" alt="Avacado" title="Avacado" />
            </div>
            <div class="thumb">
               <img src="Banana.jpg" alt="Banana" title="Banana" />
            </div>
            ...
         </div>

The primary goal of the project is to separate HTML from server-side 
program logic as distinctly as we separate HTML from CSS & JavaScript.  A 
side-benefit is to be able to validate the HTML template and increase the 
likelihood that this validation will hold true for child pages generated 
from the template.  Ultimately I may create "templatesheets" that guide the 
creation of child pages in a way analagous to the way stylesheets guide the 
rendering of pages today, by merging database content, user input, and 
program logic with html templates using simplified scripting syntax.  This 
isn't a unique project -- there are plenty of templating systems out there 
already -- but I'm having a great time rolling my own.  I've done a lot of 
the work in meta-code and am now implementing it in ASP and PHP.

> > You say:
> > >I believe that regex lookaheads and lookbehinds are not supported in
> > >Javascript.
> >
> > Why would such be necessary in order to determine whether a matched pair of
> > <div/</div occurred within a string?  It seems like what I really need to
> > know is how to say in regexp, "match this string if it contains any
> > character but NOT the substring "</TAG"".  With that tool, I can filter for
> > nested TAGs inside my parent TAG.
>Because  a lookbehind should, in theory allow for a regex to say
>"find any /div that is not preceeded by a /div earlier in the string."
>If we could do that, then we could cut out the hassle of having to
>split on every tag. Unless my thinking is faulty, which is a consideration
>at this late hour.

RegEx allows us to find strings that don't contain specific characters:

         /<div[^>]+id=\"bob\"/

which I believe means:

         find '<div' followed by 1-N characters BUT NOT '>' followed by 
'id="bob"'

...a regex way of selecting for div#bob.

What I want to do is say "BUT NOT '/div'" -- i.e. exclude a string, not 
just selected single characters.  I want to be able to select:

         <div...>...</div>

and be sure I'm not capturing:

         <div...>...</div>...</div>

Possible?

Paul