[thelist] RE: ColdFusion and checking HTML

Joshua Olson joshua at alphashop.net
Thu Apr 26 10:42:41 CDT 2001


Jon,

To be honest, the only *real* what to check if a page is nothing but white
space is to let the browser render the page, then check the result.  If you
are automating this process, then it's going to be very difficult to do it
with any single regular expression.

The perfect algorithm that strip tags has yet to be discovered.  Generally,
tag strippers are overly greedy in what they strip.  For example:

<script>
  document.write('<hr>');
</script>

sent through a tag stripper may leave:

  document.write('');

or, other tag strippers may leave script blocks alone.

I'm not trying to discourage you away from trying to find an answer, but the
answer you are seeking is not trivial.  You may find that much time will be
saved in the long run by rethinking the problem and finding another
approach.

Again, you'll most likely have to settle for an answer that is 70% correct
in 5% of the time in this case.  The first answer I gave was nowhere near
complete--it just gave a potential starting point if you wanted to tackle
the problem using the algorithm that you specified.

Raymond provided a regular expression that was essentially a tag stripper.
Other people or sources may suggest other methods.  It's going to be your
call whether to combine different techniques together until you get an
answer that suits you.

Again, this particular problem is not new to the internet world.  It may
even be classified up there near NP-Complete status--no theoretical
solution, short of actually rendering it.

(But even at that, without knowledge of what the images are on the page, you
cannot even then be sure.  What if an image is completely transparent one
day and not the next??)

Sorry I cannot be more help.

Maybe I can help with a related problem.  What is the nature of the
*overall* problem you are trying to solve?  Maybe a different set of eyes
(3000+) will help you come up with an alternate route.

-joshua

> Thanks Josh,
>
> Is there a way to strip all html tags from a string (remove everything
> between '<' and '>'? That is, only leave the text? Maybe I can alter
> that to strip &nbsp; and leave <img> tags, then Trim it and check that
> string's length. How could I get rid of the HTML tags?
>
> Thanks,
> Jon.





More information about the thelist mailing list