[thelist] Excluding tags from a regular expression search

David Dorward david at dorward.me.uk
Thu Sep 13 01:56:24 CDT 2007


On 12 Sep 2007, at 19:59, E Michael Brandt wrote:
> str=str.replace(/<[^>]+>/g,'');

Unfortunately, this fails when you get content such as:

<foo attribute="3 > 2" anotherAttribute="bar">

Parsing HTML is quite hard and not something I'd like to leave to  
regular expressions. You'd probably be better off running the code  
through a proper HTML parser that can give you plain text (there's no  
shortage of HTML->Text converters, you can use Lynx if you get really  
stuck) and storing that along side the markup (and then searching  
that rather then the HTML).

-- 
David Dorward
http://dorward.me.uk/
http://blog.dorward.me.uk/





More information about the thelist mailing list