[thelist] php preg_replace

Simon Willison cs1spw at bath.ac.uk
Thu Oct 2 17:13:02 CDT 2003


Robert Vreeland wrote:

> I working on an php based html parser that takes IE jscript 
 > outerHTML as the input. Regardless of how well formed the
 > html doc is, IE drops a lot of the 'optional' closing
 > tags. I'm using preg_replace to re-insert the closing
 > tag, but for some reason it replaces every other instance.
 >
> My current work around is to use preg_replace twice.

That way lies insanity! IEs HTML generator is notorious for producing 
horrible markup, and cleaning it up with regular expressions is a very 
tricky business. For one thing, IE allows users to paste any old junk in 
to the contenteditable region (try pasting from Word or from another IE 
window) so you could end up with pretty much anything in there. For 
another thing, when the next version of IE eventually rears its ugly 
head the chances are it will produce different HTML, breaking all of 
your hard work.

Your best bet is probably to run the HTML from IE through HTML Tidy, 
which can add all of the closing tags for you. That should be a lot more 
robust than regular expressions. You can either use PHP to execute the 
command line version of tidy, or you can take a look at the Tidy 
extension for PHP: http://www.coggeshall.org/tidy.php

Cheers,

Simon Willison
http://simon.incutio.com/




More information about the thelist mailing list