I made this expression to remove all empty (inluding tags with just whitespace) tags in the page.
$content = preg_replace('/<[^\/>]*>([\s]?)*<\/[^>]*>/', '', $content);
It worked a treat until it had to deal with content like this...
<blockquote>
<p >foo bar</p>
</blockquote>
<p ><a href="image.jpg" rel="lightbox" title=""><img title="image" src="image.jpg" /></a><br /></p>
and it outputs it as...
<blockquote>
<p >this is a test for the pluggin</p>
<p ><a href="image.jpg" rel="lightbox" title=""><img title="image" src="image.jpg" /></a><br /></p>
Thus removing the </blockquote>
.
I have been scratching my head on this one and can't get it working. Can anyone see an obvious solut开发者_Python百科ion other than specifying what tags it should format? I should also say that it is formatting 'the_content' on a wordpress post.
Regexps and HTML are not a good match, since HTML is not a regular syntax, and there are no end of edge cases and gotchas. You'll be better off using an HTML parser such as this one and inspecting/manipulating the DOM object.
You might also like to take a look at HTML Purifier, which is more advanced than Simple HTML Dom, if you find it doesn't get all the tags.
精彩评论