Remove all empty HTML tags?_问答_开发者_运维开发者技术经验分享

I am imagining a function which I figure would use Regex, and it would be recursive for instances like <p><strong></strong></p> to remove all empty HTML tags within a string. This would have to account for whitespace to if possible. There would be no crazy instances where < character was being used in an attribute value.

I am pretty terrible at regex but I imagine this is possible. How can you do it?

Here is the method I have so far:

Public Shared Function stripEmptyHtmlTags(ByVal html As String) As String
    Dim newHtml As String = Regex.Replace(html, "/(<.+?>\s*</.+?>)/Usi", "")

    If html <> newHtml Then
        newHtml = stripEmptyHtmlTags(newHtml)
    End If

    Return newHtml
End Function

However my current Regex is in PHP format, and it doesn't seem to be working. I am not familiar with .NET regex syntax.

To all those saying don't use regex: I am curious what the pattern would be regardless. Surely there is a pattern which could match all opening/closing start tags with any amount of white space (or none) in between the tags? I've seen regex that matches HTML tags with any number of attributes, one empty tag (such as just <p></p>) etc.

So far I have开发者_如何学C tried the following regex patterns in the above method to no avail (as in, I have a text string with empty paragraphs tags that didn't even get removed.)

Regex.Replace(html, "/(<.+?>\s*</.+?>)/Usi", "")

Regex.Replace(html, "(<.+?>\s*</.+?>)", "")

Regex.Replace(html, "%<(\w+)\b[^>]*>\s*</\1\s*>%", "")

Regex.Replace(html, "<\w+\s*>\s*</\1\s*>", "")

First, note that empty HTML elements are, by definition, not nested.

Update: The solution below now applies the empty element regex recursively to remove "nested-empty-element" structures such as: <p><strong></strong></p> (subject to the caveats stated below).

Simple version:

This works pretty well (see caveats below) for HTML having no start tag attributes containing <> funny stuff, in the form of an (untested) VB.NET snippet:

Dim RegexObj As New Regex("<(\w+)\b[^>]*>\s*</\1\s*>")
Do While RegexObj.IsMatch(html)
    html = RegexObj.Replace(html, "")
Loop

Enhanced Version

<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>

Here is the uncommented enhanced version in VB.NET (untested):

Dim RegexObj As New Regex("<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:""[^""]*""|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>")
Do While RegexObj.IsMatch(html)
    html = RegexObj.Replace(html, "")
Loop

This more complex regex correctly matches a valid empty HTML 4.01 element even if it has angle brackets in its attribute values (subject once again, to the caveats below). In other words, this regex correctly handles all start tag attribute values which are quoted (which can have <>), unquoted (which can't) and empty. Here is a fully commented (and tested) PHP version:

function strip_empty_tags($text) {
    // Match empty elements (attribute values may have angle brackets).
    $re = '%
        # Regex to match an empty HTML 4.01 Transitional element.
        <                    # Opening tag opening "<" delimiter.
        (\w+)\b              # $1 Tag name.
        (?:                  # Non-capture group for optional attribute(s).
          \s+                # Attributes must be separated by whitespace.
          [\w\-.:]+          # Attribute name is required for attr=value pair.
          (?:                # Non-capture group for optional attribute value.
            \s*=\s*          # Name and value separated by "=" and optional ws.
            (?:              # Non-capture group for attrib value alternatives.
              "[^"]*"        # Double quoted string.
            | \'[^\']*\'     # Single quoted string.
            | [\w\-.:]+      # Non-quoted attrib value can be A-Z0-9-._:
            )                # End of attribute value alternatives.
          )?                 # Attribute value is optional.
        )*                   # Allow zero or more attribute=value pairs
        \s*                  # Whitespace is allowed before closing delimiter.
        >                    # Opening tag closing ">" delimiter.
        \s*                  # Content is zero or more whitespace.
        </\1\s*>             # Element closing tag.
        %x';
    while (preg_match($re, $text)) {
        // Recursively remove innermost empty elements.
        $text = preg_replace($re, '', $text);
    }
}

Caveats: This function does not parse HTML. It simply matches and removes any text pattern sequence corresponding to a valid empty HTML 4.01 element (which, by definition, is not nested). Note that this also erroneously matches and removes the same text pattern which may occur outside normal HTML markup, such as within SCRIPT and STYLE tags and HTML comments and the attributes of other start tags. This regex does not work with short tags.

Update: This regex solution also does not work (and will erroneously remove valid markup) if you do something insanely unlikely (but perfectly valid) like this:

<div att="<p att='">stuff</div><div att="'></p>'">stuff</div>

Summary:

On second thought, just use an HTML parser!

The problem you face is the arbitrary levels of nesting, which cannot be matched with a standard regex. I suppose you could apply the same regex replacement over and over again until nothing is left. But there are better solutions out there, such as a dedicated HTML parsing library.

You can't do it with a regular expression. You could probably use an xml parser assuming the html is well formed.

Why recursive though, you could simply run