Regex to strip HTML tags_问答_开发者_运维开发者技术经验分享

I have this HTML input:

<font size="5"><p>some text</p>
<p> another text</p></font>

I'd like to use regex to remove the HTML tags so that the output is:

some text
another text

Can anyone suggest how to do this with regex?开发者_StackOverflow社区

Since you asked, here's a quick and dirty solution:

String stripped = input.replaceAll("<[^>]*>", "");

(Ideone.com demo)

Using regexps to deal with HTML is a pretty bad idea though. The above hack won't deal with stuff like

<tag attribute=">">Hello</tag>
<script>if (a < b) alert('Hello>');</script>

etc.

A better approach would be to use for instance Jsoup. To remove all tags from a string, you can for instance do Jsoup.parse(html).text().

Use a HTML parser. Here's a Jsoup example.

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);

Result:

some text another text

Or if you want to preserve newlines:

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
    String stripped = Jsoup.parse(line).text();
    System.out.println(stripped);
}

Result:

some text
another text

Jsoup offers more advantages as well. You could easily extract specific parts of the HTML document using the select() method which accepts jQuery-like CSS selectors. It only requires the document to be semantically well-formed. The presence of the since 1998 deprecated <font> tag is already not a very good indication, but if you know the HTML structure in depth detail beforehand, it'll still be doable.

public class HtmlSanitizer {

    private static String pattern;

    private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};

    static {
        StringBuffer tags = new StringBuffer();
        for (int i=0;i<tagsTab.length;i++) {
            tags.append(tagsTab[i].toLowerCase()).append('|').append(tagsTab[i].toUpperCase());
            if (i<tagsTab.length-1) {
                tags.append('|');
            }
        }
        pattern = "</?("+tags.toString()+"){1}.*?/?>";
    }

    public static String sanitize(String input) {
        return input.replaceAll(pattern, "");
    }

    public final static void main(String[] args) {
        System.out.println(HtmlSanitizer.pattern);

        System.out.println(HtmlSanitizer.sanitize("<font size=\"5\"><p>some text</p><br/> <p>another text</p></font>"));
    }

}

I wrote this in order to be Java 1.4 compliant, for some sad reasons, so feel free to use for each and StringBuilder...

Advantages:

You can generate lists of tags you want to strip, which means you can keep those you want
You avoid stripping stuff that isn't an HTML tag
You keep the whitespaces

Drawbacks:

You have to list all HTML tags you want to strip from your string. Which can be a lot, for example if you want to strip everything.

If you see any other drawbacks, I would really be glad to know them.

If you use Jericho, then you just have to use something like this:

public String extractAllText(String htmlText){
    Source source = new Source(htmlText);
    return source.getTextExtractor().toString();
}

Of course you can do the same even with an Element:

for (Element link : links) {
  System.out.println(link.getTextExtractor().toString());
}

Regex to strip HTML tags

See also:

精彩评论

关注公众号

热门标签

图文推荐