I have this HTML input:
<font size="5"><p>some text</p>
<p> another text</p></font>
I'd like to use regex to remove the HTML tags so that the output is:
some text
another text
Can anyone suggest how to do this with regex?开发者_StackOverflow社区
Since you asked, here's a quick and dirty solution:
String stripped = input.replaceAll("<[^>]*>", "");
(Ideone.com demo)
Using regexps to deal with HTML is a pretty bad idea though. The above hack won't deal with stuff like
<tag attribute=">">Hello</tag>
<script>if (a < b) alert('Hello>');</script>
etc.
A better approach would be to use for instance Jsoup. To remove all tags from a string, you can for instance do Jsoup.parse(html).text()
.
Use a HTML parser. Here's a Jsoup example.
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);
Result:
some text another text
Or if you want to preserve newlines:
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
Result:
some text another text
Jsoup offers more advantages as well. You could easily extract specific parts of the HTML document using the select()
method which accepts jQuery-like CSS selectors. It only requires the document to be semantically well-formed. The presence of the since 1998 deprecated <font>
tag is already not a very good indication, but if you know the HTML structure in depth detail beforehand, it'll still be doable.
See also:
- Pros and cons of leading HTML parsers in Java
You can go with HTML parser called Jericho Html parser.
you can download it from here - http://jericho.htmlparser.net/docs/index.html
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognized or invalid HTML. It also provides high-level HTML form manipulation functions.
The presence of badly formatted HTML does not interfere with the parsing
Starting from aioobe's code, I tried something more daring:
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = input.replaceAll("</?(font|p){1}.*?/?>", "");
System.out.println(stripped);
The code to strip every HTML tag would look like this:
public class HtmlSanitizer {
private static String pattern;
private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};
static {
StringBuffer tags = new StringBuffer();
for (int i=0;i<tagsTab.length;i++) {
tags.append(tagsTab[i].toLowerCase()).append('|').append(tagsTab[i].toUpperCase());
if (i<tagsTab.length-1) {
tags.append('|');
}
}
pattern = "</?("+tags.toString()+"){1}.*?/?>";
}
public static String sanitize(String input) {
return input.replaceAll(pattern, "");
}
public final static void main(String[] args) {
System.out.println(HtmlSanitizer.pattern);
System.out.println(HtmlSanitizer.sanitize("<font size=\"5\"><p>some text</p><br/> <p>another text</p></font>"));
}
}
I wrote this in order to be Java 1.4 compliant, for some sad reasons, so feel free to use for each and StringBuilder...
Advantages:
- You can generate lists of tags you want to strip, which means you can keep those you want
- You avoid stripping stuff that isn't an HTML tag
- You keep the whitespaces
Drawbacks:
- You have to list all HTML tags you want to strip from your string. Which can be a lot, for example if you want to strip everything.
If you see any other drawbacks, I would really be glad to know them.
If you use Jericho, then you just have to use something like this:
public String extractAllText(String htmlText){
Source source = new Source(htmlText);
return source.getTextExtractor().toString();
}
Of course you can do the same even with an Element
:
for (Element link : links) {
System.out.println(link.getTextExtractor().toString());
}
精彩评论