I want to split the following string according to the td tags:
<html>
<body>
<table>
<tr><td>data1</td></tr>
<tr><td>data2</td></tr>
<tr><td>data3</td></tr>
<tr><td>data4</td></tr>
</table>
</body>
I'v tried split("h2");
and split("[h2]");
but this way the split method splits the html code where it finds "h"
or "2"
and if Iam not mistaken开发者_开发百科 also "h2"
.
My ultimate goal is to retrieve everything between <td>
and </td>
Can anyone please please tell me how to do this with only using split()
?
Thanks alot
No.
That would mean — in essence — parsing HTML with regex. We don't do that 'round these parts.
Here is how to solve your optimal goal:
String html = ""; // your html
Pattern p = Pattern.compile("<td>([^<]*)</td>", Pattern.MULTILINE | Pattern.DOTALL);
for (Matcher m = p.matcher(html); m.find(); ) {
String tag = m.group(1);
System.out.println(tyg);
}
Please note that this code is written here without compiler but it gives the idea.
BUT why do you want to parse HTML using regex? I agree with guys: use HTML or XML parser (if your HTML is well-formatted.)
You cannot successfully parse HTML (or in your case, get the data between TD tags) with regular expressions. You should take a look at a simple HTML parser:
import java.io.StringReader;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;
public static List<String> extractTDs(String html) throws IOException {
final List<String> tdList = new ArrayList<String>();
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
StringBuffer buffer = new StringBuffer();
public void handleText(final char[] data, final int pos) {
buffer.append(data);
}
public void handleEndTag(Tag t, final int pos) {
if(Tag.TD.equals(t)) {
tdList.add(buffer.toString());
}
buffer = new StringBuffer();
}
};
parserDelegator.parse(new StringReader(html), parserCallback, true);
return tdList;
}
String.Split or regexes should not be used to parse markup languages as they have no notion of depth (HTML is a recursive grammar needs a recursive parser). Consider what would happen if your <td>
looked like:
<td>
<table><tr><td> td inside a td? </td></tr></table>
</td>
A regex would greedily match everything between the outer <td>...</td>
giving you unwanted results.
You should use an HTML parser like Johan mentioned.
You should really use a html parser, such as neko html or HtmlParser.
Iff you have a very small set of controlled html you could (although I generally recommend against it) use a regex such as
(?<=\\<td\\>)\\w+(?=\\</td\\>)
精彩评论