So I'm writing an application that will do a little screen scraping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument
called doc
. Now I want to parse that doc, looking for this:
<table border="0" cellspacing="3">
<tr><td>First rows stuff</td><开发者_运维知识库/tr>
<tr>
<td>
The data I want is in here <br />
and it's seperated by these annoying <br /> 's.
No id's, classes, or even a single <p> tag. </p> Just a bunch of <br /> tags.
</td>
</tr>
</table>
So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?
Update: Here is how I'm loading my doc
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);
Since you are using Html Agility Pack already I would suggest using the methods it provides to find the information you want. There are a few ways to navigate the document, but one of the most concise is to use XPath. In this case you could use something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load("input.html");
HtmlNode node = doc.DocumentNode
.SelectNodes("//table[@cellspacing='3']/tr[2]/td")
.Single();
string text = node.InnerText;
"Something else" is the best answer -- HTML is best parsed by an HTML parser rather than via regular expressions. I'm no C# expert, but I hear the HTML Agility Pack is well-liked for this purpose.
I'd say som̡et̨hińg Else
You'd probably get better mileage with an xml parser.
If you're using the Agility pack already, then it's just a matter of using some thing doc.DocumentNode.SelectNodes("//table[@cellspacing='3']")
to get the table in the document. Try looking through the documentation and coding examples. Since you already have structured data, it's rediculous to go back to the text data and reparse.
精彩评论