I am parsing the tabular information from the html file with the help of the html agility pack.
Now I can do it and it works.
But when the table what I want to extract is inner most.
Or I don't know at which position it is in nested tables.And there can be any number of nested tables and from that I want to extract the information of the table which has column name name,address.
Ex.
<table>
<table>
<tr><td>PHONE NO.</td><td>OTHER INFO.</td></tr>
<tr><td>
<table>
<tr><td>AMOUNT</td></tr>
<tr><td>50000</td></tr>
<tr><td>80000</td></tr>
</table>
</td></tr>
<tr><td>
<table>
<tr><td>
<table>
<tr><td>
<table>
<tr><td> NAME </td><td>ADDRESS</td>
<tr><td> ABC </td><td> kfks </td>
<tr><td> BCD </td><td> fdsa </td>
</table>
</tr></td>
</table>
</td></tr>
</table>
</td></tr>
</table>
There are many tables but I want to extract the tab开发者_运维问答le which has column name name,address. So what should I do ?
Load the document as a HtmlDocument. Then use an XPath query to find a table that contains no other tables and which has a td in the first row containing "Name".
The XPath implementation is the standard .NET one from System.Xml.XPath
, so any documentation about using XPath with XmlDocument will be applicable.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.html");
HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[not(descendant::table) and tr[1]/td['NAME' = normalize-space()]]");
If the "Name" column was fixed, you could use something like 'Name' = normalize-space(tr[1]/td[2])
.
To find a table based on several column names, but not the inner most table condition.
HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[tr[1]/td['NAME' = normalize-space()] and tr[1]/td['ADDRESS' = normalize-space()]]");
var table = doc.DocumentNode.SelectSingleNode("//table [not(descendant::table) and tr[1]/td[normalize-space()='ADDRESS'] ]");
精彩评论