I'm trying to parse HTML file with libxml2. Usually this works fine, but not in this case:
<p>
<b>Titles</b>
(Some Text)
<table>
<tr>
<td valign="top">
…Something1...
</td>
<td align="right" valign="top">
…Something2...
</td>
</tr>
</table>
</p>
I do this query to get the first <td>
//p[b='Titles']/table/tr/td[0]
but nothing is returned because libxml think that <table>
tag is not a child of a tag <p>
开发者_如何学C and following him.
And finally the question WHY?
Are you using HTML or XML parser? AFAIR, HTML allows only inline elements inside <p>
(you cannot put <table>
in <p>
), so that it auto-closes <p>
tag after seeing <table>
tag (in HTML, you don't have to close every tag). So, your HTML is roughly equivalent to (attributes omitted):
<P>
<B>Titles</B>
Some text...
<TABLE>
<TR>
<TD>...Something1...
<TD>...Something2...
</TABLE>
Try using XML parser form libxml instead of HTML.
//p[b='Titles']/table/tr/td[0]
The error is in the indexing. XPath uses 1-based indexing.
The corrected XPath expression is:
//p[b='Titles']/table/tr/td[1]
精彩评论