I'm screen-scraping an HTML page which contains:
<table border=1 class="searchresult" cellpadding=2>
<tr><th colspan=2>Last search</th></tr>
<tr><th align=left>Search term</th><td>xxxxxx</td></tr>
<tr><th align=left>Result</th><td>yyyyyyyy/td></tr>
</table>
I want to write an XPATH expression which gets me the data cell containing "yyyyyyyy". I've gotten as far as
.//table[@class='searchresult']//tr/th
which gets me a list of all the table-header nodes in the table. I can iterate over them in user code, find the one whose .text is "Results" and then call .getnext() on that to get the table-data. But, is there a cleaner way to do this by writing a more specific XPATH pattern? It seems like there should be, but开发者_运维知识库 I haven't gotten my head that far around XPATH yet to figure out how.
If it matters, I'm doing this in Python with lxml.
.//table[@class='searchresult']//tr/td[preceding-sibling::th] might give you what you need.
Two comprehensive papers on semi-automatically creating XPath statements like this one, specifically for screen scraping purposes can be found here:
http://tobiasanton.com/Tobias_Anton/Academia.html
Use:
//table/tr[last()]/td
This selects any td
element that is a child of any tr
that is the last tr
child of any table
in this XHTML document.
This may select more than one td
element, depending on whether or not there is only one table
in the XHTML document. You need to make this expression more precise, if more than one table
element is present.
For example, if the table
in question is the first in the document, use:
(//table)[1]/tr[last()]/td
精彩评论