Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.
<table>
<tr>
<td>
<table>
<tr><td></td></tr>
<tr><td>
<table>
开发者_如何学Go <tr><td><u><b>Header1</b></u></td></tr>
<tr><td>Data</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
What I really want is the deeply nested table, because it has the header text "Header1". I am trying like so:
from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')
but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. Any thoughts?
Use:
//td[text() = 'Header1']/ancestor::table[1]
Find the header you are interested in and then pull out its table.
//u[b = 'Header1']/ancestor::table[1]
or
//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]
Note that //
always starts at the document root (!). You can't do:
//table[//*[contains(text(), "Header1")]]
and expect the inner predicate (//*…
) to magically start at the right context. Use .//
to start at the context node. Even then, this:
//table[.//*[contains(text(), "Header1")]]
won't work since even the outermost table contains the text 'Header1'
somewhere deep down, so the predicate evaluates to true for every table in your example. Use not()
like I did to make sure no other tables are nested.
Also, don't test the condition on every node .//*
, since it can't be true for every node to begin with. It's more efficient to be specific.
Perhaps this would work for you:
tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")
The not(descendant::table)
bit ensures that you're getting the innermost table.
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
//*[text()="Header1"]
selects an element anywhere in a document with textHeader1
.ancestor::table[1]
selects the first ancestor of the element that istable
.
Complete example
#!/usr/bin/env python
from lxml import html
page = """
<table>
<tr>
<td>
<table>
<tr><td></td></tr>
<tr><td>
<table>
<tr><td><u><b>Header1</b></u></td></tr>
<tr><td>Data</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
"""
tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)
精彩评论