As is often the case, I'm struggling with the lack of proper lxml documentation (note to self: should write a proper lmxl tutorial and get lots of traffic!).
I want to find all <li>
items that do not contain an <a>
tag with a particular class.
For example:
<ul>
<li><small>pudding</small>: peache开发者_如何学JAVAs and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>
I'd like to get hold of only the <li>
that does not contain a link with class new
, and I'd like to get hold of the text inside <small>
. In other words, 'pudding'.
Can anyone help?
thanks!
import lxml.html as lh
content='''\
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>
'''
tree=lh.fromstring(content)
for elt in tree.xpath('//li[not(descendant::a[@class="new"])]/small/text()'):
print(elt)
# pudding
The XPath has the following meaning:
// # from the root node, look at all descendants
li[ # select nodes of type <li> who
not(descendant::a[ # do not have a descendant of type <a>
@class="new"])] # with a class="new" attribute
/small # select the node of type <small>
/text() # return the text of that node
Quickly hacked together this code:
from lxml import etree
from lxml.cssselect import CSSSelector
str = r"""
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>"""
html = etree.HTML(str)
bad_sel = CSSSelector('li > a.new')
good_sel = CSSSelector('li > small')
bad = [item.getparent() for item in bad_sel(html)]
good = filter(lambda item: item.getparent() not in bad, [item for item in good_sel(html)])
for item in good:
print(item.text)
It first builds a list of items you do not want, and then it builds the ones you do want by excluding the bad ones.
精彩评论