开发者

locate element using lxml.html vs BeautifulSoup

开发者 https://www.devze.com 2023-02-28 01:43 出处:网络
I\'m scraping an html document using lxml.html; there\'s one thing I can do in BeautifulSoup, but don\'t manage to do with lxml.htm. Here itis:

I'm scraping an html document using lxml.html; there's one thing I can do in BeautifulSoup, but don't manage to do with lxml.htm. Here it is:

from BeautifulSoup import BeautifulSoup
import re

doc = ['<ht开发者_如何学Pythonml>',
'<h2> some text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> A table</td> </tr> </table>',
'<h2> some special text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> The table I want </td> </tr> </table>',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.find(text=re.compile("special")).findNext('table')

I tried this with cssselect, but no success. Any ideas on how I could locate this using the methods in lxml.html?

Many thanks, D


You can use a regular expression in an lxml Xpath, by using EXSLT syntax. For example, given your document, this will select the parent node whose text matches the regexp spe.*al:

import re
import lxml.html

NS = 'http://exslt.org/regular-expressions'
tree = lxml.html.fromstring(DOC)

# select sibling table nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::table"
print tree.xpath(path, namespaces={'re': NS})

# select all sibling nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::*"
print tree.xpath(path, namespaces={'re': NS})

Output:

[<Element table at 7fe21acd3f58>]
[<Element p at 7f76ac2c3f58>, <Element table at 7f76ac2e6050>]
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号