I'm scraping an html document using lxml.html
; there's one thing I can do in BeautifulSoup
, but don't manage to do with lxml.htm. Here it is:
from BeautifulSoup import BeautifulSoup
import re
doc = ['<ht开发者_如何学Pythonml>',
'<h2> some text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> A table</td> </tr> </table>',
'<h2> some special text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> The table I want </td> </tr> </table>',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.find(text=re.compile("special")).findNext('table')
I tried this with cssselect
, but no success. Any ideas on how I could locate this using the methods in lxml.html
?
Many thanks, D
You can use a regular expression in an lxml Xpath, by using EXSLT syntax. For example, given your document, this will select the parent node whose text matches the regexp spe.*al
:
import re
import lxml.html
NS = 'http://exslt.org/regular-expressions'
tree = lxml.html.fromstring(DOC)
# select sibling table nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::table"
print tree.xpath(path, namespaces={'re': NS})
# select all sibling nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::*"
print tree.xpath(path, namespaces={'re': NS})
Output:
[<Element table at 7fe21acd3f58>]
[<Element p at 7f76ac2c3f58>, <Element table at 7f76ac2e6050>]
精彩评论