How can I get the content of <body>
element by using html5lib
in Python?
Example input data: <html><head></head><body>xxx<b>yyy</b></hr></body></html>
Expected output: xxx<b>yyy</b></hr>
It should work even if HTML is brok开发者_运维问答en (unclosed tags,...).
html5lib
allows you to parse your documents using a variety of standard tree formats. You can do this using lxml, as I've done below, or you can follow the instructions in their user documentation to do it either with minidom, ElementTree or BeautifulSoup.
file = open("mydocument.html")
doc = html5lib.parse(file, treebuilder="lxml")
content = doc.findtext("html/body", default=None):
Response to comment
It is possible to acheive this without installing any external libs using their own simpletree.py, but judging by the comment at the start of the file I would guess this is not the recommended way...
# Really crappy basic implementation of a DOM-core like thing
If you still want to do this, however, you can parse the html document like so:
f = open("mydocument.html")
doc = html5lib.parse(f)
and then find the element you're looking for by doing a breadth-first search of the child nodes in the document. The nodes are kept in an array named childNodes
and each node has a name stored in the field name
.
精彩评论