开发者

lxml can't parse <table>?

开发者 https://www.devze.com 2023-01-23 16:13 出处:网络
I want to parse tables in html, but i found lxml can\'t parse it? what\'s wrong?开发者_如何学Go

I want to parse tables in html, but i found lxml can't parse it? what's wrong?开发者_如何学Go

# -*- coding: utf8 -*-
import urllib
import lxml.etree
keyword = 'lxml+tutorial'

url = 'http://www.baidu.com/s?wd='
if __name__ == '__main__':
    page = 0

    link = url + keyword + '&pn=' + str(page)

    f = urllib.urlopen(link)
    content = f.read()
    f.close()

    tree = lxml.etree.HTML(content)

    query_link = '//table'

    info_link = tree.xpath(query_link)

    print info_link

the print result is just []...


lxml's documentation says, "The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing."

And sure enough, the HTML returned by Baidu is invalid: the W3C validator reports "173 Errors, 7 warnings". I don't know (and haven't investigated) whether these particular errors have caused your trouble with lxml, because I think that your strategy of using lxml to parse HTML found "in the wild" (which is nearly always invalid) is doomed.

For parsing invalid HTML, you need a parser that implements the (surprisingly bizarre!) HTML error recovery algorithm. So I recommend swapping lxml for html5lib, which handles Baidu's invalid HTML with no problems:

>>> import urllib
>>> from html5lib import html5parser, treebuilders
>>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
>>> dom = p.parse(urllib.urlopen('http://www.baidu.com/s?wd=foo').read())
>>> len(dom.getElementsByTagName('table'))
12


I see several places that code could be improved but, for your question, here are my suggestions:

  1. Use lxml.html.parse(link) rather than lxml.etree.HTML(content) so all the "just works" automatics can kick in. (eg. Handling character coding declarations in headers properly)

  2. Try using tree.findall(".//table") rather than tree.xpath("//table"). I'm not sure whether it'll make a difference, but I just used that syntax in a project of my own a few hours ago without issue and, as a bonus, it's compatible with non-LXML ElementTree APIs.

The other major thing I'd suggest would be using Python's built-in functions for building URLs so you can be sure the URL you're building is valid and properly escaped in all circumstances.

If LXML can't find a table and the browser shows a table to exist, I can only imagine it's one of these three problems:

  1. Bad request. LXML gets a page without a table in it. (eg. error 404 or 500)
  2. Bad parsing. Something about the page confused lxml.etree.HTML when called directly.
  3. Javascript needed. Maybe the table is generated client-side.
0

精彩评论

暂无评论...
验证码 换一张
取 消