I've being trying to parse an html page in Python using lxml.html.
I used the following code:
import lxml.html as H
page = open('page.html', 'r').read()
doc = H.fromstring(page)
print H.tostring(doc)
The page.html is a web page I downloaded with a proxy program I wrote before which do some开发者_高级运维 work about using proxy and encoding transfer. The encoding of the file has been changed to utf-8 while the charset declaration in the page is like this:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
btw, gb2312 is a kind of Chinese character set.
At first, I ran the above python code, but it printed nothing but an empty html structure which is wrong and not what I wanted.
I tried some ways and at last I found the problem happened because of the charset declaration: when I replaced the 'charset=gb2312' with an empty string, the parsing code worked as I expected.
But I don't quite understand why this happen. And is the way I solved the problem the right method or just a coincidence?
http://lxml.de/parsing.html#python-unicode-strings says:
You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.
精彩评论