A charset problem about parsing HTML with lxml.html_问答_开发者

A charset problem about parsing HTML with lxml.html

开发者 https://www.devze.com 2023-02-06 17:06 出处：网络

I\'ve being trying to parse an html page in Python using lxml.html. I used the following code: import lxml.html as H

I've being trying to parse an html page in Python using lxml.html.

I used the following code:

import lxml.html as H
page = open('page.html', 'r').read()
doc = H.fromstring(page)
print H.tostring(doc)

The page.html is a web page I downloaded with a proxy program I wrote before which do some开发者_高级运维 work about using proxy and encoding transfer. The encoding of the file has been changed to utf-8 while the charset declaration in the page is like this:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

btw, gb2312 is a kind of Chinese character set.

At first, I ran the above python code, but it printed nothing but an empty html structure which is wrong and not what I wanted.

I tried some ways and at last I found the problem happened because of the charset declaration: when I replaced the 'charset=gb2312' with an empty string, the parsing code worked as I expected.

But I don't quite understand why this happen. And is the way I solved the problem the right method or just a coincidence?

http://lxml.de/parsing.html#python-unicode-strings says: