开发者

A charset problem about parsing HTML with lxml.html

开发者 https://www.devze.com 2023-02-06 17:06 出处:网络
I\'ve being trying to parse an html page in Python using lxml.html. I used the following code: import lxml.html as H

I've being trying to parse an html page in Python using lxml.html.

I used the following code:

import lxml.html as H
page = open('page.html', 'r').read()
doc = H.fromstring(page)
print H.tostring(doc)

The page.html is a web page I downloaded with a proxy program I wrote before which do some开发者_高级运维 work about using proxy and encoding transfer. The encoding of the file has been changed to utf-8 while the charset declaration in the page is like this:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

btw, gb2312 is a kind of Chinese character set.

At first, I ran the above python code, but it printed nothing but an empty html structure which is wrong and not what I wanted.

I tried some ways and at last I found the problem happened because of the charset declaration: when I replaced the 'charset=gb2312' with an empty string, the parsing code worked as I expected.

But I don't quite understand why this happen. And is the way I solved the problem the right method or just a coincidence?


http://lxml.de/parsing.html#python-unicode-strings says:

You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号