decoding problem with urllib2 in python_问答_开发者

开发者 https://www.devze.com 2023-01-25 14:51 出处：网络

I\'m trying to use urllib2 in python 2.7 to fetch a page from the web. The page happens to be encoded in unicode(UTF-8) and have greek characters. When I try to fetch and print it with the code below,

I'm trying to use urllib2 in python 2.7 to fetch a page from the web. The page happens to be encoded in unicode(UTF-8) and have greek characters. When I try to fetch and print it with the code below, I get gibberish instead of the greek characters.

import urllib2
print urllib2.urlopen("http://www.pamestihima.gr").read()

开发者_如何学Go

The result is the same both in Netbeans 6.9.1 and in Windows 7 CLI.

I'm doing something wrong, but what?

Unicode is not UTF-8. UTF-8 is a string encoding, like ISO-8859-1, ASCII etc.
Always decode your data as soon as possible, to make real Unicode out of it. ('somestring in utf8'.decode('utf-8') == u'somestring in utf-8'), unicode objects are u'' , not ''
When you have data leaving your app, always encode it in the proper encoding. For Web stuff this is utf-8mostly. For console stuff this is whatever your console encoding is. On Windows this is not UTF-8 by default.

It prints correctly for me, too.

Check the character encoding of the program in which you are viewing the HTML source code. For example, in a Linux terminal, you can find "Set Character Encoding" and make sure it is UTF-8.