Decode function tries to encode Python_问答_开发者

开发者 https://www.devze.com 2023-02-06 13:55 出处：网络

I am trying to print a unicode string without the specific encoding hex in it. I\'m grabbing this data from facebook which has an encoding type in the html headers of UTF-8. When I print the type - it

I am trying to print a unicode string without the specific encoding hex in it. I'm grabbing this data from facebook which has an encoding type in the html headers of UTF-8. When I print the type - it says its unicode, but then when I try to decode it with unicode-escape says there is an encoding error. Why is it trying to encode when I use the decode method?

Code

a='really long string of unicode html text that i wont reprint'
print type(a)
 >>> <type 'unicode'>   
print a.decode('unicode-escape')
 >>> Traceback (most recent call last):
  File "scfbp.py", line 203, in myFunctionPage
    print a.decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 1945: ordinal not in range(12开发者_运维知识库8)

It's not the decode that's failing. It's because you are trying to display the result to the console. When you use print it encodes the string using the default encoding which is ASCII. Don't use print and it should work.

>>> a=u'really long string containing \\u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
u'really long string containing \u20ac and some other text'
>>> print a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)

I'd recommend using IDLE or some other interpreter that can output unicode, then you won't get this problem.

Update: Note that this is not the same as the situtation with one less backslash, where it fails during the decode, but with the same error message:

>>> a=u'really long string containing \u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)

When you print to the console Python tries to encode (convert) the string to the character set of your terminal. If this is not UTF-8, or something that doesn't map all the characters in the string, it will whine and throw an exception.

This snags me every now and then when I do quick processing of data, with for example Turkish characters in it.

If you are running python.exe through the Windows command prompt you can find some solutions here: What encoding/code page is cmd.exe using. Basically you can change the codepage with chcp but it's quite cumbersome. I would follow Mark's advice and use something like IDLE.

>>> print type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')

Why is it trying to encode when I use the decode method?

Because you decode to Unicode, and you encode from. You just tried to decode a unicode string to unicode. The first thing it then does is try to convert it to a string, with the ascii codec. That's why you get:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2110' in position 3: ordinal not in range(128)

Remember: Unicode is not an encoding. Everything else is, like ascii, utf8, latin-1 etc.

This implicit encoding is gone in Python 3, btw, because it confuses people.

Decode function tries to encode Python

精彩评论

关注公众号

热门标签

图文推荐

Decode function tries to encode Python

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：