Writing utf-8 string inside my python files_问答_开发者

This line in my .py file is giving me a: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-13: unsupported Unicode code range"

if line.startswith(u"Fußnote"):

The file is saved in utf-8 and has the encoding 开发者_如何转开发at the top: # -- coding: utf-8 --

I've got a lot of other py files with utf-8 encoded chinese text in them in the comments and in arrays for example: arr = [u"chinese text",] so I'm wondering why this case in particular doesn't work for me.

Let's examine that error message very closely:

"UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-13: unsupported Unicode code range"

Note carefully that it says "bytes in position 8-13" -- that's a 6-byte UTF-8 sequence. That might have been valid in the dark ages, but since Unicode was frozen at 21 bits, the maximum is FOUR bytes. UTF-8 validations and error reporting were tightened up recently; as a matter of interest, exactly what version of Python are you running?

With 2.7.1 and 2.6.6 at least, that error becomes the more useful "... can't decode byte XXXX in position 8: invalid start byte" where XXXX can be only be 0xfc or 0xfd if the old message suggested a 6-byte sequence. In ISO-8859-1 or cp1252, 0xfc represents U+00FC LATIN SMALL LETTER U WITH DIAERESIS (aka u-umlaut, a likely suspect); 0xfd represents U+00FD LATIN SMALL LETTER Y WITH ACUTE (less likely).

The problem is NOT with the if line.startswith(u"Fußnote"): statement in your source file. You would have got a message at COMPILE time if it wasn't proper UTF-8, and the message would have started with "SyntaxError", not "UnicodeDecodeError". In any case the UTF-8 encoding of that string is only 8 bytes long, not 14.

The problem is (as @Mark Tolonen has pointed out) in whatever "line" is referring to. It can only be a str object.

To get further you need to answer Mark's questions (1) result of print repr(line) (2) site.py change.

At this stage it's a good idea to clear the air about mixing str and unicode objects (in many operations, not just a.startswith(b)).

Unless the operation is defined to produce a str object, it will NOT coerce the unicode object to str. This is not the case with a.startswith(b).It will attempt to decode the str object using the default (usually 'ascii') encoding.

Examples:

>>> "\xff".startswith(u"\xab")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

>>> u"\xff".startswith("\xab")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)

Furthermore, it is NOT correct to say "Mix and you get UnicodeDecodeError". It is quite possible that the str object is validly encoded in the default encoding (usually 'ascii') -- no exception is raised.

Examples:

>>> "abc".startswith(u"\xff")
False
>>> u"\xff".startswith("abc")
False
>>>

I can reproduce the UnicodeDecodeError with this code:

#!/usr/bin/env python
# -- coding: utf-8 --

line='Fußnoteno'
if line.startswith(u"Fußnote"):
    print('Hi')

Note that line is a string object, but u"Fußnote" is a unicode object. Since line is a string object, the unicode object is being converted to a string object in the call to startswith. In Python2, the default is to try to decode using the ascii codec. Since u"ß" can't be decoded with the ascii codec, a UnicodeDecodeError is raised.

The error can be avoided if you first make line a unicode object:

line='Fußnoteno'.decode('utf-8')
if line.startswith(u"Fußnote"):
    print('Hi')

or if you first make u"Fußnote" a string object:

line='Fußnoteno'
if line.startswith(u"Fußnote".encode('utf-8')):
    print('Hi')

The error indicates line is not a Unicode string. In X.startswith(Y) both X and Y must be Unicode or byte string. Mix and you get UnicodeDecodeError. print repr(line) to inspect it. Also have you altered site.py to change the default encoding from 'ascii' to 'utf8'? Normally it is the 'ascii' codec that is the default for Python 2.x.

Without seeing your code, it's unclear if the problem is the code or the data file the code is reading.

When you open the file, are you doing:

file = open("essay.txt")

or:

import codecs
file = codecs.open("essay.txt", encoding="utf-8")

What does:

print file.encoding

say if you add it just below the open line?

Both of these ways work for me:

# -- coding: utf-8 --

file = open("essay.txt")

print file.encoding

for line in file:
    uline = line.decode("utf-8")
    print type(uline)
    if uline.startswith(u"Fußnote"):
        print "Footnote"
    else:
        print "Other"

and this way:

# -- coding: utf-8 --

import codecs
file = codecs.open("essay.txt", encoding="utf-8")

print file.encoding

for line in file:
    print type(line)
    if line.startswith(u"Fußnote"):
        print "Footnote"
    else:
        print "Other"

In the first one, I am letting Python default to opening the file as a byte stream, then converting each line from a byte stream to a Unicode string using uline = line.decode("utf-8").

In the second one, I am opening the file as a UTF-8 encoded file, so Python returns Unicode strings when I iterate over the file.

EDIT

Here's a trivial way you could use to find out if the file contained any non-utf8 data.

import codecs
file = open("baduni.txt")
try:
    for char in codecs.iterdecode(file, "utf-8"):
        print char
except UnicodeDecodeError as e:
    print "error:", e

And an example of it in use:

$ echo 'ABC\0200\0101DEF' > baduni.txt
$ od -c baduni.txt
0000000   A   B   C 200   A   D   E   F  \n
0000011
$ python testuni.py
error: 'utf8' codec can't decode byte 0x80 in position 3: invalid start byte

In the example, the 4th byte (position 3, counting from 0) is 200 octal/0x80 hexadecimal.
The Wikipedia UTF-8 article shows that that would only be valid as the second byte of a two-byte sequence.

Your file is saved in some other encoding, and not UTF-8. Figure out what encoding the file is in (possibly CP1252 or so), and declare that instead.