python3: readlines() indices issue?_问答_开发者

开发者 https://www.devze.com 2023-01-24 10:12 出处：网络

Python 3.1.2 (r312:79147, Nov9 2010, 09:41:54) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type \"help\", \"copyright\", \"credits\" or \"license\" for more information.

Python 3.1.2 (r312:79147, Nov  9 2010, 09:41:54)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> open("/home/madsc13ntist/test_file.txt", "r").readlines()[6]
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2230: unexpected code byte

and yet...

Python 2.4.3 (#1, Sep  8 2010, 11:37:47)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> open("/home/madsc13ntist/test_file.txt", "r").readlines()[6]
'2010-06-14 21:14:43 613 xxx.xxx.xxx.xxx 200 TCP_NC_MISS 4198 635 GET http www.thelegendssportscomplex.com 80 /thumbnails/t/sponsors/145x138/007.gi开发者_JAVA百科f - - - DIRECT www.thelegendssportscomplex.com image/gif http://www.thelegendssportscomplex.com/ "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; InfoPath.1; MS-RTC LM 8)" OBSERVED "Sports/Recreation" - xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx\r\n'

does anyone have any idea why .readlines()[6] doesn't work for python-3 but does work in 2.4?

also... I thought 0xAE was ®

From the Python wiki:

The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail

It appears as though you have a different encoding than you think you do.

open function doc:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

reading files using encoding for ever:

open("/home/madsc13ntist/test_file.txt", "r",encoding='iso8859-1').readlines()[6]

ignore decoding error? Setting the errors='ignore'. The default value of 'errors' is 'None', same with 'strict'.

As it is about two years from asking the question, you probably already know the reason. Basically, Python 3 strings are Unicode strings. To make them abstract you need to tell Python what encoding is used for the file.

Python 2 strings are actually byte sequences and Python feels fine to read whatever bytes from the file. Some of the characters are interpreted (newlines, tabs,...), but the rest is left untouched.

Python 3 open() is similar to Python 2 codecs.open().

... the time has come ... to close the question by accepting one of the answers.