开发者

How do you read a file inside a zip file as text, not bytes?

开发者 https://www.devze.com 2023-02-24 02:45 出处:网络
A simple program for reading a CSV file inside a ZIP archive: import csv, sys, zipfile zip_file= zipfile.ZipFile(sys.argv[1])

A simple program for reading a CSV file inside a ZIP archive:

import csv, sys, zipfile

zip_file    = zipfile.ZipFile(sys.argv[1])
items_file  = zip_file.open('items.csv', 'rU')
    
for row in csv.DictReader(items_file):
    pass

works in Python 2.7:

$ python2.7 test_zip_file_py3k.py ~/data.zip
$

but not in Python 3.2:

$ python3.2 test_zip_file_py3k.py ~/data.zip
Traceback开发者_如何学Go (most recent call last):
    File "test_zip_file_py3k.py", line 8, in <module>
    for row in csv.DictReader(items_file):
    File "/somedir/python3.2/csv.py", line 109, in __next__
    self.fieldnames
    File "/somedir/python3.2/csv.py", line 96, in fieldnames
    self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not bytes (did you open the file 
in text mode?)

The csv module in Python 3 wants to see a text file, but zipfile.ZipFile.open returns a zipfile.ZipExtFile that is always treated as binary data.

How does one make this work in Python 3?


I just noticed that Lennart's answer didn't work with Python 3.1, but it does work with Python 3.2. They've enhanced zipfile.ZipExtFile in Python 3.2 (see release notes). These changes appear to make zipfile.ZipExtFile work nicely with io.TextWrapper.

Incidentally, it works in Python 3.1, if you uncomment the hacky lines below to monkey-patch zipfile.ZipExtFile, not that I would recommend this sort of hackery. I include it only to illustrate the essence of what was done in Python 3.2 to make things work nicely.

$ cat test_zip_file_py3k.py 
import csv, io, sys, zipfile

zip_file    = zipfile.ZipFile(sys.argv[1])
items_file  = zip_file.open('items.csv', 'rU')
# items_file.readable = lambda: True
# items_file.writable = lambda: False
# items_file.seekable = lambda: False
# items_file.read1 = items_file.read
items_file  = io.TextIOWrapper(items_file)
    
for idx, row in enumerate(csv.DictReader(items_file)):
    print('Processing row {0} -- row = {1}'.format(idx, row))

If I had to support py3k < 3.2, then I would go with the solution in my other answer.

Update for 3.6+

Starting w/3.6, support for mode='U' was removed^1:

Changed in version 3.6: Removed support of mode='U'. Use io.TextIOWrapper for reading compressed text files in universal newlines mode.

Starting w/3.8, a Path object was added which gives us an open() method that we can call like the built-in open() function (passing newline='' in the case of our CSV) and we get back an io.TextIOWrapper object the csv readers accept. See Yuri's answer, here.


You can wrap it in a io.TextIOWrapper.

items_file  = io.TextIOWrapper(items_file, encoding='your-encoding', newline='')

Should work.


And if you just like to read a file into a string:

with ZipFile('spam.zip') as myzip:
    with myzip.open('eggs.txt') as myfile:
       eggs = myfile.read().decode('UTF-8'))


Lennart's answer is on the right track (Thanks, Lennart, I voted up your answer) and it almost works:

$ cat test_zip_file_py3k.py 
import csv, io, sys, zipfile

zip_file    = zipfile.ZipFile(sys.argv[1])
items_file  = zip_file.open('items.csv', 'rU')
items_file  = io.TextIOWrapper(items_file, encoding='iso-8859-1', newline='')

for idx, row in enumerate(csv.DictReader(items_file)):
    print('Processing row {0}'.format(idx))

$ python3.1 test_zip_file_py3k.py ~/data.zip
Traceback (most recent call last):
  File "test_zip_file_py3k.py", line 7, in <module>
    items_file  = io.TextIOWrapper(items_file, 
                                   encoding='iso-8859-1', 
                                   newline='')
AttributeError: readable

The problem appears to be that io.TextWrapper's first required parameter is a buffer; not a file object.

This appears to work:

items_file  = io.TextIOWrapper(io.BytesIO(items_file.read()))

This seems a little complex and also it seems annoying to have to read in a whole (perhaps huge) zip file into memory. Any better way?

Here it is in action:

$ cat test_zip_file_py3k.py 
import csv, io, sys, zipfile

zip_file    = zipfile.ZipFile(sys.argv[1])
items_file  = zip_file.open('items.csv', 'rU')
items_file  = io.TextIOWrapper(io.BytesIO(items_file.read()))

for idx, row in enumerate(csv.DictReader(items_file)):
    print('Processing row {0}'.format(idx))

$ python3.1 test_zip_file_py3k.py ~/data.zip
Processing row 0
Processing row 1
Processing row 2
...
Processing row 250


Starting with Python 3.8, the zipfile module has the Path object, which we can use with its open() method to get an io.TextIOWrapper object, which can be passed to the csv readers:

import csv, sys, zipfile

# Give a string path to the ZIP archive, and
# the archived file to read from 
items_zipf = zipfile.Path(sys.argv[1], at='items.csv')

# Then use the open method, like you'd usually
# use the built-in open()
items_f = items_zipf.open(newline='')

# Pass the TextIO-like file to your reader as normal
for row in csv.DictReader(items_f):
    print(row)


Here's a minimal recipe to open a zip file and read a text file inside that zip. I found the trick to be the TextIOWrapper read() method, not mentioned in any answers above (BytesIO.read() was mentioned above, but Python docs recommend TextIOWrapper).

import zipfile
import io

# Create the ZipFile object
zf = zipfile.ZipFile('my_zip_file.zip')

# Read a file that is inside the zip...reads it as a binary file-like object
my_file_binary = zf.open('my_text_file_inside_zip.txt')

# Convert the binary file-like object directly to text using TextIOWrapper and it's read() method
my_file_text  = io.TextIOWrapper(my_file_binary, encoding='utf-8', newline='').read()

I wish they kept the mode='U' parameter in the ZipFile open() method to do this same thing since that was so succinct but, alas, that is obsolete.

0

精彩评论

暂无评论...
验证码 换一张
取 消