I have two large (~100 GB) text files that must be iterated through simultaneously.
Zip works well for smaller files but I found out that it's actually making a list of lines from my two files. This means that every line gets stored in memory. I don't need to do anything with the lines more than once.
handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')
for i, j in zip(handle1, handle2):
do something with i and j.
write to an output file.
no need to do anything with i and j after t开发者_如何学Chis.
Is there an alternative to zip() that acts as a generator that will allow me to iterate through these two files without using >200GB of ram?
itertools
has a function izip
that does that
from itertools import izip
for i, j in izip(handle1, handle2):
...
If the files are of different sizes you may use izip_longest
, as izip
will stop at the smaller file.
You can use izip_longest like this to pad the shorter file with empty lines
in python 2.6
from itertools import izip_longest
with handle1 as open('filea', 'r'):
with handle2 as open('fileb', 'r'):
for i, j in izip_longest(handle1, handle2, fillvalue=""):
...
or in Python 3+
from itertools import zip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'):
for i, j in zip_longest(handle1, handle2, fillvalue=""):
...
If you want to truncate to the shortest file:
handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')
try:
while 1:
i = handle1.next()
j = handle2.next()
do something with i and j.
write to an output file.
except StopIteration:
pass
finally:
handle1.close()
handle2.close()
Else
handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')
i_ended = False
j_ended = False
while 1:
try:
i = handle1.next()
except StopIteration:
i_ended = True
try:
j = handle2.next()
except StopIteration:
j_ended = True
do something with i and j.
write to an output file.
if i_ended and j_ended:
break
handle1.close()
handle2.close()
Or
handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')
while 1:
i = handle1.readline()
j = handle2.readline()
do something with i and j.
write to an output file.
if not i and not j:
break
handle1.close()
handle2.close()
Something like this? Wordy, but it seems to be what you're asking for.
It can be adjusted to do things like a proper merge to match keys between the two files, which is often more what's needed than the simplistic zip function. Also, this doesn't truncate, which is what the SQL OUTER JOIN algorithm does, again, different from what zip does and more typical of files.
with open("file1","r") as file1:
with open( "file2", "r" as file2:
for line1, line2 in parallel( file1, file2 ):
process lines
def parallel( file1, file2 ):
if1_more, if2_more = True, True
while if1_more or if2_more:
line1, line2 = None, None # Assume simplistic zip-style matching
# If you're going to compare keys, then you'd do that before
# deciding what to read.
if if1_more:
try:
line1= file1.next()
except StopIteration:
if1_more= False
if if2_more:
try:
line2= file2.next()
except StopIteration:
if2_more= False
yield line1, line2
精彩评论