For one of my data analysis pipelines, I end up generating a lot of individual CSV files. I would like to transpose them, concatenate them, and tran开发者_如何学Cspose them again. However, the amount of data is large, so loading it all into memory is not practical.
Concatenating the rows of data from two csv files (if that's what you meant) without loading all of both of them into memory is a relatively easy and fast operation: Just read in a single row from each one, join those together, and then write that to an output file, repeating until all the input data is exhausted.
Transposing the data in a csv file is without reading the entire thing into memory is intrinsically going to be a much slower process, since it requires the entire input file to be reread in multiple passes, each time extracting that data from just one column it contains. If that's an acceptable (or necessary) trade-off, here's basically how it would be done using the built-in csv
module:
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
with open(output_filename, 'wb') as outputf:
writer = csv.writer(outputf)
with open(input_filename, 'rb') as inputf:
# determine number of columns in input file by counting those in its first row
# number of cols in input file determines number of rows in output file
numcols = len(csv.reader(inputf).next())
# read entire input file multiple times, extracting one column from each row
for col_index in xrange(numcols):
# write all of column data as a single row of the output file
inputf.seek(0) # rewind file for each pass
writer.writerow(tuple(row[col_index] for row in csv.reader(inputf)))
Another short and pythonic solution. I used this to transpose CSVs that are 15,000,000 x 12,000 . It's fast and pure python. Everything else you need done is trivial and this is definitely the hardest part.
Github link: https://gist.github.com/arose13/facfb91b609d453f3ad840417faa503a
def transpose_csv_out_of_core(csv_path, output_csv_path='transposed.csv', delimiter=','):
"""
On my laptop it can transpose at ~375,000 lines a sec
:param csv_path:
:param output_csv_path:
:param delimiter:
:return:
"""
import csv
transposed_iterator = zip(*csv.reader(open(csv_path)))
with open(output_csv_path, 'w') as out:
for row in transposed_iterator:
out.write(delimiter.join(row) + '\n')
The following code simulates reading from two csv files. The first one has the two rows
[1,2,1]
[3,4,1]
the second one
[7,8,2]
[9,10.2].
The result are the two rows
[1,2,1,7,8,2]
[3,4,1,9,10,2]
Is that what you wanted ?
def source1():
for i in [ [1,2, 1] ,[3,4, 1]] : yield i
def source2():
for i in [ [7,8,2] ,[9,10,2]] : yield i
def join(*sources):
while True:
row = []
for s in sources:
row.extend(s.next())
yield row
for row in join(source1(), source2()):
print row
In your case you have to replace calls to source1() and source2() by the csv file iterators.
Here is a solution that works, when fields have fixed widths:
import sys
import os
def main():
path_in = sys.argv[-1]
path_out = os.path.basename(path_in)+'.transposed'
with open(path_in) as fd_in:
line = fd_in.readline()
l = line.split()
field_width = int(len(line)/len(l))
file_size = os.path.getsize(path_in)
cols2 = rows1 = line_count = int(file_size/len(line))
rows2 = cols1 = len(l)
with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
for row in range(rows2):
for col in range(cols2-1):
fd_in.seek(col*len(line)+row*field_width)
fd_out.write('{} '.format(fd_in.read(field_width-1)))
fd_in.seek((col+1)*len(line)+row*field_width)
fd_out.write('{}\n'.format(fd_in.read(field_width-1)))
return
if __name__ == '__main__':
main()
Here is a solution that works, if the fields don't have fixed widths:
import sys
import os
def main():
path_in = sys.argv[-1]
path_out = os.path.basename(path_in)+'.transposed'
separator = ' '
d_seek = {}
with open(path_in) as fd_in:
i = 0
while True:
tell = fd_in.tell()
if fd_in.readline() == '':
break
d_seek[i] = tell
i += 1
cols2 = rows1 = i
with open(path_in) as fd_in:
line = fd_in.readline()
rows2 = cols1 = len(line.split(separator))
del line
with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
for row2 in range(rows2):
for row1 in range(rows1):
fd_in.seek(d_seek[row1])
j = 0
s = ''
while True:
char = fd_in.read(1)
j += 1
if char == separator or char == '\n':
break
s += char
d_seek[row1] += len(s)+1
if row1+1 < rows1:
fd_out.write('{} '.format(s))
else:
fd_out.write('{}\n'.format(s))
return
if __name__ == '__main__':
main()
Use a generator, e.g.
from itertools import izip
file1 = open("test", "r")
file2 = open("test2", "r")
def lazy(file):
for line in file:
#do something with the line
yield line
for lines in izip(lazy(file1), lazy(file2)):
print lines
http://wiki.python.org/moin/Generators
Edit: You can use the CSV module to parse it, also I realized that the readlines() method of file objects isn't lazy, so you have to use the for line in file pattern.
精彩评论