How do I transpose/pivot a csv file with python *without* loading the whole file into memory?_问答_开发者

For one of my data analysis pipelines, I end up generating a lot of individual CSV files. I would like to transpose them, concatenate them, and tran开发者_如何学Cspose them again. However, the amount of data is large, so loading it all into memory is not practical.

Concatenating the rows of data from two csv files (if that's what you meant) without loading all of both of them into memory is a relatively easy and fast operation: Just read in a single row from each one, join those together, and then write that to an output file, repeating until all the input data is exhausted.

Transposing the data in a csv file is without reading the entire thing into memory is intrinsically going to be a much slower process, since it requires the entire input file to be reread in multiple passes, each time extracting that data from just one column it contains. If that's an acceptable (or necessary) trade-off, here's basically how it would be done using the built-in csv module:

import csv

input_filename = 'input.csv'
output_filename = 'output.csv'

with open(output_filename, 'wb') as outputf:
    writer = csv.writer(outputf)
    with open(input_filename, 'rb') as inputf:
        # determine number of columns in input file by counting those in its first row
        # number of cols in input file determines number of rows in output file
        numcols = len(csv.reader(inputf).next())
        # read entire input file multiple times, extracting one column from each row
        for col_index in xrange(numcols):
            # write all of column data as a single row of the output file
            inputf.seek(0)  # rewind file for each pass
            writer.writerow(tuple(row[col_index] for row in csv.reader(inputf)))

Another short and pythonic solution. I used this to transpose CSVs that are 15,000,000 x 12,000 . It's fast and pure python. Everything else you need done is trivial and this is definitely the hardest part.

Github link: https://gist.github.com/arose13/facfb91b609d453f3ad840417faa503a

    def transpose_csv_out_of_core(csv_path, output_csv_path='transposed.csv', delimiter=','):
    """
    On my laptop it can transpose at ~375,000 lines a sec

    :param csv_path: 
    :param output_csv_path: 
    :param delimiter: 
    :return: 
    """
    import csv

    transposed_iterator = zip(*csv.reader(open(csv_path)))
    with open(output_csv_path, 'w') as out:
        for row in transposed_iterator:
            out.write(delimiter.join(row) + '\n')

The following code simulates reading from two csv files. The first one has the two rows

[1,2,1]
[3,4,1]

the second one

[7,8,2]
[9,10.2].

The result are the two rows

[1,2,1,7,8,2]
[3,4,1,9,10,2]

Is that what you wanted ?

def source1():
    for i in [ [1,2, 1] ,[3,4, 1]] : yield i

def source2():
    for i in [ [7,8,2] ,[9,10,2]] : yield i

def join(*sources):
    while True:
        row = []
        for s in sources:
            row.extend(s.next())
        yield row

for row in join(source1(), source2()):
    print row

In your case you have to replace calls to source1() and source2() by the csv file iterators.

Here is a solution that works, when fields have fixed widths:

import sys
import os


def main():

    path_in = sys.argv[-1]
    path_out = os.path.basename(path_in)+'.transposed'

    with open(path_in) as fd_in:
        line = fd_in.readline()
        l = line.split()
        field_width = int(len(line)/len(l))

    file_size = os.path.getsize(path_in)
    cols2 = rows1 = line_count = int(file_size/len(line))
    rows2 = cols1 = len(l)

    with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
        for row in range(rows2):
            for col in range(cols2-1):
                fd_in.seek(col*len(line)+row*field_width)
                fd_out.write('{} '.format(fd_in.read(field_width-1)))
            fd_in.seek((col+1)*len(line)+row*field_width)
            fd_out.write('{}\n'.format(fd_in.read(field_width-1)))

    return


if __name__ == '__main__':
    main()

Here is a solution that works, if the fields don't have fixed widths:

import sys
import os


def main():

    path_in = sys.argv[-1]
    path_out = os.path.basename(path_in)+'.transposed'
    separator = ' '

    d_seek = {}
    with open(path_in) as fd_in:
        i = 0
        while True:
            tell = fd_in.tell()
            if fd_in.readline() == '':
                break
            d_seek[i] = tell
            i += 1
    cols2 = rows1 = i

    with open(path_in) as fd_in:
        line = fd_in.readline()
    rows2 = cols1 = len(line.split(separator))
    del line

    with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
        for row2 in range(rows2):
            for row1 in range(rows1):
                fd_in.seek(d_seek[row1])
                j = 0
                s = ''
                while True:
                    char = fd_in.read(1)
                    j += 1
                    if char == separator or char == '\n':
                        break
                    s += char
                d_seek[row1] += len(s)+1
                if row1+1 < rows1:
                    fd_out.write('{} '.format(s))
                else:
                    fd_out.write('{}\n'.format(s))

    return


if __name__ == '__main__':
    main()

Use a generator, e.g.

from itertools import izip

file1 = open("test", "r")
file2 = open("test2", "r")

def lazy(file):
    for line in file:
        #do something with the line
        yield line

for lines in izip(lazy(file1), lazy(file2)):
    print lines

http://wiki.python.org/moin/Generators

Edit: You can use the CSV module to parse it, also I realized that the readlines() method of file objects isn't lazy, so you have to use the for line in file pattern.