Reading files in parallel in python_问答_开发者

I have a bunch of files (almost 100) which contain data of the format: (number of people) \t (average age)

These files were generated from a random walk conducted on a population of a certain demographic. Each file has 100,000 lines, corresponding开发者_开发知识库 to the average age of populations of sizes from 1 to 100,000. Each file corresponds to a different locality in a third world country. We will be comparing these values to the average ages of similar sized localities in a developed country.

What I want to do is,

for each i (i ranges from 1 to 100,000):
  Read in the first 'i' values of average-age
  perform some statistics on these values

That means, for each run i (where i ranges from 1 to 100,000), read in the first i values of average-age, add them to a list and run a few tests (like Kolmogorov-Smirnov or chi-square)

In order to open all these files in parallel, I figured the best way would be a dictionary of file objects. But I am stuck in trying to do the above operations.

Is my method the best possible one (complexity-wise)?

Is there a better method?

Actually, it would be possible to hold 10,000,000 lines in memory.

Make a dictionary where the keys are number of people and values are lists of average age where each element of the list comes a different file. Therefore, if there are 100 files, each of your lists will have 100 elements.

This way, you don't need to store the file objects in a dict

Hope this helps

Why not take a simple approach:

Open each file sequentially and read its lines to fill an in-memory data structure
Perform statistics on the in-memory data structure

Here is a self-contained example with 3 "files", each containing 3 lines. It uses StringIO for convenience instead of actual files:

#!/usr/bin/env python
# coding: utf-8

from StringIO import StringIO

# for this example, each "file" has 3 lines instead of 100000
f1 = '1\t10\n2\t11\n3\t12'
f2 = '1\t13\n2\t14\n3\t15'
f3 = '1\t16\n2\t17\n3\t18'

files = [f1, f2, f3]

# data is a list of dictionaries mapping population to average age
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with
# population of 10000.
data = []

for i,filename in enumerate(files):
    f = StringIO(filename)
    # f = open(filename, 'r')
    data.append(dict())

    for line in f:
        population, average_age = (int(s) for s in line.split('\t'))
        data[i][population] = average_age

print data

# gather custom statistics on the data

# i.e. here's how to calculate the average age across all locations where
# population is 2:
num_locations = len(data)
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations
print 'Average age with population 2 is', pop2_avg, 'years old'

The output is:

[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}]
Average age with population 2 is 14 years old

I ... don't know if I like this approach, but it's possible that it could work for you. It has the potential to consume large amounts of memory, but may do what you need it to. I make the assumption that your data files are numbered. If that's not the case this may need adaptation.

# open the files.
handles = [open('file-%d.txt' % i) for i in range(1, 101)]

# loop for the number of lines.
for line in range(100000):
  lines = [fh.readline() for fh in handles]

  # Some sort of processing for the list of lines.

That may get close to what you need, but again, I don't know that I like it. If you have any files that don't have the same number of lines this could run into trouble.