I have a bunch of files (almost 100) which contain data of the format: (number of people) \t (average age)
These files were generated from a random walk conducted on a population of a certain demographic. Each file has 100,000 lines, corresponding开发者_开发知识库 to the average age of populations of sizes from 1 to 100,000. Each file corresponds to a different locality in a third world country. We will be comparing these values to the average ages of similar sized localities in a developed country.
What I want to do is,
for each i (i ranges from 1 to 100,000):
Read in the first 'i' values of average-age
perform some statistics on these values
That means, for each run i (where i ranges from 1 to 100,000), read in the first i values of average-age, add them to a list and run a few tests (like Kolmogorov-Smirnov or chi-square)
In order to open all these files in parallel, I figured the best way would be a dictionary of file objects. But I am stuck in trying to do the above operations.
Is my method the best possible one (complexity-wise)?
Is there a better method?
Actually, it would be possible to hold 10,000,000 lines in memory.
Make a dictionary where the keys are number of people
and values are lists of average age
where each element of the list comes a different file. Therefore, if there are 100 files, each of your lists will have 100 elements.
This way, you don't need to store the file objects in a dict
Hope this helps
Why not take a simple approach:
- Open each file sequentially and read its lines to fill an in-memory data structure
- Perform statistics on the in-memory data structure
Here is a self-contained example with 3 "files", each containing 3 lines. It uses StringIO
for convenience instead of actual files:
#!/usr/bin/env python
# coding: utf-8
from StringIO import StringIO
# for this example, each "file" has 3 lines instead of 100000
f1 = '1\t10\n2\t11\n3\t12'
f2 = '1\t13\n2\t14\n3\t15'
f3 = '1\t16\n2\t17\n3\t18'
files = [f1, f2, f3]
# data is a list of dictionaries mapping population to average age
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with
# population of 10000.
data = []
for i,filename in enumerate(files):
f = StringIO(filename)
# f = open(filename, 'r')
data.append(dict())
for line in f:
population, average_age = (int(s) for s in line.split('\t'))
data[i][population] = average_age
print data
# gather custom statistics on the data
# i.e. here's how to calculate the average age across all locations where
# population is 2:
num_locations = len(data)
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations
print 'Average age with population 2 is', pop2_avg, 'years old'
The output is:
[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}]
Average age with population 2 is 14 years old
I ... don't know if I like this approach, but it's possible that it could work for you. It has the potential to consume large amounts of memory, but may do what you need it to. I make the assumption that your data files are numbered. If that's not the case this may need adaptation.
# open the files.
handles = [open('file-%d.txt' % i) for i in range(1, 101)]
# loop for the number of lines.
for line in range(100000):
lines = [fh.readline() for fh in handles]
# Some sort of processing for the list of lines.
That may get close to what you need, but again, I don't know that I like it. If you have any files that don't have the same number of lines this could run into trouble.
精彩评论