I have about a 100 csv files each 100,000 x 4开发者_开发问答0 rows columns. I'd like to do some statistical analysis on it, pull out some sample data, plot general trends, do variance and R-square analysis, and plot some spectra diagrams. For now, I'm considering numpy for the analysis.
I was wondering what issues should I expect with such large files? I've already checked for erroneous data. What are your recommendations on doing statistical analysis? would it be better if I just split the files and do the whole thing in Excel?
I've found that Python + CSV is probably the fastest, and simplest way to do some kinds of statistical processing.
We do a fair amount of reformatting and correcting for odd data errors, so Python helps us.
The availability of Python's functional programming features makes this particularly simple. You can do sampling with tools like this.
def someStatFunction( source ):
for row in source:
...some processing...
def someFilterFunction( source ):
for row in source:
if someFunction( row ):
yield row
# All rows
with open( "someFile", "rb" ) as source:
rdr = csv.reader( source )
someStatFunction( rdr )
# Filtered by someFilterFunction applied to each row
with open( "someFile", "rb" ) as source:
rdr = csv.reader( source )
someStatFunction( someFilterFunction( rdr ) )
I really like being able to compose more complex functions from simpler functions.
For massive datasets you might be interested in ROOT. It can be used to analyze and very effectively store petabytes of data. It also come with some basic and more advanced statistics tools.
While it is written to be used with C++, there are also pretty complete python bindings. They don't make it extremely easy to get direct access to the raw data (e.g. to use them in R or numpy) -- but it is definitely possible (I do it all the time).
Python is very nice for such kind of data processing, especially if your samples are "rows" and you can process each such row independently:
row1
row2
row3
etc.
In fact your program can have very small memory footprint, thanks to generators and generator expressions, about which you can read here: http://www.dabeaz.com/generators/ (it's not basic stuff but some mind-twisting applications of generators).
Regarding S.Lott's answer, you probably want to avoid filter() being applied to sequence of rows - it might explode your computer if you pass to it sequence that is long enough (try: filter(None, itertools.count())
- after saving all you data :-)). It's much better to replace filter
with something like this:
def filter_generator(func, sequence):
for item in sequence:
if (func is None and item) or func(item):
yield item
or shorter:
filtered_sequence = (item for item in sequence if (func is None and item) or func(item))
This can be further optimized by extracting condition before the loop, but this is an excersise for the reader :-)
I've been having great success using Python and CSV file reading and generation. Using a modest Core 2 Duo laptop I have been able to store close to the same amount of data as you and process it in memory in a few minutes. My main advice in doing this is to split up your jobs so that you can do things in separate steps since batching all your jobs at once can be a pain when you want only one feature to execute. Come up with a good battle rhythm that allows you to take advantage of your resources as much as possible.
Excel is nice for smaller batches of data, but check out matplotlib for doing graphs and charts normally reserved for Excel.
In general, don't worry too much about the size. If your files get bigger by a factor of 2-3, you might start running out of memory on a 32-bit system. I figure that if each field of the table is 100 bytes, i.e., each row is 4000 bytes, you'll be using roughly 400 MB of RAM to store the data in memory and if you add about as much for processing, you'll still only be using 800 or so MB. These calculations are very back of the envelope and extremely generous (you'll only use this much memory if you have a lot of long strings or humongous integers in your data, since the maximum you'll use for standard datatypes is 8 bytes for a float or a long).
If you do start running out of memory, 64-bit might be the way to go. But other than that, Python will handle large amounts of data with aplomb, especially when combined with numpy/scipy. Using Numpy arrays will almost always be faster than using native lists as well. Matplotlib will take care of most plotting needs and can certainly handle the simple plots you've described.
Finally, if you find something that Python can't do, but already have a codebase written in it, take a look at RPy.
精彩评论