Dynamically parsing research data in python_问答_开发者

The long (winded) version: I'm gathering research data using Python. My initial parsing is ugly (but functional) code which gives me some basic information and turns my raw data into a format suitable for heavy duty statistical analysis using SPSS. However, every time I modify the experiment, I have to dive into the analysis code.

For a typical experiment, I'll have 30 files, each for a unique user. Field count is fixed for each experiment (but can vary from one to another 10-20). Files are typically 700-1000 records long with a header row. Record format is tab separated (see sample which is 4 integers, 3 strings, and 10 floats).

I need to sort my list into categories. In a 1000 line file, I could have 4-256 categories. Rather than trying to pre-determine how many categories each file has, I'm using开发者_如何学Python the code below to count them. The integers at the beginning of each line dictate what category the float values in the row correspond to. Integer combinations can be modified by the string values to produce wildly different results, and multiple combinations can sometimes be lumped together.

Once they're in categories, number crunching begins. I get statistical info (mean, sd, etc. for each category for each file).

The essentials: I need to parse data like the sample below into categories. Categories are combos of the non-floats in each record. ~~I'm also trying to come up with a dynamic (graphical) way to associate column combinations with categories.~~ Will make a new post fot this.

I'm looking for suggestions on how to do both.

    # data is a list of tab separated records
    # fields is a list of my field names

    # get a list of fieldtypes via gettype on our first row
    # gettype is a function to get type from string without changing data
    fieldtype = [gettype(n) for n in data[1].split('\t')]

    # get the indexes for fields that aren't floats
    mask =  [i for i, field in enumerate(fieldtype) if field!="float"]

    # for each row of data[skipping first and last empty lists] we split(on tabs)
    # and take the ith element of that split where i is taken from the list mask
    # which tells us which fields are not floats
    records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]

    # we now get a unique set of combos
    # since set doesn't happily take a list of lists, we join each row of values
    # together in a comma seperated string. So we end up with a list of strings.
    uniquerecs = set([",".join(row) for row in records])


    print len(uniquerecs)
    quit()

def gettype(s):
    try:
        int(s)
        return "int"
    except ValueError:
        pass
    try:
        float(s)
        return "float"
    except ValueError:
        return "string"

Sample Data:

field0  field1  field2  field3  field4  field5  field6  field7  field8  field9  field10 field11 field12 field13 field14 field15
10  0   2   1   Right   Right   Right   5.76765674196   0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3   1   3   0   Left    Left    Right   8.00982745764   0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5   19  1   0   Right   Left    Left    4.69440026591   0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3   1   4   2   Left    Right   Left    9.58648184552   0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9   0   0   7   Left    Left    Left    7.65374257547   0.030318719717  0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397

Not sure if I understand your question, but here are a few thoughts:

For parsing the data files, you usually use the Python csv module.

For categorizing the data you could use a defaultdict with the non-float fields joined as a key for the dict. Example:

from collections import defaultdict
import csv

reader = csv.reader(open('data.file', 'rb'), delimiter='\t')
data_of_category = defaultdict(list)
lines = [line for line in reader]
mask =  [i for i, n in enumerate(lines[1]) if gettype(n)!="float"]
for line in lines[1:]:
    category = ','.join([line[i] for i in mask])
    data_of_category[category].append(line)

This way you don't have to calculate the categories in the first place an can process the data in one pass.

And I didn't understand the part about "a dynamic (graphical) way to associate column combinations with categories".

For at least part of your question, have a look at Named Tuples

Step 1: Use something like csv.DictReader to turn the text file into an iterable of rows.

Step 2: Turn that into a dict of first entry: rest of entries.

with open("...", "rb") as data_file:
    lines = csv.Reader(data_file, some_custom_dialect)
    categories = {line[0]: line[1:] for line in lines}

Step 3: Iterate over the items() of the data and do something with each line.

for category, line in categories.items():
    do_stats_to_line(line)

Some useful answers already but I'll throw mine in as well. Key points:

Use the csv module
Use collections.namedtuple for each row
Group the rows using a tuple of int field values as the key

If your source rows are sorted by the keys (the integer column values), you could use itertools.groupby. This would likely reduce memory consumption. Given your example data, and the fact that your files contain >= 1000 rows, this is probably not an issue to worry about.

def coerce_to_type(value):
    _types = (int, float)
    for _type in _types:
        try:
            return _type(value)
        except ValueError:
            continue
    return value

def parse_row(row):
    return [coerce_to_type(field) for field in row]

with open(datafile) as srcfile:
    data    = csv.reader(srcfile, delimiter='\t')

    ## Read headers, create namedtuple
    headers = srcfile.next().strip().split('\t')
    datarow = namedtuple('datarow', headers)

    ## Wrap with parser and namedtuple
    data = (parse_row(row) for row in data)
    data = (datarow(*row) for row in data)

    ## Group by the leading integer columns
    grouped_rows = defaultdict(list)
    for row in data:
        integer_fields = [field for field in row if isinstance(field, int)]
        grouped_rows[tuple(integer_fields)].append(row)

    ## DO SOMETHING INTERESTING WITH THE GROUPS
    import pprint
    pprint.pprint(dict(grouped_rows))

EDIT You may find the code at https://gist.github.com/985882 useful.