开发者

Python Script for converting two columns into a 68 x 150 matrix

开发者 https://www.devze.com 2023-03-30 14:54 出处:网络
I am a PhD student with a a data wrangling problem.I have two columns of data in a text file that follow this format:

I am a PhD student with a a data wrangling problem. I have two columns of data in a text file that follow this format:

Site  Species
A01   ACRB
A01   TBL
A02   TBL
A03   GRF   
...

I need to count how many of each species type (i.e. ACRB) there are for each Site (i.e. A01) and produce a matrix with about 60 sites and 150 开发者_开发技巧species that looks like this:

Site  ACRB  TBL  GRF
A01   1      1    0
A02   0      1    0
A03   0      0    1

I would be most appreciative for any advice on how to best handle this task as I am very new to Python.

Thank you kindly, -Elizabeth


Here is a way to do it with Python2.7

from collections import Counter
with open("in.txt") as f:
    next(f)  # do this to skip the first row of the file
    c = Counter(tuple(row.split()) for row in f if not row.isspace())

sites = sorted(set(x[0] for x in c))
species = sorted(set(x[1] for x in c))

print 'Site\t', '\t'.join(species)
for site in sites:
    print site,'\t', '\t'.join(str(c[site, spec]) for spec in species)


from StringIO import StringIO

input = """Site  Species
A01   ACRB
A01   TBL
A02   TBL
A03   GRF 
"""

counts = {}
sites = set()
species = set()

# Count pairs (site, specie)    
for line in StringIO(input).readlines()[1:]:
     site, specie = line.strip().split()
     sites.add(site)
     species.add(specie)
     count = counts.get((site, specie), 0)
     counts[(site, specie)] = count + 1

# Print first row.
print "Site\t",
for specie in species:
    print specie, "\t",
print

# Print other rows.
for site in sites:
    print site, "\t",
    for specie in species:
        print counts.get((site, specie), 0),
    print


Let's see...

import itertools

l = [('A01', 'ACRB'), ('A01', 'TBL'), ('A02', 'TBL'), ('A03', 'GRF')]

def mygrouping(l):
    speclist = list(set(i[1] for i in l))
    yield tuple(speclist)
    l.sort()
    gr = itertools.groupby(l, lambda i:i[0]) # i[0] is the site; group by that...
    for site, items in gr:
        counts = [0] * len(speclist)
        for _site, species in items:
            counts[speclist.index(species)] += 1
        yield site, tuple(counts)

print list(mygrouping(l))

Another solution with namedtuples would be

import itertools
import collections

l = [('A01', 'ACRB'), ('A01', 'TBL'), ('A02', 'TBL'), ('A03', 'GRF')]

def mygrouping(l):
    speclist = list(set(i[1] for i in l))
    TupClass = collections.namedtuple('grouping', speclist)
    l.sort()
    gr = itertools.groupby(l, lambda i:i[0]) # i[0] is the site; group by that...
    for site, items in gr:
        counts = [0] * len(speclist)
        for _site, species in items:
            counts[speclist.index(species)] += 1
        yield site, TupClass(*counts)

print list(mygrouping(l))

The displaying stuff will I let to you.


It's a histogram2d problem, but the data is string. You can convert string to integer first:

x = ["A01","A01","A02","A03","A02","A04"]
y = ["ACRB","TBL","TBL","GRF","TBL","TBL"]

import numpy as np

def convert(data):
    tmp = sorted(set(data))
    d = dict(zip(tmp,range(len(tmp))))
    return tmp, np.array([d[x] for x in data])

xindex, xn = convert(x)
yindex, yn = convert(y)

print xindex, xn
print yindex, yn

the output is:

['A01', 'A02', 'A03', 'A04'] [0 0 1 2 1 3]
['ACRB', 'GRF', 'TBL'] [0 2 2 1 2 2]

xn, yn is the converted array, and xindex, yindex can be used to convert integer back to string.

Then you can use numpy.histogram2d to count the occurrence quickly:

m = np.histogram2d(xn, yn, bins=(len(xindex), len(yindex)))[0]
print m

the output is:

[[ 1.  0.  1.]
 [ 0.  0.  2.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]
0

精彩评论

暂无评论...
验证码 换一张
取 消