I am a PhD student with a a data wrangling problem. I have two columns of data in a text file that follow this format:
Site Species
A01 ACRB
A01 TBL
A02 TBL
A03 GRF
...
I need to count how many of each species type (i.e. ACRB) there are for each Site (i.e. A01) and produce a matrix with about 60 sites and 150 开发者_开发技巧species that looks like this:
Site ACRB TBL GRF
A01 1 1 0
A02 0 1 0
A03 0 0 1
I would be most appreciative for any advice on how to best handle this task as I am very new to Python.
Thank you kindly, -Elizabeth
Here is a way to do it with Python2.7
from collections import Counter
with open("in.txt") as f:
next(f) # do this to skip the first row of the file
c = Counter(tuple(row.split()) for row in f if not row.isspace())
sites = sorted(set(x[0] for x in c))
species = sorted(set(x[1] for x in c))
print 'Site\t', '\t'.join(species)
for site in sites:
print site,'\t', '\t'.join(str(c[site, spec]) for spec in species)
from StringIO import StringIO
input = """Site Species
A01 ACRB
A01 TBL
A02 TBL
A03 GRF
"""
counts = {}
sites = set()
species = set()
# Count pairs (site, specie)
for line in StringIO(input).readlines()[1:]:
site, specie = line.strip().split()
sites.add(site)
species.add(specie)
count = counts.get((site, specie), 0)
counts[(site, specie)] = count + 1
# Print first row.
print "Site\t",
for specie in species:
print specie, "\t",
print
# Print other rows.
for site in sites:
print site, "\t",
for specie in species:
print counts.get((site, specie), 0),
print
Let's see...
import itertools
l = [('A01', 'ACRB'), ('A01', 'TBL'), ('A02', 'TBL'), ('A03', 'GRF')]
def mygrouping(l):
speclist = list(set(i[1] for i in l))
yield tuple(speclist)
l.sort()
gr = itertools.groupby(l, lambda i:i[0]) # i[0] is the site; group by that...
for site, items in gr:
counts = [0] * len(speclist)
for _site, species in items:
counts[speclist.index(species)] += 1
yield site, tuple(counts)
print list(mygrouping(l))
Another solution with namedtuples would be
import itertools
import collections
l = [('A01', 'ACRB'), ('A01', 'TBL'), ('A02', 'TBL'), ('A03', 'GRF')]
def mygrouping(l):
speclist = list(set(i[1] for i in l))
TupClass = collections.namedtuple('grouping', speclist)
l.sort()
gr = itertools.groupby(l, lambda i:i[0]) # i[0] is the site; group by that...
for site, items in gr:
counts = [0] * len(speclist)
for _site, species in items:
counts[speclist.index(species)] += 1
yield site, TupClass(*counts)
print list(mygrouping(l))
The displaying stuff will I let to you.
It's a histogram2d problem, but the data is string. You can convert string to integer first:
x = ["A01","A01","A02","A03","A02","A04"]
y = ["ACRB","TBL","TBL","GRF","TBL","TBL"]
import numpy as np
def convert(data):
tmp = sorted(set(data))
d = dict(zip(tmp,range(len(tmp))))
return tmp, np.array([d[x] for x in data])
xindex, xn = convert(x)
yindex, yn = convert(y)
print xindex, xn
print yindex, yn
the output is:
['A01', 'A02', 'A03', 'A04'] [0 0 1 2 1 3]
['ACRB', 'GRF', 'TBL'] [0 2 2 1 2 2]
xn, yn is the converted array, and xindex, yindex can be used to convert integer back to string.
Then you can use numpy.histogram2d to count the occurrence quickly:
m = np.histogram2d(xn, yn, bins=(len(xindex), len(yindex)))[0]
print m
the output is:
[[ 1. 0. 1.]
[ 0. 0. 2.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
精彩评论