开发者

Summing Arrays by Characteristics in Python

开发者 https://www.devze.com 2023-04-04 15:42 出处:网络
I\'m wondering what is the most efficient way to sum elements of an array by given characteristics.For example I have 1000 draws of data, and I 开发者_开发问答what I\'m looking for is the sum of each

I'm wondering what is the most efficient way to sum elements of an array by given characteristics. For example I have 1000 draws of data, and I 开发者_开发问答what I'm looking for is the sum of each draw (column) across sexes for a given year-disease (ie, the draws are by sex, year, disease, and I want the sum of both sexes for each year and disease).

import numpy as np
year = np.repeat((1980, 1990 , 2000, 2010), 10)
sex = np.array(['male', 'female']*20)
disease = np.repeat(('d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'), 5)
draws = np.random.normal(0, 1, size=(sex.shape[0], 1000))

Any thoughts on how to get an array that will be shape (20, 1000) that has the sum of the draw across both sexes for a given year-disease? I will also need to be able to do this in situations where the data isn't perfectly square (there are disease-years which only have 1 sex).


import numpy as np
import itertools   
import csv

year = np.repeat((1980, 1990 , 2000, 2010), 10)
sex = np.array(['male', 'female']*20)
disease = np.repeat(('d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'), 5)
draws = np.random.normal(0, 1, size=(sex.shape[0], 1000))

years=np.unique(year)
diseases=np.unique(disease)

draw_sums = dict(((y,d), draws[(year==y)&(disease==d)].sum(axis=0)) 
                  for y,d in itertools.product(years,diseases))

This results in an dict associating each (year,disease) with the corresponding sum of the draws. To write draw_sums to a csv, you could do something like this:

with open('/tmp/test.csv','w') as f:
    writer=csv.writer(f)
    writer.writerow(['year', 'date']+['draw{i}'.format(i=i) for i in range(1,1001)])
    for yeardate,draws in sorted(draw_sums.items()):
        writer.writerow(list(yeardate)+draws.tolist())


This is a typical group-by problem, which can be efficiently solved in a fully vectorized manner using the numpy_indexed package (disclaimer: I am its author)

keys, values = npi.group_by((year, disease)).sum(draws)
for key, value in zip(zip(*keys), values):
    print(key, value.shape)
0

精彩评论

暂无评论...
验证码 换一张
取 消