Calculate within categories: Equivalent of R's ddply in Python?_问答_开发者

Calculate within categories: Equivalent of R's ddply in Python?

开发者 https://www.devze.com 2023-04-12 18:15 出处：网络

I have some R code I need to port to python. Howe开发者_如何学编程ver, R\'s magic data.frame and ddply are keeping me from finding a good way to do this in python.

相关专题：r

I have some R code I need to port to python. Howe开发者_如何学编程ver, R's magic data.frame and ddply are keeping me from finding a good way to do this in python.

Sample data (R):

x <- data.frame(d=c(1,1,1,2,2,2),c=c(rep(c('a','b','c'),2)),v=1:6)

Sample computation:

y <- ddply(x, 'd', transform, v2=(v-min(v))/(max(v)-min(v)))

Sample output:

  d c v  v2
1 1 a 1 0.0
2 1 b 2 0.5
3 1 c 3 1.0
4 2 a 4 0.0
5 2 b 5 0.5
6 2 c 6 1.0

So here's my question for the pythonistas out there: how would you do the same? You have a data structure with a couple of important dimensions.

For each (c), and each(d) compute (v-min(v))/(max(v)-min(v))) and associate it with the corresponding (d,c) pair.

Feel free to use whatever data structures you want, so long as they're quick on reasonably large datasets (those that fit in memory).

Indeed pandas is the right (and only, I believe) tool for this in Python. It's a bit less magical than plyr but here's how to do this using the groupby functionality:

df = DataFrame({'d' : [1.,1.,1.,2.,2.,2.],
                'c' : np.tile(['a','b','c'], 2),
                'v' : np.arange(1., 7.)})
# in IPython
In [34]: df
Out[34]: 
   c  d  v
0  a  1  1
1  b  1  2
2  c  1  3
3  a  2  4
4  b  2  5
5  c  2  6

Now write a small transform function:

def f(group):
    v = group['v']
    group['v2'] = (v - v.min()) / (v.max() - v.min())
    return group

Note that this also handles NAs since the v variable is a pandas Series object.

Now group by the d column and apply f:

In [36]: df.groupby('d').apply(f)
Out[36]: 
   c  d  v  v2 
0  a  1  1  0  
1  b  1  2  0.5
2  c  1  3  1  
3  a  2  4  0  
4  b  2  5  0.5
5  c  2  6  1

Sounds like you want pandas and group by or aggregate.

You can also achieve a more performance if you use numpy and scipy.

Despite some ugly code it will be faster, pandas way will be slow if number of groups is very large and may even be worse than R. This will always be faster than R:

import numpy as np
import numpy.lib.recfunctions
from scipy import ndimage

x = np.rec.fromarrays(([1,1,1,2,2,2],['a','b','c']*2,range(1, 7)), names='d,c,v')

unique, groups = np.unique(x['d'], False, True)
uniques = range(unique.size)
mins = ndimage.minimum(x['v'], groups, uniques)[groups]
maxs = ndimage.maximum(x['v'], groups, uniques)[groups]

x2 = np.lib.recfunctions.append_fields(x, 'v2', (x['v'] - mins)/(maxs - mins + 0.0))

#save as csv
np.savetxt('file.csv', x2, delimiter=';')