Most efficient way to sum huge 2D NumPy array, grouped by ID column?_问答_开发者

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

开发者 https://www.devze.com 2023-03-28 11:09 出处：网络

I have a massive data array (500k rows) that looks like: idvaluescore 12020 11030 1150 2124 238 2569 3618 ... As you can see, there is a non-unique ID column to the left, and various scores in the

相关专题：numpy python

I have a massive data array (500k rows) that looks like:

id  value  score
1   20     20
1   10     30
1   15     0
2   12     4
2   3      8
2   56     9
3   6      18
...

As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column. 开发者_运维问答

I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id

With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.

table_trunc = table[(table == id).any(1)]
score       = sum(table_trunc[:,2])

Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?

you can use bincount():

import numpy as np

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

print np.bincount(ids, weights=data)

the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.

You can use a for loop and numba

from numba import njit

@njit
def wbcnt(b, w, k):
    bins = np.arange(k)
    bins = bins * 0
    for i in range(len(b)):
        bins[b[i]] += w[i]
    return bins

Using @HYRY's variables

ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]

Then:

wbcnt(ids, data, 4)

array([ 0, 50, 21, 18])

Timing

%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)

1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop

I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:

import pandas as pd

df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})

So your dataframe would look like this:

  id  score
0   1     20
1   1     30
2   1      0
3   2      4
4   2      8
5   2      9
6   3     18

Now you can use the functions groupby() and sum():

df.groupby(['id'], sort=False).sum()

which gives you the desired output:

By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.

You can try using boolean operations:

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

[((ids == i)*data).sum() for i in np.unique(ids)]

This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.

If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.

Your sum operation there would look like:

res = aggregate(id, score)

The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:

import numpy_indexed as npi
npi.group_by(id).sum(score)

Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.

(The data must be sorted according to the group by func, in this case ID)

>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
        for y in i:
            if isinstance(y, int):
                print(y)
            else:
                for p in y:
                    print('-', p)

Output:

1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

精彩评论

关注公众号

热门标签

图文推荐

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：