开发者

Removing rows in NumPy efficiently

开发者 https://www.devze.com 2023-04-01 10:06 出处:网络
I have a large numpy array with a lot of ID values (call it X): X: idrating 188 299 377 466 ... etc.I also have anot开发者_高级运维her numpy array of \"bad IDs\" -- which signify rows I\'d like to

I have a large numpy array with a lot of ID values (call it X):

X:
id   rating
1    88
2    99
3    77
4    66
...

etc. I also have anot开发者_高级运维her numpy array of "bad IDs" -- which signify rows I'd like to remove from X.

B: [2, 3]

So when I'm done, I'd like:

X:
id   rating
1    88
4    66

What is the cleanest way to do this, without iterating?


This is the fastest way I could come up with:

import numpy

x = numpy.arange(1000000, dtype=numpy.int32).reshape((-1,2))
bad = numpy.arange(0, 1000000, 2000, dtype=numpy.int32)

print x.shape
print bad.shape

cleared = numpy.delete(x, numpy.where(numpy.in1d(x[:,0], bad)), 0)
print cleared.shape

This prints:

(500000, 2)
(500,)
(499500, 2)

and runs much faster than a ufunc. It will use some extra memory, but whether that's okay for you depends on how big your array is.

Explanation:

  • The numpy.in1d returns an array the same size as x containing True if the element is in the bad array, and False otherwise.
  • The numpy.where turns that True/False array into an array of integers containing the index values where the array was True.
  • It then passes the index locations to numpy.delete, telling it to delete along the first axis (0)


reproduce the problem spec from OP:

X = NP.array('1 88 2 99 3 77 4 66'.split(), dtype=int).reshape(4, 2)
bad_ids = [3,2]
bad_ideas = set(bad_ideas)    # see jterrance comment below this Answer

Vectorize a bult-in from Python's membership tests--i.e., X in Y syntax

@NP.vectorize
def filter_bad_ids(id) :
    return id not in bad_ids


>>> X_clean = X[filter_bad_ids(X[:,0])]
>>> X_clean                                # result
   array([[ 1, 88],
          [ 4, 66]])


If you want to completely delete the information for bad ID's, try this:

x = x[numpy.in1d(x[:,0], bad, invert=True)]

This solution uses fairly little memory and should be very fast. (bad is converted to a numpy array, so should not be a set for this to work, see the note in http://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html)
If bad is very small, it might be a little faster to do instead:

from functools import reduce
x = x[~reduce(numpy.logical_or, (x[:,0] == b for b in bad))]

Note: The first line is required only in Python3.
This also uses little memory because of the use of a generator.

0

精彩评论

暂无评论...
验证码 换一张
取 消