I have a large numpy array with a lot of ID values (call it X):
X:
id rating
1 88
2 99
3 77
4 66
...
etc. I also have anot开发者_高级运维her numpy array of "bad IDs" -- which signify rows I'd like to remove from X.
B: [2, 3]
So when I'm done, I'd like:
X:
id rating
1 88
4 66
What is the cleanest way to do this, without iterating?
This is the fastest way I could come up with:
import numpy
x = numpy.arange(1000000, dtype=numpy.int32).reshape((-1,2))
bad = numpy.arange(0, 1000000, 2000, dtype=numpy.int32)
print x.shape
print bad.shape
cleared = numpy.delete(x, numpy.where(numpy.in1d(x[:,0], bad)), 0)
print cleared.shape
This prints:
(500000, 2)
(500,)
(499500, 2)
and runs much faster than a ufunc. It will use some extra memory, but whether that's okay for you depends on how big your array is.
Explanation:
- The numpy.in1d returns an array the same size as
x
containingTrue
if the element is in thebad
array, andFalse
otherwise. - The numpy.where turns that
True
/False
array into an array of integers containing the index values where the array wasTrue
. - It then passes the index locations to numpy.delete, telling it to delete along the first axis (0)
reproduce the problem spec from OP:
X = NP.array('1 88 2 99 3 77 4 66'.split(), dtype=int).reshape(4, 2)
bad_ids = [3,2]
bad_ideas = set(bad_ideas) # see jterrance comment below this Answer
Vectorize a bult-in from Python's membership tests--i.e., X in Y syntax
@NP.vectorize
def filter_bad_ids(id) :
return id not in bad_ids
>>> X_clean = X[filter_bad_ids(X[:,0])]
>>> X_clean # result
array([[ 1, 88],
[ 4, 66]])
If you want to completely delete the information for bad ID's, try this:
x = x[numpy.in1d(x[:,0], bad, invert=True)]
This solution uses fairly little memory and should be very fast. (bad is converted to a numpy array, so should not be a set for this to work, see the note in http://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html)
If bad is very small, it might be a little faster to do instead:
from functools import reduce
x = x[~reduce(numpy.logical_or, (x[:,0] == b for b in bad))]
Note: The first line is required only in Python3.
This also uses little memory because of the use of a generator.
精彩评论