I have a very large NumPy array
1 40 3
4 50 4
5 60 7
5 49 6
6 70 8
8 80 9
8 72 1
9 90 7
....
I want to check to see if a value exists in the 1st column of the array. 开发者_如何学JAVA I've got a bunch of homegrown ways (e.g. iterating through each row and checking), but given the size of the array I'd like to find the most efficient method.
Thanks!
How about
if value in my_array[:, col_num]:
do_whatever
Edit: I think __contains__
is implemented in such a way that this is the same as @detly's version
The most obvious to me would be:
np.any(my_array[:, 0] == value)
To check multiple values, you can use numpy.in1d(), which is an element-wise function version of the python keyword in. If your data is sorted, you can use numpy.searchsorted():
import numpy as np
data = np.array([1,4,5,5,6,8,8,9])
values = [2,3,4,6,7]
print np.in1d(values, data)
index = np.searchsorted(data, values)
print data[index] == values
Fascinating. I needed to improve the speed of a series of loops that must perform matching index determination in this same way. So I decided to time all the solutions here, along with some riff's.
Here are my speed tests for Python 2.7.10:
import timeit
timeit.timeit('N.any(N.in1d(sids, val))', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
18.86137104034424
timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = [20010401010101+x for x in range(1000)]')
15.061666011810303
timeit.timeit('N.in1d(sids, val)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
11.613027095794678
timeit.timeit('N.any(val == sids)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
7.670552015304565
timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
5.610057830810547
timeit.timeit('val == sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
1.6632978916168213
timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = set([20010401010101+x for x in range(1000)])')
0.0548710823059082
timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = dict(zip([20010401010101+x for x in range(1000)],[True,]*1000))')
0.054754018783569336
Very surprising! Orders of magnitude difference!
To summarize, if you just want to know whether something's in a 1D list or not:
- 19s N.any(N.in1d(numpy array))
- 15s x in (list)
- 8s N.any(x == numpy array)
- 6s x in (numpy array)
- .1s x in (set or a dictionary)
If you want to know where something is in the list as well (order is important):
- 12s N.in1d(x, numpy array)
- 2s x == (numpy array)
Adding to @HYRY's answer in1d seems to be fastest for numpy. This is using numpy 1.8 and python 2.7.6.
In this test in1d was fastest, however 10 in a
look cleaner:
a = arange(0,99999,3)
%timeit 10 in a
%timeit in1d(a, 10)
10000 loops, best of 3: 150 µs per loop
10000 loops, best of 3: 61.9 µs per loop
Constructing a set is slower than calling in1d, but checking if the value exists is a bit faster:
s = set(range(0, 99999, 3))
%timeit 10 in s
10000000 loops, best of 3: 47 ns per loop
The most convenient way according to me is:
(Val in X[:, col_num])
where Val is the value that you want to check for and X is the array. In your example, suppose you want to check if the value 8 exists in your the third column. Simply write
(8 in X[:, 2])
This will return True if 8 is there in the third column, else False.
If you are looking for a list of integers, you may use indexing for doing the work. This also works with nd-arrays, but seems to be slower. It may be better when doing this more than once.
def valuesInArray(values, array):
values = np.asanyarray(values)
array = np.asanyarray(array)
assert array.dtype == np.int and values.dtype == np.int
matches = np.zeros(array.max()+1, dtype=np.bool_)
matches[values] = True
res = matches[array]
return np.any(res), res
array = np.random.randint(0, 1000, (10000,3))
values = np.array((1,6,23,543,222))
matched, matches = valuesInArray(values, array)
By using numba and njit, I could get a speedup of this by ~x10.
精彩评论