Optimal (broadcasted) matrix division in numpy. Avoiding temporary arrays or not?_问答_开发者

Numpy allows matrices of different sizes to be added/multiplied/divided provided certain broadcasting rules are followed. Also, creation of temporary arrays is a major speed impediment to numpy.

The following timit results surprise me...what is going on?

In [41]: def f_no_dot(mat,arr):
   ....:     return mat/arr

In [42]: def f_dot(mat,arr):
   ....:     denominator = scipy.dot(arr, scipy.ones((1,2)))
   ....:     return mat/denominator

In [43]: mat = scipy.rand(360000,2)

In [44]: arr = scipy.rand(360000,1)

In [45]: timeit temp = f_no_dot(mat,arr)
10 loops, best of 3: 44.7 ms per loop

In [46]: timeit temp = f_dot(mat,arr)
100 loops, best of 3: 10.1 ms per loop

I thought that f_dot would be slower since it had to create the temporary array denominator, and I assumed 开发者_JAVA技巧that this step was skipped by f_no_dot. I should note that these times scale linearly (with array size, up to length 1 billion) for f_no_dot, and slightly worse than linear for f_dot.

I thought that f_dot would be slower since it had to create the temporary array denominator, and I assumed that this step was skipped by f_no_dot.

For what it's worth, creating the temporary array is skipped, which is why f_no_dot is slower (but uses less memory).

Element-wise operations on arrays of the same size are faster, because numpy doesn't have to worry about the striding (dimensions, size, etc) of the arrays.

Operations that use broadcasting will generally be a bit slower than operations that don't have to.

If you have the memory to spare, creating a temporary copy can give you a speedup, but will use more memory.

For example, comparing these three functions:

import numpy as np
import timeit

def f_no_dot(x, y):
    return x / y

def f_dot(x, y):
    denom = np.dot(y, np.ones((1,2)))
    return x / denom

def f_in_place(x, y):
    x /= y
    return x

num = 3600000
x = np.ones((num, 2))
y = np.ones((num, 1))


for func in ['f_dot', 'f_no_dot', 'f_in_place']:
    t = timeit.timeit('%s(x,y)' % func, number=100,
            setup='from __main__ import x,y,f_dot, f_no_dot, f_in_place')
    print func, 'time...'
    print t / 100.0

This yields similar timings to your results:

f_dot time...
0.184361531734
f_no_dot time...
0.619203259945
f_in_place time...
0.585789341927

However, if we compare the memory usage, things become a bit clearer...

The combined size of our x and y arrays is about 27.5 + 55 MB, or 82 MB (for 64-bit ints). There's an additional ~11 MB of overhead in import numpy, etc.

Returning x / y as a new array (i.e. not doing x /= y) will require another 55 MB array.

100 runs of f_dot: We're creating a temporary array here, so we'd expect to see 11 + 82 + 55 + 55 MB or ~203 MB of memory usage. And, that's what we see...

Optimal (broadcasted) matrix division in numpy. Avoiding temporary arrays or not?

100 runs of f_no_dot: If no temporary array is created, we'd expect a peak memory usage of 11 + 82 + 55 MB, or 148 MB...

Optimal (broadcasted) matrix division in numpy. Avoiding temporary arrays or not?

...which is exactly what we see.

So, x / y is not creating an additional num x 2 temporary array to do the division.

Thus, the division takes a quite a bit longer than it would if it were operating on two arrays of the same size.

100 runs of f_in_place: If we can modify x in-place, we can save even more memory, if that's the main concern.

Optimal (broadcasted) matrix division in numpy. Avoiding temporary arrays or not?

Basically, numpy tries to conserve memory at the expense of speed, in some cases.

What you are seeing is most likely iteration overhead over the small (2,) dimension. Numpy (versions < 1.6) dealt efficiently only with operations involving contiguous arrays (of the same shape). The effect goes away as the size of the last dimension increases.

To see the effect of contiguity:

In [1]: import numpy
In [2]: numpy.__version__
Out[2]: '1.5.1'
In [3]: arr_cont1 = numpy.random.rand(360000, 2)
In [4]: arr_cont2 = numpy.random.rand(360000, 2)
In [5]: arr_noncont = numpy.random.rand(360000, 4)[:,::2]
In [6]: arr_bcast = numpy.random.rand(360000, 1)
In [7]: %timeit arr_cont1 / arr_cont2
100 loops, best of 3: 5.75 ms per loop
In [8]: %timeit arr_noncont / arr_cont2
10 loops, best of 3: 54.4 ms per loop
In [9]: %timeit arr_bcast / arr_cont2
10 loops, best of 3: 55.2 ms per loop

The situation is however much improved in Numpy >= 1.6.0:

In [1]: import numpy
In [2]: numpy.__version__
Out[2]: '1.6.1'
In [3]: arr_cont1 = numpy.random.rand(360000, 2)
In [4]: arr_cont2 = numpy.random.rand(360000, 2)
In [5]: arr_noncont = numpy.random.rand(360000, 4)[:,::2]
In [6]: arr_bcast = numpy.random.rand(360000, 1)
In [7]: %timeit arr_cont1 / arr_cont2
100 loops, best of 3: 5.37 ms per loop
In [8]: %timeit arr_noncont / arr_cont2
100 loops, best of 3: 6.12 ms per loop
In [9]: %timeit arr_bcast / arr_cont2
100 loops, best of 3: 7.81 ms per loop

(All the timings above are probably only accurate to 1 ms.)

Note also that temporaries are not that expensive:

In [82]: %timeit arr_cont1.copy()
1000 loops, best of 3: 778 us per loop

EDIT: Note above that also arr_noncont is sort of contiguous with stride of 2*itemsize, so that the inner loop can be unraveled --- Numpy can make it about as fast as the contiguous array. With broadcasting (or with a really noncontiguous array such as numpy.random.rand(360000*2, 2)[::2,:], the inner loop cannot be unraveled, and these cases are slightly slower. Improving that would still be possible if Numpy emitted tailored machine code on the fly for each loop, but it doesn't do that (at least yet :)