What is the most platform- and Python-version-independent way to make a fast loop for use in Python?_问答_开发者

I'm writing a scientific application in Python with a very processor-intensive loop at its core. I would like to optimise this as far as possible, at minimum inconvenience to end users, who will probably use it as an uncompiled collection of Python scripts, and will be using Windows, Mac, and (mainly Ubuntu) Linux.

It is currently written in Python with a dash of NumPy, and I've included the code below.

Is there a solution which would be reasonably fast which would not require compilation? This would seem to be the easiest way to maintain platform-independence.
If using something like Pyrex, which does require compilation, is there an easy way to bundle many modules and have Python choose between them depending on detected OS and Python version? Is there an easy way to build the collection of modules without needing access to every system with every version of Python?
Does one method lend itself particularly to multi-processor optimisation?

(If you're interested, the loop is to calculate the magnetic field at a given point inside a crystal by adding together the contributions of a large number of nearby magnetic ions, treated as tiny bar magnets. Basically, a massive sum of these.)

# calculate_dipole
# -------------------------
# calculate_dipole works out the dipole field at a given point within the crystal unit cell
# ---
# INPUT
# mu = position at which to calculate the dipole field
# r_i = array of atomic positions
# mom_i = corresponding array of magnetic moments
# ---
# OUTPUT
# B = the B-field at this point

def calculate_dipole(mu, r_i, mom_i):
    relative = mu - r_i
    r_unit = unit_vectors(relative)
    #4pi / mu0 (at the front of the dipole eqn)
    A = 1e-7
    #initalise dipole field
    B = zeros(3,float)

    for i in range(len(relative)):
        #work out the dipole field and add it to the estimate so far
        B +=开发者_StackOverflow中文版 A*(3*dot(mom_i[i],r_unit[i])*r_unit[i] - mom_i[i]) / sqrt(dot(relative[i],relative[i]))**3
    return B

You can get this to run much, much faster if you eliminate the loop and use Numpy's vectorized operations. Put your data in numpy arrays of shape (3,N) and try the following:

import numpy as np

N = 20000
mu = np.random.random((3,1))
r_i = np.random.random((3,N))
mom_i = np.random.random((3,N))

def unit_vectors(r):
     return r / np.sqrt((r*r).sum(0))

def calculate_dipole(mu, r_i, mom_i):
    relative = mu - r_i
    r_unit = unit_vectors(relative)
    A = 1e-7

    num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i)
    den = np.sqrt(np.sum(relative*relative, 0))**3
    B = np.sum(num/den, 1)
    return B

This runs about 50 times faster for me than using a for loop.

Numpy does use some native optimization for array processing. You can use Numpy array with Cython to gain some speed-ups.

Your python code could probably be speeded up a bit by replacing your loop with a generator expression and removing all the lookups of mom_i[i], relative[i] and r_unit[i] by iterating through all three sequences in parallel using itertools.izip.

i.e. replace

B = zeros(3,float)

for i in range(len(relative)):
    #work out the dipole field and add it to the estimate so far
    B += A*(3*dot(mom_i[i],r_unit[i])*r_unit[i] - mom_i[i]) / sqrt(dot(relative[i],relative[i]))**3
return B

with:

from itertools import izip
...
return sum((A*(3*dot(mom,ru)*ru - mom) / sqrt(dot(rel,rel))**3 
            for mom, ru, rel in izip(mom_i, r_unit, relative)),
           zeros(3,float))

This is also more readable IMHO since the core equation is not cluttered with [i] everywhere..

I suspect however that this will only get you marginal gains compared to doing the whole function in a compiled language such as Cython.

One simple, but significant speed-up is to take the multiplication by A outside of your sum. You can just times the B with it as you return it:

for i in range(len(relative)):
    #work out the dipole field and add it to the estimate so far
    B += (3*dot(mom_i[i],r_unit[i])*r_unit[i] - mom_i[i]) / sqrt(dot(relative[i],relative[i]))**3

return A*B

This gave about an 8% speed-up using 20,000 random dipoles.

Beyond that easy speed-up, I would recommend using Cython (which is generally recommended over using Pyrex) or Weave from Scipy. Take a look at the Performance Python for some examples and comparisons of various ways to speed-up Numpy/Scipy.

If you want to try making this parallel, I would recommend looking at Scipy's Parallel Programming to get started.

It's good to see another physicist on SO. There aren't very many on here.

Edit:

I decided to take this as a challenge to develop some Cython skills and got about a 10x time improvement over a Psyco optimized version. Let me know if you'd like to see my code.

Edit2:

Okay, went back and found what was slowing things up in my Cython version. Now the speed-up is well over 100x. If you want or need another factor of 2x or so over Ray's sped-up Numpy version, let me know and I will post my code.

Cython source code:

Here's the Cython code that I drummed up:

import numpy as np
cimport numpy as np
cimport cython
cdef extern from "math.h":
    double sqrt(double theta)
ctypedef np.float64_t dtype_t

@cython.boundscheck(False)
@cython.wraparound(False)
def calculate_dipole_cython(np.ndarray[dtype_t,ndim=2,mode="c"] mu, 
                            np.ndarray[dtype_t,ndim=2,mode="c"] r_i, 
                            np.ndarray[dtype_t,ndim=2,mode="c"] mom_i):
    cdef Py_ssize_t i
    cdef np.ndarray[dtype_t,ndim=1,mode="c"] tmp = np.empty(3,np.float64)
    cdef np.ndarray[dtype_t,ndim=1,mode="c"] relative = np.empty(3,np.float64)
    cdef double A = 1e-7
    cdef double C, D, F
    cdef np.ndarray[dtype_t,ndim=1,mode="c"] B = np.zeros(3,np.float64)
    for i in xrange(r_i.shape[0]):
        relative[0] = mu[0,0] - r_i[i,0]
        relative[1] = mu[0,1] - r_i[i,1]
        relative[2] = mu[0,2] - r_i[i,2]
        C = relative[0]*relative[0] + relative[1]*relative[1] + relative[2]*relative[2]
        C = 1.0/sqrt(C)
        D = C**3
        tmp[0] = relative[0]*C
        F = mom_i[i,0]*tmp[0]
        tmp[1] = relative[1]*C
        F += mom_i[i,1]*tmp[1]
        tmp[2] = relative[2]*C
        F += mom_i[i,2]*tmp[2]
        F *= 3
        B[0] += (F*tmp[0] - mom_i[i,0])*D
        B[1] += (F*tmp[1] - mom_i[i,1])*D
        B[2] += (F*tmp[2] - mom_i[i,2])*D
    return A*B

I've optimized it quite a bit I think, but there might be a little more you can get out of it. You can still maybe replace the np.zeros and np.empty with direct calls from the Numpy C API, but that shouldn't make much of a difference. As it stands, this code gives a 2-3 times improvement over the Numpy optimized code you have. However, you need to pass the numbers in correctly. The arrays need to be in C format (which is the default for Numpy arrays, but in Numpy the transpose of a C formatted array is a Fortran formatted array).

For instance, to run the code from your other question, you will need to replace the np.random.random((3,N))s with np.random.random((N,3)). Also, `

r_test_fast = reshape_vector(r_test)

needs to be changed to

r_test_fast = np.array(np.matrix(r_test))

This last line can be made simpler/faster, but this would be premature optimization in my opinion.

If you haven't used Cython before and don't know how to compile this, then let me know and I'll be glad to help.

Lastly, I would recommend looking at this paper. I used it as a guide for my optimizations. The next step would be to try to use BLAS functions that make use of the SSE2 instruction set, trying to use the SSE API, or trying to use more of the Numpy C API which interfaces with the SSE2 stuff. Also, you can look into parallelizing.

Python isn't intended for high-performance computation. Write the core loop in C and call it from Python.