Parallelization of a quadratic algorithm_问答_开发者

Let's say I have a vector (array, list, whatever...) V of N elements, let's say V₀ to V_(N-1). For each element V_i, a function f(V_i,V_j) needs to be computed for every index j (including the case i=j). The function is symmetric so that once f(V_i, V_j) is computed, there is no need to recompute f(V_j,V_i). Then we have N(N+1)/2 to开发者_高级运维tal evaluations of the function, making this a O(N²) algorithm. Let's assume the time it takes to compute f is relatively long but consistent.

Now, I want to parallelize the execution of the algorithm. I need to determine a schedule for (some number M of) worker threads so that two threads do not use the same part of memory (i.e. the same element) at the same time. For example, f(V₁,V₂) could be evaluated parallel to f(V₃,V₄), but not parallel to f(V₂,V₃). The workflow is divided into steps such that for each step, every worker thread performs one evaluation of f. The threads are then synchronized, after which they proceed to the next step and so on.

The question is, how do I determine (preferably optimally) the schedule for each thread as a series of index pairs (i,j) so that the complete problem is solved (i.e. each index pair visited exactly once, considering the symmetry)? While a direct answer would of course be nice, I'd also appreciate a pointer to an algorithm or even to relevant websites/literature.

This reminds me of a common sports scheduling problem: In a league with N teams, arrange N-1 gamedays, so that every team has one game per day and plays every other team once.

Playing chess, there is a quite illustrative solution to this problem. Arrange all boards side by side on a long table. One player always remains at the same chair. The other players rotate around the table in the same direction skipping that player.

Let's see direct implementation:

for(i=0; i < N; ++i)
{
    for(j=1; j < i; ++J)
    {
        compute i,j pair and assign to i,j, and j,i result
    }
}

I am C++ programmer, so I can think about external loop parallelization, let's say, with OpenMP:

#pragma OMP parallel for
for(i=0; i < N; ++i)
{
    ....
}

If you don't know OpenMP, it just divides the loop into n loops, according to number of processors, and executes them in parallel. I don't think that in this case result will be good, because every i+1 loop is shorter than i loop. Let's write this algorithm by such way, that all loops have the same length. Total number of loops will be N/2. First loop handles 0 and N-1 lines. Second loop handles 1 and N-2 lines etc. Such loop may be parallelized successfully, without conflicts. Simplified code, without details about even/odd N etc.:

#pragma OMP parallel for
for(i=0; i < N/2; ++i)
{
    Handle i value
    ...

    Handle N - 1 - i value
    ...
}

This is just general idea, there are incorrect details that you can fix.

I see two possibilities. One is to divide the K = N(N+1)/2 tasks among the M threads apriori. The following is just pseudo-code:

allocateTasks() {
    num = ceil(N*(N+1)/2/M); // number of tasks per thread
    i0 = 0, j0 = 0;
    n = 0; 
    for (i = 0; i < N; ++i) {
        for (j = i; j < N; ++j) { // skip symmetries
            if (n == 0) {i0 = i; j0 = j;}
            ++n;
            if (n == num) {
                thread(i0, j0, num); 
                n = 0; 
            }
        }
    }
    if (n > 0) thread(i0, j0, n); 
}

thread(i0, j0, num) {
    // perform num tasks starting at (i0, j0)
    i = i0; j = j0;
    for (n = 0; n < num; ++n) {
        // f[i, j] = ...
        ++j;
        if (j >= N) {++i; j = i;}
    }
}

The problem here is that when the computation of f(i, j) does not take the same time, then the load is not balanced among the threads.

So, perhaps the second approach is better. Each thread takes the next available task. There is no global pre-allocation of tasks.

// init boolean array: done[0..N-1, 0..N-1] = false
thread() {
    for (i = 0; i < N; ++i) {
        for (j = i; j < N; ++j) { // skip symmetries
            boolean mytask = false;
            lock(mutex); // synchronize access to global array done[][]
            if (!done[i, j]) {
                done[i, j] = true;
                mytask = true;
            }
            unlock(mutex);
            if (mytask) {
                f(i, j) = ...
            }
        }
     }
}

In any case the access to f(i, j) would be via