Summing with OpenMP using C_问答_开发者_运维开发者技术经验分享

I've been trying to parallelize this piece of code for about two days and keep having logical errors. The program is to find the area of an integral using the sum of the very small dx and calculate each discrete value of the integral. I am trying to implement this with openmp but I actually have no experience with openmp. I would like your help please. The actual goal is to parallelize the suma variable in the threads so every thread calculates less values of the integral. The program compiles successful开发者_如何学Goly but when I execute the program it returns wrong results.

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char *argv[]){
    float down = 1, up = 100, dx, suma = 0, j;
    int steps, i, nthreads, tid;
    long starttime, finishtime, runtime; 

    starttime = omp_get_wtime();
    steps = atoi(argv[1]);
    dx = (up - down) / steps;

    nthreads = omp_get_num_threads();
    tid = omp_get_thread_num();
    #pragma omp parallel for private(i, j, tid) reduction(+:suma)
    for(i = 0; i < steps; i++){
        for(j = (steps / nthreads) * tid; j < (steps / nthreads) * (tid + 1); j += dx){
            suma += ((j * j * j) + ((j + dx) * (j + dx) * (j + dx))) / 2 * dx;
        }
    }
    printf("For %d steps the area of the integral  3 * x^2 + 1 from %f to %f is: %f\n", steps, down, up, suma);
    finishtime = omp_get_wtime();
    runtime = finishtime - starttime;
    printf("Runtime: %ld\n", runtime);
    return (0);
}

The problem lies within your for-loop. If you use the for-pragma, OpenMP does the loop splitting for you:

#pragma omp parallel for private(i) reduction(+:suma)
for(i = 0; i < steps; i++) {
    // recover the x-position of the i-th step
    float x = down + i * dx;
    // evaluate the function at x
    float y = (3.0f * x * x + 1)
    // add the sum of the rectangle to the overall integral
    suma += y * dx
}

Even if you would convert to a parallelisation scheme where you would have to compute the indices by yourself, that would be problematic. The outer loop should be executed only nthread times.

You should also consider switching to double for increased accuracy.

Let's just consider the threads=1 case. This:

#pragma omp parallel for private(i, j, tid) reduction(+:suma)
for(i = 0; i < steps; i++){
    for(j = (steps / nthreads) * tid; j < (steps / nthreads) * (tid + 1); j += dx){
        suma += ((j * j * j) + ((j + dx) * (j + dx) * (j + dx))) / 2 * dx;
    }
}

turns into this:

for(i = 0; i < steps; i++){
    for(j = 0; j < steps; j += dx){
        suma += ((j * j * j) + ((j + dx) * (j + dx) * (j + dx))) / 2 * dx;
    }
}

and you can start to see the problem; you're basically looping over steps².

In addition, your second loop doesn't make any sense, as you're incrementing by dx. That same confusion between indicies (i, j) with locations in the physical domain (i*dx) shows up in your increment. j+dx doesn't make any sense. Presumably you want to be increasing suma by (f(x) + f(x'))*dx/2 (eg, trapezoidal rule); that should be

        float x = down + i*dx;
        suma += dx * ((3 * x * x + 1) + (3 * (x + dx) * (x + dx) + 1)) / 2;

As ebo points out, you want to be summing the integrand, not its antiderivative.

Now if we include a check on the answer:

printf("For %d steps the area of the integral  3 * x^2 + 1 from %f to %f is: %f (expected: %f)\n",
            steps, down, up, suma, up*up*up-down*down*down + up - down);

and we run it in serial, we start getting the right answer:

$ ./foo 10
For 10 steps the area of the integral  3 * x^2 + 1 from 1.000000 to 100.000000 is: 1004949.375000 (expected: 1000098.000000)
Runtime: 0
$ ./foo 100
For 100 steps the area of the integral  3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000146.562500 (expected: 1000098.000000)
Runtime: 0
$ ./foo 1000
For 1000 steps the area of the integral  3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.437500 (expected: 1000098.000000)
Runtime: 0

There's no point at all in worrying about the OpenMP case until the serial case works.

Once it comes time to OpenMP this, as ebo points out, the easiest thing to do is to just let OpenMP do your loop decomposition for you: eg,

#pragma omp parallel for reduction(+:suma)
    for(i = 0; i < steps; i++){
        float x = down + i*dx;
        suma += dx * ((3 * x * x + 1) + (3 * (x + dx) * (x + dx) + 1)) / 2;
    }

Running this, one gets

$ setenv OMP_NUM_THREADS 1
$ ./foo 1000
For 1000 steps the area of the integral  3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.437500 (expected: 1000098.000000)
Runtime: 0
$ setenv OMP_NUM_THREADS 2
$ ./foo 1000
For 1000 steps the area of the integral  3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.437500 (expected: 1000098.000000)
Runtime: 0
$ setenv OMP_NUM_THREADS 4
$ ./foo 1000
For 1000 steps the area of the integral  3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.625000 (expected: 1000098.000000)
Runtime: 0
$ setenv OMP_NUM_THREADS 8
$ ./foo 1000
For 1000 steps the area of the integral  3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.500000 (expected: 1000098.000000)

One can do the blocking explicitly in OpenMP if you really want to, but you should have a reason for doing that.