Low performance in a OpenMP program_问答_开发者

开发者 https://www.devze.com 2023-01-31 16:22 出处：网络

I am trying to understand an openmp code from here. You can see the code below. In order to measure the speedup, difference between the serial and omp version, I use time.h, do you find right this a

I am trying to understand an openmp code from here. You can see the code below.

In order to measure the speedup, difference between the serial and omp version, I use time.h, do you find right this approach?
The program runs on a 4 core machine. I specify export OMP_NUM_THREADS="4" but can not see substantially speedup, usually I get 1.2 - 1.7. Which problems am I facing in this parallelization?
Which debug/performace tool could I use to see the loss of performace?

code (for compilation I use xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe)

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE   1000000
#define N       100000000

int main (int argc, char *argv[]) 
{
    int nthreads, tid, i, chunk;
    float a[N], b[N], c[N];
    unsigned long elapsed;
    unsigned long elapsed_serial;
    unsigned long elapsed_omp;
    struct timeval start;
    struct timeval stop;


    chunk = CHUNKSIZE;

    // =================    SERIAL     start =======================
    /* Some initializations */
    for (i=0; i < N; i++)
        a[i] = b[i] = i * 1.0;
    gettimeofday(&start,NULL); 
    for (i=0; i<N; i++)
    {
        c[i] = a[i] + b[i];
        //printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
    }
    gettimeofday(&stop,NULL);
    elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
    elapsed += stop.tv_usec - start.tv_usec;
    elapsed_serial = elapsed ;
    printf ("   \n Time SEQ= %lu microsecs\n", elapsed_serial);
    // =================    SERIAL     end =======================


    // =================    OMP    start =======================
    /* Some initializations */
    for (i=0; i < N; i++)
        a[i] = b[i] = i * 1.0;
    gettimeofday(&start,NULL); 
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
    {
        tid = omp_get_thread_num();
        if (tid == 0)
        {
            nthreads = omp_get_num_threads();
            printf("Number of threads = %d\n", nthreads);
        }
        /开发者_StackOverflow/printf("Thread %d starting...\n",tid);

#pragma omp for schedule(static,chunk)
        for (i=0; i<N; i++)
        {
            c[i] = a[i] + b[i];
            //printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
        }

    }  /* end of parallel section */
    gettimeofday(&stop,NULL);
    elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
    elapsed += stop.tv_usec - start.tv_usec;
    elapsed_omp = elapsed ;
    printf ("   \n Time OMP= %lu microsecs\n", elapsed_omp);
    // =================    OMP    end =======================
    printf ("   \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;

}

There's nothing really wrong with the code as above, but your speedup is going to be limited by the fact that the main loop, c=a+b, has very little work -- the time required to do the computation (a single addition) is going to be dominated by memory access time (2 loads and one store), and there's more contention for memory bandwidth with more threads acting on the array.

We can test this by making the work inside the loop more compute-intensive:

c[i] = exp(sin(a[i])) + exp(cos(b[i]));

And then we get

$ ./apb

 Time SEQ= 17678571 microsecs
Number of threads = 4

 Time OMP= 4703485 microsecs

 speedup= 3.758611

which is obviously a lot closer to the 4x speedup one would expect.

Update: Oh, and to the other questions -- gettimeofday() is probably fine for timing, and on a system where you're using xlc - is this AIX? In that case, peekperf is a good overall performance tool, and the hardware performance monitors will give you access to to memory access times. On x86 platforms, free tools for performance monitoring of threaded code include cachegrind/valgrind for cache performance debugging (not the problem here), scalasca for general OpenMP issues, and OpenSpeedShop is pretty useful, too.