I have a function which is passed two structures by reference. These structures are composed of dynamically allocated arrays. Now when I try to implement OpenMP I'm getting a slowdown not a speedup. I'm thinking this can be attributed to possible sharing issues. Here's some of the code for your perusal (C):
void leap(MHD *mhd,GRID *grid,short int gchk)
{
/*-- V A R I A B L E S --*/
// Indexes
int i,j,k,tid;
double rhoinv[grid->nx][grid->ny][grid->nz];
double rhoiinv[grid->nx][grid->ny][grid->nz];
double rhoeinv[grid->nx][grid->ny][grid->nz];
double rhoninv[grid->nx][grid->ny][grid->nz]; // Rho Inversion
#pragma omp parallel shared(mhd->rho,mhd->rhoi,mhd->rhoe,mhd->rhon,grid,rhoinv,rhoiinv,rhoeinv,rhoninv) \
开发者_运维百科 private(i,j,k,tid,stime)
{
tid=omp_get_thread_num();
printf("----- Thread %d Checking in!\n",tid);
#pragma omp barrier
if (tid == 0)
{
stime=clock();
printf("-----1) Calculating leap helpers");
}
#pragma omp for
for(i=0;i<grid->nx;i++)
{
for(j=0;j<grid->ny;j++)
{
for(k=0;k<grid->nz;k++)
{
// rho's
rhoinv[i][j][k]=1./mhd->rho[i][j][k];
rhoiinv[i][j][k]=1./mhd->rhoi[i][j][k];
rhoeinv[i][j][k]=1./mhd->rhoe[i][j][k];
rhoninv[i][j][k]=1./mhd->rhon[i][j][k];
}
}
}
if (tid == 0)
{
printf("........%04.2f [s] -----\n",(clock()-stime)/CLOCKS_PER_SEC);
stime=clock();
}
#pragma omp barrier
}/*-- End Parallel Region --*/
}
Now I've tried default(shared) and shared(mhd) but neither show any signs of improvement. Could it be that since the arrays are allocated
mhd->rho=(double ***)newarray(nx,ny,nz,sizeof(double));
That by declaring the structure or the pointer to the element of the structure that I'm not actually sharing the memory just the pointers to it? Oh and nx=389 ny=7 and nz=739 in this example. Execution time for this section in serial is 0.23 [s] and 0.79 [s] for 8 threads.
My issue boiled down to a real simple mistake....clock(). While I did protect my timing algorithm by only having a specific thread calculate the time, I forgot one important thing about clock()...it returns wall clock time which is the total processor time (summation over the active threads). What I needed to be calling was omp_get_wtime(). Doing this I suddenly see a speedup for many sections of my code. For the record I've modified my code to include
#ifdef _OPENMP
#include <omp.h>
#define TIMESCALE 1
#else
#define omp_get_thread_num() 0
#define omp_get_num_procs() 0
#define omp_get_num_threads() 1
#define omp_set_num_threads(bob) 0
#define omp_get_wtime() clock()
#define TIMESCALE CLOCKS_PER_SEC
#endif
And my timing algorithm is now
#pragma omp barrier
if (tid == 0)
{
stime=omp_get_wtime();
printf("-----1) Calculating leap helpers");
}
#pragma omp for
for(i=0;i<grid->nx;i++)
{
for(j=0;j<grid->ny;j++)
{
for(k=0;k<grid->nz;k++)
{
// rho's
rhoinv[i][j][k]=1./mhd->rho[i][j][k];
rhoiinv[i][j][k]=1./mhd->rhoi[i][j][k];
rhoeinv[i][j][k]=1./mhd->rhoe[i][j][k];
rhoninv[i][j][k]=1./mhd->rhon[i][j][k];
// 1./(gamma-1.)
gaminv[i][j][k]=1./(mhd->gamma[i][j][k]-1.);
gamiinv[i][j][k]=1./(mhd->gammai[i][j][k]-1.);
gameinv[i][j][k]=1./(mhd->gammae[i][j][k]-1.);
gamninv[i][j][k]=1./(mhd->gamman[i][j][k]-1.);
}
}
}
if (tid == 0)
{
printf("........%04.2f [s] -----\n",(omp_get_wtime()-stime)/TIMESCALE);
stime=omp_get_wtime();
printf("-----2) Calculating leap helpers");
}
An important point here could be your upper bound of your loops. Since you use grid->nz
etc openMP can't know if they will change or not for each iteration. Load these values in local variables and use these for the loop condition.
Well, you are also using doubles and division. Can you make the division into multiplication?
The floating point unit is shared among the cores and divisions do not have a deterministic number of cycles till completion (as opposed to multiplication). So you end up serializing for accessing the fp unit.
I'm sure that if you use integral types or multiplication, you'll see a speedup.
精彩评论