open MP - dot product_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-18 03:16 出处：网络

I am implementing parallel dot product in open MP I have this code: #include <stdio.h> #include <stdlib.h>

相关专题：c openmp

I am implementing parallel dot product in open MP

I have this code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#define SIZE 1000

int main (int argc, char *argv[]) {

  float u[SIZE], v[SIZE], dp,dpp;
  int i, j, tid;

  dp=0.0;
  for(i=0;i<SIZE;i++){
      u[i]=1.0*(i+1);
      v[i]=1.0*(i+2);
  }
  printf("\n values of u and v:\n");

  for (i=0;i<SIZE;i++){
      printf(" u[%d]= %.1f\t v[%d]= %.1f\开发者_如何学Gon",i,u[i],i,v[i]);
  }
  #pragma omp parallel shared(u,v,dp,dpp) private (tid,i)
  {
      tid=omp_get_thread_num();

      #pragma omp for private (i)
      for(i=0;i<SIZE;i++){
          dpp+=u[i]*v[i];
          printf("thread: %d\n", tid);
      }
      #pragma omp critical
      {
          dp=dpp;
          printf("thread %d\n",tid);
      }


  }

  printf("\n dot product is %f\n",dp);

 }

I am starting it with: pgcc -B -Mconcur -Minfo -o prog prog.c

And result I get in console is:

33, Loop not parallelized: innermost 

39, Loop not vectorized/parallelized: contains call

48, Loop not vectorized/parallelized: contains call

What am I doing wrong?

From my side of view, everything looks Ok.

First of all, a simple 1,000-element dot product does not have enough computational cost to justify multi-threading --- you will pay so much more in communication and synchronization costs than you will gain in performance that it is not worth it.

Secondly, it looks like you are computing the full dot product in each thread, not dividing the computation across multiple threads and combining the result at the end.

Here is an example of how to do vector dot products from https://computing.llnl.gov/tutorials/openMP/#SHARED

#include <omp.h>

main ()
{
  int   i, n, chunk;
  float a[100], b[100], result;

  /* Some initializations */
  n = 100;
  chunk = 10;
  result = 0.0;
  for (i=0; i < n; i++) {
      a[i] = i * 1.0;
      b[i] = i * 2.0;
  }

  #pragma omp parallel for      \  
      default(shared) private(i)  \  
      schedule(static,chunk)      \  
      reduction(+:result)  

    for (i=0; i < n; i++)
        result += (a[i] * b[i]);

  printf("Final result= %f\n",result);
}

Basically, OpenMP is good for doing coarse-grained parallelism when you have large, expensive loops. In general, when you are doing parallel programming, the larger the "chunks" of computation you can do before re-synchronizing, the better. Especially as the number of cores grows, the communication and synchronization costs will grow. Pretend that each synchronization (grabbing a new index or chunk of indexes to execute, entering a critical section, etc.) costs you 10ms, or 1M instructions to get a better idea of when/where/how to parallelize your code.

The problem is still the same as in your latest question. You are accumulating values in a variable and you must tell OpenMp how to do that:

#pragma omp for reduction(+: dpp)
for(size_t i=0; i<SIZE; i++){
  dpp += u[i]*v[i];
}

Use a loop-local variable for the index and that is all you need, forget about all the stuff that you are doing around that. If you want to see then what the compiler is doing of your code run it with -S and check the assembler output. This can be very instructive, because you then learn what simple statements like that amount to when they are parallelized.

And don't use int for loop indices. Sizes and stuff like that are size_t.