开发者

OpenMP: Which examples can get a better performance gain?

开发者 https://www.devze.com 2023-03-22 19:06 出处:网络
Which one can gain a better performance? Example 1 #pragma omp parallel for private (i,j) for(i = 0; i < 100; i++) {

Which one can gain a better performance?

Example 1

 #pragma omp parallel for private (i,j)
    for(i = 0; i < 100; i++) {
        for (j=0; j< 100; j++){
           ....do sth...
        }
    }

Example 2

   for(i = 0; i < 100; i++) {
        #pragma omp parallel for private (i,j)
        for (j=0; j< 100; j++){
           ....do sth...
        }
    }

Follow up question Is it valid to use Example 3?

 #pragma omp parallel for private (i)
   for(i = 0; i < 100; i++) {
        #pragma omp parallel for开发者_开发知识库 private (j)
        for (j=0; j< 100; j++){
           ....do sth...
        }
    }


In general, Example 1 is the best as it parallelizes the outer most loop, which minimizes thread fork/join overhead. Although many OpenMP implementations pre-allocate the thread pool, there are still overhead to dispatch logical tasks to worker threads (a.k.a. a team of thread) and join them. Also note that when you use a dynamic scheduling (e.g., schedule(dynamic, 1)), then this task dispatch overhead would be problematic.

So, Example 2 may incur significant parallel overhead, especially when the trip count of for-i is large (100 is okay, though), and the amount of workload of for-j is small. Small may be an ambiguous term and depends on many variables. But, less than 1 millisecond would be definitely wasteful to use OpenMP.

However, in case where the for-i is not parallelizable and only for-j is parallelizable, then Example2 is the only option. In this case, you must consider carefully whether the amount of parallel workload can offset the parallel overhead.

Example3 is perfectly valid once for-i and for-j are safely parallelizable (i.e., no loop-carried flow dependences in each two loops, respectively). Example3 is called nested parallelism. You may take a look this article. Nested parallelism should be used with care. In many OpenMP implementations, you need to manually turn on nested parallelism by calling omp_set_nested. However, as nested parallelism may spawn huge number of threads, its benefit may be significantly reduced.


It depends on the amount your doing in the inner loop. If it's small, lauching too many threads will represent a overhead. If the work is big, I would probabaly go with option 2, depending on the number of cores your machines has.

BTW, the only place where you need to flag a variable as private is "j" in example 1. In all the other cases it's implicit.

0

精彩评论

暂无评论...
验证码 换一张
取 消