OpenMP: poor performance of heap arrays (stack arrays work fine)_问答_开发者

I am a fairly experienced OpenMP user, but I have just run into a puzzling problem, and I am hopeful that someone here could help. The problem is that a simple hashing algorithm performs well for stack-allocated arrays, but poorly for arrays on the heap.

Example below uses i%M (i modulus M) to count every M-th integer in respective array element. For simplicity, imagine N=1000000, M=10. If N%M==0, then the result should be that every element of bins[] is equal to N/M:

#pragma omp for
  for (int i=0; i<N; i++) 
    bins[ i%M ]++;

Array bins[] is private to each thread (I sum results of all threads in a critical section afterwards).

When bins[] is allocated on the stack, the program works great, with performance scaling proportionally to the number of cores.

However, if bins[] is on the heap (pointer to bins[] is on the stack), performance drops drastically. And that is a major problem!

I want parallelize binning (hashing) of certain data into heap arrays with OpenMP, and this is a major performance hit.

It is definitely not something silly like all threads trying to write into the same area of memory. That is because each thread has its own bins[] array, results are correct with both heap- and stack-allocated bins, and there is no difference in performance for single-thread runs. I reproduced the problem on different hardware (Intel Xeon and AMD Opteron), with GCC and Intel C++ compilers. All tests were on Linux (Ubuntu and RedHat).

There seems no reason why good performance of OpenMP should be limited to stack arrays.

Any guesses? Maybe access of threads to the heap goes through some kind of shared gateway on Linux? How do I fix that?

Complete program to play around with is below:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main(const int argc, const char* argv[])
{
  const int N=1024*1024*1024;
  const int M=4;
  double t1, t2;
  int checksum=0;

  printf("OpenMP threads: %d\n", omp_get_max_threads());

  //////////////////////////////////////////////////////////////////
  // Case 1: stack-allocated array
  t1=omp_get_wtime();
  checksum=0;
#pragma omp parallel
  { // Each openmp thread should have a private copy of 
    // bins_thread_stack on the stack:
    int bins_thread_stack[M];
    for (int j=0; j<M; j++) bins_thread_stack[j]=0;
#pragma omp for
    for (int i=0; i<N; i++) 
      { // Accumulating every M-th number in respective array element
        const int j=i%M;
        bins_thread_stack[j]++;
      }
#pragma omp critical
    for (int j=0; j<M; j++) checksum+=bins_thread_stack[j];
  }
  t2=omp_get_wtime();
  printf("Time with stack array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N);
  //////////////////////////////////////////////////////////////////

  //////////////////////////////////////////////////////////////////
  // Case 2: heap-allocated array
  t1=omp_get_wtime();
  checksum=0;
  #pragma omp parallel 
  { // Each openmp thread should have a private copy of 
    // bins_thread_heap on the heap:
    int* bins_thread_heap=(int*)malloc(sizeof(int)*M); 
    for (int j=0; j<M; j++) bins_thread_heap[j]=0;
  #pragma omp for
    for (int i=0; i<N; i++) 
      { // Accumulating every M-th number in respective array element
        const int j=i%M;
        bins_thread_heap[j]++;
      }
  #pragma omp critical
    for (int j=0; j<M; j++) checksum+=bins_thread_heap[j];
    free(bins_thread_heap);
  }
  t2=omp_get_wtime();
  printf("Time with heap  array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum开发者_JS百科, N);
  //////////////////////////////////////////////////////////////////

  return 0;
}

Sample outputs of the program are below:

for OMP_NUM_THREADS=1

OpenMP threads: 1
Time with stack array: 2.973 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array: 3.091 sec, checksum=1073741824 (must be 1073741824).

and for OMP_NUM_THREADS=10

OpenMP threads: 10
Time with stack array: 0.329 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array: 2.150 sec, checksum=1073741824 (must be 1073741824).

I would very much appreciate any help!

This is a cute problem: with the code as above (gcc4.4, Intel i7) with 4 threads I get

OpenMP threads: 4
Time with stack array:        1.696 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array:        5.413 sec, checksum=1073741824 (must be 1073741824).

but if I change the malloc line to

    int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);

(Update: or even

    int* bins_thread_heap=(int*)malloc(sizeof(int)*16);

)

then I get

OpenMP threads: 4
Time with stack array:        1.578 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array:        1.574 sec, checksum=1073741824 (must be 1073741824).

The problem here is false sharing. The default malloc is being very (space-) efficient, and putting the requested small allocations all in one block of memory, next to each other; but since the allocations are so small that multiple fit in the same cache line, that means every time one thread updates its values, it dirties the cache line of the values in neighbouring threads. By making the requested memory large enough, this is no longer an issue.

Incidentally, it should be clear why the stack-allocated case does not see this problem; different threads - different stacks - memory far enough appart that false sharing isn't an issue.

As a side point -- it doesn't really matter for M of the size you're using here, but if your M (or number of threads) were larger, the omp critical would be a big serial bottleneck; you can use OpenMP reductions to sum the checksum more efficiently

#pragma omp parallel reduction(+:checksum)
    { // Each openmp thread should have a private copy of 
        // bins_thread_heap on the heap:
        int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);
        for (int j=0; j<M; j++) bins_thread_heap[j]=0;
#pragma omp for
        for (int i=0; i<N; i++)
        { // Accumulating every M-th number in respective array element
            const int j=i%M;
            bins_thread_heap[j]++;
        }
        for (int j=0; j<M; j++)
            checksum+=bins_thread_heap[j];
        free(bins_thread_heap);
 }

The initial question implied that heap arrays are slower than stack arrays. Unfortunately the reason for this slowness related to a particular case of cache line clashes in multi-threaded applications. It does not justify the implication that in general heap arrays are slower than stack arrays. For most cases, there is no significant difference in performance, especially where the arrays are much larger than cache line size. The opposite can often be the case, as the use of allocatable heap arrays, targeted to the size required can lead to performance advantages over larger fixed size arrays, which demand more memory transfers.