I have the following program. nv is around 100, dgemm is 20x100 or so, so there is plenty of work to go around:
#pragma omp parallel for schedule(dynamic,1)
for (int c = 0; c < int(nv); ++c) {
omp::thread thread;
matrix &t3_c = vv_.at(omp::num_threads()+thread);
if (terms.first) {
blas::gemm(1, t2_, vvvo_, 1, t3_c);
blas::gemm(1, vvvo_, t2_, 1, t3_c);
}
matrix &t3_b = vv_[thread];
if (terms.second) {
matr开发者_C百科ix &t2_ci = vo_[thread];
blas::gemm(-1, t2_ci, Vjk_, 1, t3_c);
blas::gemm(-1, t2_ci, Vkj_, 0, t3_b);
}
}
however with GCC 4.4, GOMP v1, the gomp_barrier_wait_end
accounts for nearly 50% of runtime. Changing GOMP_SPINCOUNT
aleviates the overhead but then only 60% of cores are used. Same for OMP_WAIT_POLICY=passive
. The system is Linux, 8 cores.
How can i get full utilization without spinning/waiting overhread
The barrier is a symptom, not the problem. The reason that there's lots of waiting at the end of the loop is that some of the threads are done well before the others, and they all wait at the end of the for loop for quite a while until everyone's done.
This is a classic load imbalance problem, which is weird here, since it's just a bunch of matrix multiplies. Are they of varying sizes? How are they laid out in memory, in terms of NUMA stuff - are they all currently sitting in one core's cache, or are there other sharing issues? Or, more simply -- are there only 9 matricies, so that the remaining 8 are doomed to be stuck waiting for whoever got the last one?
When this sort of thing happens in a larger parallel block of code, sometime it's ok to proceed to the next block of code while some of the loop iterations aren't done yet; there you can add the nowait
directive to the for which will override the default behaviour and get rid of the implied barrier. Here, though, since the parallel block is exactly the size of the for loop, that can't really help.
Could it be that your BLAS implementation also calls OpenMP inside? Unless you only see one call to gomp_barrier_wait_end
.
精彩评论