I've written a C++ app that has to process a lot of data. Using OpenMP I parallelized the processing phase quite well and, embarrassingly, found that the output writing is now the bottleneck. I decided to use a parallel for
there as well, as the order in which I output items is irrelevant; they just need to be output as coherent chunks.
Below is a simplified version of the output code, showing all the variables except for two custom iterators in the "collect data in related" loop. My question is: is this the correct and optimal way to solve this problem? I read about the barrier
pragma, do I need that?
long i, n = nrows();
#pragma omp parallel for
for (i=0; i<n; i++) {
std::vector<MyData> related;
for (size_t j=0; j < data[i].size(); j++)
related.push_back(data[i][j]);
sort(related.rbegin(), related.rend());
#pragma omp critical
{
std::cout << data[i].label << "\n";
for (size_t j=0; j<related.size(); j++)
std::cout << " " << related[j].label << "\n";
}
}
(I labeled this question c
as I imagine OpenMP is ver开发者_C百科y similar in C and C++. Please correct me if I'm wrong.)
One way to get around output contention is to write the thread-local output to a string stream, (can be done in parallel) and then push the contents to cout
(requires synchronization).
Something like this:
#pragma omp parallel for
for (i=0; i<n; i++) {
std::vector<MyData> related;
for (size_t j=0; j < data[i].size(); j++)
related.push_back(data[i][j]);
sort(related.rbegin(), related.rend());
std::stringstream buf;
buf << data[i].label << "\n";
for (size_t j=0; j<related.size(); j++)
buf << " " << related[j].label << "\n";
#pragma omp critical
std::cout << buf.rdbuf();
}
This offers much more fine-grained locking and the performance should increase accordingly. On the other hand, this still uses locking. So another way would be to use an array of stream buffers, one for each thread, and pushing them to cout
sequentially after the parallel loop. This has the advantage of avoiding costly locks, and the output to cout
must be serialized anyway.
On the other hand, you can even try to omit the critical
section in the above code. In my experience, this works since the underlying streams have their own way of controlling concurrency. But I believe that this behaviour is strictly implementation defined and not portable.
cout
contention is still going to be a problem here. Why not output the results in some thread-local storage and collate them to the desired location centrally, meaning no contention. For example, you could have each target thread for the parallel code write to a separate filestream or memory stream and just concatenate them afterwards, since ordering is not important. Or postprocess the results from multiple places instead of one - no contention, single write only required.
精彩评论