I wrote a code which uses many host (OpenMP) threads per one GPU. Each thread has its own CUDA stream to order it requests. It looks very similar to below code:
#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
cudaStream_t stream;
cudaStreamCreate(&s开发者_StackOverflow社区tream);
while (hasJob()) {
//... code to prepare job - dData, hData, dataSize etc
cudaError_t streamStatus = cudaStreamQuery(stream);
if (streamStatus == cudaSuccess) {
cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
else {
CUDA_CHECK(streamStatus);
}
cudaStreamSynchronize(stream);
}
cudaStreamDestroy(stream);
}
And everything were good till I got many small jobs. In that case, from time to time, cudaStreamQuery returns cudaErrorNotReady, which is for me unexpected because I use cudaStreamSynchronize. Till now I were thinking that cudaStreamQuery will always return cudaSuccess if it is called after cudaStreamSynchronize. Unfortunately it appeared that cudaStreamSynchronize may finish even when cudaStreamQuery still returns cudaErrorNotReady.
I changed the code into the following and everything works correctly.
#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
cudaStream_t stream;
cudaStreamCreate(&stream);
while (hasJob()) {
//... code to prepare job - dData, hData, dataSize etc
cudaError_t streamStatus;
while ((streamStatus = cudaStreamQuery(stream)) == cudaErrorNotReady) {
cudaStreamSynchronize();
}
if (streamStatus == cudaSuccess) {
cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
else {
CUDA_CHECK(streamStatus);
}
cudaStreamSynchronize(stream);
}
cudaStreamDestroy(stream);
}
So my question is.... is it a bug or a feature?
EDIT: it is similar to JAVA
synchronize {
while(waitCondition) {
wait();
}
}
What is under
//... code to prepare job - dData, hData, dataSize etc
Do you have any functions of kind cudaMemcpyAsync
there, or the only memory transfer is in the code you have shown? Those are asynchronous functions may exit early, even when the code is not at the destination yet. When that happens cudaStreamQuery
will return cudaSuccess
only when memory transfers succeed.
Also, does hasJob()
uses any of the host-CUDA functions?
If I am not mistaken, in a single stream, it is not possible to execute both kernel and memory transfers. Therefore, calling cudaStreamQuery
is necessary only when a kernel depends on the data transferred by a different stream.
Didn't notice it earlier: cudaStreamSynchronize()
should take a parameter (stream)
. I am not sure which stream you are synchronising when parameter is ommited, could be that it defaults to stream 0.
精彩评论