CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour_问答_开发者

CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour

开发者 https://www.devze.com 2023-02-15 08:12 出处：网络

I wrote a code which uses many host (OpenMP) threads per one GPU. Each thread has its own CUDA stream to order it requests. It looks very similar to below code:

#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
    cudaStream_t stream;
    cudaStreamCreate(&s开发者_StackOverflow社区tream);

    while (hasJob()) {

        //... code to prepare job - dData, hData, dataSize etc

        cudaError_t streamStatus = cudaStreamQuery(stream);
        if (streamStatus == cudaSuccess) {
             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
        else {
             CUDA_CHECK(streamStatus);
        }
        cudaStreamSynchronize(stream);
    }
    cudaStreamDestroy(stream);
}

And everything were good till I got many small jobs. In that case, from time to time, cudaStreamQuery returns cudaErrorNotReady, which is for me unexpected because I use cudaStreamSynchronize. Till now I were thinking that cudaStreamQuery will always return cudaSuccess if it is called after cudaStreamSynchronize. Unfortunately it appeared that cudaStreamSynchronize may finish even when cudaStreamQuery still returns cudaErrorNotReady.

I changed the code into the following and everything works correctly.

#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    while (hasJob()) {

        //... code to prepare job - dData, hData, dataSize etc

        cudaError_t streamStatus;
        while ((streamStatus = cudaStreamQuery(stream)) == cudaErrorNotReady) {
             cudaStreamSynchronize();
        }
        if (streamStatus == cudaSuccess) {
             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
        else {
             CUDA_CHECK(streamStatus);
        }
        cudaStreamSynchronize(stream);
    }
    cudaStreamDestroy(stream);
}

So my question is.... is it a bug or a feature?

EDIT: it is similar to JAVA

synchronize {
    while(waitCondition) {
         wait();
    }
}

What is under

//... code to prepare job - dData, hData, dataSize etc

Do you have any functions of kind cudaMemcpyAsync there, or the only memory transfer is in the code you have shown? Those are asynchronous functions may exit early, even when the code is not at the destination yet. When that happens cudaStreamQuery will return cudaSuccess only when memory transfers succeed.

Also, does hasJob() uses any of the host-CUDA functions?

If I am not mistaken, in a single stream, it is not possible to execute both kernel and memory transfers. Therefore, calling cudaStreamQuery is necessary only when a kernel depends on the data transferred by a different stream.

Didn't notice it earlier: cudaStreamSynchronize() should take a parameter (stream). I am not sure which stream you are synchronising when parameter is ommited, could be that it defaults to stream 0.