Implementing a critical section in CUDA_问答_开发者

I'm trying to implement a critical section in CUDA using atomic instructions, but I ran into some trouble. I have created the test program to show the problem:

#include <cuda_runtime.h>
#include <cutil_inline.h>
#include <stdio.h>

__global__ void k_testLocking(unsigned int* locks, int n) {
    int id = threadIdx.x % n;
    while (atomicExch(&(locks[id]), 1u) != 0u) 开发者_开发百科{} //lock
    //critical section would go here
    atomicExch(&(locks[id]),0u); //unlock
}

int main(int argc, char** argv) {
    //initialize the locks array on the GPU to (0...0)
    unsigned int* locks;
    unsigned int zeros[10]; for (int i = 0; i < 10; i++) {zeros[i] = 0u;}
    cutilSafeCall(cudaMalloc((void**)&locks, sizeof(unsigned int)*10));
    cutilSafeCall(cudaMemcpy(locks, zeros, sizeof(unsigned int)*10, cudaMemcpyHostToDevice));

    //Run the kernel:
    k_testLocking<<<dim3(1), dim3(256)>>>(locks, 10);

    //Check the error messages:
    cudaError_t error = cudaGetLastError();
    cutilSafeCall(cudaFree(locks));
    if (cudaSuccess != error) {
        printf("error 1: CUDA ERROR (%d) {%s}\n", error, cudaGetErrorString(error));
        exit(-1);
    }
    return 0;
}

This code, unfortunately, hard freezes my machine for several seconds and finally exits, printing out the message:

fcudaSafeCall() Runtime API error in file <XXX.cu>, line XXX : the launch timed out and was terminated.

which means that one of those while loops is not returning, but it seems like this should work.

As a reminder atomicExch(unsigned int* address, unsigned int val) atomically sets the value of the memory location stored in address to val and returns the old value. So the idea behind my locking mechanism is that it is initially 0u, so one thread should get past the while loop and all other threads should wait on the while loop since they will read locks[id] as 1u. Then when the thread is done with the critical section, it resets the lock to 0u so another thread can enter.

What am I missing?

By the way, I am compiling with:

nvcc -arch sm_11 -Ipath/to/cuda/C/common/inc XXX.cu

Okay, I figured it out, and this is yet-another-one-of-the-cuda-paradigm-pains.

As any good cuda programmer knows (notice that I did not remember this which makes me a bad cuda programmer, I think) all threads in a warp must execute the same code. The code I wrote would work perfectly if not for this fact. As it is, however, there are likely to be two threads in the same warp accessing the same lock. If one of them acquires the lock, it just forgets about executing the loop, but it cannot continue past the loop until all other threads in its warp have completed the loop. Unfortunately the other thread will never complete because it is waiting for the first one to unlock.

Here is a kernel that will do the trick without error:

__global__ void k_testLocking(unsigned int* locks, int n) {
    int id = threadIdx.x % n;
    bool leaveLoop = false;
    while (!leaveLoop) {
        if (atomicExch(&(locks[id]), 1u) == 0u) {
            //critical section
            leaveLoop = true;
            atomicExch(&(locks[id]),0u);
        }
    } 
}

The poster has already found an answer to his own issue. Nevertheless, in the code below, I'm providing a general framework to implement a critical section in CUDA. More in detail, the code performs a block counting, but it is easily modifyiable to host other operations to be performed in a critical section. Below, I'm also reporting some explanation of the code, with some, "typical" mistakes in the implementation of critical sections in CUDA.

THE CODE

#include <stdio.h>

#include "Utilities.cuh"

#define NUMBLOCKS  512
#define NUMTHREADS 512 * 2

/***************/
/* LOCK STRUCT */
/***************/
struct Lock {

    int *d_state;

    // --- Constructor
    Lock(void) {
        int h_state = 0;                                        // --- Host side lock state initializer
        gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int)));  // --- Allocate device side lock state
        gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
    }

    // --- Destructor
    __host__ __device__ ~Lock(void) { 
#if !defined(__CUDACC__)
        gpuErrchk(cudaFree(d_state)); 
#else

#endif  
    }

    // --- Lock function
    __device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }

    // --- Unlock function
    __device__ void unlock(void) { atomicExch(d_state, 0); }
};

/*************************************/
/* BLOCK COUNTER KERNEL WITHOUT LOCK */
/*************************************/
__global__ void blockCountingKernelNoLock(int *numBlocks) {

    if (threadIdx.x == 0) { numBlocks[0] = numBlocks[0] + 1; }
}

/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCountingKernelLock(Lock lock, int *numBlocks) {

    if (threadIdx.x == 0) {
        lock.lock();
        numBlocks[0] = numBlocks[0] + 1;
        lock.unlock();
    }
}

/****************************************/
/* BLOCK COUNTER KERNEL WITH WRONG LOCK */
/****************************************/
__global__ void blockCountingKernelDeadlock(Lock lock, int *numBlocks) {

    lock.lock();
    if (threadIdx.x == 0) { numBlocks[0] = numBlocks[0] + 1; }
    lock.unlock();
}

/********/
/* MAIN */
/********/
int main(){

    int h_counting, *d_counting;
    Lock lock;

    gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));

    // --- Unlocked case
    h_counting = 0;
    gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));

    blockCountingKernelNoLock << <NUMBLOCKS, NUMTHREADS >> >(d_counting);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
    printf("Counting in the unlocked case: %i\n", h_counting);

    // --- Locked case
    h_counting = 0;
    gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));

    blockCountingKernelLock << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
    printf("Counting in the locked case: %i\n", h_counting);

    gpuErrchk(cudaFree(d_counting));
}

CODE EXPLANATION

Critical sections are sequences of operations that must be executed sequentially by the CUDA threads.

Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block having threadIdx.x == 0 increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section.

The above code has two kernel functions: blockCountingKernelNoLock and blockCountingKernelLock. The former does not use a critical section to increase the counter and, as one can see, returns wrong results. The latter encapsulates the counter increase within a critical section and so produces correct results. But how does the critical section work?

The critical section is governed by a global state d_state. Initially, the state is 0. Furthermore, two __device__ methods, lock and unlock, can change this state. The lock and unlock methods can be invoked only by a single thread within each block and, in particular, by the thread having local thread index threadIdx.x == 0.

Randomly during the execution, one of the threads having local thread index threadIdx.x == 0 and global thread index, say, t will be the first invoking the lock method. In particular, it will launch atomicCAS(d_state, 0, 1). Since initially d_state == 0, then d_state will be updated to 1, atomicCAS will return 0 and the thread will exit the lock function, passing to the update instruction. In the meanwhile such a thread performs the mentioned operations, all the other threads of all the other blocks having threadIdx.x == 0 will execute the lock method. They will however find a value of d_state equal to 1, so that atomicCAS(d_state, 0, 1) will perform no update and will return 1, so leaving these threads running the while loop. After that thread t accomplishes the update, then it executes the unlock function, namely atomicExch(d_state, 0), thus restoring d_state to 0. At this point, randomly, another of the threads with threadIdx.x == 0 will lock again the state.

The above code contains also a third kernel function, namely blockCountingKernelDeadlock. However, this is another wrong implementation of the critical section, leading to deadlocks. Indeed, we recall that warps operate in lockstep and they synchronize after every instruction. So, when we execute blockCountingKernelDeadlock, there is the possibility that one of the threads in a warp, say a thread with local thread index t≠0, will lock the state. Under this circumstance, the other threads in the same warp of t, including that with threadIdx.x == 0, will execute the same while loop statement as thread t, being the execution of threads in the same warp performed in lockstep. Accordingly, all the threads will wait for someone to unlock the state, but no other thread will be able to do so, and the code will be stuck in a deadlock.

by the way u have to remember that global memory writes and ! reads aren't completed where u write them in the code ... so for this to be practice you need to add a global memfence ie __threadfence()