开发者

CUDA: Max of array, how to prevent write collisions?

开发者 https://www.devze.com 2023-03-19 12:05 出处:网络
I have an array of doubles stored in GPU global memory and i need to find the maximum value in it. I have read some texts about parallel reduction, so i know that one should divide the array between b

I have an array of doubles stored in GPU global memory and i need to find the maximum value in it. I have read some texts about parallel reduction, so i know that one should divide the array between blocks and make them find their "global 开发者_如何学Pythonmaximum", and so on. But they never seem to address the issue of threads trying to write to the same memory position simultaneously.

Let's say that local_max=0.0 in the beginning of a block execution. Then each thread reads their value from the input vector, decides that is larger than local_max, and then try to write their value to local_max. When all of this happens at the exact same time (atleast when inside the same warp), how can this work and end up with the actual maximum within this block?

I would think either an atomic function or some kind of lock or critical section would be needed, but i haven't seen this addressed in the answers i have found. (ex http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf )


The answer to your questions are contained in the very document you linked to, and the SDK reduction example shows concrete implementations of the reduction concept.

For completeness, here is a concrete example of a reduction kernel:

template <typename T, int BLOCKSIZE>
__global__ reduction(T *inputvals, T *outputvals, int N)
{
    __shared__ volatile T data[BLOCKSIZE];

    T maxval = inputvals[threadIdx.x];
    for(int i=blockDim.x + threadIdx.x; i<N; i+=blockDim.x) 
    {
        maxfunc(maxval, inputvals[i]);
    }

    data[threadIdx.x] = maxval;
    __syncthreads();

    // Here maxfunc(a,b) sets a to the minimum of a and b
    if (threadIdx.x < 32) {

        for(int i=32+threadIdx.x; i < BLOCKSIZE; i+= 32) {
            maxfunc(data[threadIdx.x], data[i]);
        }

        if (threadIdx.x < 16) maxfunc(data[threadIdx.x], data[threadIdx.x+16]);
        if (threadIdx.x < 8) maxfunc(data[threadIdx.x], data[threadIdx.x+8]);
        if (threadIdx.x < 4) maxfunc(data[threadIdx.x], data[threadIdx.x+4]);
        if (threadIdx.x < 2) maxfunc(data[threadIdx.x], data[threadIdx.x+2]);
        if (threadIdx.x == 0) {
            maxfunc(data[0], data[1]);
            outputvals[blockIdx.x] = data[0];
        }
    }
}

The key point is using the synchronization that is implicit within a warp to perform the reduction in shared memory. The result is a single per-block maximum value. A second reduction pass is required to reduce the set of block maximums to the global maximum (often it is faster to o this on the host). In this example, maxvals is the "compare and set" function which could be as simple as

template<T>
__device__ void maxfunc(T & a, T  & b)
{
    a = (b > a) ? b : a;
}


Dont' cook your own code, use some thrust (included in version 4.0 of the Cuda sdk) :

#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <iostream>

int main(void)
{

    thrust::host_vector<int> h_vec(10000);
    thrust::sequence(h_vec.begin(), h_vec.end());
    // show hvec
    thrust::copy(h_vec.begin(), h_vec.end(), 
                 std::ostream_iterator<int>(std::cout, "\n"));

    // transfer to device
    thrust::device_vector<int> d_vec = h_vec;

    int max_dvec_value = *thrust::max_element(d_vec.begin(), d_vec.end()); 

    std::cout << "max value: " << max_dvec_value << "\n";
    return 0;    
}

And watch out that thrust::max_element returns a pointer.


Your question is clearly answered in the document you link to. I think you just need to spend some more time reading it and understanding the CUDA concepts used in it. In particular, I would focus on shared memory, the __syncthreads() method, and how to uniquely identify a thread while inside a kernel. Additionally, you should try to understand why the reduction may need to be run in 2 passes to find the global maximum.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号