Using device variable by multiple threads on CUDA_问答_开发者

Using device variable by multiple threads on CUDA

开发者 https://www.devze.com 2022-12-22 20:57 出处：网络

I am playing around with cuda. At the moment I have a problem. I am testing a large array for particular responses, and when I get the response, I have to copy the data onto another array.

I am playing around with cuda.

At the moment I have a problem. I am testing a large array for particular responses, and when I get the response, I have to copy the data onto another array.

For example, my test array of 5 elements looks like this:

[ ][ ][v1][ ][ ][v2]

Result must look like this:

[v1][v2]

The problem is how do I calculate the address of the second array to store the result? All elements of the first array are checked in parallel.

I am thinking to declare a devi开发者_如何学Cce variable int addr = 0. Every time I find a response, I will increment the addr. But I am not sure about that because it means that addr may be accessed by multiple threads at the same time. Will that cause problems? Or will the thread wait until another thread finishes using that variable?

Is not as trivial as it seems. I just finished to implement one and I can tell what you need read the scan Gpu Gems 3 Article in particular chapter 39.3.1 Stream Compaction.

To implement your own start from the LargeArrayScan example in the SDK, that will give you just the prescan. Assuming you have the selection array in device memory (an array of 1 and 0 meaning 1- select 0 - discard), dev_selection_array a dev_elements_array elements to be selected a dev_prescan_array and a dev_result_array all of size N then you do

prescan(dev_prescan_array,dev_selection_array, N);
scatter(dev_result_array, dev_prescan_array,
         dev_selection_array, dev_elements_array, N);

where the scatter is

 __global__ void scatter_kernel( T*dev_result_array, 
                   const T* dev_prescan_array, 
                   const T* dev_selection_array,
                   const T* dev_elements_array, std::size_t size){

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= size) return;
if (dev_selection_array[idx] == 1){
    dev_result_array[dev_prescan_array[idx]] = dev_elements_array[idx];
}
}

for other nice application of the prescan see the paper Ble93

Have fun!

You're talking about classic stream compaction. Generally I would recommend looking at Thrust or CUDPP (those links go to the compaction documentation). Both of these are open source, if you want to roll your own then I would also suggest looking at the 'scan' SDK sample.