CUDA , finding Max using reduction, error_问答_开发者

here is my code trying to do reduction to find maximum of a 50 value array in a block. I have padded the array to 64.

For threads 1-31 I have correct maxVal printing out but for threads 32-49 it's a completely random number. I dont know what I am doing wrong.

btw. I thought I dont need to _sync every line in unrolling but apparently I have to. any suggestion about that?

Thanks in advance for any help.

//block size = 50


__syncthreads();

if (tid<32){

    cptmp[tid]=(cptmp[tid]< cptmp[tid+32]) ? cptmp[tid+32] : cptmp[tid];__syncthreads();    
    cptmp[tid]=(cptmp[tid]< cptmp[tid+16]) ? cptmp[tid+16] : cptmp[tid];__syncthreads();
    cptmp[tid]=(cptmp[tid]< cptmp[tid+8]) ? cptmp[tid+8] : cptmp[tid];  __syncthreads();    
    cptmp[tid]=(cptmp[tid]< cptmp[tid+4]) ? cptmp[tid+4] : cptmp[tid]; 开发者_Go百科 __syncthreads();
    cptmp[tid]=(cptmp[tid]< cptmp[tid+2]) ? cptmp[tid+2] : cptmp[tid];  __syncthreads();    
    cptmp[tid]=(cptmp[tid]< cptmp[tid+1]) ? cptmp[tid+1] : cptmp[tid];  __syncthreads();

}

__syncthreads();

//if (tid==0) {
    maxVal=cptmp[0];
    if(bix==0 && biy==0) cuPrintf(" max:%f x:%d y:%d\n", maxVal, blockIdx.x, blockIdx.y);
//}

Here is a more efficient (at least on Fermi GPUs) and correct code using volatile. Replace T with your type (or use a template):

if (tid<32) {
    volatile T *c = cptmp;
    T t = c[tid];
    c[tid] = t = (t < c[tid+32]) ? c[tid+32] : t;
    c[tid] = t = (t < c[tid+16]) ? c[tid+16] : t;
    c[tid] = t = (t < c[tid+ 8]) ? c[tid+ 8] : t;
    c[tid] = t = (t < c[tid+ 4]) ? c[tid+ 4] : t;
    c[tid] = t = (t < c[tid+ 2]) ? c[tid+ 2] : t;
    c[tid] = t = (t < c[tid+ 1]) ? c[tid+ 1] : t;
}

Why is this more efficient? Well, for correctness in the absence of __syncthreads() we must use a volatile pointer to shared memory. But that forces the compiler to "honor" all reads from and writes to shared memory -- it can't optimize and keep anything in registers. So by explicitly always keeping c[tid] in the temporary t, we save one shared memory load per line of code. And since Fermi is a load/store architecture which can only use registers as instruction operands, that means we save an instruction per line, or 6 instructions total (about 25% overall, I expect).

On the old T10/GT200 architecture and earlier, your code (with volatile and no __syncthreads()) would be equally efficient because that architecture could source one operand per instruction directly from shared memory.

This code should be equivalent if you prefer if over ?::

if (tid<32) {
    volatile T *c = cptmp;
    T t = c[tid];
    if (t < c[tid+32]) c[tid] = t = c[tid+32];
    if (t < c[tid+16]) c[tid] = t = c[tid+16];
    if (t < c[tid+ 8]) c[tid] = t = c[tid+ 8];
    if (t < c[tid+ 4]) c[tid] = t = c[tid+ 4];
    if (t < c[tid+ 2]) c[tid] = t = c[tid+ 2];
    if (t < c[tid+ 1]) c[tid] = t = c[tid+ 1];
}

Do not use __syncthreads() in a divergent code! Either all threads or no threads from a given block should reach every __syncthreads() at the same location.

All threads from a single warp (32 threads) are implicitly synchronised, so you don't need __syncthreads() to put them all together. However, if you are worried that shared-memory writes of one thread may not be visible by another thread of the same warp, use __threadfence_block().

To elaborate the importance of __threadfence_block(). Consider the following two lines:

cptmp[tid]=(cptmp[tid]< cptmp[tid+2]) ? cptmp[tid+2] : cptmp[tid];
cptmp[tid]=(cptmp[tid]< cptmp[tid+1]) ? cptmp[tid+1] : cptmp[tid];

It may compile into something like this:

int tmp; //assuming that cptmp is an array of int-s
tmp=cptmp[tid];
tmp=(tmp<cptmp[tid+2])?cptmp[tid+2]:tmp;
tmp=(tmp<cptmp[tid+1])?cptmp[tid+1]:tmp;
cptmp[tid]=tmp;

While it would be correct for a single-threaded code, it obviously fails for CUDA.

To prevent optimisations like that you either declare your cptmp array as volatile, or add this __threadfence_block() between the lines. That function ensures that all threads of the same block see the shared-memory writed of the current thread, before the function exist.

A similar __threadfence() function exists to ensure global-memory visibility.

For everybody who will stumble upon this thread in the future, as I did, here is an advice in addition to harrism answer - it might be worth from performance point of view to consider shuffle operation, so the updated code to get max out of 64 elements using single warp would look like this:

auto localMax = max(c[tid], c[tid + 32]);    
for (auto i = 16; i >= 1; i /= 2)
{
    localMax = max(localMax, __shfl_xor(localMax, i));
}
c[tid] = localMax;

Only two reads and one write from global memory needed, so it is pretty neat.