开发者

CUDA Thread divergence and branches, examples

开发者 https://www.devze.com 2023-03-08 17:10 出处:网络
I have some examples which give me some strange headaches: I produce a thread divergence, but I cannot figure out which branch or which statements are computed first?

I have some examples which give me some strange headaches: I produce a thread divergence, but I cannot figure out which branch or which statements are computed first?

First example:

I have the following kernel, which I start by 2 threads in 1 block. with a[0]=0, and a1=0.

__global__ void branchTest_kernel( float* a){

  int tx = threadIdx.x;

  if(tx==0){                   // or tx==1
     a[1] = a[0] + 1;  (a)
  }else if(tx==1){             // or tx==0
     a[0] = a[1] + 1;;         (b)
  }
}

Output

a[0] = 1  
a[1] = 1 

I assum that because the two threads are in one warp, they execute in lockstep, and (a) and (b) both read at the same time a[0] and a1.

Second example:

Exactly the same as the first but, now removed the else if part:

__global__ void branchTest_kernel( float* a){

  int tx = threadIdx.x;

  if(tx==0){
     a[1] = a[0] + 1;  (a)
  }else{
     a[0] = a[1] + 1;  (b)
  }


} 

Output

a[0] = 1  
a[1] = 2 

What causes this behaviour that suddenly now (b) is first, and (a) second... (most inner branch probably) Can somebody explain how the precendence rules开发者_如何学编程 are for branches? Or where to find such information?

I encountered this example during an implementation of a Gauss-Seidel Solver: Gauss Seidel See Figure 3, (a) diagonal block


There are no precedence rules for branch execution order within a warp in CUDA - the behaviour is undefined. The compiler, assembler and JIT runtime are free to reorder instructions as they see fit, and you absolutely must not try and rely on whatever order you deduce empirically, because it can change (as you have found out). The only way to enforce formal correctness in that sort of situation is to use a atomic memory access operation, which will force serialization. Better still, look for another algorithm.

In your Gauss-Seidel case, the orthodox approach is use a separate kernel launch for each color in the graph decomposition of the matrix or computational grid.

0

精彩评论

暂无评论...
验证码 换一张
取 消