开发者

CUDA - what is this loop doing

开发者 https://www.devze.com 2023-02-17 19:25 出处:网络
Hey I\'ve seen on a website this example kernel __global__ void loop1( int N, float alpha, float* x, float* y ) {

Hey I've seen on a website this example kernel

 __global__ void loop1( int N, float alpha, float* x, float* y ) {
   int i;
   int i0 = blockIdx.x*blockDim.x + threadIdx.x;

   for(i=i0;i<N;i+=blockDim.x*gridDim.x) {
      y[i] = alpha*x[i] + y[i];
    }
}   

To compute this function in C

   for(i=0;i<N;i++) {
      y[i] = alpha*x[i] + y[i];
   }

Surely the for loop inside the kernel isn't necessary? and you can just do y[i0] = alpha*x[i0] + y[i0] and remove the for loop altogether.

I'm just curious as to why it's there and what it's purpose is. This is assuming a kernel call such as loop1<<<64,256>>>> s开发者_运维知识库o presumably gridDim.x = 1


You need the for loop in the kernel if your vector has more entrys than you have started threads. If it's possible it is of course more efficent to start enough threads.


Interesting kernel. The loop inside the kernel is necessary, because N is greater than total number of threads, which is 16 384 (blockDim.x*gridDim.x), but I think it's not good practice to do it (the whole point of CUDA is to use SIMT concept). According to CUDA Programming Guide you can have at most 65535 thread blocks with one kernel. Futhermore starting from Compute Capability 2.x (Fermi) you can have at most 1024 threads per one block (512 before Fermi) Also you can (if possible) separate code into multiple (sequential) kernels.


Much as we would like to believe that CUDA GPUs have infinite execution resources, they do not, and authors of highly optimized code are finding that unrolled for loops, often with fixed numbers of blocks, give the best performance. Makes for painful coding, but optimized CPU code is also pretty painful.

btw a commenter mentioned that this code would have coalescing problems, and I don't see why. If the base addresses are correctly aligned (64B since those are floats), all of the memory transactions by this code will be coalesced, provided the threads/block is also divisible by 64.

0

精彩评论

暂无评论...
验证码 换一张
取 消