开发者

Increasing block size decreases performance

开发者 https://www.devze.com 2023-02-11 20:16 出处:网络
In my cuda code if I increase the blocksizeX ,blocksizeY it actually is taking more time .[Therefore I run it at 1x1]Also a chunk of my execution time ( for eg7 out of 9 s ) is taken by just the call

In my cuda code if I increase the blocksizeX ,blocksizeY it actually is taking more time .[Therefore I run it at 1x1]Also a chunk of my execution time ( for eg 7 out of 9 s ) is taken by just the call to the kernel .Infact I am quite amazed that even if I comment out the entire kernel the time is almost same.Any suggestions where and how to optimize?

P.S. I have edited this post with my actual code .I am downsampling an image so every 4 neighoring pixels (so for eg 1,2 from row 1 and 1,2 from row 2) give an output pixel.I get a effective bw of .5GB/s compared to theoretical maximum of 86.4 GB/s.The time I use is the difference in calling the kernel with instructions and calling an empty kernel. It looks pretty bad to me right now but I cant figure out what am I doing wrong.

 __global__ void streamkernel(int *r_d,int *g_d,int *b_d,int height ,int width,int *f_r,int *f_g,int *f_b){


    int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockD开发者_如何学编程im.y;
    int number=2*(id%(width/2))+(id/(width/2))*width*2;

     if (id<height*width/4)
    {

        f_r[id]=(r_d[number]+r_d[number+1];+r_d[number+width];+r_d[number+width+1];)/4;                              
        f_g[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1])/4;             
        f_b[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1];)/4;  
    }


  }


Try looking up the matrix multiplication example in CUDA SDK examples for how to use shared memory.

The problem with your current kernel is that it's doing 4 global memory reads and 1 global memory write for each 3 additions and 1 division. Each global memory access costs roughly 400 cycles. This means you're spending the vast majority of time doing memory access (what GPUs are bad at) rather than compute (what GPUs are good at).

Shared memory in effect allows you to cache this so that amortized, you get roughly 1 read and 1 write at each pixel for 3 additions and 1 division. That is still not doing so great on the CGMA ratio (compute to global memory access ratio, the holy grail of GPU computing).

Overall, I think for a simple kernel like this, a CPU implementation is likely going to be faster given the overhead of transferring data across the PCI-E bus.


You're forgetting the fact that one multiprocessor can execute up to 8 blocks simultaneously and the maximum performance is reached exactly then. However there are many factors that limit the number of blocks that can exist in parallel (incomplete list):

  • Maximum amount of shared memory per multiprocessor limits the number of blocks if #blocks * shared memory per block would be > total shared memory.
  • Maximum number of threads per multiprocessor limits the number of blocks if #blocks * #threads / block would be > max total #threads.
  • ...

You should try to find a kernel execution configuration that causes exactly 8 blocks to be run on one multiprocessor. This will almost always yield the highest performance even if the occupancy is =/= 1.0! From this point on you can try to iteratively make changes that reduce the number of executed blocks per MP, but therefore increase the occupancy of your kernel and see if the performance increases.

The nvidia occupancy calculator(excel sheet) will be of great help.

0

精彩评论

暂无评论...
验证码 换一张
取 消