开发者

Optimizing CUDA kernels regarding registers

开发者 https://www.devze.com 2023-03-07 11:41 出处:网络
I\'m using the CUDA Occupancy calculator to try to optimize my CUDA kernel. Currently I\'m using 34 registers and zero shared memory...Thus the maximum occupancy is 63% for 310 Threads per block. When

I'm using the CUDA Occupancy calculator to try to optimize my CUDA kernel. Currently I'm using 34 registers and zero shared memory...Thus the maximum occupancy is 63% for 310 Threads per block. When I could somehow change the registers (e.g. by passing kernel parameters via shared memory) to 20 or below I could get an occupancy of 100%. Is this a good way to do it or would you advise me to use another path of optimizing?

Further I'm also wondering if t开发者_开发问答here's a newer version of the occupancy calculator for Compute Capability 2.1!?


Some points to consider:

  1. 320 threads per block will give the same occupancy as 310, because occupancy is defined as active warps/maximum warps per SM, and the warp size is always 32 threads. You should never use a block size which is not a round multiple of 32. That just wastes cores and cycles.
  2. Kernel parameters are passed in constant memory on your compute 2.1 device, and they have no effect on occupancy or register usage.
  3. The GPU design has a pipeline latency of about 21 cycles. So for a Fermi GPU, you need about 43% occupancy to cover all of the internal scheduling latency. Once that is done, you may find that there is relatively little benefit in trying to achieve higher occupancy.
  4. Striving for 100% occupancy is usually never a constructive optimization goal. If you have not done so, I highly recommend looking over Vasily Volkov's presentation from GTC 2010 "Better performance at lower occupancy", where he shows all sorts of surprising results, like code hitting 85% of peak memory bandwidth at 8% occupancy.
  5. The newest occupancy calculator doesn't cover compute 2.1, but the effective occupancy rules for compute 2.0 apply to 2.1 devices too. The extra cores in the compute 2.1 multiprocessor come into play via instruction level parallelism and what is almost out of order execution. That really doesn't change the occupancy characteristics of the device at all compared to compute 2.0 devices.


talonmies is correct, occupancy is overrated.

Vasily Volkov had a great presentation at GTC2010 on this topic: "Better Performance at Lower Occupancy."

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号