I know number of block running on one SM is limited by block number, threads, shared memory, and register. Is there any strategy to avo开发者_开发知识库iding having too many registers? I mean I just don't want to have too many of them, eventually it limits the number of block I run on one SM.
Compiling with nvcc -Xptxas -v
will print out the diagnostic information Edric mentioned. Additionally, you can force the compiler to conserve registers using the __launch_bounds__
qualifier. For example
__global__ void
__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)
MyKernel(...)
{
...
}
guarantees that at least minBlocksPerMultiprocessor
blocks of size maxThreadsPerBlock
will fit on a single SM. See Section B.16 of the CUDA Programming Guide for a complete explanation of __launch_bounds__
.
One of the main drivers for the number of registers is amount of local data you declare in your kernel. However, the PTX assembler can do quite a good job of re-using registers, so it's not always easy to work out how many will be used from the PTX code - you need to run ptxas
to get the real answer.
精彩评论