开发者

CUDA warps and occupancy

开发者 https://www.devze.com 2023-02-27 02:26 出处:网络
I have always thought that the warp scheduler will execute one warp at a time, depending on which warp is ready, and this warp can be from any one of the thread blocks in the multiprocessor. However,

I have always thought that the warp scheduler will execute one warp at a time, depending on which warp is ready, and this warp can be from any one of the thread blocks in the multiprocessor. However, in one of the Nvidia webminar slides, it is stated that "Occupancy = Number of warps run开发者_如何学Pythonning concurrently on a multiprocessor divided by maximum number of warps that can run concurrently". So more than one warp can run at one time? How does this work?

Thank you.


"Running" might be better interpreted as "having state on the SM and/or instructions in the pipeline". The GPU hardware schedules up as many blocks as are available or will fit into the resources of the SM (whichever is smaller), allocates state for every warp they contain (ie. register file and local memory), then starts scheduling the warps for execution. The instruction pipeline seems to be about 21-24 cycles long, and so there are a lot of threads in various stages of "running" at any given time.

The first two generations of CUDA capable GPU (so G80/90 and G200) only retire instructions from a single warp per four clock cycles. Compute 2.0 devices dual-issue instructions from two warps per two clock cycles, so there are two warps retiring instructions per clock. Compute 2.1 extends this by allowing what is effectively out of order execution - still only two warps per clock, but potentially two instructions from the same warp at a time. So the extra 16 cores per SM get used for instruction level parallelism, still issued from the same shared scheduler.

0

精彩评论

暂无评论...
验证码 换一张
取 消