开发者

CUDA - what if I choose too many blocks?

开发者 https://www.devze.com 2023-02-20 09:24 出处:网络
I\'m still getting mad on these unknown-size matrices which may vary from 10-20.000 for each dimension.

I'm still getting mad on these unknown-size matrices which may vary from 10-20.000 for each dimension.

I'm looking at the CUDA sdk and wondering: what if I choose a number of blocks too high?

Something like a grid of 9999 x 9999 blocks in the X and Y dimensions, if my hardware has SMs which can't hold all these blocks, will the kernel have problems or the performances would simply collapse?

I don't know how to dimension in blocks/threads something which may vary so much.. I'm thinking at using the MAXIMUM number of blocks my hardware supports and then making the threads inside them work across all th开发者_运维技巧e matrix, is this the right way?


The thread blocks do not have a one to one mapping with the cores. Blocks are scheduled to cores as they become available, meaning you can request as many as you want (up to a limit probably). Requesting a huge number of blocks would just slow the system down as it loads and unloads do-nothing thread blocks to the cores.

You can specify the dimensions of the grid and blocks at run time.

Edit: Here are the limits on the dimensions of the grid and the blocks, from the documentation.

CUDA - what if I choose too many blocks?


If you choose an excessively large block size, you waste some cycles while the "dead" blocks get retired (typically only of the order of a few tens of microseconds even for the maximum grid size on a "full size" Fermi or GT200 card). It isn't a huge penalty.

But the grid dimension should always be computable a priori. Usually there is a known relationship between a quantifiable unit of data parallel work - something like one thread per data point, or one block per matrix column or whatever - which allows the required grid dimensions to be calculated at runtime.

An alternative strategy would be to use a fixed number of blocks (usually only needs to be something like 4-8 per MP on the GPU) and have each block/thread process multiple units of parallel work, so each block becomes "persistent". If there is a lot of fixed overhead costs in setup per thread, it can be a good way to amortize those fixed overheads across more work per thread.

0

精彩评论

暂无评论...
验证码 换一张
取 消