I'm still getting mad on these unknown-size matrices which may vary from 10-20.000 for each dimension.
I'm looking at the CUDA sdk and wondering: what if I choose a number of blocks too high?
Something like a grid of 9999 x 9999 blocks in the X and Y dimensions, if my hardware has SMs which can't hold all these blocks, will the kernel have problems or the performances would simply collapse?
I don't know how to dimension in blocks/threads something which may vary so much.. I'm thinking at using the MAXIMUM number of blocks my hardware supports and then making the threads inside them work across all th开发者_运维技巧e matrix, is this the right way?
The thread blocks do not have a one to one mapping with the cores. Blocks are scheduled to cores as they become available, meaning you can request as many as you want (up to a limit probably). Requesting a huge number of blocks would just slow the system down as it loads and unloads do-nothing thread blocks to the cores.
You can specify the dimensions of the grid and blocks at run time.
Edit: Here are the limits on the dimensions of the grid and the blocks, from the documentation.
If you choose an excessively large block size, you waste some cycles while the "dead" blocks get retired (typically only of the order of a few tens of microseconds even for the maximum grid size on a "full size" Fermi or GT200 card). It isn't a huge penalty.
But the grid dimension should always be computable a priori. Usually there is a known relationship between a quantifiable unit of data parallel work - something like one thread per data point, or one block per matrix column or whatever - which allows the required grid dimensions to be calculated at runtime.
An alternative strategy would be to use a fixed number of blocks (usually only needs to be something like 4-8 per MP on the GPU) and have each block/thread process multiple units of parallel work, so each block becomes "persistent". If there is a lot of fixed overhead costs in setup per thread, it can be a good way to amortize those fixed overheads across more work per thread.
精彩评论