I am trying to implement a GEMM implmentation using AMD-APP-SDK 2.4 on a ATI HD 6990 card (Cayman architecture).
One of the optimizing techniques is the use of blocking/tiling.
In its implementation, is it faster if we store the sub-matrices in the shared local memory or is it faster when we use a texture cache? If possible please give the reason also.
Please also suggest which is easier to implement.
Thanks.
P.S. I want it for single precision only, if it matters!
Note: The size of the sub matrix is not an issue, however I feel that since the larger it is the better it would be. The only factor to be taken in consideration i开发者_开发百科s that if unit of memory is 128 bit (4 single precision) then, block size should be a multiple of 4.
The Cypress chips were used in the 5800 series Radeons. The 6900 series uses the Cayman core, which has several important differences, most notably that it is a VLIW4 architecture instead of the VLIW5 configuration used in earlier cores.
As always, the only definitive way to know which method is faster is to benchmark it. In particular, since you give no information about the size of the sub-matrices, it is hard to say where they will best fit.
精彩评论