I could find that for "global" memory access, the coalescing (neighboring) the memory addresses which required by threads is the 开发者_StackOverflow社区key for optimum transaction while in "shared" memory the non-conflicting the addresses issued by threads is the key. Did I understand well?
From NVIDIA CUDA Programming guide:
To maximize global memory throughput, it is therefore important to maximize coalescing by:
- Following the most optimal access patterns based on Sections G.3.2 and G.4.2,
- Using data types that meet the size and alignment requirement detailed in Section 5.3.2.1.1,
- Padding data in some cases, for example, when accessing a two-dimensional array as described in Section 5.3.2.1.2.
This is related to the memory accesses of the threads in a warp which is coalesced 'packed' into one or more transactions. This issue has been relaxed for devices of compute capability 2.x.
On the other hand, for shared memory accesses you need to understand how this memory is implemented.
To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously.
If two or more threads access the same bank the transfer is serialized, a.k.a. a bank conflict.
Appendix G. Compute Capabilities has more info about the architecture.
Regards!
精彩评论