all, From "NVIDIA CUDA Programming Guide 2.0" Section 5.1.2.1: "Coalescing on Devices with Compute Capability 1.2 and Higher"
"Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data."
I have a doubt here: since each half-warp has 16 threads, if all threads access 8-bit data, then the total size for per half-warp should be 16 * 8-bit=128 bits= 16 bytes. While开发者_开发知识库 "Guide" says "32 bytes for 8-bit data". It seems half bandwidth is wasted. Am I understanding correctly?
Thanks Deryk
Yes. Memory access is always in chunks of 32, 64 or 128 bytes, regardless of how much you actually need from that memory line.
Update:
Question: How does that explain 64 bytes for 16 bit data ?
The value: 32bytes for 1byte-words, 64bytes for 2byte-words and 128bytes for higher-byte words is the maximum size of the accessed segment. If, for example, each thread is fetching 2-byte word and your access is perfectly aligned, the memory access will be reduced to use only 32-byte line fetch.
Check out the section G.3.2.2 "Devices of Compute Capability 1.2 and 1.3" of "CUDA programming guide (v3.2)".
I see you used CUDA PG v. 2.0 (and probably CUDA 2.0 compiler). There were lots of improvements (in particular: bug fixes) since then.
精彩评论