开发者

CUDA: memory transaction size for compute capability 1.2 or later

开发者 https://www.devze.com 2023-02-18 00:57 出处:网络
all, From \"NVIDIA CUDA Programming Guide 2.0\" Section 5.1.2.1: \"Coalescing on Devices with Compute Capability 1.2 and Higher\"

all, From "NVIDIA CUDA Programming Guide 2.0" Section 5.1.2.1: "Coalescing on Devices with Compute Capability 1.2 and Higher"

"Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data."

I have a doubt here: since each half-warp has 16 threads, if all threads access 8-bit data, then the total size for per half-warp should be 16 * 8-bit=128 bits= 16 bytes. While开发者_开发知识库 "Guide" says "32 bytes for 8-bit data". It seems half bandwidth is wasted. Am I understanding correctly?

Thanks Deryk


Yes. Memory access is always in chunks of 32, 64 or 128 bytes, regardless of how much you actually need from that memory line.


Update:

Question: How does that explain 64 bytes for 16 bit data ?

The value: 32bytes for 1byte-words, 64bytes for 2byte-words and 128bytes for higher-byte words is the maximum size of the accessed segment. If, for example, each thread is fetching 2-byte word and your access is perfectly aligned, the memory access will be reduced to use only 32-byte line fetch.

Check out the section G.3.2.2 "Devices of Compute Capability 1.2 and 1.3" of "CUDA programming guide (v3.2)".

I see you used CUDA PG v. 2.0 (and probably CUDA 2.0 compiler). There were lots of improvements (in particular: bug fixes) since then.

0

精彩评论

暂无评论...
验证码 换一张
取 消