say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the开发者_开发百科 elements being accessed.
If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 ... has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.
According to Section G.3.2.1 of the CUDA Programming Guide short
will not coalesce on Compute Capability 1.0 and 1.1 devices under any circumstances. Specifically, it states:
The size of the words accessed by the threads must be 4, 8, or 16 bytes
You can however use vector types such as short2
, short4
, or even short8
to get coalesced access. The coalescing rules for these types is spelled out in Section G.3.2.1 as well. However, as far as coalescing is concerned a short2
is just like a 32-bit int
.
FWIW, devices with Compute Capability 1.3 or greater handle types like char
and short
much better. Reading char
s on a 1.3 device might give you as much as ~60% of peak memory bandwidth vs. ~10% of peak memory bandwidth on a 1.0 or 1.1 device.
精彩评论