开发者

How do GPUs (Geforce 9800) implement bitwise integer operations?

开发者 https://www.devze.com 2023-01-26 10:27 出处:网络
CUDA gives the programmer the possibility to write something like a & b | ~ c (a, b开发者_运维知识库, c being unsigned ints).

CUDA gives the programmer the possibility to write something like a & b | ~ c (a, b开发者_运维知识库, c being unsigned ints).

What does the GPU do internally? Does it somehow "emulate" bitwise operations on integers or are they similarily efficient like on a traditional CPU?


According to the CUDA Programming Guide v2.3 (Section 5.1.1.1) the bitwise operations run at full speed (8 operations per clock cycle).

Integer Arithmetic

Throughput of integer add is 8 operations per clock cycle.

Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but mul24 provide 24-bit integer multiplication with a troughput of 8 operations per clock cycle. On future architectures however, mul24 will be slower than 32-bit integer multiplication, so we recommend to provide two kernels, one using mul24 and the other using generic 32-bit integer multiplication, to be called appropriately by the application.

Integer division and modulo operation are particularly costly and should be avoided if possible or replaced with bitwise operations whenever possible: If n is a power of 2, (i/n) is equivalent to (i>>log2(n)) and (i%n) is equivalent to (i&(n-1)); the compiler will perform these conversions if n is literal.

Comparison Throughput of compare, min, max is 8 operations per clock cycle.

Bitwise Operations Throughput of any bitwise operation is 8 operations per clock cycle.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号