Newer NVIDIA GPUs support a __popc(x) instruction that counts the number of bits set in a 32 bit register.
I am 99% OpenCL does not support inline assembler unless it is a vendor kernel extension.
1) Does AMD hardware support this yet? (I am not aware of it).
2) For OS X and Linux, how do you intercept the NVIDIA intermediate language that it is compiled to so you could insert this?
I figured out how to dump the PTX "binary" in PyOpenCL, now I just need to figure out how to re-insert it with modifications.
#create the program
self.program = cl.Program(self.ctx开发者_开发知识库, fstr).build()
print self.program.BINARIES[0]
NVIDIA's nvcc supports inline PTX assembly inside OpenCL code using the 'asm' keyword. The notation is similar to GCC inline assembly. I currently use this:
inline uint popcnt(const uint i) {
uint n;
asm("popc.b32 %0, %1;" : "=r"(n) : "r" (i));
return n;
}
Tested and working on Ubuntu Linux.
If you want more information check NVIDIA's oclInlinePTX code sample and the PTX ISA documentation.
If you are using an AMD or Intel card it is irrelevant as you can just use the built-in popcount instruction in OpenCL 1.2.
To the best of my knowledge, there is no inline assembly in any current OpenCL implementation, nor it there any way to intercept PTX (or CAL) during the JIT compilation cycle on OS X or Linux.
popc
is a hardware instruction in NVIDIA compute 2.x hardware, but in compute 1.x hardware it is emulated. You can find the code for it in device_functions.h in the CUDA toolkit. You could always implement it as function in OpenCL, at the expense of some speed.
精彩评论