I made some kernels for testing bandwidth and they do no useful computations. A minimal example is
__global__ void testKernel(float* a)
{
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float x;
x = a[i];
}
When I compile, I get (not surprisingly)
warning: variable "x" was set but never used
and the kernel runs as quickly as an empty kernel:
__global__ void donothing()
{
}
This indicates that the read of a[i] has been optimized out.
I have tried tricks such as
volatile float x;
if(x);
(void)(x;)
and they suppress the warning, but the kernel still finishes too quickly.
How can I make sure that the useless instructions actually get executed?
I found the option CU_JIT_OPTIMIZATION_LEVEL but google provides mostly links to the documentation and not how to use it. Woul开发者_如何学Cd this option help me and how do I use it?
Try introducing a branch which stores the variable:
__global__ void testKernel(float* a, float *b)
{
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float x;
x = a[i];
if(b)
{
*b = x;
}
}
The cost of the branch compared to the cost of memory transfer is negligible.
At the kernel launch site, simply pass a null pointer:
testKernel<<<...>>>(a, static_cast<float*>(0));
nvcc will not perform constant folding at this granularity, so your load should not be removed because the compiler cannot prove it is useless.
精彩评论