I have some code that I want to make into a cuda kernel. Behold:
for (r = Y; r < Y + H; r+=2)
{
ch1RowSum = ch2RowSum = ch3RowSum = 0;
for (c = X; c < X + W; c+=2)
{
chan1Value = //some calc'd value
chan3Value = //some calc'd value
chan2Value = //some calc'd value
ch2RowSum += chan2Value;
ch3RowSum += chan3Value;
ch1RowSum += chan1Value;
}
ch1Mean += ch1RowSum / W;
ch2Mean += ch2RowSum / W;
ch3Mean += ch3RowSum / W;
}
Should this be split up into two kernels, one t开发者_运维技巧o calculate the RowSums and one to calculate the Means, and how should I handle the fact that my loop indices dont start at zero and end at N?
Let's suppose you have a kernel that computes the three values. Each thread in your configuration will compute the three values for each (r,c) pair.
__global__ value_kernel(Y, H, X, W)
{
r = blockIdx.x + Y;
c = threadIdx.x + W;
chan1value = ...
chan2value = ...
chan3value = ...
}
I don't believe you can calculate the sum (completely in parallel, at least) in the above kernel. You won't be able to use += like you have above. You could put it all in one kernel if you have only one thread in each block (row) do the sum and mean, like this...
__global__ both_kernel(Y, H, X, W)
{
r = blockIdx.x + Y;
c = threadIdx.x + W;
chan1value = ...
chan2value = ...
chan3value = ...
if(threadIdx.x == 0)
{
ch1RowSum = 0;
ch2RowSum = 0;
ch3RowSum = 0;
for(i=0; i<blockDim.x; i++)
{
ch1RowSum += chan1value;
ch2RowSum += chan2value;
ch3RowSum += chan3value;
}
ch1Mean = ch1RowSum / blockDim.x;
ch2Mean = ch2RowSum / blockDim.x;
ch3Mean = ch3RowSum / blockDim.x;
}
}
but it's probably better to use the first value kernel and then a second kernel for both sums and means... It's possible to further parallelize the kernel below, and if it's separate you can focus on that when you're ready.
__global__ sum_kernel(Y,W)
{
r = blockIdx.x + Y;
ch1RowSum = 0;
ch2RowSum = 0;
ch3RowSum = 0;
for(i=0; i<W; i++)
{
ch1RowSum += chan1value;
ch2RowSum += chan2value;
ch3RowSum += chan3value;
}
ch1Mean = ch1RowSum / W;
ch2Mean = ch2RowSum / W;
ch3Mean = ch3RowSum / W;
}
精彩评论