I'm trying to take the convolution of an array of data, 256x256, with a filter, 3x3 on a GPU using shared memory. I understand that I'm to break the array up in blocks, and then apply the filter within each block. This ultimately means that blocks with overlap along the edges, and some padding will need to be done around the edges where there is no data so that the filter works properly.
int grid = (256/(16+3-1))*(256/(16+3-1))
where 256 is the length or width of my array, 16 is the length or wide of my block in shared memory, 3 is the length or width of my filter, and I minus one to make it so it's even.
int thread = (16+3-1)*(16+3-1)
Now I call my kernel <<>>(output, input, 256) input and output are an array of size 256*256
__global__ void kernel(float *input, float *output, int size)
{
__shared__ float tile[16+3-1][开发者_开发知识库16+3-1];
blockIdx.x = bIdx;
blockIdy.y = bIdy;
threadIdx.x = tIdx;
threadIdy.y = tIdy
//i is for input
unsigned int iX = bIdx * 3 + tIdx;
unsigned int iY = bIdy * 3 + tIdy;
if (tIdx == 0 || tIdx == width || tIdy == 0 || tIdy == height)
{
//this will pad the outside edges
block[tIdy][tIdx] = 0;
}
else
{
//This will fill in the block with real data
unsigned int iin = iY * size + iX;
block[tIdy][tIdx] = idata[iin];
}
__syncthreads();
//I believe is above is correct; below, where I do the convolution, I feel is wrong
float result = 0;
for(int fX=-N/2; fX<=N/2; fX++){
for(int fY=-N/2; fY<=N/2; fY++){
if(iY+fX>=0 && iY+fX<size && iX+fY>=0 && iX+fY<size)
result+=tile[tIdx+fX][tIdy+fY];
}
}
output[iY*size+iX] = result/(3*3);
}
When I run the code, if I run the convolution part, I get a kernel error. Any insights? Or suggestions?
Check out the sobelFilter SDK sample.
It uses texture to deal with the edge cases, overfetches blocks slightly (but the texture cache makes that more efficient), and uses shared memory for the processing.
The subtle thing about the shared memory is that you get 4-way bank conflicts if you read adjacent bytes. One way to get around this, illustrated in the sobelFilter sample, is to unroll your loop 4x and access every fourth byte.
精彩评论