I'm really new to CUDA and have been trying to traverse a 2D array. I have the following code which works as expected on plain C:
for (ty=0;ty<s;ty++){
if (ty+pixY < s && ty+pixY>=0){
for(tx=0;tx<r;tx++){
T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ?
*(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0;
}
}
}
Maybe I'm getting something wrong but wouldn't this code translate to CUDA as following?
tx = threadIdx.x;
ty = threadIdy.y;
T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ?
*(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0;
provided I have defined my kernel parameters as dimGrid(1,1,1)
and blockDim(r,s,1)
I ask because I'开发者_StackOverflow社区m getting unexpected results. Also if I properly declare and allocate my arrays as 2D cuda arrays instead of just a big 1D array will this help?
Thanks for your help.
Leaving aside whether the array allocation and indexing schemes are correct (I am not sure there is enough information in the post to confirm that), and the fact that integer division and modulo are slow and should be avoided, you have a much more fundamental problem - a memory race.
Multiple threads within the single block you are using will be attempting to read and write to the same entry of T at the same time. CUDA makes no guarantees about the correctness of this sort of operation and it is almost certainly not going to work. The simplest alternative is to only use a single thread to compute each T[][]
entry, rather than three threads. This eliminates the memory race.
精彩评论