I have an image of size 512 x 512 x 512. I need to process all the voxels individually. How can I get the thread id to do this? If I use 1D thread ID the number of blocks will exceeds 65536.
int id = blockIdx.x*blockDim.x + threadIdx.x;
Note :- My card doesnt support for 开发者_JAVA百科the 3D grids
You are able to use 3D indicies in CUDA 4.0 and compute capability 2.0+. Example code:
int blocksInX = (nx+8-1)/8;
int blocksInY = (ny+8-1)/8;
int blocksInZ = (nz+8-1)/8;
dim3 Dg(blocksInX, blocksInY, blocksInZ);
dim3 Db(8, 8, 8);
foo_kernel<<Dg, Db>>(R, nx, ny, nz);
...
__global__ void foo_kernel( float* R, const int nx, const int ny, const int nz )
{
unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
unsigned int zIndex = blockDim.z * blockIdx.z + threadIdx.z;
if ( (xIndex < nx) && (yIndex < ny) && (zIndex < nz) )
{
unsigned int index_out = xIndex + nx*yIndex + nx*ny*zIndex;
...
R[index_out] = ...;
}
}
If your device doesn't support compute capability 2.0, there is some trick:
int threadsInX = 16;
int threadsInY = 4;
int threadsInZ = 4;
int blocksInX = (nx+threadsInX-1)/threadsInX;
int blocksInY = (ny+threadsInY-1)/threadsInY;
int blocksInZ = (nz+threadsInZ-1)/threadsInZ;
dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);
dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);
foo_kernel<<<Dg, Db>>>(R, nx, ny, nz, blocksInY, 1.0f/(float)blocksInY);
__global__ void foo_kernel(float *R, const int nx, const int ny, const int nz,
unsigned int blocksInY, float invBlocksInY)
{
unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);
unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz, blocksInY);
unsigned int xIndex = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
unsigned int yIndex = __umul24(blockIdxy, blockDim.y) + threadIdx.y;
unsigned int zIndex = __umul24(blockIdxz, blockDim.z) + threadIdx.z;
if ( (xIndex < nx) && (yIndex < xIndex) && (zIndex < nz) )
{
unsigned int index = xIndex + nx*yIndex + nx*ny*zIndex;
...
R[index] = ...;
}
}
You could use grids. It gives you much more indexes.
Note that the memory of your PC is not in 3D. It's just the matter of visualization, so you can convert your 3D image into a single pointer.
Array[i][j][z] is same as Array2[ i*cols+j + rows*cols*z];
Now feed the Array2 to CUDA and work in single dimension
If you need a larger grid, CUDA supports 2D grids on all hardware, and the most recent versions of the CUDA toolkit also support 3D grids on current Fermi hardware.
However, it isn't strictly necessary to have such large grids. If each voxel operation is independent, then why not just use a 1D grid, but have each thread process more than one voxel? Not only would such a scheme not need larger 2D or 3D grids, it might well be more efficient because the fixed costs associated with scheduling and initialization of a block can be amortized over multiple voxel calculations.
I used something like this:
In the code define your grid: dim3 altgrid,altthreads; altgrid.x=lx; altgrid.y=ly; altgrid.z=1; altthreads.x=lz; altthreads.y=1; altthreads.z=1;
and in the kernel
int idx = threadIdx.x;
int idy = blockIdx.x ;
int idz = blockIdx.y ;
Since the array in on device is only 1D you retrieve the [idx][idy][idz] element by of a matrix A as A[ind], where ind=idz+lz*(idy+ly*idx);
I hope it helps
精彩评论