开发者

how to make a CUDA Histogram kernel?

开发者 https://www.devze.com 2023-01-02 22:01 出处:网络
I am writing a CUDA kernel for 开发者_StackOverflowHistogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible so

I am writing a CUDA kernel for 开发者_StackOverflowHistogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible solution for it?

__global__ void Hist(
    TColor *dst, //input image
    int imageW,
    int imageH,
 int*data
){
    const int ix = blockDim.x * blockIdx.x + threadIdx.x;
    const int iy = blockDim.y * blockIdx.y + threadIdx.y;

if(ix < imageW && iy < imageH)
{
  int pixel = get_red(dst[imageW * (iy) + (ix)]);
                  //this assign specific RED value of image to pixel

  data[pixel] ++; // ?? problem statement ...
 }
}

@para d_dst: input image TColor is equals to float4.

@para data: the array for histogram size [255]

extern "C" void
cuda_Hist(TColor *d_dst, int imageW, int imageH,int* data) 
{
  dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);
  dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));
  Hist<<<grid, threads>>>(d_dst, imageW, imageH, data);
}


Have you looked at the SDK sample? The "histogram" sample is available in the CUDA SDK (currently version 3.0 on the NVIDIA developer site, version 3.1 beta available for registered developers).

The documentation with the sample explains nicely how to handle your summation, either using global memory atomics on the GPU or by collecting the results for each block separately and then doing a separate reduction (either on the host or the GPU).


Histogramming is not particularly efficient when implemented with CUDA (or with GPGPU in general) - typically you need to generate lots of partial histograms in shared memory and then sum them. You might want to consider keeping this particular task on the CPU.


You will have to either use atomic function to block other thread from using he same memory, or use the partial histogram. Either way it not that efficient unless the input image is very very large.

0

精彩评论

暂无评论...
验证码 换一张
取 消