I'm trying to think of a way to implement the following algorithm using CUDA:
Working on a large volume of voxels, for each voxel I calculate an index i
and a value c
. after the calculation I need to perform hi开发者_Go百科stogram[i] += c
c
is a float value and the histogram can have up to 15,000 bins.
I'm looking for a way to implement this efficiently using CUDA. The first obvious problem is that with compute capabilities 1.3 which is what I'm using I can't even do an atomicAdd()
of floats so how can I accumulate anything reliably?
This example by nVidia does something somewhat simpler. The histograms are saved in the shared memory (which I can't do due to its size) and it only accumulates integers. Can this approach be generalized to my case?
A two step approach will probably be the best. You can create multiple histograms for a subset of voxels and sum them all upfor a global histogram
Assuming you have N voxels, first create global device memory of size M x 15000 (where is M < N but big enough to keep all the cores busy)
Run a cuda kernal to compute the histogram of N/M number of voxels for each thread index. After all the threads are finished, you can now run another cuda kernal that sum the M histogram for your final histogram.
Histogramming directly to external memory using atomicAdd() is going to be very inefficient. Depending on the range of your histogram you might want to consider multiple passes through your source data, updating a partial histogram in shared memory, then write out the partial histogram after each pass. So long as you don't have to make a large number of passes through the input data this should be a lot more efficient. Obviously you would want to make the partial histogram as large as possible, subject to the limits of available shared memory, in order to minimise the number of passes required.
精彩评论