Short version of my question: I have a CUDA program where each thread needs to store numbers in different "bins", and I identify each of these bins by an integer. For a typical run of my program, each CUDA thread might only store numbers in 100 out of millions of bins, so I'd like to know if there is a data structure other than an array that would allow me to hold this data. Each thread would have its own copy of this structure. If I were programming in Python, I would just use a dictionary where the bin numbers are the keys, for example mydict[0] = 1.0, mydict[2327632] = 3.0, and then at the end of the run I would look at the keys and do something with them (and ignore the bins where no numbers are stored in them since they aren't in the dictionary). I tried implementing a hash table for every thread in my cuda program and it killed performance.
Long version: I have a CUDA Monte Carlo simulation which simulates the transport of particles through a voxelized (simple volume elements) geometry. The particles deposit energy during their transport and this energy is tallied on a voxel-per-voxel basis. The voxels are represented as a linearized 3D grid which is quite large, around 180^3 elements. Each CUDA thread transports 1-100 particles and I usually try to maximize the number of threads that I spawn my kernel with. (Currently, I use 384*512 threads). The energy deposited i开发者_如何学运维n a given voxel is added to the linearized 3d grid which resides in global memory through atomicAdd.
I'm running into some problems with a part of my simulation which involves calculating uncertainties in my simulation. For a given particle, I have to keep track of where (which voxel indices) it deposits energy, and how much energy for a given voxel, so that I can square this number at the end of the particle transport before moving on to a new particle. Since I assign each thread one (or a few) particle, this information has to be stored at a per-thread scope. The reason I only run into this problem with uncertainty calculation is that energy deposition can just be done as an atomic operation to a global variable every time a thread has to deposit energy, but uncertainty calculation has to be done at the end of a particle's transport, so I have to somehow have each thread keep track of the "history" of their assigned particles.
My first idea was to implement a hash table whose key would be the linearized voxel index, and value would be energy deposited, and I would just square every element in that hash table and add it to a global uncertainty grid after a particle is done transporting. I tried to implement uthash but it destroyed the performance of my code. I'm guessing it caused a huge amount of thread divergence.
I could simply use two dynamic arrays where one stores the voxel index and the other would store the energy deposited for that voxel, but I am thinking that it would also be very bad for performance. I'm hoping that there is a data structure that I don't know about which would lend itself well to being used in a CUDA program. I also tried to include many details in case I am completely wrong in my approach to the problem.
Thank you
Your question is a bit jargon-ful. If you can distill out the science and leave just the computer science, you might get more answers.
There have been CUDA hash tables implemented. The work at that link will be included in the 2.0 release of the CUDPP library. It is already working in the SVN trunk of CUDPP, if you would like to try it.
That said, if you really only need per-thread storage, and not shared storage, you might be able to do something much simpler, like some per-thread scratch space (in shared or global memory) or a local array.
精彩评论