I have a short piece of code like this:
typedef struct {
double sX;
double sY;
double vX;
double vY;
int rX;
int rY;
int mass;
int species;
int boxnum;
} particle;
typedef struct {
double mX;
double mY;
double count;
int rotDir;
double cX;
double cY;
int superDir;
} box;
//....
int i;
for(i=0;i<PART_COUNT;i++) {
particles[i].boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT));
}
for(i=0;i<PART_COUNT;i++) {
//sum the momenta
boxnum = particles[i].boxnum;
boxes[boxnum].mX += particles[i].vX*particles[i].mass;
boxes[boxnum].mY += particles[i].vY*particles[i].mass;
boxes[boxnum].count++;
}
Now, I want to port开发者_StackOverflow社区 this to CUDA. The first step is easy; spreading the calculation across a bunch of threads is no problem. The issue is the second. Since any two particles are equally likely to be in any same box, I'm not sure how I can partition it so as to avoid conflicts.
Number of particles is on the order of 10,000 to 10,000,000, and number of boxes is on the order of 1024 to 1048576.
Ideas?
You can try to use atomicAdd
operations to modify your boxes array. Atomic operations on global memory are very slow but at the same time it's quite impossible to do any optimizations involving shared memory for two reasons:
- Under the assumption that the properties
boxnum
of the particlesparticles[0]..particles[n]
aren't ordered and do not lie in any small boundaries (in the range of a block size) you can't predict which boxes to load from global memory into shared memory. You would've to first collect all the boxnumbers.. - If you try to collect all boxnumbers you can't use an array with every possible boxnumber as an index since there are way too many boxes to fit into shared memory. So you'd have to collect indices with a queue (realized with an array, a pointer to the next free slot and atomic operations), but then you'd still have conflicts because the same boxnumber could occur multiple times in your queue.
Conclusion: atomicAdd
will give you at least correct behavior. Try it out and test the performance. If you aren't satisfied by the performance, think if there's another way to do the same computations that would profit from shared memory.
As an alternative, you could launch a 2D grid of blocks.
blocks.x = numParticles / threadsPerBlock / repeatPerBlock.
blocks.y = numOfBoxes / 1024;
Each block performs atomic additions in shared memory if and only if boxnum lies in between 1024 * blockIdx.y and 1024 * (blockIdx.y + 1);
This is followed by a reduction along blocks.x
This may or may not be faster than atomicAdd on global memory as the data is read blocks.y number of times. This could however be fixed if the "particles" are sorted by boxnum in a sorting pass followed by a partitioning pass.
There may be several other ways to do it, but since the problem size varies by a large amount, you may end up having to write 2-3 different methods that are optimized for a given size range.
精彩评论