I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first ve开发者_运维问答ctor is multiplied by the next odd-numbered element in the second vector... and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector.
For example:
vector 1 is V1(1) V1(2) V1(3) V1(4)
vector 2 is V2(1) V2(2) V2(3) V2(4) V1(1) * V2(2) V1(3) * V2(4) V1(2) * V2(1) V1(4) * V2(3)I have written Cuda code to do this (Pds has the elements of the first vector in shared memory, Nds the second Vector):
// instead of % 2, checking the first bit to decide if a number
// is odd/even is faster
if ((tx & 0x0001) == 0x0000)
Nds[tx+1] = Pds[tx] * Nds[tx+1];
else
Nds[tx-1] = Pds[tx] * Nds[tx-1];
__syncthreads();
Is there anyway to further accelerate this code or avoid divergence?
You should be able to eliminate the branch like this:
int tx_index = tx ^ 1; // equivalent to: tx_index = (tx & 1) ? tx - 1 : tx + 1
Nds[tx_index] = Pds[tx] * Nds[tx_index];
This is an old post, may be someone finds my answer useful. If in your code tx is threadIdx, then you have branching or warp divergence. You must avoid divergence in blocks, because it serializes the process. It means that the threads with even indices will run first, and then threads with odd indices will run. If tx is threadIdx, try to change your algorithm such that branching depends on blockIdx.
精彩评论