The CUDA programming guide states that
__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.
So if开发者_如何学Python I need to synchronize threads with a conditional branching across a block, some of which threads may or may not take the branch that includes the __syncthreads()
call, does this mean that it won't work?
I'm imagining that there might be all sorts of cases in which you might need to do this; for example, if you have a binary mask and need to apply a certain operation on pixels conditionally. Say, if (mask(x, y) != 0)
then execute the code that includes __syncthreads()
, otherwise do nothing. How would that be done?
If you need to go this route you could split the body into two phases:
if (condition)
{
// code before sync
}
__syncthreads();
if (condition) // or remember a flag or whatever
{
// code after sync
}
Alternatively you could use the condition to set a flag that disables certain operations, for example if you're computing a delta update you could do the following:
// *ALL* compute a delta update, those threads that would have failed the condition
// simply compute garbage.
// This can include syncthreads
if (condition)
// apply update
From 3.0 you can use the warp vote functions to accomplish what __syncthreads can't:
Warp vote functions are only supported by devices of compute capability 1.2
int __all(int predicate); predicate for all threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for all of them.
int __any(int predicate); evaluates predicate for all threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for any of them.
unsigned int __ballot(int predicate); evaluates predicate for all threads of the warp and returns an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp. This function is only supported by devices of compute capability 2.x.
Otherwise there are also the Atomic Bitwise functions
atomicAnd, atomicOr, atomicXor
See section B.11 of the cuda programming Guide
精彩评论