Serge Rogatch Serge Rogatch - 2 months ago 20
C++ Question

How to reduce in CUDA if __syncthreads can't be called inside conditional branches?

The reduction method suggested by NVIDIA uses

__syncthreads()
inside conditional branching e.g.:

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }


or

for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}


In the second example
__syncthreads()
is inside
for
loop body, which is also a conditional branch.

However, a number of questions on SO raise the problem of
__syncthreads()
inside conditional branches (e.g. Can I use __syncthreads() after having dropped threads? and conditional syncthreads & deadlock (or not) ), and the answers say that
__syncthreads()
in conditional branches may lead to a deadlock. Consequently, reduction method suggested by NVIDIA may deadlock (if believing the documentation on which the answers are based).

Furthermore, if
_syncthreads()
can't be used inside conditional branches, then I'm afraid that many of the basic operations are blocked and reduction is just an example.

So how to do reduction in CUDA without using
__syncthreads()
in conditional branches? Or is it a bug in the documentation?

Answer

The limitation is not

__syncthreads cannot be used in conditional branches

The limitation is

__syncthreads cannot be used in branches which will not be traversed by all threads at the same time

Notice that in both the examples you give, __syncthreads is not covered by a condition that would depend on the thread ID (or some per-thread data). In the first case, it's after the } closing the if. In the second case, it's likewise after the if.

Yes, the for loop's s > 32 is a condition, but it is a condition whose truth value does not depend on the thread or its data in any way. blockdim.x is the same for all threads. And all threads will execute exactly the same modifications of s. Which means that all threads will reach the __syncthreads in exactly the same point of their control flow. Which is perfectly OK.

The other case, where you cannot use __syncthreads, is a condition which can be true for some threads and false for other ones. In such case, you have to close all conditions to use __syncthreads. So instead of this:

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
  __syncthreads();
  operation2();
}

You must do this:

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
}
__syncthreads();
if (threadIdx.x < SOME_CONSTANT)
{
  operation2();
}