nosbor nosbor - 1 month ago 6
C++ Question

Cuda Memory shared in every threads

I started my adventure with CUDA today. I'm trying to share an unsigned int among all the threads. All the threads modify this value. I copied this one value to device by using cudaMemcpy. But, at the end when calculations are finished I received that this value is equal to 0.

Maybe several threads are writing to this variable at the same time?
I'm not sure if I should use any semaphores or lock this variable when a thread starts writing or what.

EDIT:

It's hard to say in more detail because my question is in general how to solve it. Actually I'm not writing any algorithm, only testing CUDA.

But if you wish... I created vector which contains some values (unsigned int). I tried to do something like searching values bigger than given shared value but, when value from vector is bigger, I'm adding 1 to the vector elements and save the shared value.

It looks like the this:

__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
if (a[idx]>*b && idx < N)
*b = a[idx]+1;
}


As I said it's not useful code, only for testing, but I wonder how to do it...

Answer

"My question is in general how to use shared memory global for every threads."

To read you don't need anything special. What you did works, faster on Fermi devices because they have a cache, slower on the others.

If you are reading the value after other threads changed it you have no way to wait for all threads to finish their operations before reading the value you want so it might not be what you expect. The only way to synchronize a value in global memory between all running threads is to use different kernels. After you change a value you want to share between all threads the kernel finishes and you launch a new one that will work with the shared value.

To make every thread write to the same memory location you must use atomic operations but keep in mind you should keep atomic operations to a minimum as this effectively serializes the execution.

To know the available atomic functions read section B.11 of the CUDA C Programming Guide available here.

What you asked would be:

__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
    int idx = blockIdx.x* blockDim.x+ threadIdx.x;
    if (a[idx]>*b && idx < N) 
        //*b = a[idx]+1;
        atomicAdd(b, a[idx]+1);
}