Manuel Manuel - 5 days ago 5
C Question

OpenCL 2.0 - work_group operations on CPU and GPU

I was testing the following code in order to perform a parallel array element addition in OpenCL 2.0 with the work_groups built-in functions. (inclusive_add and reduce_add in this case)

kernel void(global const float *input,
global float *sum,
local float *scratch)
{
uint local_id = get_local_id(0);
scratch[local_id] = work_group_inclusive_add(input[get_global_id(0)]);

if (local_id == get_local_size(0)-1)
sum[get_group_id(0)] = work_group_reduce_add(scratch[local_id]);
}


If I test it with an array of floats from 0 to 15 with steps of 1, global_size = 16 and local_size = 4 I'm expecting as a result "6.0 22.0 38.0 54.0" and this works fine if I choose my CPU as a device.

But as soon as I choose the GPU and run the same code I get "0.0 4.0 8.0 12.0" (which is just the element in the first position for each work-group)

Am I missing something?

Things that I tried to do but didn't affect a thing:


  1. Adding "barrier(CLK_LOCAL_MEM_FENCE)" before the "if"

  2. Changing the local size and/or the array size / global size.



Notes:


  • I am passing the input array with clEnqueueWriteBuffer and then reading the sum with clEnqueueReadBuffer

  • CPU: i5 6200u

  • GPU: Intel HD Graphics 520



(yes they support OpenCL 2.0 and I can build the kernel successfully with ioc64 passing -cl-std=CL2.0 as I do while building the program at runtime)

Answer

You are getting different results because you are using work_group_reduce_add wrong way.

The OpenCL 2.0 spec says:

This built-in function must be encountered by all work-items in a work-group executing the kernel.

This isn't the case when you call work_group_reduce_add. You need to remove that if statement from there altogether. By adding if statement which allows only one work item to access it you are calculating sum of one just one value. And that is returned to you.

After work_group_scan_inclusive_add the numbers should be as follows:

w1: 0,1,2,3 -> 0,1,3,6
w2: 4,5,6,7 -> 4,9,15,22
w3: 8,9,10,11 -> 8,17,27,38
w4: 12,13,14,15 -> 12,25,39,54

After work_group_reduce_add:

w1: 10
w2: 50
w3: 90
w4: 130

And 2nd thing from the spec:

NOTE: The order of floating-point operations is not guaranteed for the work_group_reduce_, work_group_scan_inclusive_ and work_group_scan_exclusive_ built-in functions that operate on half, float and double data types. The order of these floating-point operations is also non-deterministic for a given workgroup.

So the results after inclusive scan I calculated may not necessary be the same and this is what you are observing on what GPU is returning (GPU is returning 0,4,8,12 which happens to be the last value of each buffer).

To summarize: removing if statement before work_group_reduce_add should fix the issue.

Comments