I was testing the following code in order to perform a parallel array element addition in OpenCL 2.0 with the work_groups built-in functions. (inclusive_add and reduce_add in this case)
kernel void(global const float *input,
global float *sum,
local float *scratch)
uint local_id = get_local_id(0);
scratch[local_id] = work_group_inclusive_add(input[get_global_id(0)]);
if (local_id == get_local_size(0)-1)
sum[get_group_id(0)] = work_group_reduce_add(scratch[local_id]);
You are getting different results because you are using
work_group_reduce_add wrong way.
The OpenCL 2.0 spec says:
This built-in function must be encountered by all work-items in a work-group executing the kernel.
This isn't the case when you call
You need to remove that
if statement from there altogether. By adding if statement which allows only one work item to access it you are calculating sum of one just one value. And that is returned to you.
work_group_scan_inclusive_add the numbers should be as follows:
w1: 0,1,2,3 -> 0,1,3,6 w2: 4,5,6,7 -> 4,9,15,22 w3: 8,9,10,11 -> 8,17,27,38 w4: 12,13,14,15 -> 12,25,39,54
w1: 10 w2: 50 w3: 90 w4: 130
And 2nd thing from the spec:
NOTE: The order of floating-point operations is not guaranteed for the work_group_reduce_, work_group_scan_inclusive_ and work_group_scan_exclusive_ built-in functions that operate on half, float and double data types. The order of these floating-point operations is also non-deterministic for a given workgroup.
So the results after inclusive scan I calculated may not necessary be the same and this is what you are observing on what GPU is returning (GPU is returning 0,4,8,12 which happens to be the last value of each buffer).
To summarize: removing if statement before
work_group_reduce_add should fix the issue.