I do understand whats the difference between global- and local-memory in general.
But I have problems to use local-memory.
1) What has to be considered by transforming a global-memory variables to local-memory variables?
2) How do I use the local-barriers?
Maybe someone can help me with a little example.
1) Query for CL_DEVICE_LOCAL_MEM_SIZE value, it is 16kB minimum and increses for different hardwares. If your local variables can fit in this and if they are re-used many times, you should put them in local memory before usage. Even if you don't, automatic usage of L2 cache when accessing global memory of a gpu can be still effective for utiliation of cores.
If global-local copy is taking important slice of time, you can do async work group copy while cores calculating things.
Another important part is, more free local memory space means more concurrent threads per core. If gpu has 64 cores per compute unit, only 64 threads can run when all local memory is used. When it has more space, 128,192,...2560 threads can be run at the same time if there are no other limitations.
A profiler can show bottlenecks so you can consider it worth a try or not.
For example, a naive matrix-matrix multiplication using nested loop relies on cache l1 l2 but submatices can fit in local memory. Maybe 48x48 submatices of floats can fit in a mid-range graphics card compute unit and can be used for N times for whole calculation before replaced by next submatrix.
CL_DEVICE_LOCAL_MEM_TYPE querying can return LOCAL or GLOBAL which also says that not recommended to use local memory if it is GLOBAL.
2) Barriers are a meeting point for all workgroup threads in a workgroup. Similar to cyclic barriers, they all stop there, wait for all until continuing. If it is a local barrier, all workgroup threads finish any local memory operations before departing from that point. If you want to give some numbers 1,2,3,4.. to a local array, you can't be sure if all threads writing these numbers or already written, until a local barrier is passed, then it is certain that array will have final values already written.
All workgroup threads must hit same barrier. If one cannot reach it, kernel stucks or you get an error.
__local int localArray; // not each thread. For all threads. // per compute unit. if(localThreadId!=0) localArray[localThreadId]=localThreadId; // 64 values written in O(1) // not sure if 2nd thread done writing, just like last thread if(localThreadId==0) // 1st core of each compute unit loads from VRAM localArray[localThreadId]=globalArray[globalThreadId]; barrier(CLK_LOCAL_MEM_FENCE); // probably all threads wait 1st thread // (maybe even 1st SIMD or // could be even whole 1st wavefront!) // here all threads written their own id to local array. safe to read. // except first element which is a variable from global memory // lets add that value to all other values if(localThreadId!=0) localArrray[localThreadId]+=localArray;