Lukas Palmer Lukas Palmer - 3 months ago 16
C Question

Intel HD 6000 local memory bandwidth with OpenCL

I'm working on some local/global memory optimization in OpenCL; after looking at this question from two years ago, I think I'm doing something wrong since local memory IO seems to be considerably slower than it should be. My GPU is an Intel HD 6000.

This is my testing setup, with kernel source:

__kernel void vecAdd(__global float* results, const unsigned int n, __local float* loc)
{
int id = get_global_id(0);
if(id < n) {
float rtemp = 0;
loc[23] = 34;
for(int i = 0; i < 1024; i ++) {
rtemp += loc[(i * 445) % 1024];
}
results[id] = rtemp;
}
}


All the kernel does is take the local float array loc and add random values from it into a global output vector. The fragment "(i * 445) % 1024" is used to ensure the local memory is randomly accessed; performance is a little better (~30% speedup) than the figure mentioned at the end without the randomization.


  • I queued the kernel for 16777216 / 16M iterations, with a work group size of 256 and a local buffer of 1024 floats, all zeroes except l[23].

  • Overall, this makes a total of 16M * 1 = 16M writes and 16M * 1024 = 16G reads to the local memory.

  • There're also around 16M * 1024 * 2 floating point operations, likely more depending on how modulo is calculated, but the HD 6000 has floating point performance around 768 GFLOPS which shouldn't be a bottleneck.

  • 16G reads of float values lead to 64G of memory being read; execution of the kernel took 453945 μs to complete, giving an estimated local memory bandwidth of 151 GB/s.



The figures thrown around in the referenced question suggest that modern graphics cards (from 2014) potentially have much higher memory bandwidth than I measured on my machine; the figure quoted in the article (which may have been a random example for comparison) was 3-4 TB/s; while my card is an integrated card as opposed to dedicated, this still seems like a slow figure considering it's release in 2015.

To make things even more confusing, I'm getting worse performance on some dedicated mid-range GPUs: both an AMD R9 m370x and an Nvidia GT 750m took 700-800 ms. These are slightly older cards than Intel's HD 6000, so that could have something to do with it.

Is there any potential way to squeeze more performance out of local memory, or am I utilizing local memory as efficiently as possible?

Answer

Answer is at edit2 part at the end of answer.

If dedicated gpu timings are bad, you can try pipelining read+compute+write operations like

enter image description here

from left to right, it starts overlapped operations in the second step so compute latency is hidden, then third step hides write latency too. This is an example of dividing a seperable work into 4 parts. Maybe more parts give slower result that should be benchmarked per device. Kernel execution is just an "adding" so it is always hidden but heavier ones may not be. This decreases I/O latencies if that graphics card can do read and write at the same time. Picture also shows idle(vertically empty) timelines because of redundant synchronizations to make it more readable than a packed but faster version.

Your igpu 151 GB/s bandwidth could be cpu-cache. It doesnt have addressable register space so even using __private registers could make it fetch from cache. Also caches have different line widths per cpu or gpu.

loc[23] = 34;

has race condition by multiple threads and gets serialized.

And there is a possibility of

for(int i = 0; i < 1024; i ++) { rtemp += loc[(i * 445) % 1024]; }

being unrolled automatically and putting pressure on instruction cache and cache/memory. You could try different levels of unrolling.

Are you sure 8 cores per execution unit of that igpu, utilized? Maybe only 1 core per EU is used that may not be enough for fully stressing cache/memory(such as cache line collision by using all 1st cores but nothing else)? Try with float8 version instead of just float. Latest intel cpus have over a TB per second.

GFLOPS limit is rarely approached. Around %50 with optimized code, %75 with unreadeable code, %90 with no-meaningful code.


Edit: below code was run on an AMD-R7-240 card at 900MHz(no more than 30 GB/s memory and 600 GFlops) for 16M elements of results.

        __kernel void vecAdd(__global float* results )
        {
           int id = get_global_id(0);
           __local float loc[1024]; // some devices may slow with this
           if(id < (4096*4096)) {
              float rtemp = 0;
              loc[23] = 34;
              for(int i = 0; i < 1024; i ++) {
                 rtemp += loc[(i * 445) % 1024];
              }
              results[id] = rtemp;
           }
        }

it took

  • 575 milliseconds(no pipeline) to write+compute+read
  • 530 milliseconds(2-part pipelined) to write+compute+read
  • 510 milliseconds(8-part pipelined) to write+compute+read
  • 455 milliseconds to compute (140 GB/s local memory bandwidth )

Edit2: optimizing for cache line utilization, compute simplification and less bubbles in shader cores:

        __kernel void vecAdd(__global float* results )
        {
           int id = get_global_id(0);
           int idL = get_local_id(0);
           __local float loc[1024];
           float rtemp = 0;
           if(id < (4096*4096)) {

              loc[23] = 34;
           }

           barrier (CLK_LOCAL_MEM_FENCE);

           if(id < (4096*4096)) {
              for(int i = 0; i < 1024; i ++) {
                 rtemp += loc[(i * 445+ idL) & 1023];
              }
              results[id] = rtemp;
           }
        }
  • 325 milliseconds (16-part pipelined) to write+compute+read
  • 270 milliseconds to compute (235 GB/s local memory bandwidth )

loc[(i * 445) % 1024];

is same for all threads, all random but changing to same value at each step, accessing through same cache line. Adding a local variation to all threads but having same summation in the end, uses more lines.

% 1024

is optimized with

&1023

lastly, barrier to elliminate any instruction bubbles in SIMD after loc[23] = 34;

Edit3: Adding some loop-unrolling and increasing local work group size from 64 to 256 (edit and edit2 were 64)

        __kernel void vecAdd(__global float* results )
        {
           int id = get_global_id(0);
           int idL = get_local_id(0);
           __local float loc[1024];
           float rtemp = 0;
           float rtemp2 = 0;
           float rtemp3 = 0;
           float rtemp4 = 0;
           if(id < (4096*4096)) {

              loc[23] = 34;
           }

           barrier (CLK_LOCAL_MEM_FENCE);

           if(id < (4096*4096)) {
              int higherLimitOfI=1024*445+idL;
              int lowerLimitOfI=idL;
              int stepSize=445*4;
              for(int i = lowerLimitOfI; i < higherLimitOfI; i+=stepSize) {
                 rtemp += loc[i & 1023];
                 rtemp2 += loc[(i+445) & 1023];
                 rtemp3 += loc[(i+445*2) & 1023];
                 rtemp4 += loc[(i+445*3) & 1023];
              }
              results[id] = rtemp+rtemp2+rtemp3+rtemp4;
           }
        }
  • 240 milliseconds to compute (264 GB/s local memory bandwidth )
  • VGPR usage limits kernel occupancy to %60 according to profiler.