Sagar Patel Sagar Patel - 17 days ago 5
C++ Question

is there an OpenCL std::queue equivalent

I have some code that is in a similar form to this:

std::queue<unsigned> q;
q.push(array[0]);

while (!q.empty())
{
unsigned index = q.front();
q.pop();

if(index >= someval){
index = index - someval;
q.push(array[index]);
}
else{
//do something
}
}
}


I would like to move this over to an OpenCL kernel, how would I replicate the functionality of the queue in the most efficient way possible.

At the moment I have implemented it with a fixed sized array which monitors the head and count of elements in the queue. I am wondering if there is a more elegant solution.

Thanks

Answer

You should check SYCL: https://www.khronos.org/sycl where where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration. If you can't use that and if you don't even have a device for opencl 2.0:

There isn't any equivalent for version 1.2. But in comments, you said:

Not sure that I understand what you mean but this would need to be individual to all threads on the GPU

Then there is no need for inter-thread communication, it should be simple to implement a FIFO one using a circular buffer for cache+memory+compute efficiency, just don't overflow it(64 elements max) and don't underflow(more pops than pushes but easy to implement a bounds check (with performance penalty)):

push

   bool push(__private uint * stack, uint value)
    {
        // pushing from bot, so you can pop it from top later (FIFO)
        // circular buffer for top performance
        uint bufLen=64;
        // zeroth element is counter for newest added element
        // first element is oldest element


        // circular buffer 
        uint nextIndex=(stack[0]%bufLen+2); // +2 because of top-bot headers

        // if overflows, it overwrites oldest elements one by one
        stack[nextIndex]=value;


        // if overflows, it still increments 
        stack[0]++;

        // simple and fast
        return true;
    }

checking if empty

        bool empty(__private uint * stack)
        {
            // tricky if you overflow both
            return (stack[0]==stack[1]);
        }

front value

        uint front(__private uint * stack)
        {
            uint bufLen=64;

            // oldest element value (top)
            uint ptr=stack[1]%bufLen+2; // circular adr + 2 header


            return stack[ptr];
        }

popping

        uint pop(__private uint * stack)
        {
               uint bufLen=64;
               uint ptr=stack[1]%bufLen+2;
               // pop from top (oldest)
               uint returnValue=stack[ptr];
               stack[ptr]=0;

               // this will be new top ctr for ptr
               stack[1]++;

               // if underflows, gets garbage, don't underflow

               return returnValue;
        }

example kernel for benchmarking:

        __kernel void queue0(__global uint * heap)
        {
            int id=get_global_id(0);
            __private uint q[100];
            for(int i=0;i<256;i++)
                q[i]=0;

            for(int i=0;i<55;i++)    
                push(q,i);

            for(int i=0;i<40;i++)    
                pop(q);

            for(int i=0;i<20;i++)    
                push(q,i);

            for(int i=0;i<35;i++)    
                pop(q);

            for(int i=0;i<35;i++)    
            {
                push(q,i);
                pop(q);
            }
            push(q,'h');
            push(q,'e');
            push(q,'l');
            push(q,'l');
            push(q,'o');
            push(q,' ');
            push(q,'w');
            push(q,'o');
            push(q,'r');
            push(q,'l');
            push(q,'d');
            for(int i=0;i<256;i++)
                heap[id*256+i]=q[i];
        }

output of buffer (which shows thread id = 0 calculations results)

121 110 0      0 0 0    0 0 0    0 0 0 // 121 pushes total, 110 pops total

0 0 0    0 0 0    0 0 0    0 0 0

0 0 0    0 0 0    0 0 0    0 0 0

0 0 0    0 0 0    0 0 0    0 0 0

104 101 108      108 111 32     119 111 114    108 100 0 
// hello world

more than 200k pushes+pops under 6.35 milliseconds(running kernel for 1024 threads each working on 256 elements but using only 64+2 elements for circular buffer) for a 1-channel 1600MHz ddr3 RAM and Intel HD Graphics 400 with 12 compute units(96 cores total @600 MHz).

If you construct a 64-element circular buffer using 64 x 4-element circular buffers, you can add elements between top and bottom of stack too!