user1296153 user1296153 - 4 months ago 40
C++ Question

Intrinsics Neon Swap elements in vector

I would like to optimize such code with Neon Intrinsics. Basically with given input of

0 1 2 3 4 5 6 7 8

will produce the output,

2 1 0 5 4 3 8 7 6

void func(uint8_t* src, uint8_t* dst, int size){

for (int i = 0; i < size; i++){
dst[0] = src[2];
dst[1] = src[1];
dst[2] = src[0]
dst = dst+3;
src = src+3;

The only way I can think of is to use

uint8x8x3_t src = vld3_u8(src);

to get 3 vectors and then access every single element from src[2], src[1], src[0] and write to the memory.

Can someone please help ?

Thank you.


This is dead easy in the underlying instruction set, because you're swapping two elements of a 3-element structure, which practically spells out the relevant instructions already:

vld3.u8 {d0-d2}, [r0]
vswp d0, d2
vst3.u8 {d0-d2}, [r0]

There's even this exact example in the NEON Programmers Guide, because it's a RGB-BGR conversion, and that's exactly the kind of processing NEON was designed for.

With intrinsics it's a bit trickier, as there's no intrinsic for vswp; you just have to express it in C and trust the compiler to do the right thing:

uint8x8x3_t data = vld3_u8(src);
uint8x8_t tmp = data.val[0];
data.val[0] = data.val[2];
data.val[2] = tmp;
vst3_u8(dest, data);

That said, with the compilers to hand being various versions of GCC, I failed to convince any of them to actually emit a vswp - code generation ranged from suboptimal to idiotic. Clang did a lot better, but still no vswp; other compilers may be cleverer.