MishMash95 MishMash95 - 1 month ago 18
C Question

Shuffle AVX 256 Vector elements by 1 position left/right - C intrinsics

I'm trying to find a more efficient way to "rotate" or shift the 32 bit floating point values within an avx _m256 vector to the right or left by one place.

Such that:

a7, a6, a5, a4, a3, a2, a1, a0


0, a7, a6, a5, a4, a3, a2, a1

(I dont mind if the data gets lost as I replace the cell anyway.)

I've already taken a look at this thread: Emulating shifts on 32 bytes with AVX
but I don't really understand what is going on, and it doesn't explained what the _MM_SHUFFLE(0, 0, 3, 0) does as an input parameter.

I'm trying to optimise this code:

_mm256_store_ps(temp, array[POS(ii, jj)]);
_mm256_store_ps(left, array[POS(ii, jj-1)]);

tmp_array[POS(ii, jj)] = _mm256_set_ps(left[0], temp[7], temp[6], temp[5], temp[4], temp[3], temp[2], temp[1]);

I know once a shift is in place, I can use an insert to replace the remaining cell. I feel this will be more efficient then unpacking into a float[8] array and reconstructing.

-- I'd also like to be able to shift both left and right, as I need to perform a similar operation elsewhere.

Any help is greatly appreciated! Thanks!


For AVX2:

Use VPERMPS to do it in one lane-crossing shuffle instruction.

rotated_right = _mm256_permutevar8x32_ps(src, _mm256_set_epi32(0,7,6,5,4,3,2,1));

For AVX (without AVX2)

Since you say the data is coming from memory already, this might be good:

  • use an unaligned load to get the 7 elements to the right place, solving all the lane-crossing problems.
  • Then blend the element that wrapped around into that vector of the other 7.
  • To get the element that wrapped in-place for the blend, maybe use a broadcast-load to get it to the high position. AVX can broadcast-load in one VBROADCASTPS instruction (so set1() is cheap), although it does need the shuffle port on Intel SnB and IvB (the only two Intel microarchitectures with AVX but not AVX2). (See perf links in the tag wiki.

INSERTPS only work on XMM destinations, and can't reach the upper lane.

You could maybe use VINSERTF128 to do another unaligned load that ends up putting the element you want as the high element in the upper lane (with some don't-care vector in the low lane).

This compiles, but isn't tested.

__m256 load_rotr(float *src)
#ifdef __AVX2__
    __m256 orig = _mm256_loadu_ps(src);
    __m256 rotated_right = _mm256_permutevar8x32_ps(orig, _mm256_set_epi32(0,7,6,5,4,3,2,1));
    return rotated_right;
    __m256 shifted = _mm256_loadu_ps(src + 1);
    __m256 bcast = _mm256_set1_ps(*src);
    return _mm256_blend_ps(shifted, bcast, 0b10000000);

See the code + asm on Godbolt