Martin Martin - 2 years ago 96
C Question

How to concatenate two vector efficiently using AVX2?

I have implemented an inline function (

_mm256_concat_epi16
). It concatenates two AVX2 vector containing 16-bit values. It works fine for first 8 numbers. If I want to use it for the rest of the vector I should change the implementation. But It would be better to use a single inline function in my main program.

The question is : Is there any better solution than mine or any suggestion to make this inline function more general which works on 16 values instead of my solution that works on 8 values? My solution concatenate 2 vectors but only 8 states of 16 possible state is solved.

**EDIT*My current solution for this question is using unaligned load function which exactly can read from any part from memory. But, when data is ready in register it might be better to reuse it. However, it might cause bottlenecks on port 5 which issues shuffle, permute, etc. But throughput might be enough (haven't test yet).



#include <stdio.h>
#include <x86intrin.h>

inline _mm256_print_epi16(__m256i a, char* name){
short temp[16], i;
_mm256_storeu_si256((__m256i *) &temp[0], a);
for(i=0; i<16; i++)
printf("%s[%d]=%4d , ",name,i+1,temp[i]);
printf("\n");
}

inline __m256i _mm256_concat_epi16(__m256i a, __m256i b, const int indx){
return _mm256_alignr_epi8(_mm256_permute2x128_si256(a,b,0x21),a,indx*2);
}

int main()
{
__m256i a = _mm256_setr_epi16(101,102,103,104,105,106,107,108,109,1010,1011,1012,1013,1014,1015,1016);_mm256_print_epi16(a, "a");
__m256i b = _mm256_setr_epi16(201,202,203,204,205,206,207,208,209,2010,2011,2012,2013,2014,2015,2016);_mm256_print_epi16(b, "b");

_mm256_print_epi16(_mm256_concat_epi16(a,b,8), "c");//numbers: 0-8
return 0;
}


The out put is :

// icc -march=native -O3 -D _GNU_SOURCE -o "concat" "concat.c"
[fedora@localhost concatination]$ "./concat"
a[1]= 101 , a[2]= 102 , a[3]= 103 , a[4]= 104 , a[5]= 105 , a[6]= 106 , a[7]= 107 , a[8]= 108 , a[9]= 109 , a[10]=1010 , a[11]=1011 , a[12]=1012 , a[13]=1013 , a[14]=1014 , a[15]=1015 , a[16]=1016 ,
b[1]= 201 , b[2]= 202 , b[3]= 203 , b[4]= 204 , b[5]= 205 , b[6]= 206 , b[7]= 207 , b[8]= 208 , b[9]= 209 , b[10]=2010 , b[11]=2011 , b[12]=2012 , b[13]=2013 , b[14]=2014 , b[15]=2015 , b[16]=2016 ,
c[1]= 109 , c[2]=1010 , c[3]=1011 , c[4]=1012 , c[5]=1013 , c[6]=1014 , c[7]=1015 , c[8]=1016 , c[9]= 201 , c[10]= 202 , c[11]= 203 , c[12]= 204 , c[13]= 205 , c[14]= 206 , c[15]= 207 , c[16]= 208 ,

Answer Source

It's impossible to give a general answer to this question. It's such a short fragment that the best strategy depends on the surrounding code and what CPU you're running on.

Sometimes we can rule out things that have no advantages on any CPU and just consume more of the same resources, but that's not the case when considering a tradeoff between unaligned loads vs. shuffles.


In a loop that bottlenecks on something other than loads, e.g. total uop throughput, or a specific ALU port, using an unaligned load will probably be the cheapest strategy. It's only 1 uop (or 2 on Ryzen, where all 256b ops are 2 uops), and probably won't be a latency problem. (Assuming your pointers are not on the critical path, the load addresses can be ready in plenty of time for out-of-order execution to get them done.)

On recent Intel CPUs, vector loads that cross a cache-line boundary still have pretty good throughput, but this is one reason why you might consider an ALU strategy, or a mix of shuffles and overlapping loads (in an unrolled loop you might alternate strategies so you don't bottleneck on either one).


The shuffle strategy:

Your current function can only compile if indx is known at compile time (because palignr needs the byte-shift-count as an immediate).

As @Mohammad suggested, you could pick from different shuffles at compile time, depending on the indx value. He seemed to be suggesting a CPP macro, but that would be ugly.

Much easier to simply use if(indx>=16) or something like that, which will optimize away. (You could make indx a template parameter if a compiler refused to compile your code with an apparently "variable" shift count.) Agner Fog uses this in his Vector Class Library (license=GPL), for functions like template <uint32_t d> static inline Vec8ui divide_by_ui(Vec8ui const & x).

Related: Emulating shifts on 32 bytes with AVX has an answer with different shuffle strategies depending on shift count. But it's only trying to emulate a shift, not a concat / lane-crossing palignr.

vperm2i128 is fast on Intel mainstream CPUs (but still a lane-crossing shuffle so 3c latency), but slow on Ryzen (8 uops with 3c latency/3c throughput). If you were tuning for Ryzen, you'd want to use an if() to figure out a combination of vextracti128 to get a high lane and/or vinserti128 on a low lane. You might also want to use separate shifts and then vpblendd the results together.


Designing the right shuffles:

The indx determines where the new bytes for each lane need to come from. Let's simplify by considering 64-bit elements:

 hi |  lo
D C | B A    # a
H G | F E    # b

palignr(b,a i) forms (H G D C) >> i | (F E B A) >> i
But what we want is

D C | B A    # concatq(b,a,0): no-op.  return a;

E D | C B    # concatq(b,a,1):  applies to 16-bit element counts from 1..7
          low lane needs  hi(a).lo(a)
          high lane needs lo(b).hi(a)
        return palignr(swapmerge(a,b), a, 2*i).  (Where we use vperm2i128 to lane-swap+merge a and b)
F E | D C    # concatq(b,a,2)
        special case of exactly half reg width: Just use vperm2i128.
        Or on Ryzen, blend then swap hi/lo with vpermq
G F | E D    # concatq(b,a,3): applies to 16-bit element counts from 9..15
        low lane needs  lo(b).hi(a)
        high lane needs hi(b).lo(b).  vperm2i128 -> palignr looks good
        return palignr(b, swapmerge(a,b), 2*i-16).

H G | F E    # concatq(b,a,4): no op: return b;

These design notes lead directly to this implementation:

// UNTESTED
// clang refuses to compile this, but gcc works.

// in many cases won't be faster than simply using unaligned loads.
static inline __m256i lanecrossing_alignr_epi16(__m256i a, __m256i  b, unsigned int count) {
#endif
   if (count == 0)
     return a;
   else if (count <= 7)
     return _mm256_alignr_epi8(_mm256_permute2x128_si256(a,b,0x21),a,count*2);
   else if (count == 8)
      return _mm256_permute2x128_si256(a,b,0x21);
   else if (count > 8 && count <= 15)
     // clang chokes on the negative shift count even when this branch is not taken
     return _mm256_alignr_epi8(b,_mm256_permute2x128_si256(a,b,0x21),count*2 - 16);
   else if (count == 16)
     return b;
   else
     assert(0 && "out-of-bounds shift count");

// can't get this to work without C++ constexpr :/
//   else
//     static_assert(count <= 16, "out-of-bounds shift count");
}

I put it on the Godbolt compiler explorer with some test functions that inline it with different constant shift counts. gcc6.3 compiles it to

test_alignr0:
    ret            # a was already in ymm0
test_alignr3:
    vperm2i128      ymm1, ymm0, ymm1, 33   # replaces b
    vpalignr        ymm0, ymm1, ymm0, 6
    ret
test_alignr8:
    vperm2i128      ymm0, ymm0, ymm1, 33
    ret
test_alignr11:
    vperm2i128      ymm0, ymm0, ymm1, 33   # replaces a
    vpalignr        ymm0, ymm1, ymm0, 6
    ret
test_alignr16:
    vmovdqa ymm0, ymm1
    ret

clang chokes on it. First, it says error: argument should be a value from 0 to 255 for the count*2 - 16 for counts that don't use that branch of the if/else chain.

Also, it can't wait and see that the alignr() count ends up being a compile-time constant: error: argument to '__builtin_ia32_palignr256' must be a constant integer, even when it is after inlining. You can solve that in C++ by making count a template parameter:

template<unsigned int count>
static inline __m256i lanecrossing_alignr_epi16(__m256i a, __m256i  b) {
   static_assert(count<=16, "out-of-bounds shift count");
   ...

In C, you could make it a CPP macro instead of a function to deal with that.

The count*2 - 16 problem is harder to solve for clang. You could make the shift count part of the macro name, like CONCAT256_EPI16_7. There's probably some CPP trickery you could use to do the 1..7 versions and the 9..15 versions separately. (Boost has some crazy CPP hacks.)


BTW, your print function is weird. It calls the first element c[1] instead of c[0]. Vector indices start at 0 for shuffles, so it's really confusing.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download