user1043761 user1043761 - 7 days ago 4x
C++ Question

AVX alternative of AVX2's vector shift?

In AVX2 we have

_mm256_srlv_epi32(a, b)
_mm256_sllv_epi32(a, b)
for shifting a set of 8 values in 'a' by the 8 values in 'b'. Is there an efficient alternative using AVX so that I can stay in AVX and not have to spit out to scalar code?


AVX1 doesn't have 256b integer operations, only FP. So I assume you're really looking for an alternative to __m128i _mm_srlv_epi32(). Using extractf128 / insertf128, you can easily do this for 256b vectors, but it's better to just use more 128b loads/stores, esp. if you have an AVX2 version that can run on CPUs with AVX2 support. (The existing AVX1-only CPUs all happen to have 128b load/store data paths, so 256b loads/stores are barely an advantage.)

The round trip from vector to scalar is pretty expensive (either store-forwarding stalls when reloading after scalar stores, or a lot of movd / pextrd / pinsrd), so even something pretty clunky might still be better than integer code, depending on whether throughput or latency are more important in the code you're using this in.

The best idea I have is basically scalar in vector regs: 4 shifts (one for each different shift count) and 3 immediate-blends to combine the results.

update: idea 2: left shift with a 32bit multiply by 2count. See the end of this answer.

If the shift counts aren't compile-time constants, you'll need to unpack the vector of shift counts so you have each shift count as the 64b of a vector. (The non-variable shift instructions can take their count in a register, but they look at the whole low 64b. And instead of masking (modulo word size) like scalar shifts, they saturate.

Getting each of the 4 elements of an xmm register isolated in an otherwise-zero destination is tricky. You can't just byte-shift them down to the bottom, because that would leave non-zero bytes from the second element.

Since this is for AVX without AVX2, I'm assuming you have a separate version for AVX2 CPUs. So for Intel, this version will be used on SnB/IvB. This means you have two 128b shuffle units, instead of just one on Haswell and later.

## 4 shift-counts in the elements of   xmm0 = [ D C B A ].  element 1 isolated in xmm1, etc.
vpsrlq      xmm2, xmm0, 32           ; xmm2 = [ 0 D 0 B ]
vpunpckhqdq xmm4, xmm2, xmm0         ; xmm4 = [ D C 0 D ]
vpshufd     xmm3, xmm4, 0b01010110   ; xmm3 = [ 0 0 0 C ]
vblendps    xmm1, xmm2, xmm0, 0b0001 ; xmm1 = [ 0 D 0 A ]
; or
vpblendw     xmm1, xmm2, xmm0, 0b00000011 ; xmm1 = [ 0 D 0 A ]

vblendps runs on p0/5 on SnB/IvB. The equivalent vpblendw runs on p1/p5 on SnB/IvB. On Haswell/SKL it's p015 vs. p5, so blendps is much better (same choice of ports as PAND). For SnB, maybe use a combo of both for blending the shift results. With intrinsics, using FP instructions on integer data requires a lot of casting, which makes the source ugly and harder to read. Unless you're going to tune this to fit best into the surrounding code with perf counters and microbenchmarks, just use pblendw for SnB/IvB. Otherwise just cast and use blendps.

Alternative if you have a [ 0 -1 0 -1 ] mask available, a vector AND can run on more ports, and shorten the dependency chain for xmm3. This is not enough better to justify loading or generating the mask, so prefer the previous version that does it all with shifts/shuffles/blends.

vpcmpeqw   xmm5, xmm5,xmm5            ; all-ones
vpsrlq     xmm5, xmm5, 32             ; [ 0 -1  0 -1 ]: generate the mask on the fly if desired

vpand       xmm1, xmm5, xmm0           ; [ 0 C 0 A ]
vpsrlq      xmm2, xmm0, 32             ; [ 0 D 0 B ]
vpunpckhqdq xmm3, xmm1,xmm1            ; [ 0 C 0 C ]  ; saves 1B vs. the equivalent pshufd: no imm8 byte
vpunpckhqdq xmm4, xmm2,xmm2            ; [ 0 D 0 D ]

Side note: bizarrely, on Skylake, VPSRLVD ymm,ymm,ymm is cheaper (1 uop) than PSRLD xmm,xmm,xmm (2 uops). Immediate-count PSRLD is only 1 uop, though. (From Agner Fog's insn tables).

@BeeOnRope's testing confirms that Agner's latency numbers are from the data input to the data output, with the shift-count not on the critical path. Latency from shift-count input to data output is 2c(xmm) or 4c(ymm), as usual for 1c a in-lane broadcast vs. 3c for a lane-crossing broadcast.

uop counts:

With scalar code for compile-time-constant shift counts, the whole thing might look like:

movaps    [rsp - 16], xmm0
shr       [rsp - 16], 3         ; 3 uops with a memory-destination.  5 uops for variable count with a memory destination
shr       [rsp - 12], 1
shr       [rsp -  8], 4
shr       [rsp -  4], 1
movaps    xmm0, [rsp - 16]      ; store-forwarding stall here from the 4x 32b stores to the 128b load

Or maybe for variable-count:

## data in xmm0,  shift counts in xmm1, results in xmm2
vmovd      eax, xmm0      ; 1 uop
vmovd      ecx, xmm1      ; 1 uop
shr        eax, cl        ; 3 uops because of CISC stupidity
vmovd      xmm2, eax      ; 1 uop

vpextrd    eax, xmm0, 1   ; 2 uops
vpextrd    ecx, xmm1, 1   ; 2 uops
shr        eax, cl        ; 3 uops because of CISC stupidity
vpinsrd    xmm2, eax, 1   ; 2 uops

... repeat twice more, for indices 2 and 3    

So the all-registers way for variable-count shifts is 6uops + 9uops * 3, total of 33 uops.

The memory-destination version is 14 fused-domain uops, since I counted a version that has the shift-counts as compile-time constants. It would be many more with loading or pextring counts into ecx, since each variable-count shift is 2 uops more than immediate-count shift.

So even though the SSE/AVX version is pretty nasty, it's not that nasty. The fully-variable vector version is still

  • 4 uops to unpack the counts
  • 8 uops for the four vpsrld xmm,xmm insns
  • 3 uops for the vpblendw or vblendps to merge those results.
  • total = 15 fused-domain uops for the fully variable AVX1.

So the fully-variable vector version is only as bad as the fully-constant store / scalar shuffle / reload version, and that has a store-forwarding stall in it.

Note that just counting fused-domain uops isn't always the only relevant thing. Latency may be important, and execution port pressure in the unfused-domain may matter.

For comparison:

  • Skylake: vpsrlvd ymm, ymm, ymm is 1 uop, 1c latency, one per 0.5c throughput.
  • Haswell/BDW: vpsrlvd ymm, ymm, ymm is 3 uops, 2c latency, one per 2c throughput.

And remember, that's for a 256b vector. All the counts I've done are for 128b vectors.

On Haswell (instead of SnB/IvB), my SSE version would probably bottleneck on shuffle port throughput. Latency will be somewhat worse, too, because resource conflicts limit the amount of insn level parallelism it can take advantage of.

Left shift by using SSE4.1 pmulld to multiply by powers of two.

On SnB/IvB, SSE4.1 pmulld is 1 uop, 5c latency, one per 1c throughput.
On Haswell, it's 2 uops, 10c latency, one per 2c throughput. (Twice the throughput on Skylake, since its uops can run on p1 as well as p0)

The trick is turning the shift-count into 2c. One way is by using a variable shift. This is fine if you can reuse the exponentiated vector of 2c to shift multiple other vectors, otherwise it's a chicken-and-egg problem.

If the range of shift counts is small (i.e. 0..7), you can use SSSE3 pshufb as a LUT to map a vector of counts to a vector of 2^c. 0 in the low byte of each element has to become 1 (20), but 0 in other bytes has to stay zero.

##           1<<8 or higher is 0, in an 8bit element
## xmm5 = _mm_set_epi8(0, 0, ..., 1<<7, ..., 1<<2, 1<<1, 1<<0);
## xmm4 = _mm_set1_epi32(0x000000ff);        
## data in xmm0, shift counts in xmm1
movdqa    xmm2, xmm5           ; avoid this with AVX
pshufb    xmm2, xmm5           ; 2^count
pand      xmm2, xmm4           ; zero all but the low byte in each element
pmulld    xmm0, xmm2           ; data * 2^count

Intel SnB/IvB: 3 uops (not counting the movdqa which isn't needed with AVX). Latency from shift-count to result: 7c. Latency from shift-data to result: 5c. Throughput: one per 1c (since all three uops can run on different ports).

With Haswell and later: 5c higher latency. Penryn/Nehalem also take more uops for pmulld than SnB, but not as bad latency as Haswell.

The LUT is all zero in the upper 64b, but it's non-trivial to convince a compiler to only store the relevant part and load it with movq. I won't go into that here.

To handle larger shift counts, we could use multiple LUTs with lookups from [ D-8 C-8 B-8 A-8 ] to get values for the 2nd byte of each 32b element, etc. etc. Note that C-8 has the sign bit set if C<8, and BLENDVB merges based on the sign bit being set. It's expensive, though, so a series of merges might not be better than just using the earlier shift/blend-immediate method.

Other than masking the pshufb result, you could instead add a vector of set1_epi32(1). Then the range of indices in the LUT with non-zero bytes would be 1..8, and the padding 0 bytes in the shift-count vector would look up the low element of the LUT (which should be 0). Doing it this way would make on-the-fly constant generation more feasible:

## xmm5 = _mm_set_epi8(0, 0, ..., 1<<7, ..., 1<<2, 1<<1, 1<<0, 0);
## data in xmm0, shift counts in xmm1
pcmpeqw   xmm4,xmm4            ; all-ones

psubd     xmm1, xmm4           ; shift_counts -= -1
movdqa    xmm2, xmm5
pshufb    xmm2, xmm1           ; 2^count
pmulld    xmm0, xmm2           ; data * 2^count

No advantage to this, unless you really care about generating a constant on the fly in one fewer insn. (set1_epi32(0xff) is fast to generate with pcmpeqw / psrld 24, but compilers often only generate on the fly when they can do it in one insn.)


The OP clarified in chat that the problem is actually much simpler: the data being shifted is a compile-time constant (0xF in particular). Also, only the low 8 bits of the result are needed.

This makes it trivial to implement with just PSHUFB as a LUT, no multiply needed. See the previous section of this answer that used pshufb to do 2<<count.

If you wanted a 32b result, you might generate [ 0 0 D+8 D | 0 0 C+8 C | ... ] for use as the control mask. With the right data in each half of the LUT, that will produce the right two bytes.