In AVX2 we have
AVX1 doesn't have 256b integer operations, only FP. So I assume you're really looking for an alternative to
__m128i _mm_srlv_epi32(). Using extractf128 / insertf128, you can easily do this for 256b vectors, but it's better to just use more 128b loads/stores, esp. if you have an AVX2 version that can run on CPUs with AVX2 support. (The existing AVX1-only CPUs all happen to have 128b load/store data paths, so 256b loads/stores are barely an advantage.)
The round trip from vector to scalar is pretty expensive (either store-forwarding stalls when reloading after scalar stores, or a lot of
pinsrd), so even something pretty clunky might still be better than integer code, depending on whether throughput or latency are more important in the code you're using this in.
The best idea I have is basically scalar in vector regs: 4 shifts (one for each different shift count) and 3 immediate-blends to combine the results.
update: idea 2: left shift with a 32bit multiply by 2count. See the end of this answer.
If the shift counts aren't compile-time constants, you'll need to unpack the vector of shift counts so you have each shift count as the 64b of a vector. (The non-variable shift instructions can take their count in a register, but they look at the whole low 64b. And instead of masking (modulo word size) like scalar shifts, they saturate.
Getting each of the 4 elements of an xmm register isolated in an otherwise-zero destination is tricky. You can't just byte-shift them down to the bottom, because that would leave non-zero bytes from the second element.
Since this is for AVX without AVX2, I'm assuming you have a separate version for AVX2 CPUs. So for Intel, this version will be used on SnB/IvB. This means you have two 128b shuffle units, instead of just one on Haswell and later.
## 4 shift-counts in the elements of xmm0 = [ D C B A ]. element 1 isolated in xmm1, etc. vpsrlq xmm2, xmm0, 32 ; xmm2 = [ 0 D 0 B ] vpunpckhqdq xmm4, xmm2, xmm0 ; xmm4 = [ D C 0 D ] vpshufd xmm3, xmm4, 0b01010110 ; xmm3 = [ 0 0 0 C ] vblendps xmm1, xmm2, xmm0, 0b0001 ; xmm1 = [ 0 D 0 A ] ; or vpblendw xmm1, xmm2, xmm0, 0b00000011 ; xmm1 = [ 0 D 0 A ]
vblendps runs on p0/5 on SnB/IvB. The equivalent
vpblendw runs on p1/p5 on SnB/IvB. On Haswell/SKL it's p015 vs. p5, so blendps is much better (same choice of ports as
PAND). For SnB, maybe use a combo of both for blending the shift results. With intrinsics, using FP instructions on integer data requires a lot of casting, which makes the source ugly and harder to read. Unless you're going to tune this to fit best into the surrounding code with perf counters and microbenchmarks, just use
pblendw for SnB/IvB. Otherwise just cast and use
Alternative if you have a
[ 0 -1 0 -1 ] mask available, a vector AND can run on more ports, and shorten the dependency chain for
xmm3. This is not enough better to justify loading or generating the mask, so prefer the previous version that does it all with shifts/shuffles/blends.
vpcmpeqw xmm5, xmm5,xmm5 ; all-ones vpsrlq xmm5, xmm5, 32 ; [ 0 -1 0 -1 ]: generate the mask on the fly if desired vpand xmm1, xmm5, xmm0 ; [ 0 C 0 A ] vpsrlq xmm2, xmm0, 32 ; [ 0 D 0 B ] vpunpckhqdq xmm3, xmm1,xmm1 ; [ 0 C 0 C ] ; saves 1B vs. the equivalent pshufd: no imm8 byte vpunpckhqdq xmm4, xmm2,xmm2 ; [ 0 D 0 D ]
Side note: bizarrely, on Skylake,
VPSRLVD ymm,ymm,ymm is cheaper (1 uop) than
PSRLD xmm,xmm,xmm (2 uops). Immediate-count
PSRLD is only 1 uop, though. (From Agner Fog's insn tables).
@BeeOnRope's testing confirms that Agner's latency numbers are from the data input to the data output, with the shift-count not on the critical path. Latency from shift-count input to data output is 2c(xmm) or 4c(ymm), as usual for 1c a in-lane broadcast vs. 3c for a lane-crossing broadcast.
movaps [rsp - 16], xmm0 shr [rsp - 16], 3 ; 3 uops with a memory-destination. 5 uops for variable count with a memory destination shr [rsp - 12], 1 shr [rsp - 8], 4 shr [rsp - 4], 1 movaps xmm0, [rsp - 16] ; store-forwarding stall here from the 4x 32b stores to the 128b load
Or maybe for variable-count:
## data in xmm0, shift counts in xmm1, results in xmm2 vmovd eax, xmm0 ; 1 uop vmovd ecx, xmm1 ; 1 uop shr eax, cl ; 3 uops because of CISC stupidity vmovd xmm2, eax ; 1 uop vpextrd eax, xmm0, 1 ; 2 uops vpextrd ecx, xmm1, 1 ; 2 uops shr eax, cl ; 3 uops because of CISC stupidity vpinsrd xmm2, eax, 1 ; 2 uops ... repeat twice more, for indices 2 and 3
So the all-registers way for variable-count shifts is 6uops + 9uops * 3, total of 33 uops.
The memory-destination version is 14 fused-domain uops, since I counted a version that has the shift-counts as compile-time constants. It would be many more with loading or
pextring counts into ecx, since each variable-count shift is 2 uops more than immediate-count shift.
So even though the SSE/AVX version is pretty nasty, it's not that nasty. The fully-variable vector version is still
vblendpsto merge those results.
So the fully-variable vector version is only as bad as the fully-constant store / scalar shuffle / reload version, and that has a store-forwarding stall in it.
Note that just counting fused-domain uops isn't always the only relevant thing. Latency may be important, and execution port pressure in the unfused-domain may matter.
vpsrlvd ymm, ymm, ymmis 1 uop, 1c latency, one per 0.5c throughput.
vpsrlvd ymm, ymm, ymmis 3 uops, 2c latency, one per 2c throughput.
And remember, that's for a 256b vector. All the counts I've done are for 128b vectors.
On Haswell (instead of SnB/IvB), my SSE version would probably bottleneck on shuffle port throughput. Latency will be somewhat worse, too, because resource conflicts limit the amount of insn level parallelism it can take advantage of.
pmulldto multiply by powers of two.
On SnB/IvB, SSE4.1
pmulld is 1 uop, 5c latency, one per 1c throughput.
On Haswell, it's 2 uops, 10c latency, one per 2c throughput. (Twice the throughput on Skylake, since its uops can run on p1 as well as p0)
The trick is turning the shift-count into 2c. One way is by using a variable shift. This is fine if you can reuse the exponentiated vector of 2c to shift multiple other vectors, otherwise it's a chicken-and-egg problem.
If the range of shift counts is small (i.e. 0..7), you can use SSSE3
pshufb as a LUT to map a vector of counts to a vector of 2^c.
0 in the low byte of each element has to become
1 (20), but
0 in other bytes has to stay zero.
## 1<<8 or higher is 0, in an 8bit element ## xmm5 = _mm_set_epi8(0, 0, ..., 1<<7, ..., 1<<2, 1<<1, 1<<0); ## xmm4 = _mm_set1_epi32(0x000000ff); ## data in xmm0, shift counts in xmm1 movdqa xmm2, xmm5 ; avoid this with AVX pshufb xmm2, xmm5 ; 2^count pand xmm2, xmm4 ; zero all but the low byte in each element pmulld xmm0, xmm2 ; data * 2^count
Intel SnB/IvB: 3 uops (not counting the movdqa which isn't needed with AVX). Latency from shift-count to result: 7c. Latency from shift-data to result: 5c. Throughput: one per 1c (since all three uops can run on different ports).
With Haswell and later: 5c higher latency. Penryn/Nehalem also take more uops for
pmulld than SnB, but not as bad latency as Haswell.
The LUT is all zero in the upper 64b, but it's non-trivial to convince a compiler to only store the relevant part and load it with movq. I won't go into that here.
To handle larger shift counts, we could use multiple LUTs with lookups from
[ D-8 C-8 B-8 A-8 ] to get values for the 2nd byte of each 32b element, etc. etc. Note that
C-8 has the sign bit set if
BLENDVB merges based on the sign bit being set. It's expensive, though, so a series of merges might not be better than just using the earlier shift/blend-immediate method.
Other than masking the
pshufb result, you could instead add a vector of
set1_epi32(1). Then the range of indices in the LUT with non-zero bytes would be 1..8, and the padding 0 bytes in the shift-count vector would look up the low element of the LUT (which should be 0). Doing it this way would make on-the-fly constant generation more feasible:
## xmm5 = _mm_set_epi8(0, 0, ..., 1<<7, ..., 1<<2, 1<<1, 1<<0, 0); ## data in xmm0, shift counts in xmm1 pcmpeqw xmm4,xmm4 ; all-ones psubd xmm1, xmm4 ; shift_counts -= -1 movdqa xmm2, xmm5 pshufb xmm2, xmm1 ; 2^count pmulld xmm0, xmm2 ; data * 2^count
No advantage to this, unless you really care about generating a constant on the fly in one fewer insn. (set1_epi32(0xff) is fast to generate with pcmpeqw / psrld 24, but compilers often only generate on the fly when they can do it in one insn.)
The OP clarified in chat that the problem is actually much simpler: the data being shifted is a compile-time constant (0xF in particular). Also, only the low 8 bits of the result are needed.
This makes it trivial to implement with just PSHUFB as a LUT, no multiply needed. See the previous section of this answer that used pshufb to do
If you wanted a 32b result, you might generate
[ 0 0 D+8 D | 0 0 C+8 C | ... ] for use as the control mask. With the right data in each half of the LUT, that will produce the right two bytes.