Vincent Vincent -4 years ago 123
C++ Question

Making g++ use SHLD/SHRD instructions

Consider the following code:

#include <limits>
#include <cstdint>

using T = uint32_t; // or uint64_t

T shift(T x, T y, T n)
{
return (x >> n) | (y << (std::numeric_limits<T>::digits - n));
}


According to goldbolt, clang 3.8.1 generates the following assembly code for -O1, -O2, -O3:

shift(unsigned int, unsigned int, unsigned int):
movb %dl, %cl
shrdl %cl, %esi, %edi
movl %edi, %eax
retq


While gcc 6.2 (even with
-mtune=haswell
) generates:

shift(unsigned int, unsigned int, unsigned int):
movl $32, %ecx
subl %edx, %ecx
sall %cl, %esi
movl %edx, %ecx
shrl %cl, %edi
movl %esi, %eax
orl %edi, %eax
ret


This seems far less optimized, since SHRD is very fast on Intel Sandybridge and later. Is there anyway to rewrite the function to facilitate optimization by compilers (and in particular gcc) and to favor the use of SHLD/SHRD assembly instructions?

Or are there any gcc
-mtune
or other options that would encourage gcc to tune better for modern Intel CPUs?

With
-march=haswell
, it emits BMI2 shlx / shrx, but still not shrd.

Answer Source

No, I can see no way to get gcc to use the SHRD instruction.
You can manipulate the output that gcc generates by changing the -mtune and -march options.

Or are there any gcc -mtune or other options that would encourage gcc to tune better for modern Intel CPUs?

Yes you can get gcc to generate BMI2 code:

E.g: X86-64 GCC6.2 -O3 -march=znver1 //AMD Zen
Generates: (Haswell timings).

    code            critical path latency     reciprocal throughput
    ---------------------------------------------------------------
    mov     eax, 32          *                     0.25
    sub     eax, edx         1                     0.25        
    shlx    eax, esi, eax    1                     0.5
    shrx    esi, edi, edx    *                     0.5
    or      eax, esi         1                     0.25
    ret
    TOTAL:                   3                     1.75

Compared with clang 3.8.1:

    mov    cl, dl            1                     0.25
    shrd   edi, esi, cl      4                     2
    mov    eax, edi          *                     0.25 
    ret
    TOTAL                    5                     2.25

Given the dependency chain here: SHRD is slower on Haswell, tied on Sandybridge, slower on Skylake.
The reciprocal throughput is faster for the shrx sequence.

So it depends, on post BMI processors gcc produces better code, pre-BMI clang wins.
SHRD has wildly varying timings on different processors, I can see why gcc is not overly fond of it.
Even with -Os (optimize for size) gcc still does not select SHRD.

*) Not part of the timing because either not on the critical path, or turns into a zero latency register rename.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download