plasmacel plasmacel - 1 month ago 17
C++ Question

Issues of compiler generated assembly for intrinsics

I'm using Intel SSE/AVX/FMA intrinsics to achieve perfectly inlining SSE/AVX instructions for some math functions.

Given the following code

#include <cmath>
#include <immintrin.h>

auto std_fma(float x, float y, float z)
{
return std::fma(x, y, z);
}

float _fma(float x, float y, float z)
{
_mm_store_ss(&x,
_mm_fmadd_ss(_mm_load_ss(&x), _mm_load_ss(&y), _mm_load_ss(&z))
);

return x;
}

float _sqrt(float x)
{
_mm_store_ss(&x,
_mm_sqrt_ss(_mm_load_ss(&x))
);

return x;
}


the clang 3.9 generated assembly with -march=x86-64 -mfma -O3

std_fma(float, float, float): # @std_fma(float, float, float)
vfmadd213ss xmm0, xmm1, xmm2
ret

_fma(float, float, float): # @_fma(float, float, float)
vxorps xmm3, xmm3, xmm3
vmovss xmm0, xmm3, xmm0 # xmm0 = xmm0[0],xmm3[1,2,3]
vmovss xmm1, xmm3, xmm1 # xmm1 = xmm1[0],xmm3[1,2,3]
vmovss xmm2, xmm3, xmm2 # xmm2 = xmm2[0],xmm3[1,2,3]
vfmadd213ss xmm0, xmm1, xmm2
ret

_sqrt(float): # @_sqrt(float)
vsqrtss xmm0, xmm0, xmm0
ret


while the generated code for
_sqrt
is fine, there are unnecessary
vxorps
(which sets the absolutely unused xmm3 register to zero) and
movss
instructions in
_fma
compared to
std_fma
(which rely on compiler intrinsic std::fma)

the GCC 6.2 generated assembly with -march=x86-64 -mfma -O3

std_fma(float, float, float):
vfmadd132ss xmm0, xmm2, xmm1
ret
_fma(float, float, float):
vinsertps xmm1, xmm1, xmm1, 0xe
vinsertps xmm2, xmm2, xmm2, 0xe
vinsertps xmm0, xmm0, xmm0, 0xe
vfmadd132ss xmm0, xmm2, xmm1
ret
_sqrt(float):
vinsertps xmm0, xmm0, xmm0, 0xe
vsqrtss xmm0, xmm0, xmm0
ret


and here are a lot of unnecessary
vinsertps
instructions

Working example: https://godbolt.org/g/q1BQym

The default x64 calling convention pass floating-point function arguments in XMM registers, so those
vmovss
and
vinsertps
instructions should be eliminated. Why do the mentioned compilers still emit them? Is it possible to get rid of them without inline assembly?

I also tried to use
_mm_cvtss_f32
instead of
_mm_store_ss
and multiple calling conventions, but nothing changed.

Answer

I write this answer based on the comments, some discussion and my own experiences.

As Ross Ridge pointed out in the comments, the compiler is not smart enough to recognize that only the lowest floating-point element of the XMM register is used, so it do zero out the other three elements with those vxorps vinsertps instructions. This is absolutely unnecessary, but what can you do?

Need to note that clang 3.9 does much better job than GCC 6.2 (or current snapshot of 7.0) at generating assembly for Intel intrinsics, since it only fails at _mm_fmadd_ss in my example. I tested more intrinsics as well and in most cases clang did perfect job to emit single instructions.

What can you do

You can use the standard <cmath> functions, with the hope that they are defined as compiler intrinsics if a proper CPU instructions is available.

This is not enough

Compilers, like GCC implement these functions with special handling of NaN and infinities. So in addition to the intrinsics, they can do some comparison, branching, and possible errno flag handling.

Compiler flags -fno-math-errno -fno-trapping-math do help GCC and clang to eliminate the additional floating-point special cases and errno handling, so they can emit single instructions if possible: https://godbolt.org/g/LZJyaB.

You can achieve the same with -ffast-math, since it also includes the above flags, but it includes much more than that, and those (like unsafe math optimizations) are probably not desired.

Unfortunately this is not a portable solution. It works in most cases (see the godbolt link), but still, you depend on the implementation.

What more

You can yet use inline assembly, which is also not portable, much more tricky and there are much more things to consider. In spite of that, for such simple one-line instructions it can be okay.

Things to consider:

1st GCC/clang and Visual Studio use different syntax for inline assembly, and Visual Studio doesn't allow it in x64 mode.

2nd You need to emit VEX encoded instructions (3 op variants, e.g. vsqrtss xmm0 xmm1 xmm2) for AVX targets, and non-VEX encoded (2 op variants, e.g. sqrtss xmm0 xmm1) variants for pre-AVX CPUs. VEX encoded instructions are 3 operand instructions, so they offer more freedom for the compiler to optimize. To take their advantage, register input/output parameters must be set properly. So something like below does the job.

#   if __AVX__
    asm("vsqrtss %1, %1, %0" :"=x"(x) : "x"(x));
#   else
    asm("sqrtss %1, %0" :"=x"(x) : "x"(x));
#   endif

But the following is a bad technique for VEX:

asm("vsqrtss %1, %1, %0" :"+x"(x));

It can yield to an unnecessary move instruction, check https://godbolt.org/g/VtNMLL.

3rd As Peter Cordes pointed out, you can lose common subexpression elimination (CSE) and constant folding (constant propagation) for inline assembly functions. However if the inline asm is not declared as volatile, the compiler can treat it as a pure function which depends only on its inputs and perform common subexpression elimination, which is great.

As Peter said:

"Don't use inline asm" isn't an absolute rule, it's just something you should be aware of and consider carefully before using. If the alternatives don't meet your requirements, and you don't end up with this inlining into places where it can't optimize, then go right ahead.