InsideLoop InsideLoop - 25 days ago 10
C++ Question

Performance difference in between Windows and Linux using intel compiler: looking at the assembly

I am running a program on both Windows and Linux (x86-64). It has been compiled with the same compiler (Intel Parallel Studio XE 2017) with the same options, and the Windows version is 3 times faster than the Linux one. The culprit is a call to std::erf which is resolved in the Intel math library for both cases (by default, it is linked dynamically on Windows and statically on Linux but using dynamic linking on Linux gives the same performance).

Here is a simple program to reproduce the problem.

#include <cmath>
#include <cstdio>

int main() {
int n = 100000000;
float sum = 1.0f;

for (int k = 0; k < n; k++) {
sum += std::erf(sum);

std::printf("%7.2f\n", sum);

When I profile this program using vTune, I find that the assembly is a bit different in between the Windows and the Linux version. Here is the call site (the loop) on Windows

Block 3:
"vmovaps xmm0, xmm6"
call 0x1400023e0 <erff>
Block 4:
inc ebx
"vaddss xmm6, xmm6, xmm0"
"cmp ebx, 0x5f5e100"
jl 0x14000103f <Block 3>

And the beginning of the erf function called on Windows

Block 1:
push rbp
"sub rsp, 0x40"
"lea rbp, ptr [rsp+0x20]"
"lea rcx, ptr [rip-0xa6c81]"
"movd edx, xmm0"
"movups xmmword ptr [rbp+0x10], xmm6"
"movss dword ptr [rbp+0x30], xmm0"
"mov eax, edx"
"and edx, 0x7fffffff"
"and eax, 0x80000000"
"add eax, 0x3f800000"
"mov dword ptr [rbp], eax"
"movss xmm6, dword ptr [rbp]"
"cmp edx, 0x7f800000"

On Linux, the code is a bit different. The call site is:

Block 3
"vmovaps %xmm1, %xmm0"
"vmovssl %xmm1, (%rsp)"
callq 0x400bc0 <erff>
Block 4
inc %r12d
"vmovssl (%rsp), %xmm1"
"vaddss %xmm0, %xmm1, %xmm1" <-------- hotspot here
"cmp $0x5f5e100, %r12d"
jl 0x400b6b <Block 3>

and the beginning of the called function (erf) is:

"movd %xmm0, %edx"
"movssl %xmm0, -0x10(%rsp)" <-------- hotspot here
"mov %edx, %eax"
"and $0x7fffffff, %edx"
"and $0x80000000, %eax"
"add $0x3f800000, %eax"
"movl %eax, -0x18(%rsp)"
"movssl -0x18(%rsp), %xmm0"
"cmp $0x7f800000, %edx"
jnl 0x400dac <Block 8>

I have shown the 2 points where the time is lost on Linux.

Does anyone understand assembly enough to explain me the difference of the 2 codes and why the Linux version is 3 times slower?


The Windows version passes the argument to erf() and gets the result back without using the stack - the data goes in and out through the xmm0 register.

Unlike Windows, the Linux version utilizes the stack to pass in and get back a single floating point value. The returned value rests on the stack and must be loaded into a register before it can participate in addition:

"vmovssl  (%rsp), %xmm1"
"vaddss %xmm0, %xmm1, %xmm1" # Stalled because of data dependence
                             # on the previous instruction