keith keith - 2 months ago 11x
C++ Question

Visual Studio C++ compiler generates 3x slower code when changing completely unrelated code

I have a nested for loop which generates the following assembly:

# branch target labels manually added for readability
002E20F8 mov ebx,esi
002E20FA mov dword ptr [ebp-10h],3B9ACA00h
002E2101 sub ebx,edi
002E2103 add ebx,7
002E2106 shr ebx,3
002E2109 nop dword ptr [eax]
002E2110 xor eax,eax
002E2112 xor ecx,ecx
002E2114 cmp edi,esi
002E2116 mov edx,ebx
002E2118 cmova edx,eax
002E211B mov eax,edi
002E211D test edx,edx
002E211F je main+107h (02E2137h) ;end_innerloop

002E2121 movsd xmm0,mmword ptr [eax]
002E2125 inc ecx ; inc/addsd swapped
002E2126 addsd xmm0,mmword ptr [k]
002E212B add eax,8
002E212E movsd mmword ptr [k],xmm0
002E2133 cmp ecx,edx
002E2135 jne main+0F1h (02E2121h) ;inner_loop
002E2137 sub dword ptr [ebp-10h],1
002E213B jne main+0E0h (02E2110h) ;outer_loop

If I change a line of code before the nested for loop to simply declare an
and then print it out after the for loop. This makes the compiler pull the store/reload of
out of the loop.

The first version of the question described this as "generate the instructions in a slightly different order". (editor's note: perhaps I should leave this analysis / correction for the answer?)

003520F8 mov ebx,esi
003520FA mov dword ptr [ebp-10h],3B9ACA00h
00352101 sub ebx,edi
00352103 add ebx,7
00352106 shr ebx,3
00352109 nop dword ptr [eax]
00352110 xor eax,eax
00352112 xor ecx,ecx
00352114 cmp edi,esi
00352116 mov edx,ebx
00352118 cmova edx,eax
0035211B mov eax,edi
0035211D test edx,edx
0035211F je main+107h (0352137h) ;end_innerloop

00352121 movsd xmm0,mmword ptr [k] ; load of k hoisted out of the loop. Strangely not optimized to xorpd xmm0,xmm0

00352126 addsd xmm0,mmword ptr [eax]
0035212A inc ecx
0035212B add eax,8
0035212E cmp ecx,edx
00352130 jne main+0F6h (0352126h) ;inner_loop

00352132 movsd mmword ptr [k],xmm0 ; movsd in different place.

00352137 sub dword ptr [ebp-10h],1
0035213B jne main+0E0h (0352110h) ;outer_loop

This second arrangement by the compiler is 3x faster. I'm slightly shocked by this. Does anyone know what is going on?

This was compiled with Visual Studio 2015.

Compiler flags (I can add more if requested):

Optimization: Maximize Speed

The code:

#include <iostream>
#include <vector>
#include "Stopwatch.h"

static constexpr int N = 1000000000;

int main()
std::vector<double> buffer;


for (auto& i : buffer)
i = 1e-100;

double k = 0;
int h = 0; // removing this line and swapping the lines std::cout << "time = "... results in 3x slower code??!!

Stopwatch watch;

for (int i = 0; i < N; i++)
for (auto& j : buffer)
k += j;

//std::cout << "time = " << watch.ElapsedMilliseconds() << " / " << k << std::endl;
std::cout << "time = " << watch.ElapsedMilliseconds() << " / " << k << " / " << h << std::endl;

std::cout << "Done...";


Stopwatch class:

#pragma once

#include <chrono>

class Stopwatch
typedef std::chrono::high_resolution_clock clock;
typedef std::chrono::microseconds microseconds;
typedef std::chrono::milliseconds milliseconds;

clock::time_point _start;


void Restart()
_start = clock::now();

double ElapsedMilliseconds()
return ElapsedMicroseconds() * 1E-3;

double ElapsedSeconds()
return ElapsedMicroseconds() * 1E-6;

Stopwatch(const Stopwatch&) = delete;
Stopwatch& operator=(const Stopwatch&) = delete;

double ElapsedMicroseconds()
return static_cast<double>(std::chrono::duration_cast<microseconds>(clock::now() - _start).count());


After editing the question to fix your confusingly-placed line-breaks, and add branch-target labels, it became clear that the loops were significantly different. The movsd isn't reordered within the loop; it's outside the loop.

Instead of leaving the misleading stuff in the question and correcting it in the answer, I decided to edit the question and talk about it here. I figured the code blocks were long enough that future readers would just get bogged down trying to keep track of 4 versions of the code, and it wasn't something that would help people with the same question would find it with a search engine.

The fast version keeps k in a register (xmm0), while the slow version reloads/stores it every iteration. This is often a sign that the compiler's alias analysis failed to prove that things couldn't overlap.

It's not the extra stores and loads themselves that hurt, it's the fact that it lengthens the loop-carried dependency chain by the store-forwarding latency from the store in one iteration to the load in the next iteration. Store-forwarding latency is something like 6 cycles on modern Intel CPUs, vs. 3 cycles for addsd (on Haswell for example). So that perfectly explains the factor of 3 speedup:

  • 9 cycles per iteration when the loop-carried dependency chain is addsd + store-forwarding
  • 3 cycles per iteration when the loop-carried dependency chain is just addsd

See for instruction tables and microarch details. Also other links in the tag wiki.

IDK how MSVC didn't manage to prove that k doesn't overlap with anything, because it's a local whose address doesn't escape the function. (It's address isn't even take). MSVC does a horrible job there, and should just xorps xmm0,xmm0 to zero it ahead of the loop, instead of loading some zeroed memory. I don't even see where it zeros any memory; are you sure that asm is from k as a local, not a global? Actually, that's probably not the asm for the whole function.

If you'd compiled with MSVC's equivalent of -ffast-math, it could have vectorized the reduction (with addpd). Although with such a tiny vector that you're looping over many times, the non-multiple-of-4 element count is moderately inconvenient. Still, the loop overhead isn't an issue here; the loop-carried dependency chain dominates even when k is kept in a register, since your code only uses one accumulator. One addsd per 3 clocks leaves tons of time for other insns to run.

BTW, there are much better ways to initialize a vector:

Instead of resizing a vector (constructing new elements with the default constructor) and then writing new values, you should just do something like

std::vector<double> buffer(10, 1e-100);   // 10 elements set to 1e-100

That makes sure the asm doesn't waste time storing zeros before storing the value you want. I think resize can also take a value to copy into new elements, so you can still declare an empty vector and then resize it.