Tomáš Zato Tomáš Zato - 1 month ago 6
C Question

How does C compiler handle post-increment (i++), memory wise?

I've been just explaining

i++
vs
++i
details to a friend. I was telling him how with no optimalization,
i++
in
for
loop essentially means making a copy of your
i
that is not used for anything. Since
i++
can be described with this pseudocode:

tmp = i;
i = i + 1;
return tmp;


Well, I noticed I just don't really know one thing: where is the memory for our
tmp
allocated? Does it increase the memory size required for whole procedure/function? (That is, is it on stack?)

I suppose it is, but how to test that? If and only if it matters we're talking about C99 standard and GCC compiler. But I'd prefer broader answer to get some perspective on the matter.

Answer

Your assumption that compilers always produce different results for ++i and i++ without optmization is false. Here's a look at pre and post increment on godbolt, in gcc 6.2, no optimization:

The C Code

int pre() {
  int i = 0;
  ++i;
}

int post() {
  int i= 0;
  i++;
}

The Assembly (x86-64)

pre():
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], 0
        add     DWORD PTR [rbp-4], 1
        nop
        pop     rbp
        ret

post():
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], 0
        add     DWORD PTR [rbp-4], 1
        nop
        pop     rbp
        ret

Note that the compiled code is byte-for-byte identical here for i++ and ++i. Both simply add 1 to the memory location reserved on the stack for i. No temporary is created or needed.

You might complain that I'm not actually using the value of the incremented expression, so let's look at something that actually does use the value:

The C Code

int pre(int i) {
  return ++i;
}

int post(int i) {
  return i++;
}

The Assembly (x86-64)

pre(int):
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        add     DWORD PTR [rbp-4], 1
        mov     eax, DWORD PTR [rbp-4]
        pop     rbp
        ret

post(int):
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        lea     edx, [rax+1]
        mov     DWORD PTR [rbp-4], edx
        pop     rbp
        ret

Here, the assembly is different. The pre-increment version uses a memory RMW (read-modify-write) instruction to increment the variable, while the post-increment version increments the variable separately through edx. While looking at un-optimized code is always an exercise in futility, I'm quite sure the post-increment version is faster here, as the dependency chain is smaller due to no RMW instruction in the critical path and subsequent store forwarding stall.

A key note is that even here there is no "temporary space" allocated in memory - only the assembly changes, and a register (eax here) is used for free as the resting place for the value of i before the post-increment.

Of course, you shouldn't really read anything into unoptimized code. It isn't going to be used in practice and you can't really learn much about the efficiency of any construct by studying it, because the optimized code will vary wildly across different idioms.

So finally, let's look at a realistic example where we are using optimization, and both the value of the increment expression and the underlying variable is actually used (so neither value is optimized away). Here we take an int & reference to i in each function so that the passed-in value is modified. We use -O2, although every other optimization level I tried produces identical results, other than -Os:

The C Code

int pre(int& i) {
  return ++i;
}

int post(int& i) {
  return i++;
}

The Assembly (x86-64)

pre(int&):
        mov     eax, DWORD PTR [rdi]
        add     eax, 1
        mov     DWORD PTR [rdi], eax
        ret

post(int&):
        mov     eax, DWORD PTR [rdi]
        lea     edx, [rax+1]
        mov     DWORD PTR [rdi], edx
        ret 

Both functions are almost exactly the same cost. They have the same number of instructions, and on modern Intel hardware produce the same number of uops (4 fused-domain) with the same cost. The functions take exactly the same number of instruction bytes1.

The post-increment differs in that it uses a lea instruction to put its result in edx, so that eax remains un-incremented as the return value. The pre-increment version simply uses eax for everything. The use of edx has no direct cost here because it is a scratch register in x86-64, so there is no need to save its value. In a more complex code, the use of another register could increase register pressure, although it is fairly unlikely because the lifetime is very small and there are more opportunities for re-ordering.

The post-increment version actually has a smaller dependency chain for the return value - assuming the caller uses the return value in eax, it will take 1 additional cycle for it to be ready (since the add eax, 1 is part of the dependency chain). That's actually inherent in the pre-increment definition: in one way, pre-increment is slower, because the increment and subsequent use of the value must occur serially, while in the post-increment case they can occur in parallel, since the use of the value doesn't depend on the increment operation. Of course, this effect is very small - not usually more than a single cycle. The classic advice to use pre-increment probably still applies because for objects it can make a big difference. For primitives, not so much.


1Interestingly, the pre-increment version could have been implemented with inc eax rather than add eax, 1, which is probably as fast on modern hardware and saves a byte. It probably isn't because of the mostly obsolete advice to avoid inc and dec due to partial flag stalls. In fact, with -Os (optimize for size) gcc does use an inc here.