Leo Heinsaar Leo Heinsaar - 1 month ago 5x
C Question

Can num++ be atomic for an int num?

In general, for an

int num
), as a read-modify-write operation, is not atomic. But I often see compilers, for example gcc, generate the following code for it (try here):

enter image description here

Since line 5, which corresponds to
is one instruction, can we conclude that
is atomic in this case?

And if so, does it mean that so-generated
can be used in concurrent (multi-threaded) scenarios without any danger of data races
(i.e. we don't need to make it, for example,
and impose the associated costs, since it's atomic anyway)?


This is absolutely a Data Race as defined by C++. It wouldn't matter if one compiler happened to produce code that does what you hoped on some target machine; it's still Undefined Behaviour. You need to use std::atomic, but you can use it with memory_order_relaxed if you don't care about reordering.

But first, the asm part of the question:

Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?

Memory-destination instructions are read-modify-write operations. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc. etc.

Memory operations from other CPUs can become globally visible between the load and store. i.e. two threads running add dword [num], 1 in a loop would step on each other's stores. After 40k increments from each of two threads, the counter might have only gone up by ~60k, not 80k on real multi-core x86 hardware.

The lock prefix can be applied to many RMW instructions to make them atomic. That is why it exists. (See also this Q&A).

So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line locked from when the load reads data from cache until the store commits its result back into cache. Operations by other cores appear to happen either before or after, not during. (This is basically the definition of atomic: that no observer can see the operation as separate steps, not that it physically / electrically happened simultaneously). I went into a lot more detail about this point in my answer to Atomicity on x86.

Note that the lock prefix also turns an instruction into a full memory barrier, stopping all reordering. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)

On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a program and a signal handler, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another Q, and the comment thread where I explain this in more detail.

Back to C++:

It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation. It's free to compile it to this if it wants:

;; valid compiler output for num++
mov   eax, [num]
inc   eax
mov   [num], eax

That used to be more efficient (on some older x86 CPUs like Pentium4, I think), but modern x86 CPUs once again handle RMW operations at least as efficiently as separate simple instructions.

Don't confuse the target memory model (x86) with the C++ memory model

Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.

Classic example: storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.

(As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but can be much more efficient on other architectures (like ARM).)

This is what gcc actually does on x86, for a few functions that operate on a global variable.

#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
  num.fetch_add(1, std::memory_order_relaxed);

int load_num() { return num; }
void store_num(int val){ num = val; }  // even seq_cst loads are free on x86
void store_num_release(int val){
  // allowed to be delayed in becoming globally visible until after following ops.
  num.store(val, std::memory_order_release);

See the src+asm formatted nicely on the Godbolt compiler explorer. I omitted a couple function from the src, but their content should be obvious from the function names in the asm output.

# g++ 6.2 -O3, targeting x86-64 System V calling convention.  (First arg in edi)
    lock add        DWORD PTR num[rip], 1      #### Even relaxed RMWs need a lock.  There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/  There is atomic_signal_fence for ordering, but nothing for RMW.
    lock add        DWORD PTR num[rip], 1
    mov     eax, DWORD PTR num[rip]
    mov     DWORD PTR num[rip], edi
    mfence                          ##### seq_cst stores need an mfence
    mov     DWORD PTR num[rip], edi
    ret                             ##### release and weaker doesn't.
    mov     DWORD PTR num[rip], edi