Leo Heinsaar - 11 months ago 41
C Question

# Can num++ be atomic for 'int num'?

In general, for

int num
,
num++
(or
++num
), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):

Since line 5, which corresponds to
num++
is one instruction, can we conclude that
num++
is atomic in this case?

And if so, does it mean that so-generated
num++
can be used in concurrent (multi-threaded) scenarios without any danger of data races
(i.e. we don't need to make it, for example,
std::atomic<int>
and impose the associated costs, since it's atomic anyway)?

This is absolutely a Data Race as defined by C++. It wouldn't matter if one compiler happened to produce code that does what you hoped on some target machine; it's still Undefined Behaviour. You need to use std::atomic, but you can use it with memory_order_relaxed if you don't care about reordering.

But first, the asm part of the question:

Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?

Memory-destination instructions are read-modify-write operations. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc. etc.

Memory operations from other CPUs can become globally visible between the load and store. i.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See @Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.

The lock prefix can be applied to many read-modify-write (memory destination) instructions to make them atomic. That is why it exists. (See also this Q&A).

So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line locked from when the load reads data from cache until the store commits its result back into cache. Operations by other cores appear to happen either before or after, not during. (This is basically the definition of atomic: that no observer can see the operation as separate steps, not that it physically / electrically happened simultaneously). I went into a lot more detail about this point in my answer to Atomicity on x86.

Note that the lock prefix also turns an instruction into a full memory barrier, stopping all reordering. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)

On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another Q, and the comments under it, where I explain this in more detail.

### Back to C++:

It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation. It's free to compile it to this if it wants:

;; valid compiler output for num++
mov   eax, [num]
inc   eax
mov   [num], eax


(That used to be more efficient (on some older x86 CPUs like Pentium4, I think; see the tag wiki and Agner Fog's instruction tables / guides for perf details), but modern x86 CPUs once again handle RMW operations at least as efficiently as separate simple instructions.)

### Don't confuse the target memory model (x86) with the C++ memory model

Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.

Classic example: storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.

You might expecting that this code will synchronize with other threads:

// flag is just a plain int global, not std::atomic<int>.
flag--;       // this isn't a real lock, but pretend it's somehow meaningful.
modify_a_data_structure(&foo);
flag++;


but it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile. (And no, C++ volatile is not a useful substitute for std::atomic.)

As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but can be much more efficient on other architectures (like ARM).

This is what gcc actually does on x86, for a few functions that operate on a std::atomic global variable.

See the src+asm formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of asm you get from atomics for those targets.

#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
}

int load_num() { return num; }            // even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
num.store(val, std::memory_order_release);
}
// can the compiler collapse multiple atomic ops into one?  No, it can't.


# g++ 6.2 -O3, targeting x86-64 System V calling convention.  (First arg in edi/rdi)
inc_relaxed():
lock add        DWORD PTR num[rip], 1      #### Even relaxed RMWs need a lock.  There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/  There is atomic_signal_fence for ordering, but nothing for RMW.
ret
inc_seq_cst():
lock add        DWORD PTR num[rip], 1
ret
mov     eax, DWORD PTR num[rip]
ret
store_num(int):
mov     DWORD PTR num[rip], edi
mfence                          ##### seq_cst stores need an mfence
ret
store_num_release(int):
mov     DWORD PTR num[rip], edi
ret                             ##### release and weaker doesn't.
store_num_relaxed(int):
mov     DWORD PTR num[rip], edi
ret


Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.

re: discussion in comments about merging std::atomic num++; num-=2; operations into one:

You might hope that compilers would combine multiple cancelling operations into a single operation, because nothing guarantees that an observer could see the intermediate values. i.e. the ordering where nothing becomes globally visible between these ops is a possibility, so the compiler should pick it at compile time (so it always happens that way).

I think this is true, and the compiler is allowed to do that, but nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong). Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It would also potentially violate the principle of least surprise, and lock-free code is hard enough to write correctly in the first place. Lock-free code needs to be carefully written for high performance, so it won't contain redundant atomic operations in the first place.

The only time you'd use atomics for all your variables "just in case" is when Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs, or something. :)

Even with mo_relaxed, you still get separate lock operations.

void multiple_ops_relaxed(std::atomic<unsigned int>& num) {