timlyo timlyo - 10 days ago 3
C Question

Why is integer assignment on a naturally aligned variable atomic?

I've been reading this article about atomic operations, and it mentions 32bit integer assignment being atomic on x86, as long as the variable is naturally aligned.

Why does natural alignment assure atomicity?


"Natural" alignment means aligned to it's own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. cache-line or page).

The only sane way for an x86 C or C++ compiler to compile an assignment to a 32bit variable is with a single store instruction. It could store the bytes one at a time, but we can be sure it doesn't. The SystemV AMD64 ABI doesn't specifically forbid perverse compilers from making accesses to int variables non-atomic, even though it does require int to be 4B with a default alignment of 4B.

Data races are Undefined Behaviour in both C and C++, so compilers can and do assume that memory is not asynchronously modified. For code that is guaranteed not to break, and actually stores or loads instead of keeping values cached in registers, use C11 stdatomic or C++11 std::atomic.

std::atomic<int> shared;  // shared variable (in aligned memory)

int x;  // local variable (compiler can keep it in a register)
x = shared.load(std::memory_order_relaxed);
shared.store(x, std::memory_order_relaxed);
// shared = x;  // don't do that, the default is seq_cst so stores need MFENCE

Thus, we just need to talk about the behaviour of an insn like mov [shared], eax.

TL;DR: The x86 ISA guarantees that such stores and loads are atomic, up to 64bits wide. So we're fine as long as we ensure the compiler generates those.

IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".

From the "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)

In x86 terminology, a "word" is two bytes. 32bits are a double-word, or DWORD.

Section 8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary
  • Reading or writing a doubleword aligned on a 32-bit boundary (This is another way of saying "natural alignment")

That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).

The rest of the section provides further guarantees for newer CPUs. I haven't read the AMD manuals, but as I understand it, modern AMD CPUs provide guarantees at least as strong as those documented by Intel for P6. There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the tag wiki). It's not usefully skimable since they define some symbols to express things in their own notation, and I haven't tried to really read it.

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

  • Reading or writing a quadword aligned on a 64-bit boundary (e.g. x87 load/store of a double, or cmpxchg8b which was new in Pentium (P5))
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

  • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line. (i.e. write-back or write-through memory regions, not uncacheable or write-combining. They don't mean that the cache-line has to already be hot in L1 cache)

The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:

"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses."

So 64bit x87 and MMX/SSE loads/stores up to 64b (e.g. movsd, movq, movhps, pinsrq, extractps, etc) are atomic if all the data comes from or goes to the same cache line. On some CPUs with 128b or 256b data paths between execution units and L1 cache, 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard. If you want atomic 128b across all x86 systems, you must use cmpxchg16b (available only in 64bit mode).

Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 with threads running on separate sockets, connected with HyperTransport.

Atomic Read-Modify-Write

I mentioned cmpxchg8b, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).

To prevent the contents of that memory location from being modified between the load and the store, you need lock cmpxchg8b, just like you need lock inc [mem] for the entire read-modify-write to be atomic.

The lock prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it with mov to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions like add [mem], eax.

(lock is implicit in xchg reg, [mem], so don't try to save code-size or instruction count with it if you care about performance. Only use it when you want the memory barrier and atomic effect, or when code-size is the only thing that matters, e.g. in a boot sector.)

See also: Can num++ be atomic for 'int num'?

Why lock mov [mem], reg doesn't exist for atomic unaligned stores

From the insn ref manual (Intel x86 manual vol2), cmpxchg:

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for locked instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for a lock mov reg, [MMIO_PORT] to produce a write as well as a read to the memory-mapped I/O register.

The other explanation is that it's not very hard to make sure your data has natural alignment, and lock store would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could use xchg [mem], reg (XCHG has an implicit LOCK prefix), which is even slower than a hypothetical lock mov.

Using a lock prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. (Fun fact: Before mfence existed, a common idiom was lock add [esp], 0, which is a no-op other than clobbering flags and doing a locked operation. [esp] is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE on AMD CPUs.)

Motivation for this design decision:

Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.

For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.

Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.

BTW, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.