judent judent - 4 months ago 17
Java Question

How does memory reordering help processors and compilers?

I studied the Java memory model and saw re-ordering problems. A simple example:

boolean first = false;
boolean second = false;

void setValues() {
first = true;
second = true;

void checkValues() {
assert first;

Reordering is very unpredictable and weird. Also, it ruins abstractions. I suppose that processor architectures must have a good reason to do something that's so inconvenient for programmers. What are those reasons?

There is a lot of information about how to handle reordering, but I can't find anything about why it is needed. Everywhere people just say something like "it is because of some performance benefit". What are the performance benefits in storing
, for example?

Can you recommend some article, paper or book about this, or explain it by yourself?


TL;DR: It gives the compiler and hardware more room to take advantage of the as-if rule by not requiring it to preserve all behaviour of the original source, only the result of the single thread itself.

Taking the externally-observable (from other threads) ordering of loads/stores out of the picture as something that optimizations must preserve gives the compiler a lot of room to merge things into fewer operations. For the hardware, delaying stores is the big one, but for compilers all kinds of reordering can help.

(See part-way down for a section on why it helps the compiler)

Why it helps hardware

Hardware reordering earlier stores with later loads (StoreLoad reordering) inside the CPU is essential for out-of-order execution. (See below).

Other kinds of reordering (e.g. StoreStore reordering, which is the subject of your question) aren't essential, and high performance CPUs can be built with only StoreLoad reordering, not the other three kinds. (The prime example being tag:x86, where every store is a release-store, every load is an acquire-load. See the tag wiki for more details.)

Some people, like Linus Torvalds, argue that reordering stores with other stores doesn't help the hardware much, because hardware already has to track store-ordering to support out-of-order execution of a single thread. (A single thread always runs as if all of its own stores/loads happen in program order.) See other posts in that thread on realworldtech if you're curious. And/or if you find Linus's mix of insults and sensible technical arguments entertaining :P

For Java, the issue is that, architectures exist where the hardware doesn't provide these ordering guarantees. Weak memory ordering is a common feature of RISC ISAs like ARM, PowerPC, and MIPS. (But not SPARC-TSO). The reasons behind that design decision are the same ones being argued over in the realworldtech thread I linked: make the hardware simpler, and let software request ordering when needed.

So Java doesn't have much of a choice: Implementing a JVM for an architecture with a weaker memory model than the Java standard would require a store-barrier instruction after every single store, and a load-barrier before every load. (Except when the JVM's JIT-compiler can prove that no other thread can have a reference to that variable.) Running barrier instructions all the time is slow.

Why it helps compilers

(see also Jeff Preshing's excellent blog post on C++ compile-time reordering. This basically applies to Java when you include JIT-compiling to native code as part of the process.)

Another reason for keeping the Java and C/C++ memory models weak is to allow more optimizations. Since other threads are allowed (by the weak memory model) to observe our stores and loads in any order, aggressive transformations are allowed even when the code involves stores to memory.

e.g. in a case like Davide's example:

c.a = 1;
c.b = 1;

// same observable effects as the much simpler
c.a = 2;
c.b = 2;

There's no requirement that other threads be able to observe the intermediate states. So a compiler can just compile that to c.a = 2; c.b = 2;, either at Java-compile time or when the bytecode is JIT-compiled to machine code.

It's common for a method that increments something to be called multiple times from another method. Without this rule, turning it into c.a += 4 could only happen if the compiler could prove that no other thread could observe the difference.

C++ programmers sometimes make the mistake of thinking that since they're compiling for x86, they don't need std::atomic<int> to get some ordering guarantees for a shared variable. This is wrong, because optimizations happen based on the as-if rule for the language memory model, not the target hardware.

More technical hardware explanations:

Why StoreLoad reordering helps performance:

Once a store is committed into cache, it becomes globally visible to threads running on other cores (via the cache-coherency protocol). At that point, it's too late to roll it back (another core might already have gotten a copy of the value). So it can't happen until it's known for certain that the store won't fault, and neither will any instruction before it. and the store's data is ready. And that there wasn't a branch-mispredict at some point earlier, etc. etc. i.e. we need to rule out all cases of mis-speculation before we can retire a store instruction.

Without StoreLoad reordering, every load would have to wait for all preceding stores to retire (i.e. be totally finished executing, having committed the data to cache) before they could read a value from cache for use by later instructions that depend on the value loaded. (The moment when a load copies a value from cache into a register is when it's globally visible to other threads.)

Since you can't know what's happening on other cores, I don't think hardware could hide this delay in starting loads by speculating that it isn't a problem, and then detecting mis-speculation after the fact. (And treat it like a branch mispredict: throw away all work done that depended on that load, and re-issue it.) A core might be able to allow speculative early loads from cache lines that were in Exclusive or Modified state, since they can't be present in other cores. (Detecting mis-speculation if a cache-coherency request for that cache line came in from another CPU before retiring the last store before the speculative load.) Anyway, this is obviously a large amount of complexity that isn't needed for anything else.

Note that I haven't even mentioned cache-misses for stores. That increases the latency of a store from a few cycles to hundreds of cycles.

How actual CPUs work (when StoreLoad reordering is allowed):

I included some links as part of a brief intro to computer architecture in the early part of my answer on Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. That might be helpful, or more confusing, if you're finding this hard to follow.

CPUs avoid WAR and WAW pipeline hazards for stores by buffering them in a store queue until store instructions are ready to retire. Loads from the same core have to check the store queue (to preserve the appearance of in-order execution for a single thread, otherwise you'd need memory-barrier instructions before loading anything that might have been stored recently!). The store queue is invisible to other threads; stores only become globally visible when the store instruction retires, but loads become globally visible as soon as they execute. (And can use values prefetched into the cache well ahead of that).

See also wikipedia's article on the classic RISC pipeline.

So out-of-order execution is possible for stores, but they're only reordered inside the store queue. Since instructions have to retire in order to support precise exceptions, there doesn't appear to be much benefit at all to having the hardware enforce StoreStore ordering.

Since loads become globally visible when they execute, enforcing LoadLoad ordering may require delaying loads after a load that misses in cache. Of course, in reality the CPU would speculatively execute the following loads, and detect a memory-order mis-speculation if it occurs. This is nearly essential for good performance: A large part of the benefit of out-of-order execution is to keep doing useful work, hiding the latency of cache misses.

One of Linus' arguments is that weakly-ordered CPUs require multi-threaded code to use a lot of memory barrier instructions, so they'll need to be cheap for multi-threaded code to not suck. That's only possible if you have hardware tracking the dependency ordering of loads and stores.

But if you have that hardware tracking of dependencies, you can just have the hardware enforce ordering all the time, so software doesn't have to run as many barrier instructions. If you have hardware support to make barriers cheap, why not just make them implicit on every load / store, like x86 does.

His other major argument is that memory ordering is HARD, and a major source of bugs. Getting it right once in hardware is better than every software project having to get it right. (This argument only works because it's possible in hardware without huge performance overhead.)