Eloff Eloff - 2 months ago 6
C Question

Acquire/release semantics with non-temporal stores on x64

I have something like:

if (f = acquire_load() == ) {
... use Foo


auto f = new Foo();

You could easily imagine an implementation of acquire_load and release_store that uses atomic with load(memory_order_acquire) and store(memory_order_release). But now what if release_store is implemented with _mm_stream_si64, a non-temporal write, which is not ordered with respect to other stores on x64? How to get the same semantics?

I think the following is the minimum required:

atomic<Foo*> gFoo;

Foo* acquire_load() {
return gFoo.load(memory_order_relaxed);

void release_store(Foo* f) {
_mm_stream_si64(*(Foo**)&gFoo, f);

And use it as so:

// thread 1
if (f = acquire_load() == ) {
... use Foo


// thread 2
auto f = new Foo();
_mm_sfence(); // ensures Foo is constructed by the time f is published to gFoo

Is that correct? I'm pretty sure the sfence is absolutely required here. But what about the lfence? Is it required or would a simple compiler barrier be enough for x64? e.g. asm volatile("": : :"memory"). According the the x86 memory model, loads are not re-ordered with other loads. So to my understanding, acquire_load() must happen before any load inside the if statement, as long as there's a compiler barrier.


I might be wrong about some things in this answer (proof-reading welcome from people that know this stuff!). It's based on reading the docs and Jeff Preshing's blog, not actual recent experience or testing.

Linus Torvalds strongly recommends against trying to invent your own locking, because it's so easy to get it wrong. It's more of an issue when writing portable code for the Linux kernel, rather than something that's x86-only, so I feel brave enough to try to sort things out for x86.

First of all, using NT stores for a single pointer global variable is insane. You might want to use NT stores into the Foo it points to, but evicting the pointer itself from cache is perverse. (And yes, movnt stores evict the cache line if it was in cache to start with, see vol1 ch Caching of Temporal vs. Non-Temporal Data). Your function names also don't really reflect what you're doing.

I think it would be a lot more sane to do a bunch of NT stores (e.g. for a memset or memcpy type of thing), then an SFENCE, then a normal release_store: done_flag.store(1, std::memory_order_release).

I don't see how using a movnti store to the synchronization variable could possibly improve performance. The whole point of NT stores is for use with Non-Temporal data, which won't be used again (by any thread) for a long time if ever. The locks that control access to shared buffers, or the flags that producers/consumers use to mark data as read, are expected to be read.

x86 hardware is extremely heavily optimized for doing release-stores, because every normal store is a release-store. The hardware has to be good at it for x86 to run fast.

movnt stores can be reordered with other stores, but not with older reads. Intel's x86 manual vol3, chapter 8.2.2 (Memory Ordering in P6 and More Recent Processor Families) says that

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads. (note the lack of exceptions).
  • Writes to memory are not reordered with other writes, with the following exceptions:
  • ... stuff about clflushopt and the fence instructions

Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order. So a StoreStore barrier (SFENCE) is necessary but not sufficient. However, the x86 memory model for WB memory already prevents LoadStore reordering even for weakly-ordered stores, so we don't need an LFENCE for its LoadStore barrier effect, only a LoadStore compiler barrier. (e.g. std::atomic_signal_fence(std::memory_order_release), but you might as well just use a thread_fence (which won't emit any instructions for x86, but will make your code portable to other architectures with the _mm_ stuff taken out).

// The function can't be called release_store unless it actually is one (i.e. includes all necessary barriers)
// Your original function should be called relaxed_store
void release_store(const Foo* f) {
   // _mm_lfence();  // make sure all reads from the locked region are already globally visible.  nvm, this is already guaranteed
   std::atomic_thread_fence(std::memory_order_release);  // no insns emitted on x86 (since it assumes no NT stores), but still a compiler barrier
   _mm_sfence();  // make sure all writes to the locked region are already globally visible
   _mm_stream_si64((long long int*)&gFoo, (int64_t)f);

This stores to the atomic variable (note the lack of dereferencing &gFoo). Your function stores to the Foo it points to, which is super weird; IDK what the point of that was. Also note that it compiles as valid C++11 code.

When thinking about what a release-store means, think about it as the store that releases the lock on a shared data structure. In your case, when the release-store becomes globally visible, any thread that sees it should be able to safely dereference it.

To do an acquire-load, just tell the compiler you want one.

x86 doesn't need any barrier instructions, but specifying mo_acquire instead of mo_relaxed gives you the necessary compiler-barrier. As a bonus, this function is portable: you'll get any and all necessary barriers on other architectures:

Foo* acquire_load() {
    return gFoo.load(std::memory_order_acquire);

You didn't say anything about storing gFoo in WC memory. It's probably really hard to arrange for your program's data segment to be mapped into WC memory... It would be a lot easier for gFoo to simply point to WC memory, after you mmap some video RAM or something. But if you want acquire-loads from WC memory, you probably do need LFENCE. IDK. Ask another question about that, because this answer mostly assumes you're using WB memory.

Note that using a pointer instead of a flag creates a data dependency. I think you should be able to use gFoo.load(std::memory_order_consume), which doesn't require barriers even on weakly-ordered CPUs (other than Alpha). Once compilers are sufficiently advanced to make sure they don't break the data dependency, they can actually make better code (instead of promoting mo_consume to mo_acquire. Read up on this before using mo_consume in production code, and esp. be careful to note that testing it properly is impossible because future compilers are expected to give weaker guarantees than current compilers in practice do.

Initially I was thinking that we did need LFENCE to get a LoadStore barrier. ("Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions". This in turn prevents them from passing (becoming globally visible before) reads that are before the LFENCE).

Note that LFENCE + SFENCE is still weaker than a full MFENCE, because it's not a StoreLoad barrier. SFENCE's own documentation says it's ordered wrt. LFENCE, but that table of the x86 memory model from Intel manual vol3 doesn't mention that. If SFENCE can't execute until after an LFENCE, then sfence / lfence might actually be a slower equivalent to mfence, but lfence / sfence / movnti would give release semantics without a full barrier. Note that the NT store could become globally visible after some following loads/stores, unlike a normal strongly-ordered x86 store.)

NT loads

I know you didn't ask this, but I wrote this part before realizing you hadn't actually mentioned them. Before researching this, I wasn't sure what kind of reordering NT loads could have, so it's something I wanted to know.

In x86, every load has acquire semantics, except for loads from WC memory. SSE4.1 MOVNTDQA is the only non-temporal load instruction, and it isn't weakly ordered when used on normal (WriteBack) memory. So it's an acquire-load, too (when used on WB memory).

Note that movntdq only has a store form, while movntdqa only has a load form. But apparently Intel couldn't just call them storentdqa and loadntdqa. They both have a 16B or 32B alignment requirement, so leaving off the a doesn't make a lot of sense to me. I guess SSE1 and SSE2 had already introduced some NT stores already using the mov... mnemonic (like movntps), but no loads until years later in SSE4.1. (2nd-gen Core2: 45nm Penryn).

The docs for MOVNTDQA say it doesn't change the ordering semantics for the memory type it's used on.

... An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write back) memory type.

A processor’s implementation of the non-temporal hint does not override the effective memory type semantics, but the implementation of the hint is processor dependent. For example, a processor implementation may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type.

My (untested) guess at how a uarch might implement it: insert the newly-loaded NT line into the cache at the LRU position, instead of at the usual MRU position. (See this article about IvB's adaptive L3 policy for a related idea.) So streaming-loads of a giant array might only pollute one "way" of set-associative caches. (TODO: test this theory!)

Also, if you are using it on WC memory (e.g. copying from video RAM, like in this Intel guide):

Because the WC protocol uses a weakly-ordered memory consistency model, an MFENCE or locked instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might reference the same WC memory locations or in order to synchronize reads of a processor with writes by other agents in the system.

That doesn't spell out how it should be used, though. Maybe only writers need to fence? And I'm not totally sure why they say MFENCE rather than SFENCE or LFENCE. Maybe they're talking about a write-to-device-memory, read-from-device-memory situation where stores have to be ordered with respect to loads (StoreLoad barrier), not just with each other (StoreStore barrier).

I searched in Vol3 for movntdqa, and didn't get any hits (in the whole pdf). 3 hits for movntdq: All the discussion of weak ordering and memory types only talks about stores. Note that LFENCE was introduced long before SSE4.1. Presumably it's useful for something, but IDK what. For load ordering, probably only with WC memory, but I haven't read up on when that would be useful.

See also: Non-temporal loads and the hardware prefetcher, do they work together?

LFENCE appears to be more than just a LoadLoad barrier for weakly-ordered loads: it orders other instructions too. (Not the global-visibility of stores, though, just their local execution).

From Intel's insn ref manual:

Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruc- tion begins execution until LFENCE completes.
Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

The entry for rdtsc suggests using LFENCE;RDTSC to prevent it from executing ahead of previous instructions, when RDTSCP isn't available (and the weaker ordering guarantee is ok: rdtscp doesn't stop following instructions from executing ahead of it). (CPUID is a common suggestion for a serializing the instruction stream around rdtsc).