Siler Siler - 23 days ago 11
C++ Question

C++ Treiber Stack and atomic next pointers

The "Treiber Stack" is generally one of the simplest lock-free data structures, and so it is often used when teaching introductions to lock-free algorithms.

I've seen many implementations of Treiber Stacks using C++ atomics. The algorithm itself is trivial, so the real challenge is handling all the other incidental details of lock-free data-structures, such as providing some way of performing safe memory reclamation, avoiding the ABA problem, and allocating nodes in a lock-free manner. This can be solved in various ways, such as using atomic reference counting, hazard pointers, counted/tagged pointers to avoid ABA, and using a lock-free memory pool.

But ignoring all those details and focusing on the simple algorithm itself, one question that occurred to me is the fact that every implementation of Treiber Stacks that I can recall implements the node class using atomic next pointers. For example:

struct Node
T value;
std::atomic<Node*> next;

But after thinking about the algorithm, I'm not sure why the next pointer needs to be atomic.

The general PUSH algorithm (ignoring lock-free allocation, safe memory reclamation, backoff, ABA avoidance, etc.) is:

Node* n = new Node();
Node* front = m_front.load();
while (!m_front.compare_exchange_weak(front, n))

The general POP algorithm (again, ignoring all details except the actual algorithmic logic) is:

Node* front = m_front.load();
Node* next = front->next.load();
while (!m_front.compare_exchange_weak(front, next))
next = front->next.load();

And here is an real-world example implementation of the PUSH algorithm:

So I don't understand why the next pointer even needs to be atomic. Most C++ implementations use relaxed loads/stores with the
pointer, so we don't need any memory fences when reading/writing to the next pointer, but my thinking is that it doesn't need to be atomic at all.

From what I can see, at no time is the next pointer of any node concurrently written to. Rather, the next pointer may be concurrently loaded, but I never see any opportunity for the algorithm to concurrently load+store or concurrently store+store. In fact, in the PUSH algorithm, the next pointer is never accessed concurrently at all.

So it seems to me that next pointers are effectively "read-only" when accessed concurrently, so I'm not sure why it would even be necessary to make them atomic at all.

Yet, every C++ implementation of a Treiber Stack I've seen makes the next pointers to be atomic. So am I correct, or is there some reason the next pointer must be made atomic?

Answer Source

If it was as simple as the code you showed, you'd be right. A Node is never modified after a pointer to it is published. But you left out the part where Nodes are cleaned up so they can be garbage-collected. (You can't just delete after popping; another thread could still have a pointer to it and not yet have read it. This is a tricky problem for RCU as well.)

This is the function you left out, called after a CAS in pop succeeds:

    void clear_links( node_type * pNode ) CDS_NOEXCEPT
        pNode-> nullptr, memory_model::memory_order_relaxed );

Here's an ordering where a reader reads a next while it's being written:

A: Node* front = m_front.load();
                                 B: Node* front = m_front.load();  // same value
A: Node* next = front->next.load();
A: m_front.compare_exchange_weak(front, next) // succeeds, no loop
A: clear_links(front);  // i.e. front->;

                                 B: front->next.load();

Thus, C++ Undefined Behaviour, end of story as far as standards compliance is concerned.

In practice, a non-atomic load will happen to be atomic anyway, or at worst experience tearing on most CPU architectures. I'm not sure there's any scenario where a torn value will actually get used (put into m_front), because clear_links() can't run until after a successful CAS. And if CAS succeeded in one thread, it will fail in the other thread because it will only try a torn next value with the old front as the expected arg to CAS.

In practice, pretty much every implementation anyone cares about has no extra cost for relaxed-atomic loads/stores vs. regular for pointer-sized objects. In fact, this stack pretty much sucks if atomicity isn't "free" for a pointer.

e.g. on AVR (8-bit RISC microcontrollers which use 16-bit pointers), it would be cheaper to just take a lock on the data structure, instead of letting std::atomic use locks for every load/store in this algo. (Especially since there aren't multi-core AVR CPUs, so locks are probably very cheap to implement.)

Using relaxed atomic stores looks good to me. You definitely don't want to implement it the way you showed, with seq_cst stores as part of initializing a new object that hasn't had any pointers to it published yet. At that point atomicity is not needed, but it's free (on normal CPUs) so there's no benefit to avoiding it. None of the stores or loads could be optimized away.