BugRepairMan BugRepairMan - 3 months ago 39
C++ Question

implement ABA counter with c++11 CAS

I am implementing a lock-free queue based on this algorithm, which uses a counter to solve the ABA problem. But I don't know how to implement this counter with c++11 CAS. For example, from the algorithm:

E9: if CAS(&tail.ptr->next, next, <node, next.count+1>)


It is an atomic operation, meaning if
tail.ptr->next
is equal to
next
, let
tail.ptr->next
point to
node
and simultaneously (atomically) make
next.count+1
. However, using C++11 CAS, I can only implement:

std::atomic_compare_exchange_weak(&tail.ptr->next, next, node);


which cannot make
next.count+1
simultaneously happen.

Answer

To atomically modify two things at once with a single atomic operation, you need to put them in adjacent memory, e.g. in a two-member struct. Then you can use std::atomic<my_struct> to get gcc to emit lock cmpxchg16b on x86-64, for example.

clang3.8 compiles it to a library call, unfortunately, so this method is not efficient with clang. IDK about other compilers.

Unfortunately, with current compilers, you need to use a union to get efficient code for reading just one of the pair. The "obvious" way of doing an atomic load of the struct and then only using one member still leads to a lock cmpxchg16b to read the whole struct even though we only need one member.

You don't need inline asm for this, and it's worth a bit of C++ syntax pain to avoid inline asm. https://gcc.gnu.org/wiki/DontUseInlineAsm.


Here's an example of code with a CAS retry-loop that compiles to asm that looks right, and I think is free of UB or other unsafe C++. It's written in C style (non-member functions, etc.) but it would be the same if you wrote member functions.

See this code with gcc5.3 on the Godbolt compiler explorer. Make sure you compiler with -mcx16 to enable cmpxchg16b in code-gen. Your code won't run on early AMD k8 CPUs, but all later x86-64 hardware supports it, AFAIK. Without -mcx16 (or with clang instead of gcc), you get a library function call (which probably uses a global lock). On Godbolt's install of g++, <atomic> doesn't work with -m32, so I can't easily check if -m32 uses cmpxchg8b, or if -mx32 (32-bit pointers in long mode) uses cmpxchg.

This is fully portable C++11, with nothing x86-specific. It's only efficient on targets that can CAS an object the size of two pointers, though. e.g. it compiles to a call to a __atomic_compare_exchange_16 library function for ARM / ARM64 and MIPS64, as you can see on Godbolt.

// compile with -mcx16 when targeting x86-64
#include <atomic>
#include <stdint.h>

using namespace std;

struct node {
  struct counted_ptr {
    node *    ptr;
    uintptr_t count;  // use pointer-sized integers to avoid padding
  };

  // hack to allow reading just the pointer without lock-cmpxchg16b,
  // but still without any C++ data race
  struct counted_ptr_separate {
    atomic<node *>    ptr;
    atomic<uintptr_t> count_separate;  // var name emphasizes that accessing this way isn't atomic with ptr
  };

  static_assert(sizeof(atomic<counted_ptr>) == sizeof(counted_ptr_separate), "atomic<counted_ptr> isn't the same size as the separate version; union type-punning will be bogus");
  //static_assert(std::atomic<counted_ptr>{}.is_lock_free());

  union {
    alignas(16) atomic<counted_ptr> next_and_count;
    counted_ptr_separate  next;
  };
  // TODO: write member functions to read next.ptr or read/write next_and_count

  int data[4];
};


// make sure read-only access is efficient.
node *follow(node *p) {   // good asm, just a mov load
  return p->next.ptr.load(memory_order_acquire);
}
node *follow_nounion(node *p) {  // really bad asm, using cmpxchg16b to load the whole thing
  return p->next_and_count.load(memory_order_acquire).ptr;
}


void update_next(node &target, node *desired)
{  // read the old value efficiently: tearing will just lead to a failed cmpxchg
  node::counted_ptr expected = {
      target.next.ptr.load(memory_order_relaxed),
      target.next.count_separate.load(memory_order_relaxed) };
  // node::counted_ptr expected = cur.counted_next.load(memory_order_acquire);  // gcc uses a slow lock cmpxchg for this
  bool success;
  do {
    node::counted_ptr newval = { desired, expected.count + 1 };

    // x86: compiles to cmpxchg16b with gcc, but clang uses a library call
    success = target.next_and_count.compare_exchange_weak(
                           expected, newval, memory_order_acq_rel);
    // updates exected on failure
  } while( !success );
}

It does depend on writing a union through one member and reading it through another, to do efficient reads of just the pointer without using a lock cmpxchg16b. Even if you had a compiler that didn't support that normally, it would probably still work, because we're always using std::atomic loads, which have to assume that the value was modified asynchronously. So we should be immune from aliasing-like problems where values in registers are still considered live after writing the value through another pointer (or union member).


To learn more about std::memory_order acquire and release, see Jeff Preshing's excellent articles.