user1235831 user1235831 - 2 months ago 11
C Question

Locks around memory manipulation via inline assembly

I am new to the low level stuff so I am completely oblivious of what kind of problems you might face down there and I am not even sure if I understand the term "atomic" right. Right now I am trying to make simple atomic locks around memory manipulation via extended assembly. Why? For sake of curiosity. I know I am reinventing the wheel here and possibly oversimplifying the whole process.

The question?
Does the code I present here achive the goal of making memory manipulation both threadsafe and reentrant?

  • If it works, why?

  • If it doesn't work, why?

  • Not good enough? Should I for example make use of the register keyword in C?

What I simply want to do...

  • Before memory manipulation, lock.

  • After memory manipulation, unlock.

The code:

volatile int atomic_gate_memory = 0;

static inline void atomic_open(volatile int *gate)
asm volatile (
"cmp %[lock], %[gate]\n"
"je wait\n"
"mov %[lock], %[gate]\n"
: [gate] "=m" (*gate)
: [lock] "r" (1)

static inline void atomic_close(volatile int *gate)
asm volatile (
"mov %[lock], %[gate]\n"
: [gate] "=m" (*gate)
: [lock] "r" (0)

Then something like:

void *_malloc(size_t size)
void *mem = malloc(size);
return mem;
#define malloc(size) _malloc(size)

.. same for calloc, realloc, free and fork(for linux).

#ifdef _UNISTD_H
int _fork()
pid_t pid;
pid = fork();
return pid;
#define fork() _fork()

After loading the stackframe for atomic_open, objdump generates:

00000000004009a7 <wait>:
4009a7: 39 10 cmp %edx,(%rax)
4009a9: 74 fc je 4009a7 <wait>
4009ab: 89 10 mov %edx,(%rax)

Also, given the disassembly above; can I assume I am making an atomic operation because it is only one instruction?


Not good enough? Should I for example make use of the register keyword in C?

register is a meaningless hint in modern optimizing compilers.

I think a simple spinlock that doesn't have any of the really major / obvious performance problems on x86 is something like:

Implement as much of this as you like in inline asm, or preferably using C11 stdatomic, like this semaphore implementation.

;;;;; UNTESTED ;;;;;;;;
    ; first arg in rdi, in the AMD64 SysV ABI

;;;;;void spin_lock  (volatile char *lock)
global spin_unlock
    ;; debug: check that the old value was non-zero.  double-unlocking is a nasty bug
    mov   byte [rdi], 0
    ;; The store has release semantics, but not sequential-consistency (which you'd get from an xchg or something),
    ;; because acquire/release is enough to protect a critical section (hence the name)

;;;;;void spin_unlock(volatile char *lock)
global spin_lock
    cmp   byte [rdi], 0           ; avoid writing to the cache line if we don't own the lock: should speed up the other thread unlocking
    jnz   .spinloop

    mov   al, 1                   ; only need to do this the first time, otherwise we know al is non-zero
    xchg  al, [rdi]

    test  al,al                   ; check if we actually got the lock
    jnz   .spinloop
    ret                           ; no taken branches on the fast-path

    pause                     ; very old CPUs decode it as REP NOP, which is fine
    cmp   byte [rdi], 0       ; To get a compiler to do this in C++11, use a memory_order_acquire load
    jnz   .spinloop
    jmp   .retry

If you were using a bitfield of atomic flags, you could use lock bts (test and set) for the equivalent of xchg-with-1. You can spin on bt. To unlock, you'd need lock btr, not just btr, because it would be a non-atomic read-modify-write of the byte, or even the containing 32bits.

With a byte or word sized lock, you don't even need a locked operation to unlock; release semantics are enough. glibc's pthread_spin_unlock does the same as my unlock function: a simple store.

This avoids writing to the lock if we see it's already locked. This avoids invalidating the cache line in L1 of the core running the thread that owns it, so it can go back to "Modified" (MESIF or MOESI) with less cache coherence delay during unlock.

We also don't flood the CPU with locked operations in a loop. I'm not sure how much this slows things down in general, but 10 threads all waiting for the same spinlock will keep the memory arbitration hardware pretty busy. This might slow down the thread that does hold the lock, or other unrelated threads on the system, while they use other locks, or memory in general.

PAUSE is also essential, to avoid mis-speculation about memory ordering by the CPU. You exit the loop only when the memory you're reading was modified by another core. However, we don't want to pause in the un-contended case. On Skylake, PAUSE waits a lot longer, like ~100cycles IIRC, so you should definitely keep the spinloop separate from the initial check for unlocked.

I'm sure Intel's and AMD's optimization manuals talk about this, see the tag wiki for that and tons of other links.