Relative performance of swap vs compare-and-swap locks on x86 - c

Two common locking idioms are:
if (!atomic_swap(lockaddr, 1)) /* got the lock */
and:
if (!atomic_compare_and_swap(lockaddr, 0, val)) /* got the lock */
where val could simply be a constant or an identifier for the new prospective owner of the lock.
What I'd like to know is whether there tends to be any significant performance difference between the two on x86 (and x86_64) machines. I know this is a fairly broad question since the answer might vary a lot between individual cpu models, but that's part of the reason I'm asking SO rather than just doing benchmarks on a few cpus I have access to.

I assume atomic_swap(lockaddr, 1) gets translated to a xchg reg,mem instruction and atomic_compare_and_swap(lockaddr, 0, val) gets translated to a cmpxchg[8b|16b].
Some linux kernel developers think cmpxchg ist faster, because the lock prefix isn't implied as with xchg. So if you are on a uniprocessor, multithread or can otherwise make sure the lock isn't needed, you are probably better of with cmpxchg.
But chances are your compiler will translate it to a "lock cmpxchg" and in that case it doesn't really matter.
Also note that while latencies for this instructions are low (1 cycle without lock and about 20 with lock), if you happen to use are common sync variable between two threads, which is quite usual, some additional bus cycles will be enforced, which last forever compared to the instruction latencies. These will most likely completly be hidden by a 200 or 500 cpu cycles long cache snoop/sync/mem access/bus lock/whatever.

I found this Intel document, stating that there is no difference in practice:
http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/
One common myth is that the lock utilizing a cmpxchg instruction is cheaper than a lock utilizing an xchg instruction. This is used because cmpxchg will not attempt to get the lock in exclusive mode since the cmp will go through first. Figure 9 shows that the cmpxchg is just as expensive as the xchg instruction.

On x86, any instruction with a LOCK prefix does all memory operations as read-modify-write cycles. This means that XCHG (with its implicit LOCK) and LOCK CMPXCHG (in all cases, even if the comparison fails) always get an exclusive lock on the cache line. The result is that there is basically no difference in performance.
Note that many CPUs all spinning on the same lock can cause a lot of bus overhead in this model. This is one reason that spin-lock loops should contain PAUSE instructions. Some other architectures have better operations for this.

Are you sure you didn't mean
if (!atomic_load(lockaddr)) {
if (!atomic_swap(lockaddr, val)) /* got the lock */
for the second one?
Test and test and set locks (see Wikipedia https://en.wikipedia.org/wiki/Test_and_test-and-set ) are a quite common optimization for many platforms.
Depending on how compare and exchange is implemented it could be faster or slower than a test and test and set.
As x86 is a relatively stronger ordered platform HW optimizations that may make test and test and set locks faster may be less possible.
Figure 8 from the document that Bo Persson found
http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/ shows that Test and Test and Set locks are superior in performance.

Using xchg vs cmpxchg to aquire the lock
In terms of performance on Intel's processors, it is the same, but for the sake of simplicity, to have things easier to fathom, I prefer the first way from the examples that you have given. There is no reason to use cmpxchg for acquiring a lock if you can do this with xchg.
According to the Occam's razor principle, simple things are better.
Besides that, locking with xchg is more powerful - you can also check the correctness of the logic of your software, i.e. that you are not accessing the memory byte that has not been explicitly allocated for locking. Thus, you will check that you are using the correctly initialized synchronization variable. Besides that, you will be able to check that you don't unlock twice.
Using normal memory store to release the lock
There is no consensus on whether writing to the synchronization variable on releasing the lock should be done with just a normal memory store (mov) or a bus-locking memory store, i.e. an instruction with implicit or explicit lock-prefix, like xchg.
The approach of using normal memory store to release lock was recommended by Peter Cordes, see the comments below for details.
But there are implementations where both acquiring and releasing the lock is done with a bus-locking memory store, since this approach seems to be straightforward and intuitive. For example, LeaveCriticalSection under Windows 10 uses bus-locking store to release the lock even on a single-socket processor; while on multiple physical processors with Non-Uniform-Memory-Access (NUMA), this issue is even more important.
I've done a micro-benchmarking on a memory manager that does lots of memory allocate/reallocate/free, on a single-socket CPU (Kaby Lake). When there is no contention, i.e. there are fewer threads than the physical cores, with locked release the tests complete about 10% slower, but when there are more threads when physical cores, tests with locked release complete 2% faster. So, on average, normal memory store to release lock outperforms locked memory store.
Example of locking that checks synchronization variable for validity
See this example (Delphi programming language) of safer locking functions that checks data of a synchronization variable for validity, and catches attempts to release locks that were not acquired:
const
cLockAvailable = 107; // arbitrary constant, use any unique values that you like, I've chosen prime numbers
cLockLocked = 109;
cLockFinished = 113;
function AcquireLock(var Target: LONG): Boolean;
var
R: LONG;
begin
R := InterlockedExchange(Target, cLockByteLocked);
case R of
cLockAvailable: Result := True; // we've got a value that indicates that the lock was available, so return True to the caller indicating that we have acquired the lock
cLockByteLocked: Result := False; // we've got a value that indicates that the lock was already acquire by someone else, so return False to the caller indicating that we have failed to acquire the lock this time
else
begin
raise Exception.Create('Serious application error - tried to acquire lock using a variable that has not been properly initialized');
end;
end;
end;
procedure ReleaseLock(var Target: LONG);
var
R: LONG;
begin
// As Peter Cordes pointed out (see comments below), releasing the lock doesn't have to be interlocked, just a normal store. Even for debugging we use normal load. However, Windows 10 uses locked release on LeaveCriticalSection.
R := Target;
Target := cLockAvailable;
if R <> cLockByteLocked then
begin
raise Exception.Create('Serious application error - tried to release a lock that has not been actually locked');
end;
end;
Your main application goes here:
var
AreaLocked: LONG;
begin
AreaLocked := cLockAvailable; // on program initialization, fill the default value
....
if AcquireLock(AreaLocked) then
try
// do something critical with the locked area
...
finally
ReleaseLock(AreaLocked);
end;
....
AreaLocked := cLockFinished; // on program termination, set the special value to catch probable cases when somebody will try to acquire the lock
end.
Efficient pause-based spin-wait loops
Test, test-and-set
You may also use the following assembly code (see the "Assembly code example of pause-based spin-wait loop" section below) as a working example of the "pause"-based spin-wait loop.
This code it uses normal memory load while spinning to save resources, as suggested by Peter Cordes. This technique is called "test, test-and-set". You can find out more on this technique at https://stackoverflow.com/a/44916975/6910868
Number of iterations
The pause-based spin-wait loop in this example first tries to acquire the lock by reading the synchronization variable, and if it is not available, utilize pause instruction in a loop of 5000 cycles. After 5000 cycles it calls Windows API function SwitchToThread(). This value of 5000 cycles is empirical. It is based on my tests. Values from 500 to 50000 also seem to be OK, but in some scenarios lower values are better while in other scenarios higher values are better. You can read more on pause-based spin-wait loops at the URL that I gave in the preceding paragraph.
Availability of the pause instruction
Please note that you may use this code only on processors that support SSE2 - you should check the corresponding CPUID bit before calling pause instruction - otherwise there will just be a waste of power. On processors without pause just use other means, like EnterCriticalSection/LeaveCriticalSection or Sleep(0) and then Sleep(1) in a loop. Some people say that on 64-bit processors you may not check for SSE2 to make sure that the pause instruction is implemented, because the original AMD64 architecture adopted Intel's SSE and SSE2 as core instructions, and, practically, if you run 64-bit code, you have already have SSE2 for sure and thus the pause instruction. However, Intel discourages a practice of relying on a presence specific feature and explicitly states that certain feature may vanish in future processors and applications must always check features via CPUID. However, the SSE instructions became ubiquitous and many 64-bit compilers use them without checking (e.g. Delphi for Win64), so chances that in some future processors there will be no SSE2, let alone pause, are very slim.
Assembly code example of pause-based spin-wait loop
// on entry rcx = address of the byte-lock
// on exit: al (eax) = old value of the byte at [rcx]
#Init:
mov edx, cLockByteLocked
mov r9d, 5000
mov eax, edx
jmp #FirstCompare
#DidntLock:
#NormalLoadLoop:
dec r9
jz #SwitchToThread // for static branch prediction, jump forward means "unlikely"
pause
#FirstCompare:
cmp [rcx], al // we are using faster, normal load to not consume the resources and only after it is ready, do once again interlocked exchange
je #NormalLoadLoop // for static branch prediction, jump backwards means "likely"
lock xchg [rcx], al
cmp eax, edx // 32-bit comparison is faster on newer processors like Xeon Phi or Cannonlake.
je #DidntLock
jmp #Finish
#SwitchToThread:
push rcx
call SwitchToThreadIfSupported
pop rcx
jmp #Init
#Finish:

Related

xv6: reading ticks directly without taking the ticks lock?

I'm working on an assignment in operating system course on Xv6. I need to implement a data status structure for a process for its creation time, termination time, sleep time, etc...
As of now I decided to use the ticks variable directly without using the tickslock because it seems not a good idea to use a lock and slow down the system for such a low priority objective.
Since the ticks variable only used like so: ticks++, is there a way where I will try to retrieve the current number of ticks and get a wrong number?
I don't mind getting a wrong number by +-10 ticks but is there a way where it will be really off. Like when the number 01111111111111111 will increment it will need to change 2 bytes. So my question is this, is it possible that the CPU storing data in stages and another CPU will be able to fetch the data in that memory location between the start and complete of the store operation?
So as I see it, if the compiler will create a mov instruction or an inc instruction, what I want to know is if the store operation can be seen between the start and end of it.
There's no problem in asm: aligned loads/stores done with a single instruction on x86 are atomic up to qword (8-byte) width. Why is integer assignment on a naturally aligned variable atomic on x86?
(On 486, the guarantee is only for 4-byte aligned values, and maybe not even that for 386, so possibly this is why Xv6 uses locking? I'm not sure if it's supposed to be multi-core safe on 386; my understanding is that the rare 386 SMP machines didn't exactly implement the modern x86 memory model (memory ordering and so on).)
But C is not asm. Using a plain non-atomic variable from multiple "threads" at once is undefined behaviour, unless all threads are only reading. This means compilers can assume that a normal C variable isn't changed asynchronously by other threads.
Using ticks in a loop in C will let the compiler read it once and keep using the same value repeatedly. You need a READ_ONCE macro like the Linux kernel uses, e.g. *(volatile int*)&ticks. Or simply declare it as volatile unsigned ticks;
For a variable narrow enough to fit in one integer register, it's probably safe to assume that a sane compiler will write it with a single dword store, whether that's a mov or a memory-destination inc or add dword [mem], 1. (You can't assume that a compiler will use a memory-destination inc/add, though, so you can't depend on an increment being single-core-atomic with respect to interrupts.)
With one writer and multiple readers, yes the readers can simply read it without any need for any kind of locking, as long as they use volatile.
Even in portable ISO C, volatile sig_atomic_t has some very limited guarantees of working safely when written by a signal handler and read by the thread that ran the signal handler. (Not necessarily by other threads, though: in ISO C volatile doesn't avoid data-race UB. But in practice on x86 with non-hostile compilers it's fine.)
(POSIX signals are the user-space equivalent of interrupts.)
See also Can num++ be atomic for 'int num'?
For one thread to publish a wider counter in two halves, you'd usually use a SeqLock. With 1 writer and multiple readers, there's no actual locking, just retry by the readers if a write overlapped with their read. See Implementing 64 bit atomic counter with 32 bit atomics
First, using locks or not isn't a matter of whether your objective is low priority or not, but a matter of solving a race condition.
Second, in the specific case you describe, it will be safe to read ticks variable without any locks as this is not a race condition case because RAM access to the same region (even same address here) cannot be made by 2 separate CPUs simultaneously (read more) and because ticks writing only increments the value by 1 and not doing any major changes that you really miss.

How is load->store reordering possible with in-order commit?

ARM allows the reordering loads with subsequent stores, so that the following pseudocode:
// CPU 0 | // CPU 1
temp0 = x; | temp1 = y;
y = 1; | x = 1;
can result in temp0 == temp1 == 1 (and, this is observable in practice as well). I'm having trouble understanding how this occurs; it seems like in-order commit would prevent it (which, it was my understanding, is present in pretty much all OOO processors). My reasoning goes "the load must have its value before it commits, it commits before the store, and the store's value can't become visible to other processors until it commits."
I'm guessing that one of my assumptions must be wrong, and something like one of the following must hold:
Instructions don't need to commit all the way in-order. A later store could safely commit and become visible before an earlier load, so long as at the time the store commits the core can guarantee that the previous load (and all intermediate instructions) won't trigger an exception, and that the load's address is guaranteed to be distinct from the store's.
The load can commit before its value is known. I don't have a guess as to how this would be implemented.
Stores can become visible before they are committed. Maybe a memory buffer somewhere is allowed to forward stores to loads to a different thread, even if the load was enqueued earlier?
Something else entirely?
There's a lot of hypothetical microarchitectural features that would explain this behavior, but I'm most curious about the ones that are actually present in modern weakly ordered CPUs.
Your bullet points of assumptions all look correct to me, except that you could build a uarch where loads can retire from the OoO core after merely checking permissions (TLB) on a load to make sure it can definitely happen. There could be OoO exec CPUs that do that (update: apparently there are).
I think x86 CPUs require loads to actually have the data arrive before they can retire, but their strong memory model doesn't allow LoadStore reordering anyway. So ARM certainly could be different.
You're right that stores can't be made visible to any other cores before retirement. That way lies madness. Even on an SMT core (multiple logical threads on one physical core), it would link speculation on two logical threads together, requiring them both to roll back if either one detected mis-speculation. That would defeat the purpose of SMT of having one logical thread take advantage of stalls in others.
(Related: Making retired but not yet committed (to L1d) stores visible to other logical threads on the same core is how some real PowerPC implementations make it possible for threads to disagree on the global order of stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)
CPUs with in-order execution can start a load (check the TLB and write a load-buffer entry) and only stall if an instruction tries to use the result before it's ready. Then later instructions, including stores, can run normally. This is basically required for non-terrible performance in an in-order pipeline; stalling on every cache miss (or even just L1d latency) would be unacceptable. Memory parallelism is a thing even on in-order CPUs; they can have multiple load buffers that track multiple outstanding cache misses. High(ish) performance in-order ARM cores like Cortex-A53 are still widely used in modern smartphones, and scheduling loads well ahead of when the result register is used is a well-known important optimization for looping over an array. (Unrolling or even software pipelining.)
So if the load misses in cache but the store hits (and commits to L1d before earlier cache-miss loads get their data), you can get LoadStore reordering. (Jeff Preshing intro to memory reording uses that example for LoadStore, but doesn't get into uarch details at all.)
A load can't fault after you've checked the TLB and / or whatever memory-region stuff for it. That part has to be complete before it retires, or before it reaches the end of an in-order pipeline. Just like a retired store sitting in the store buffer waiting to commit, a retired load sitting in a load buffer is definitely happening at some point.
So the sequence on an in-order pipeline is:
lw r0, [r1] TLB hit, but misses in L1d cache. Load execution unit writes the address (r1) into a load buffer. Any later instruction that tries to read r0 will stall, but we know for sure that the load didn't fault.
With r0 tied to waiting for that load buffer to be ready, the lw instruction itself can leave the pipeline (retire), and so can later instructions.
any amount of other instructions that don't read r0. That would stall an in-order pipeline.
sw r2, [r3] store execution unit writes address + data to the store buffer / queue. Then this instruction can retire.
Probing the load buffers finds that this store doesn't overlap with the pending load, so it can commit to L1d. (If it had overlapped, you couldn't commit it until a MESI RFO completed anyway, and fast restart would forward the incoming data to the load buffer. So it might not be too complicated to handle that case without even probing on every store, but let's only look at the separate-cache-line case where we can get LoadStore reordering)
Committing to L1d = becoming globally visible. This can happen while the earlier load is still waiting for the cache line to arrive.
For OoO CPUs, you'd need some way to tie load completion back into the OoO core for instructions waiting on the load result. I guess that's possible, but it means that the architectural/retirement value of a register might not be stored anywhere in the core. Pipeline flushes and other rollbacks from mis-speculation would have to hang on to that association between an incoming load and a physical and architectural register. (Not flushing store buffers on pipeline rollbacks is already a thing that CPUs have to do, though. Retired but not yet committed stores sitting in the store buffer have no way to be rolled back.)
That could be a good design idea for uarches with a small OoO window that's too small to come close to hiding a cache miss. (Which to be fair, is every high-performance OoO exec CPU: memory latency is usually too high to fully hide.)
We have experimental evidence of LoadStore reordering on an OoO ARM: section 7.1 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf shows non-zero counts for "load buffering" on Tegra 2, which is based on the out-of-order Cortex-A9 uarch. I didn't look up all the others, but I did rewrite the answer to suggest that this is the likely mechanism for out-of-order CPUs, too. I don't know for sure if that's the case, though.

When is CLREX actually needed on ARM Cortex M7?

I found a couple of places online which state that CLREX "must" be called whenever an interrupt routine is entered, which I don't understand. The docs for CLREX state (added the numbering for easier reference):
(1) Clears the local record of the executing processor that an address has had a request for an exclusive access.
(2) Use the CLREX instruction to return a closely-coupled exclusive access monitor to its open-access state. This removes the requirement for a dummy store to memory.
(3) It is implementation-defined whether CLREX also clears the global record of the executing processor that an address has had a request for an exclusive access.
I don't understand pretty much anything here.
I had the impression that writing something along the lines the example in the docs was enough to guarantee atomicity:
MOV r1, #0x1 ; load the ‘lock taken’ value
try: <---\
LDREX r0, [LockAddr] ; load the lock value |
CMP r0, #0 ; is the lock free? |
STREXEQ r0, r1, [LockAddr] ; try and claim the lock |
CMPEQ r0, #0 ; did this succeed? |
BNE try ; no - try again ------------/
.... ; yes - we have the lock
Why should the "local record" need to be cleared? I thought that LDREX/STREX are enough to guarantee atomic access to an address from several interrupts? I.e. GCC for ARM compiles all C11 atomic functions using LDREX/STREX and I don't see CLREX being called anywhere.
What "requirement for a dummy store" is the second paragraph referring to?
What is the difference between the global record and a local record? Is global record needed for multi-core scenarios?
Taking (and paraphrasing) your three questions separately:
1. Why clear the access record?
When strict nesting of code is enforced, such as when you're working with interrupts, then CLREX is not usually required. However, there are cases where it's important. Imagine you're writing a context switch for a preemptive operating system kernel, which can asynchronously suspend a running task and resume another. Now consider the following pathological situation, involving two tasks of equal priority (A and B) manipulating the same shared resource using LDREX and STREX:
Task A Task B
...
LDREX
-------------------- context switch
LDREX
STREX (succeeds)
...
LDREX
-------------------- context switch
STREX (succeeds, and should not)
...
Therefore the context switch must issue a CLREX to avoid this.
2. What 'requirement for a dummy store' is avoided?
If there wasn't a CLREX instruction then it would be necessary to use a STREX to relinquish the exclusive-access flag, which involves a memory transaction and is therefore slower than it needs to be if all you want to do is clear the flag.
3. Is the 'global record' for multi-core scenarios?
Yes, if you're using a single-core machine, there's only one record because there's only one CPU.
Actually CLREX isn't needed for exceptions/interrupts on the M7, it appears to only be included for compatibility reasons. From the documenation (Version c):
CLREX enables compatibility with other ARM Cortex processors that have
to force the failure of the store exclusive if the exception occurs
between a load exclusive instruction and the matching store exclusive
instruction in a synchronization operation. In Cortex-M processors,
the local exclusive access monitor clears automatically on an
exception boundary, so exception handlers using CLREX are optional.
So, since Cortex-M processors clear the local exclusive access flag on exception/interrupt entry/exit, this negates most (all?) of the use cases for CLREX.
With regard to your third question, as others have mentioned you are correct in thinking that the global record is used in multi-core scenarios. There may still be use cases for CLREX on multi-core processors depending on the implementation defined effects on local/global flags.
I can see why there is confusion around this, as the initial version of the M7 documentation doesn't include these sentences (not to mention the various other versions of more generic documentation on the ARM website). Even now, I cannot even link to the latest revision. The page displays 'Version a' by default and you have to manually change the version via a drop down box (hopefully this will change in future).
Update
In response to comments, an additional documentation link for this. This is the part of the manual that describes the usage of these instructions outside of the specific instruction documentation (and also has been there since the first revision):
The processor removes its exclusive access tag if:
It executes a CLREX instruction.
It executes a STREX instruction, regardless of whether the write succeeds.
An exception occurs. This means the processor can resolve semaphore conflicts between different threads.
In a multiprocessor implementation:
Executing a CLREX instruction removes only the local exclusive access tag for the processor.
Executing a STREX instruction, or an exception, removes the local exclusive access tags for the processor.
Executing a STREX instruction to a Shareable memory region can also remove the global exclusive access tags for the processor in the
system.

Locks around memory manipulation via inline assembly

I am new to the low level stuff so I am completely oblivious of what kind of problems you might face down there and I am not even sure if I understand the term "atomic" right. Right now I am trying to make simple atomic locks around memory manipulation via extended assembly. Why? For sake of curiosity. I know I am reinventing the wheel here and possibly oversimplifying the whole process.
The question?
Does the code I present here achive the goal of making memory manipulation both threadsafe and reentrant?
If it works, why?
If it doesn't work, why?
Not good enough? Should I for example make use of the register keyword in C?
What I simply want to do...
Before memory manipulation, lock.
After memory manipulation, unlock.
The code:
volatile int atomic_gate_memory = 0;
static inline void atomic_open(volatile int *gate)
{
asm volatile (
"wait:\n"
"cmp %[lock], %[gate]\n"
"je wait\n"
"mov %[lock], %[gate]\n"
: [gate] "=m" (*gate)
: [lock] "r" (1)
);
}
static inline void atomic_close(volatile int *gate)
{
asm volatile (
"mov %[lock], %[gate]\n"
: [gate] "=m" (*gate)
: [lock] "r" (0)
);
}
Then something like:
void *_malloc(size_t size)
{
atomic_open(&atomic_gate_memory);
void *mem = malloc(size);
atomic_close(&atomic_gate_memory);
return mem;
}
#define malloc(size) _malloc(size)
.. same for calloc, realloc, free and fork(for linux).
#ifdef _UNISTD_H
int _fork()
{
pid_t pid;
atomic_open(&atomic_gate_memory);
pid = fork();
atomic_close(&atomic_gate_memory);
return pid;
}
#define fork() _fork()
#endif
After loading the stackframe for atomic_open, objdump generates:
00000000004009a7 <wait>:
4009a7: 39 10 cmp %edx,(%rax)
4009a9: 74 fc je 4009a7 <wait>
4009ab: 89 10 mov %edx,(%rax)
Also, given the disassembly above; can I assume I am making an atomic operation because it is only one instruction?
I think a simple spinlock that doesn't have any of the really major / obvious performance problems on x86 is something like this. Of course a real implementation would use a system call (like Linux futex) after spinning for a while, and unlocking would have to check if it needs to notify any waiters with another system call. This is important; you don't want to spin forever wasting CPU time (and energy / heat) doing nothing. But conceptually this is the spin part of a spinlock before you take the fallback path. It's an important piece of how light-weight locking is implemented. (Only attempting to take the lock once before calling the kernel would be a valid choice, instead of spinning at all.)
Implement as much of this as you like in inline asm, or preferably using C11 stdatomic, like this semaphore implementation. This is NASM syntax. In GNU C, make sure you use a "memory" clobber to stop compile-time reordering of memory access (TTAS coherence issue?)
;;; UNTESTED ;;;;;;;;
;;; TODO: **IMPORTANT** fall back to OS-supported sleep/wakeup after spinning some
;;; e.g. Linux futex
; first arg in rdi as per AMD64 SysV ABI (Linux / Mac / etc)
;;;;;void spin_lock (volatile char *lock)
global spin_unlock
spin_unlock:
; movzx eax, byte [rdi] ; debug check for double-unlocking. Expect 1
mov byte [rdi], 0 ; lock.store(0, std::memory_order_release)
ret
align 16
;;;;;void spin_unlock(volatile char *lock)
global spin_lock
spin_lock:
mov eax, 1 ; only need to do this the first time, otherwise we know al is non-zero
.retry:
xchg al, [rdi]
test al,al ; check if we actually got the lock
jnz .spinloop
ret ; no taken branches on the fast-path
align 8
.spinloop: ; do {
pause
cmp byte [rdi], al ; C++11
jne .retry ; if (lock.load(std::memory_order_acquire) != 1)
jmp .spinloop
; if not translating this to inline asm, you could put the spin loop *before* the function entry point, saving the last jmp
; but since this is probably too simplistic for real use, I'm going to leave it as-is.
A plain store has release semantics, but not sequential-consistency (which you'd get from an xchg or something). Acquire/release is enough to protect a critical section (hence the name).
If you were using a bitfield of atomic flags, you could use lock bts (test and set) for the equivalent of xchg-with-1. You can spin on bt or test. To unlock, you'd need lock btr, not just btr, because it would be a non-atomic read-modify-write of the byte, or even the containing 32-bits.
With a byte or int sized lock like you should normally use, you don't even need a locked operation to unlock; release semantics are enough. glibc's pthread_spin_unlock does it the same as my unlock function: a simple store.
(lock bts is not necessary; xchg or lock cmpxchg are just as good if for a normal lock.)
The first access should be an atomic RMW
See discussion on Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? - if the first access is read-only, the CPU might send out just a share request for that cache line. Then, if it sees the line unlocked (the hopefully-common low-contention case) it would have to send out an RFO (Read For Ownership) to actually be able to write the cache line. So that's twice as many off-core transactions.
The downside is that this will take MESI exclusive ownership of that cache line, but what really matters is that the thread owning the lock can efficiently store a 0 so we can see it unlocked. Either way, read-only or RMW, that core will lose exclusive ownership of the line and have to RFO before it can commit that unlocking store.
I think a read-only first access would just optimize for slightly less traffic between cores when multiple threads queue up to wait for a lock that's already taken. That would be a silly thing to optimize for.
(Fastest inline-assembly spinlock also tested the idea for a massively contended spinlock with multiple threads doing nothing but trying to take the lock, with poor results. That linked answer makes some incorrect claims about xchg globally locking a bus - aligned locks don't do that, just a cache lock (Can num++ be atomic for 'int num'?), and each core can be doing a separate atomic RMW on a different cache line at the same time.)
However, if that initial attempt finds it locks, we don't want to keep hammering on the cache line with atomic RMWs. That's when we fall back to read-only. 10 threads all spamming xchg for the same spinlock would keep the memory arbitration hardware pretty busy. It would likely delay the visibility of the store that unlocks (because that thread has to contend for exclusive ownership of the line), so it's directly counter-productive. It may also memory in general in general for other cores.
PAUSE is also essential, to avoid mis-speculation about memory ordering by the CPU. You exit the loop only when the memory you're reading was modified by another core. However, we don't want to pause in the un-contended case. On Skylake, PAUSE waits a lot longer, like ~100 cycles up from ~5, so you should definitely keep the spin-loop separate from the initial check for unlocked.
I'm sure Intel's and AMD's optimization manuals talk about this, see the x86 tag wiki for that and tons of other links.
Not good enough? Should I for example make use of the register keyword in C?
register is a meaningless hint in modern optimizing compilers, except in debug builds (gcc -O0).

ARM64: LDXR/STXR vs LDAXR/STLXR

On iOS, there are two similar functions, OSAtomicAdd32 and OSAtomicAdd32Barrier. I'm wondering when you would need the Barrier variant.
Disassembled, they are:
_OSAtomicAdd32:
ldxr w8, [x1]
add w8, w8, w0
stxr w9, w8, [x1]
cbnz w9, _OSAtomicAdd32
mov x0, x8
ret lr
_OSAtomicAdd32Barrier:
ldaxr w8, [x1]
add w8, w8, w0
stlxr w9, w8, [x1]
cbnz w9, _OSAtomicAdd32Barrier
mov x0, x8
ret lr
In which scenarios would you need the Load-Acquire / Store-Release semantics of the latter? Can LDXR/STXR instructions be reordered? If they can, is it possible for an atomic update to be "lost" in the absence of a barrier? From what I've read, it doesn't seem like that can happen, and if true, then why would you need the Barrier variant? Perhaps only if you also happened to need a DMB for other purposes?
Thanks!
Oh, the mind-bending horror of weak memory ordering...
The first snippet is your basic atomic read-modify-write - if someone else touches whatever address x1 points to, the store-exclusive will fail and it will try again until it succeeds. So far so good. However, this only applies to the address (or more rightly region) covered by the exclusive monitor, so whilst it's good for atomicity, it's ineffective for synchronisation of anything other than that value.
Consider a case where CPU1 is waiting for CPU0 to write some data to a buffer. CPU1 sits there waiting on some kind of synchronisation object (let's say a semaphore), waiting for CPU0 to update it to signal that new data is ready.
CPU0 writes to the data address.
CPU0 increments the semaphore (atomically, as you do) which happens to be elsewhere in memory.
???
CPU1 sees the new semaphore value.
CPU1 reads some data, which may or may not be the old data, the new data, or some mix of the two.
Now, what happened at step 3? Maybe it all occurred in order. Quite possibly, the hardware decided that since there was no address dependency it would let the store to the semaphore go ahead of the store to the data address. Maybe the semaphore store hit in the cache whereas the data didn't. Maybe it just did so because of complicated reasons only those hardware guys understand. Either way it's perfectly possible for CPU1 to see the semaphore update before the new data has hit memory, thus read back invalid data.
To fix this, CPU0 must have a barrier between steps 1 and 2, to ensure the data has definitely been written before the semaphore is written. Having the atomic write be a barrier is a nice simple way to do this. However since barriers are pretty performance-degrading you want the lightweight no-barrier version as well for situations where you don't need this kind of full synchronisation.
Now, the even less intuitive part is that CPU1 could also reorder its loads. Again since there is no address dependency, it would be free to speculate the data load before the semaphore load irrespective of CPU0's barrier. Thus CPU1 also needs its own barrier between steps 4 and 5.
For the more authoritative, but pretty heavy going, version have a read of ARM's Barrier Litmus Tests and Cookbook. Be warned, this stuff can be confusing ;)
As an aside, in this case the architectural semantics of acquire/release complicate things further. Since they are only one-way barriers, whilst OSAtomicAdd32Barrier adds up to a full barrier relative to code before and after it, it doesn't actually guarantee any ordering relative to the atomic operation itself - see this discussion from Linux for more explanation. Of course, that's from the theoretical point of view of the architecture; in reality it's not inconceivable that the A7 hardware has taken the 'simple' option of wiring up LDAXR to just do DMB+LDXR, and so on, meaning they can get away with this since they're at liberty to code to their own implementation, rather than the specification.
OSAtomicAdd32Barrier() exists for people that are using OSAtomicAdd() for something beyond just atomic increment. Specifically, they are implementing their own multi-processing synchronization primitives based on OSAtomicAdd(). For example, creating their own mutex library. OSAtomicAdd32Barrier() uses heavy barrier instructions to enforce memory ordering on both side of the atomic operation. This is not desirable in normal usage.
To summarize:
1) If you just want to increment an integer in a thread-safe way, use OSAtomicAdd32()
2) If you are stuck with a bunch of old code that foolishly assumes OSAtomicAdd32() can be used as an interprocessor memory ordering and speculation barrier, replace it with OSAtomicAdd32Barrier()
I would guess that this is simply a way of reproducing existing architecture-independent semantics for this operation.
With the ldaxr/stlxr pair, the above sequence will assure correct ordering if the AtomicAdd32 is used as a synchronization mechanism (mutex/semaphore) - regardless of whether the resulting higher-level operation is an acquire or release.
So - this is not about enforcing consistency of the atomic add, but about enforcing ordering between acquiring/releasing a mutex and any operations performed on the resource protected by that mutex.
It is less efficient than the ldxar/stxr or ldxr/stlxr you would use in a normal native synchronization mechanism, but if you have existing platform-independent code expecting an atomic add with those semantics, this is probably the best way to implement it.

Resources