How does a mutex lock and unlock functions prevents CPU reordering?

How does a mutex lock and unlock functions prevents CPU reordering? - c

As far as I know, a function call acts as a compiler barrier, but not as a CPU barrier.
This tutorial says the following:
acquiring a lock implies acquire semantics, while releasing a lock
implies release semantics! All the memory operations in between are
contained inside a nice little barrier sandwich, preventing any
undesireable memory reordering across the boundaries.
I assume that the above quote is talking about CPU reordering and not about compiler reordering.
But I don't understand how does a mutex lock and unlock causes the CPU to give these functions acquire and release semantics.
For example, if we have the following C code:
pthread_mutex_lock(&lock);
i = 10;
j = 20;
pthread_mutex_unlock(&lock);
The above C code is translated into the following (pseudo) assembly instructions:
push the address of lock into the stack
call pthread_mutex_lock()
mov 10 into i
mov 20 into j
push the address of lock into the stack
call pthread_mutex_unlock()
Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock() or to below call pthread_mutex_unlock()?
If it is the call instruction that prevents the CPU from doing the reordering, then why is the tutorial I quoted makes it seem like it is the mutex lock and unlock functions that prevents the CPU reordering, why the tutorial I quoted didn't say that any function call will prevent the CPU reordering?
My question is about the x86 architecture.

The short answer is that the body of the pthread_mutex_lock and pthread_mutex_unlock calls will include the necessary platform-specific memory barriers which will prevent the CPU from moving memory accesses within the critical section outside of it. The instruction flow will move from the calling code into the lock and unlock functions via a call instruction, and it is this dynamic instruction trace you have to consider for the purposes of reordering - not the static sequence you see in an assembly listing.
On x86 specifically, you probably won't find explicit, standalone memory barriers inside those methods, since you'll already have lock-prefixed instructions in order to perform the actual locking and unlocking atomically, and these instructions imply a full memory barrier, which prevents the CPU reordering you are concerned about.
For example, on my Ubuntu 16.04 system with glibc 2.23, pthread_mutex_lock is implemented using a lock cmpxchg (compare-and-exchange) and pthread_mutex_unlock is implemented using lock dec (decrement), both of which have full barrier semantics.

If i and j are local variables, nothing. The compiler can keep them in registers across the function call if it can prove that nothing outside the current function have their address.
But any global variables, or locals whose address might be stored in a global, do have to be "in sync" in memory for a non-inline function call. The compiler has to assume that any function call it can't inline modifies any / every variable it can possibly have a reference to.
So for example, if int i; is a local variable, after sscanf("0", "%d", &i); its address will have escaped the function and the compiler will then have to spill/reload it around function calls instead of keeping it in a call-preserved register.
See my answer on Understanding volatile asm vs volatile variable, with an example of asm volatile("":::"memory") being a barrier for a local variable whose address escaped the function (sscanf("0", "%d", &i);), but not for locals that are still purely local. It's exactly the same behaviour for exactly the same reason.
I assume that the above quote is talking about CPU reordering and not about compiler reordering.
It's talking about both, because both are necessary for correctness.
This is why the compiler can't reorder updates to shared variables with any function call. (This is very important: the weak C11 memory model allows lots of compile-time reordering. The strong x86 memory model only allows StoreLoad reordering, and local store-forwarding.)
pthread_mutex_lock being a non-inline function call takes care of compile-time reordering, and the fact that it does a locked operation, an atomic RMW, also means it includes a full runtime memory barrier on x86. (Not the call instruction itself, though, just the code in the function body.) This gives it acquire semantics.
Unlocking a spinlock only needs a release-store, not a RMW, so depending on the implementation details the unlock function might not be a StoreLoad barrier. (This is still ok: it keeps everything in the critical section from getting out. It's not necessary to stop later operations from appearing before the unlock. See Jeff Preshing's article explaining Acquire and Release semantics)
On a weakly-ordered ISA, those mutex functions would run barrier instructions, like ARM dmb (data memory barrier). Normal functions wouldn't, so the author of that guide is correct to point out that those functions are special.
Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock()
This isn't the important reason (because on a weakly-ordered ISA pthread_mutex_unlock would run a barrier instruction), but it is actually true on x86 that the stores can't even be reorder with the call instruction, let alone actual locking/unlocking of the mutex done by the function body before the function returns.
x86 has strong memory-ordering semantics (stores don't reorder with other stores), and call is a store (pushing the return address).
So mov [i], 10 must appear in the global store between the stores done by the call instruction.
Of course in a normal program, nobody is observing the call stack of other threads, just the xchg to take the mutex or the release-store to release it in pthread_mutex_unlock.

Related

Are mutexes alone sufficient for thread safe operations?

Suppose we have multiple threads incrementing a common variable X, and each thread synchronizes by using a mutex M;
function_thread_n(){
ACQUIRE (M)
X++;
RELEASE (M)
}
The mutex ensures that only one thread is updating X at any time, but does a mutex ensure that once updated the value of X is visible to the other threads too. Say the initial values of X is 2; thread 1 increments it to 3. However, the cache of another processor might have the earlier value of 2, and another thread can still end up incrementing the value of 2 to 3. The third condition for cache coherence only requires that the order of writes made by different processors holds, right?
I guess this is what memory barriers are for and if a memory barrier is used before releasing the mutex, then the issue can be avoided.

This is a great question.
TL;DR: The short answer is "yes".
Mutexes provide three primary services:
Mutual exclusion, to ensure that only one thread is executing instructions within the critical section between acquire and release of a given mutex.
Compiler optimization fences, which prevent the compiler's optimizer from moving load/store instructions out of that critical section during compilation.
Architectural memory barriers appropriate to the current architecture, which in general includes a memory acquire fence instruction during mutex acquire and a memory release fence instruction during mutex release. These fences prevent superscalar processors from effectively reordering memory load/stores across the fence at runtime in a way that would cause them to appear to be "performed" outside the critical section.
The combination of all three ensure that data accesses within the critical section delimited by the mutex acquire/release will never observably race with data accesses from another thread who also protects its accesses using the same mutex.
Regarding the part of your question involving caches, coherent cache memory systems separately ensure that at any particular moment, a given line of memory is only writeable by at most one core at a time. Furthermore, memory store operations do not complete until they have evicted any "newly stale" copies cached elsewhere in the caching system (e.g. the L1 of other cores). See this question for more details.

Do I need to use smp_mb() after binding the CPU

Suppose my system is a multicore system， if I bind my program on a cpu core, still I need the smp_mb() to guard the cpu would not reorder
the cpu instructions?
I have this point because I know that the smp_mb() on a single-core systems is not necessary,but I'm no sure this point is correct.

You rarely need a full barrier anyway, usually acquire/release is enough. And usually you want to use C11 atomic_load_explicit(&var, memory_order_acquire), or in Linux kernel code, use one of its functions for an acquire-load, which can be done more efficiently on some ISAs than a plain load and an acquire barrier. (Notably AArch64 or 32-bit ARMv8 with ldar or ldapr)
But yeah, if all threads are sharing the same logical core, run-time memory reordering is impossible, only compile-time. So you just need a compiler memory barrier like asm("" ::: "memory") or C11 atomic_signal_fence(seq_cst), not a CPU run-time barrier like atomic_thread_fence(seq_cst) or the Linux kernel's SMP memory barrier (smp_mb() is x86 mfence or equivalent, or ARM dmb ish, for example).
See Why memory reordering is not a problem on single core/processor machines? for more details about the fact that all instructions on the same core observe memory effects to have happened in program order, regardless of interrupts. e.g. a later load must see the value from an earlier store, otherwise the CPU is not maintaining the illusion of instructions on that core running in program order.
And if you can convince your compiler to emit atomic RMW instructions without the x86 lock prefix, for example, they'll be atomic wrt. context switches (and interrupts in general). Or use gcc -Wa,-momit-lock-prefix=yes to have GAS remove lock prefixes for you, so you can use <stdatomic.h> functions efficiently. At least on x86; for RISC ISAs, there's no way to do a read-modify-write of a memory location in a single instruction.
Or if there is (ARMv8.1), it implies an atomic RMW that's SMP-safe, like x86 lock add [mem], eax. But on a CISC like x86, we have instructions like add [mem], eax or whatever which are just like separate load / ADD / store glued into a single instruction, which either executes fully or not at all before an interrupt. (Note that "executing" a store just means writing into the store buffer, not globally visible cache, but that's sufficient for later code on the same core to see it.)
See also Is x86 CMPXCHG atomic, if so why does it need LOCK? for more about non-locked use-cases.

Locks around memory manipulation via inline assembly

I am new to the low level stuff so I am completely oblivious of what kind of problems you might face down there and I am not even sure if I understand the term "atomic" right. Right now I am trying to make simple atomic locks around memory manipulation via extended assembly. Why? For sake of curiosity. I know I am reinventing the wheel here and possibly oversimplifying the whole process.
The question?
Does the code I present here achive the goal of making memory manipulation both threadsafe and reentrant?
If it works, why?
If it doesn't work, why?
Not good enough? Should I for example make use of the register keyword in C?
What I simply want to do...
Before memory manipulation, lock.
After memory manipulation, unlock.
The code:
volatile int atomic_gate_memory = 0;
static inline void atomic_open(volatile int *gate)
{
asm volatile (
"wait:\n"
"cmp %[lock], %[gate]\n"
"je wait\n"
"mov %[lock], %[gate]\n"
: [gate] "=m" (*gate)
: [lock] "r" (1)
);
}
static inline void atomic_close(volatile int *gate)
{
asm volatile (
"mov %[lock], %[gate]\n"
: [gate] "=m" (*gate)
: [lock] "r" (0)
);
}
Then something like:
void *_malloc(size_t size)
{
atomic_open(&atomic_gate_memory);
void *mem = malloc(size);
atomic_close(&atomic_gate_memory);
return mem;
}
#define malloc(size) _malloc(size)
.. same for calloc, realloc, free and fork(for linux).
#ifdef _UNISTD_H
int _fork()
{
pid_t pid;
atomic_open(&atomic_gate_memory);
pid = fork();
atomic_close(&atomic_gate_memory);
return pid;
}
#define fork() _fork()
#endif
After loading the stackframe for atomic_open, objdump generates:
00000000004009a7 <wait>:
4009a7: 39 10 cmp %edx,(%rax)
4009a9: 74 fc je 4009a7 <wait>
4009ab: 89 10 mov %edx,(%rax)
Also, given the disassembly above; can I assume I am making an atomic operation because it is only one instruction?

I think a simple spinlock that doesn't have any of the really major / obvious performance problems on x86 is something like this. Of course a real implementation would use a system call (like Linux futex) after spinning for a while, and unlocking would have to check if it needs to notify any waiters with another system call. This is important; you don't want to spin forever wasting CPU time (and energy / heat) doing nothing. But conceptually this is the spin part of a spinlock before you take the fallback path. It's an important piece of how light-weight locking is implemented. (Only attempting to take the lock once before calling the kernel would be a valid choice, instead of spinning at all.)
Implement as much of this as you like in inline asm, or preferably using C11 stdatomic, like this semaphore implementation. This is NASM syntax. In GNU C, make sure you use a "memory" clobber to stop compile-time reordering of memory access (TTAS coherence issue?)
;;; UNTESTED ;;;;;;;;
;;; TODO: **IMPORTANT** fall back to OS-supported sleep/wakeup after spinning some
;;; e.g. Linux futex
; first arg in rdi as per AMD64 SysV ABI (Linux / Mac / etc)
;;;;;void spin_lock (volatile char *lock)
global spin_unlock
spin_unlock:
; movzx eax, byte [rdi] ; debug check for double-unlocking. Expect 1
mov byte [rdi], 0 ; lock.store(0, std::memory_order_release)
ret
align 16
;;;;;void spin_unlock(volatile char *lock)
global spin_lock
spin_lock:
mov eax, 1 ; only need to do this the first time, otherwise we know al is non-zero
.retry:
xchg al, [rdi]
test al,al ; check if we actually got the lock
jnz .spinloop
ret ; no taken branches on the fast-path
align 8
.spinloop: ; do {
pause
cmp byte [rdi], al ; C++11
jne .retry ; if (lock.load(std::memory_order_acquire) != 1)
jmp .spinloop
; if not translating this to inline asm, you could put the spin loop *before* the function entry point, saving the last jmp
; but since this is probably too simplistic for real use, I'm going to leave it as-is.
A plain store has release semantics, but not sequential-consistency (which you'd get from an xchg or something). Acquire/release is enough to protect a critical section (hence the name).
If you were using a bitfield of atomic flags, you could use lock bts (test and set) for the equivalent of xchg-with-1. You can spin on bt or test. To unlock, you'd need lock btr, not just btr, because it would be a non-atomic read-modify-write of the byte, or even the containing 32-bits.
With a byte or int sized lock like you should normally use, you don't even need a locked operation to unlock; release semantics are enough. glibc's pthread_spin_unlock does it the same as my unlock function: a simple store.
(lock bts is not necessary; xchg or lock cmpxchg are just as good if for a normal lock.)
The first access should be an atomic RMW
See discussion on Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? - if the first access is read-only, the CPU might send out just a share request for that cache line. Then, if it sees the line unlocked (the hopefully-common low-contention case) it would have to send out an RFO (Read For Ownership) to actually be able to write the cache line. So that's twice as many off-core transactions.
The downside is that this will take MESI exclusive ownership of that cache line, but what really matters is that the thread owning the lock can efficiently store a 0 so we can see it unlocked. Either way, read-only or RMW, that core will lose exclusive ownership of the line and have to RFO before it can commit that unlocking store.
I think a read-only first access would just optimize for slightly less traffic between cores when multiple threads queue up to wait for a lock that's already taken. That would be a silly thing to optimize for.
(Fastest inline-assembly spinlock also tested the idea for a massively contended spinlock with multiple threads doing nothing but trying to take the lock, with poor results. That linked answer makes some incorrect claims about xchg globally locking a bus - aligned locks don't do that, just a cache lock (Can num++ be atomic for 'int num'?), and each core can be doing a separate atomic RMW on a different cache line at the same time.)
However, if that initial attempt finds it locks, we don't want to keep hammering on the cache line with atomic RMWs. That's when we fall back to read-only. 10 threads all spamming xchg for the same spinlock would keep the memory arbitration hardware pretty busy. It would likely delay the visibility of the store that unlocks (because that thread has to contend for exclusive ownership of the line), so it's directly counter-productive. It may also memory in general in general for other cores.
PAUSE is also essential, to avoid mis-speculation about memory ordering by the CPU. You exit the loop only when the memory you're reading was modified by another core. However, we don't want to pause in the un-contended case. On Skylake, PAUSE waits a lot longer, like ~100 cycles up from ~5, so you should definitely keep the spin-loop separate from the initial check for unlocked.
I'm sure Intel's and AMD's optimization manuals talk about this, see the x86 tag wiki for that and tons of other links.
Not good enough? Should I for example make use of the register keyword in C?
register is a meaningless hint in modern optimizing compilers, except in debug builds (gcc -O0).

Should I mutex lock a single variable?

If a single 32-bit variable is shared between multiple threads, should I put a mutex lock around the variable? For example, suppose 1 thread writes to a 32-bit counter and a 2nd thread reads it. Is there any chance the 2nd thread could read a corrupted value?
I'm working on a 32-bit ARM embedded system. The compiler always seems to align 32-bit variables so they can be read or written with a single instruction. If the 32-bit variable was not aligned, then the read or write would be broken down into multiple instructions and the 2nd thread could read a corrupted value.
Does the answer to this question change if I move to a multiple-core system in the future and the variable is shared between cores? (assuming a shared cache between cores)
Thanks!

A mutex protects you from more than just tearing - for example some ARM implementations use out-of-order execution, and a mutex will include memory (and compiler) barriers that may be necessary for your algorithm's correctness.
It is safer to include the mutex, then figure out a way to optimise it later if it shows as a performance problem.
Note also that if your compiler is GCC-based, you may have access to the GCC atomic builtins.

If all the writing is done from one thread (i.e. other threads are only reading), then no you don't need a mutex. If more than one thread may be writing, then you do.

You don't need mutex.
On 32-bit ARM, single write or read is an atomic operation. (regardless of the number of cores)
Of course, you should declare that variable as volatile.

On a 32-bit system, reads and writes of 32-bit vars are atomic. However, it depends what else you are doing with the variable. E.g. if you maniputale it somehow (e.g. add a value), then this requires a read, manipulation and write. If the CPU and compiler do not support an atomic operation for this, then you will need to use a mutex to protect this multi-operation sequence.
There are other, lock-free techniques which can reduce the need for mutexes.

Relative performance of swap vs compare-and-swap locks on x86

Two common locking idioms are:
if (!atomic_swap(lockaddr, 1)) /* got the lock */
and:
if (!atomic_compare_and_swap(lockaddr, 0, val)) /* got the lock */
where val could simply be a constant or an identifier for the new prospective owner of the lock.
What I'd like to know is whether there tends to be any significant performance difference between the two on x86 (and x86_64) machines. I know this is a fairly broad question since the answer might vary a lot between individual cpu models, but that's part of the reason I'm asking SO rather than just doing benchmarks on a few cpus I have access to.

I assume atomic_swap(lockaddr, 1) gets translated to a xchg reg,mem instruction and atomic_compare_and_swap(lockaddr, 0, val) gets translated to a cmpxchg[8b|16b].
Some linux kernel developers think cmpxchg ist faster, because the lock prefix isn't implied as with xchg. So if you are on a uniprocessor, multithread or can otherwise make sure the lock isn't needed, you are probably better of with cmpxchg.
But chances are your compiler will translate it to a "lock cmpxchg" and in that case it doesn't really matter.
Also note that while latencies for this instructions are low (1 cycle without lock and about 20 with lock), if you happen to use are common sync variable between two threads, which is quite usual, some additional bus cycles will be enforced, which last forever compared to the instruction latencies. These will most likely completly be hidden by a 200 or 500 cpu cycles long cache snoop/sync/mem access/bus lock/whatever.

I found this Intel document, stating that there is no difference in practice:
http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/
One common myth is that the lock utilizing a cmpxchg instruction is cheaper than a lock utilizing an xchg instruction. This is used because cmpxchg will not attempt to get the lock in exclusive mode since the cmp will go through first. Figure 9 shows that the cmpxchg is just as expensive as the xchg instruction.

On x86, any instruction with a LOCK prefix does all memory operations as read-modify-write cycles. This means that XCHG (with its implicit LOCK) and LOCK CMPXCHG (in all cases, even if the comparison fails) always get an exclusive lock on the cache line. The result is that there is basically no difference in performance.
Note that many CPUs all spinning on the same lock can cause a lot of bus overhead in this model. This is one reason that spin-lock loops should contain PAUSE instructions. Some other architectures have better operations for this.

Are you sure you didn't mean
if (!atomic_load(lockaddr)) {
if (!atomic_swap(lockaddr, val)) /* got the lock */
for the second one?
Test and test and set locks (see Wikipedia https://en.wikipedia.org/wiki/Test_and_test-and-set ) are a quite common optimization for many platforms.
Depending on how compare and exchange is implemented it could be faster or slower than a test and test and set.
As x86 is a relatively stronger ordered platform HW optimizations that may make test and test and set locks faster may be less possible.
Figure 8 from the document that Bo Persson found
http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/ shows that Test and Test and Set locks are superior in performance.

Using xchg vs cmpxchg to aquire the lock
In terms of performance on Intel's processors, it is the same, but for the sake of simplicity, to have things easier to fathom, I prefer the first way from the examples that you have given. There is no reason to use cmpxchg for acquiring a lock if you can do this with xchg.
According to the Occam's razor principle, simple things are better.
Besides that, locking with xchg is more powerful - you can also check the correctness of the logic of your software, i.e. that you are not accessing the memory byte that has not been explicitly allocated for locking. Thus, you will check that you are using the correctly initialized synchronization variable. Besides that, you will be able to check that you don't unlock twice.
Using normal memory store to release the lock
There is no consensus on whether writing to the synchronization variable on releasing the lock should be done with just a normal memory store (mov) or a bus-locking memory store, i.e. an instruction with implicit or explicit lock-prefix, like xchg.
The approach of using normal memory store to release lock was recommended by Peter Cordes, see the comments below for details.
But there are implementations where both acquiring and releasing the lock is done with a bus-locking memory store, since this approach seems to be straightforward and intuitive. For example, LeaveCriticalSection under Windows 10 uses bus-locking store to release the lock even on a single-socket processor; while on multiple physical processors with Non-Uniform-Memory-Access (NUMA), this issue is even more important.
I've done a micro-benchmarking on a memory manager that does lots of memory allocate/reallocate/free, on a single-socket CPU (Kaby Lake). When there is no contention, i.e. there are fewer threads than the physical cores, with locked release the tests complete about 10% slower, but when there are more threads when physical cores, tests with locked release complete 2% faster. So, on average, normal memory store to release lock outperforms locked memory store.
Example of locking that checks synchronization variable for validity
See this example (Delphi programming language) of safer locking functions that checks data of a synchronization variable for validity, and catches attempts to release locks that were not acquired:
const
cLockAvailable = 107; // arbitrary constant, use any unique values that you like, I've chosen prime numbers
cLockLocked = 109;
cLockFinished = 113;
function AcquireLock(var Target: LONG): Boolean;
var
R: LONG;
begin
R := InterlockedExchange(Target, cLockByteLocked);
case R of
cLockAvailable: Result := True; // we've got a value that indicates that the lock was available, so return True to the caller indicating that we have acquired the lock
cLockByteLocked: Result := False; // we've got a value that indicates that the lock was already acquire by someone else, so return False to the caller indicating that we have failed to acquire the lock this time
else
begin
raise Exception.Create('Serious application error - tried to acquire lock using a variable that has not been properly initialized');
end;
end;
end;
procedure ReleaseLock(var Target: LONG);
var
R: LONG;
begin
// As Peter Cordes pointed out (see comments below), releasing the lock doesn't have to be interlocked, just a normal store. Even for debugging we use normal load. However, Windows 10 uses locked release on LeaveCriticalSection.
R := Target;
Target := cLockAvailable;
if R <> cLockByteLocked then
begin
raise Exception.Create('Serious application error - tried to release a lock that has not been actually locked');
end;
end;
Your main application goes here:
var
AreaLocked: LONG;
begin
AreaLocked := cLockAvailable; // on program initialization, fill the default value
....
if AcquireLock(AreaLocked) then
try
// do something critical with the locked area
...
finally
ReleaseLock(AreaLocked);
end;
....
AreaLocked := cLockFinished; // on program termination, set the special value to catch probable cases when somebody will try to acquire the lock
end.
Efficient pause-based spin-wait loops
Test, test-and-set
You may also use the following assembly code (see the "Assembly code example of pause-based spin-wait loop" section below) as a working example of the "pause"-based spin-wait loop.
This code it uses normal memory load while spinning to save resources, as suggested by Peter Cordes. This technique is called "test, test-and-set". You can find out more on this technique at https://stackoverflow.com/a/44916975/6910868
Number of iterations
The pause-based spin-wait loop in this example first tries to acquire the lock by reading the synchronization variable, and if it is not available, utilize pause instruction in a loop of 5000 cycles. After 5000 cycles it calls Windows API function SwitchToThread(). This value of 5000 cycles is empirical. It is based on my tests. Values from 500 to 50000 also seem to be OK, but in some scenarios lower values are better while in other scenarios higher values are better. You can read more on pause-based spin-wait loops at the URL that I gave in the preceding paragraph.
Availability of the pause instruction
Please note that you may use this code only on processors that support SSE2 - you should check the corresponding CPUID bit before calling pause instruction - otherwise there will just be a waste of power. On processors without pause just use other means, like EnterCriticalSection/LeaveCriticalSection or Sleep(0) and then Sleep(1) in a loop. Some people say that on 64-bit processors you may not check for SSE2 to make sure that the pause instruction is implemented, because the original AMD64 architecture adopted Intel's SSE and SSE2 as core instructions, and, practically, if you run 64-bit code, you have already have SSE2 for sure and thus the pause instruction. However, Intel discourages a practice of relying on a presence specific feature and explicitly states that certain feature may vanish in future processors and applications must always check features via CPUID. However, the SSE instructions became ubiquitous and many 64-bit compilers use them without checking (e.g. Delphi for Win64), so chances that in some future processors there will be no SSE2, let alone pause, are very slim.
Assembly code example of pause-based spin-wait loop
// on entry rcx = address of the byte-lock
// on exit: al (eax) = old value of the byte at [rcx]
#Init:
mov edx, cLockByteLocked
mov r9d, 5000
mov eax, edx
jmp #FirstCompare
#DidntLock:
#NormalLoadLoop:
dec r9
jz #SwitchToThread // for static branch prediction, jump forward means "unlikely"
pause
#FirstCompare:
cmp [rcx], al // we are using faster, normal load to not consume the resources and only after it is ready, do once again interlocked exchange
je #NormalLoadLoop // for static branch prediction, jump backwards means "likely"
lock xchg [rcx], al
cmp eax, edx // 32-bit comparison is faster on newer processors like Xeon Phi or Cannonlake.
je #DidntLock
jmp #Finish
#SwitchToThread:
push rcx
call SwitchToThreadIfSupported
pop rcx
jmp #Init
#Finish:

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight