Understanding ARM Transactional Memory Extensions

Understanding ARM Transactional Memory Extensions - arm

The ARM Transactional Memory Extensions have a fairly straightforward description of how one would use them:
sem_post:
TSTART X0 // Start of outer transaction
CBNZ test_fail // No reason for this routine to cancel or fail
LDR X1, [X2] // X2 points to semaphore
ADD X1, X1, #1 // Increment semaphore value
STR X1, [X2]. // Store incremented value
TCOMMIT // Commit transaction and exit
What I'm trying to figure out is whether these transactions replay based on collisions with transaction in other parts of the code, and whether they replay based on collisions with any sort of accesses. To elaborate, let's say we have this routine:
sem_wait:
TSTART X0 // Start of outer transaction
CBNZ retry_check // This routine checks for retry (RTRY) and restarts transaction
LDR X1, [X2] // X2 points to semaphore
CMP X1, #0 // Check if semaphore is already used
CBNZ decrement // If it's non-zero, we can decrement the semaphore
TCANCEL #0xFF // If it's zero, we gotta retry
decrement:
SUB X1, X1, #1 // Decrement semaphore value
STR X1, [X2]. // Store decremented value
TCOMMIT // Commit transaction and exit
So this transaction would be in another part of the code, but would be accessing the locations in memory as the sem_post transaction.
My first question: would a thread executing the sem_post transaction potentially replay do to a thread executing the sem_wait transaction concurrently?
For the second part of my question, let's say we have a simple routine like this:
break_semaphore:
MOV X0, #0xFF
STR X0, [X1] // X1 points to semaphore
The above routine isn't a transaction at all, it's just messing with the semaphore.
My second question: Would the thread executing the sem_post transaction potentially replay due to any concurrent access to locations that are to be updated and committed in the sem_post transaction?
For clarity, I fully understand that this isn't really how the TME instructions are supposed to be used, and that locks would be implemented more like this: https://www.gem5.org/project/2020/10/27/tme.html
I'm more wondering what it is that transactions actually linearize: two transactions with common regions of code, all transactions with each other, or the transaction with respect to all other accesses to memory?

TME definitely linearizes accesses to shared memory regions. In the case of your example, the reason why these transactions are aborting is not due to the fact that they are executing the same code, but due to the shared memory address.
From the ARM TME documentation, any conflicting state to a memory address will cause TSTART to fail with the MEM bit set. In the context of your semaphore example, because there is no fallback code for the call to sem_post, the transaction would cancel program execution state would revert back to non-Transactional state.
For a similar reason, transactions do not necessarily linearize by executing the same code because they may refer to different regions of memory (i.e., multiple semaphores with different pointers), which is perfectly legal.
Whether or not transactions linearize amongst each other is more difficult to answer, because it is typically hardware dependent. For example, two transactions could be legally executing on different cores with different memory objects, but if two transactions are attempting the be executed on the same core and registers (i.e., with hyper-threading), this behavior would be more difficult to define.

Related

Are mutexes alone sufficient for thread safe operations?

Suppose we have multiple threads incrementing a common variable X, and each thread synchronizes by using a mutex M;
function_thread_n(){
ACQUIRE (M)
X++;
RELEASE (M)
}
The mutex ensures that only one thread is updating X at any time, but does a mutex ensure that once updated the value of X is visible to the other threads too. Say the initial values of X is 2; thread 1 increments it to 3. However, the cache of another processor might have the earlier value of 2, and another thread can still end up incrementing the value of 2 to 3. The third condition for cache coherence only requires that the order of writes made by different processors holds, right?
I guess this is what memory barriers are for and if a memory barrier is used before releasing the mutex, then the issue can be avoided.

This is a great question.
TL;DR: The short answer is "yes".
Mutexes provide three primary services:
Mutual exclusion, to ensure that only one thread is executing instructions within the critical section between acquire and release of a given mutex.
Compiler optimization fences, which prevent the compiler's optimizer from moving load/store instructions out of that critical section during compilation.
Architectural memory barriers appropriate to the current architecture, which in general includes a memory acquire fence instruction during mutex acquire and a memory release fence instruction during mutex release. These fences prevent superscalar processors from effectively reordering memory load/stores across the fence at runtime in a way that would cause them to appear to be "performed" outside the critical section.
The combination of all three ensure that data accesses within the critical section delimited by the mutex acquire/release will never observably race with data accesses from another thread who also protects its accesses using the same mutex.
Regarding the part of your question involving caches, coherent cache memory systems separately ensure that at any particular moment, a given line of memory is only writeable by at most one core at a time. Furthermore, memory store operations do not complete until they have evicted any "newly stale" copies cached elsewhere in the caching system (e.g. the L1 of other cores). See this question for more details.

I heard that INC instruction is not atomic. Then, how CAS instruction can be atomic?

CAS is very primitive lock free technique, and I know that it is atomic.
Also, it is much more complex operation than INC.
It should compare value and if value is not changed, CAS sets the new value while guaranteeing that other thread does not access it.
Then, how can CAS is atomic while INC is not.
I also learned that LOCK INC is atomic operation, but with expensive cost than INC.
If CAS also uses some similar technique like LOCK INC internally, then why is it called lock free technique?
Is lock used in CAS is different from normal lock we usually know?
If so, how much different between cost of normal lock and cas?

CaS is different from locked inc. LOCK INC semantically locks the memory and performs increment (lock does not always happen, but the effect is as is it was). As a result, LOCK INC is guranteed to increment the value, and if two LOCK INC are issued at the same time from two different threads on the same value, the result would be value incremented exactly twice. LOCK INC can never fail.
CaS is 'try and see' operation. The operation is attempted (namely, set value to X if it is Y) and it can either succeed - if value is indeed Y or fail, if it is something else. There is no guanratee that it will succeed. If two threads are issuing the same CaS operation on the same value at the same time, only one of them will succeed, while the other will fail.
There is also a concept of 'atomic increment', which basically means 'increment value, but do not lock it'. The way it is usually done is by trying to do CaS in a loop to the new incremented value until it succeeds. Every fail means the new value and the check value will be adjusted. As a result, atomic increment can potentially be slow on a high-contented values.

Compare and swap (CAS) is implemented as an atomic processor operation on most processor architectures. Since it is atomic at the hardware level, no explicit lock is needed when using it. C compilers generally know if the target architecture has the instruction, so if you use the compare and swap operation in the atomic library, it will most likely use this hardware operation if available, without incurring the overhead of an explicit lock.
"Increment" is not well defined in a multithreaded environment; if another thread has changed the value since the current thread read it, should the result of the increment operation be one more than the value the current thread read, or one more than the value the other thread wrote? For most people the intuitive result is that it should be one more than the value the current thread read, but only if another thread hasn't written to it, which actually makes increment a more complex operation than a compare and swap.

ARM64: LDXR/STXR vs LDAXR/STLXR

On iOS, there are two similar functions, OSAtomicAdd32 and OSAtomicAdd32Barrier. I'm wondering when you would need the Barrier variant.
Disassembled, they are:
_OSAtomicAdd32:
ldxr w8, [x1]
add w8, w8, w0
stxr w9, w8, [x1]
cbnz w9, _OSAtomicAdd32
mov x0, x8
ret lr
_OSAtomicAdd32Barrier:
ldaxr w8, [x1]
add w8, w8, w0
stlxr w9, w8, [x1]
cbnz w9, _OSAtomicAdd32Barrier
mov x0, x8
ret lr
In which scenarios would you need the Load-Acquire / Store-Release semantics of the latter? Can LDXR/STXR instructions be reordered? If they can, is it possible for an atomic update to be "lost" in the absence of a barrier? From what I've read, it doesn't seem like that can happen, and if true, then why would you need the Barrier variant? Perhaps only if you also happened to need a DMB for other purposes?
Thanks!

Oh, the mind-bending horror of weak memory ordering...
The first snippet is your basic atomic read-modify-write - if someone else touches whatever address x1 points to, the store-exclusive will fail and it will try again until it succeeds. So far so good. However, this only applies to the address (or more rightly region) covered by the exclusive monitor, so whilst it's good for atomicity, it's ineffective for synchronisation of anything other than that value.
Consider a case where CPU1 is waiting for CPU0 to write some data to a buffer. CPU1 sits there waiting on some kind of synchronisation object (let's say a semaphore), waiting for CPU0 to update it to signal that new data is ready.
CPU0 writes to the data address.
CPU0 increments the semaphore (atomically, as you do) which happens to be elsewhere in memory.
???
CPU1 sees the new semaphore value.
CPU1 reads some data, which may or may not be the old data, the new data, or some mix of the two.
Now, what happened at step 3? Maybe it all occurred in order. Quite possibly, the hardware decided that since there was no address dependency it would let the store to the semaphore go ahead of the store to the data address. Maybe the semaphore store hit in the cache whereas the data didn't. Maybe it just did so because of complicated reasons only those hardware guys understand. Either way it's perfectly possible for CPU1 to see the semaphore update before the new data has hit memory, thus read back invalid data.
To fix this, CPU0 must have a barrier between steps 1 and 2, to ensure the data has definitely been written before the semaphore is written. Having the atomic write be a barrier is a nice simple way to do this. However since barriers are pretty performance-degrading you want the lightweight no-barrier version as well for situations where you don't need this kind of full synchronisation.
Now, the even less intuitive part is that CPU1 could also reorder its loads. Again since there is no address dependency, it would be free to speculate the data load before the semaphore load irrespective of CPU0's barrier. Thus CPU1 also needs its own barrier between steps 4 and 5.
For the more authoritative, but pretty heavy going, version have a read of ARM's Barrier Litmus Tests and Cookbook. Be warned, this stuff can be confusing ;)
As an aside, in this case the architectural semantics of acquire/release complicate things further. Since they are only one-way barriers, whilst OSAtomicAdd32Barrier adds up to a full barrier relative to code before and after it, it doesn't actually guarantee any ordering relative to the atomic operation itself - see this discussion from Linux for more explanation. Of course, that's from the theoretical point of view of the architecture; in reality it's not inconceivable that the A7 hardware has taken the 'simple' option of wiring up LDAXR to just do DMB+LDXR, and so on, meaning they can get away with this since they're at liberty to code to their own implementation, rather than the specification.

OSAtomicAdd32Barrier() exists for people that are using OSAtomicAdd() for something beyond just atomic increment. Specifically, they are implementing their own multi-processing synchronization primitives based on OSAtomicAdd(). For example, creating their own mutex library. OSAtomicAdd32Barrier() uses heavy barrier instructions to enforce memory ordering on both side of the atomic operation. This is not desirable in normal usage.
To summarize:
1) If you just want to increment an integer in a thread-safe way, use OSAtomicAdd32()
2) If you are stuck with a bunch of old code that foolishly assumes OSAtomicAdd32() can be used as an interprocessor memory ordering and speculation barrier, replace it with OSAtomicAdd32Barrier()

I would guess that this is simply a way of reproducing existing architecture-independent semantics for this operation.
With the ldaxr/stlxr pair, the above sequence will assure correct ordering if the AtomicAdd32 is used as a synchronization mechanism (mutex/semaphore) - regardless of whether the resulting higher-level operation is an acquire or release.
So - this is not about enforcing consistency of the atomic add, but about enforcing ordering between acquiring/releasing a mutex and any operations performed on the resource protected by that mutex.
It is less efficient than the ldxar/stxr or ldxr/stlxr you would use in a normal native synchronization mechanism, but if you have existing platform-independent code expecting an atomic add with those semantics, this is probably the best way to implement it.

What prevents a race condition when checking a semaphore value?

I'm studying multithreading and trying to understand the concept of semaphores and mutual exclusion. Most of the examples I find online use some sort of library (e.g. pthread) to implement the semaphore or mutex, but I'm more interested in the implementation of a simple semaphore that establishes a critical section -- no more than one thread accessing a particular region of memory.
For this task, I believe I would need a mutex (a.k.a. a binary semaphore if I understand the terminology correctly). I can see how the semaphore would prevent a race condition by "locking" the section of code to a single thread, but what prevents a race condition from occurring at the semaphore itself?
I imagine a binary semaphore to hold an int value to keep track of the lock:
Semaphore
---------
int lock = 1;
unsigned P(void){
if(lock > 0){
lock--;
return 0; /* success */
}
return 1; /* fail */
}
void V(void){
lock++;
}
Suppose two threads call the P function at the same time, they both reach the if(lock > 0) check at the same time and evaluate the condition as true -- this creates a race condition where both threads are granted access to the same region of memory at the same time.
So what prevents this race condition from occurring in real world implementations of semaphores?

Locking and relasing semaphores and/or mutexes happen as atomic operations, this means the CPU cannot be withdrawn from the current process. This ensures, that as soon as a mutex-lock is started (it consists of either a single or a few CPU-instruction (microcode)), the process keeps the CPU until the locking/releasing is done.
There are also different ways to implement threading, which can either be a direct support by CPU (kernel-space) or through a library (such as pthreads) in user-space.
From OSDev.org
An atomic operation is an operation that will always be executed without any other process being able to read or change state that is read or changed during the operation. It is effectively executed as a single step, and is an important quality in a number of algorithms that deal with multiple indepent processes, both in synchronization and algorithms that update shared data without requiring synchronization.
Here is a nice article on atomicity, too (although in Delphi).

The most common (although definitely not the only) way to implement most locking primitives are compare-and-set instructions. An normal move instruction would just set the value of a memory location to whatever value you ask it to while a compare-and-set instruction does "atomically set this memory location to value X only if the value of the memory location is Y, then set some flag if the operation succeeded or not". The keyword "atomic" is that the CPU can in hardware make sure that nothing else can interfere with that operation.
Using a compare-and-swap instruction your example P could be implemented as:
int oldlock;
retry:
oldlock = lock;
if (oldlock > 0) {
if (compare_and_swap(&lock, oldlock, oldlock - 1))
goto retry;
return 0;
}
return 1;
Of course reality is much more complex than that, but compare-and-set is easy to understand and explain and has the nice property that it can implement (almost?) all other locking primitives.
Here's a wikipedia article.

The difference between a semaphore (or a mutex) and a "normal" variable isn't that big. Those libraries which offer you this functionality just make sure that the semaphore is only accessed through atomic operations. There are multiple ways to achieve that:
Special assembly instructions which guarantee atomic access, e.g.: TSL or XCHG.
Turning off the scheduler's interrupts before the variable gets accessed and afterwards turning them back on again. So the scheduler can't remove your process from the CPU. But you have to be aware that this only works on single CPU systems.
Using language specific features like Java's synchronise keyword.
An example of how to use the TSL instruction:
enter_region: ; A "jump to" tag; function entry point.
tsl reg, flag ; Test and Set Lock; flag is the
; shared variable; it is copied
; into the register reg and flag
; then atomically set to 1.
cmp reg, #0 ; Was flag zero on entry_region?
jnz enter_region ; Jump to enter_region if
; reg is non-zero; i.e.,
; flag was non-zero on entry.
ret ; Exit; i.e., flag was zero on
; entry. If we get here, tsl
; will have set it non-zero; thus,
; we have claimed the resource
; associated with flag.
leave_region:
move flag, #0 ; store 0 in flag
ret ; return to caller
By the way, as you already and correctly pointed out, a mutex is just a special kind of a semaphore, only allowing FALSE (represented by 0 in C) and TRUE (represented by 1 or any other value != 0) for it's internal int variable. Thus making it a so called binary semaphore.

Relative performance of swap vs compare-and-swap locks on x86

Two common locking idioms are:
if (!atomic_swap(lockaddr, 1)) /* got the lock */
and:
if (!atomic_compare_and_swap(lockaddr, 0, val)) /* got the lock */
where val could simply be a constant or an identifier for the new prospective owner of the lock.
What I'd like to know is whether there tends to be any significant performance difference between the two on x86 (and x86_64) machines. I know this is a fairly broad question since the answer might vary a lot between individual cpu models, but that's part of the reason I'm asking SO rather than just doing benchmarks on a few cpus I have access to.

I assume atomic_swap(lockaddr, 1) gets translated to a xchg reg,mem instruction and atomic_compare_and_swap(lockaddr, 0, val) gets translated to a cmpxchg[8b|16b].
Some linux kernel developers think cmpxchg ist faster, because the lock prefix isn't implied as with xchg. So if you are on a uniprocessor, multithread or can otherwise make sure the lock isn't needed, you are probably better of with cmpxchg.
But chances are your compiler will translate it to a "lock cmpxchg" and in that case it doesn't really matter.
Also note that while latencies for this instructions are low (1 cycle without lock and about 20 with lock), if you happen to use are common sync variable between two threads, which is quite usual, some additional bus cycles will be enforced, which last forever compared to the instruction latencies. These will most likely completly be hidden by a 200 or 500 cpu cycles long cache snoop/sync/mem access/bus lock/whatever.

I found this Intel document, stating that there is no difference in practice:
http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/
One common myth is that the lock utilizing a cmpxchg instruction is cheaper than a lock utilizing an xchg instruction. This is used because cmpxchg will not attempt to get the lock in exclusive mode since the cmp will go through first. Figure 9 shows that the cmpxchg is just as expensive as the xchg instruction.

On x86, any instruction with a LOCK prefix does all memory operations as read-modify-write cycles. This means that XCHG (with its implicit LOCK) and LOCK CMPXCHG (in all cases, even if the comparison fails) always get an exclusive lock on the cache line. The result is that there is basically no difference in performance.
Note that many CPUs all spinning on the same lock can cause a lot of bus overhead in this model. This is one reason that spin-lock loops should contain PAUSE instructions. Some other architectures have better operations for this.

Are you sure you didn't mean
if (!atomic_load(lockaddr)) {
if (!atomic_swap(lockaddr, val)) /* got the lock */
for the second one?
Test and test and set locks (see Wikipedia https://en.wikipedia.org/wiki/Test_and_test-and-set ) are a quite common optimization for many platforms.
Depending on how compare and exchange is implemented it could be faster or slower than a test and test and set.
As x86 is a relatively stronger ordered platform HW optimizations that may make test and test and set locks faster may be less possible.
Figure 8 from the document that Bo Persson found
http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/ shows that Test and Test and Set locks are superior in performance.

Using xchg vs cmpxchg to aquire the lock
In terms of performance on Intel's processors, it is the same, but for the sake of simplicity, to have things easier to fathom, I prefer the first way from the examples that you have given. There is no reason to use cmpxchg for acquiring a lock if you can do this with xchg.
According to the Occam's razor principle, simple things are better.
Besides that, locking with xchg is more powerful - you can also check the correctness of the logic of your software, i.e. that you are not accessing the memory byte that has not been explicitly allocated for locking. Thus, you will check that you are using the correctly initialized synchronization variable. Besides that, you will be able to check that you don't unlock twice.
Using normal memory store to release the lock
There is no consensus on whether writing to the synchronization variable on releasing the lock should be done with just a normal memory store (mov) or a bus-locking memory store, i.e. an instruction with implicit or explicit lock-prefix, like xchg.
The approach of using normal memory store to release lock was recommended by Peter Cordes, see the comments below for details.
But there are implementations where both acquiring and releasing the lock is done with a bus-locking memory store, since this approach seems to be straightforward and intuitive. For example, LeaveCriticalSection under Windows 10 uses bus-locking store to release the lock even on a single-socket processor; while on multiple physical processors with Non-Uniform-Memory-Access (NUMA), this issue is even more important.
I've done a micro-benchmarking on a memory manager that does lots of memory allocate/reallocate/free, on a single-socket CPU (Kaby Lake). When there is no contention, i.e. there are fewer threads than the physical cores, with locked release the tests complete about 10% slower, but when there are more threads when physical cores, tests with locked release complete 2% faster. So, on average, normal memory store to release lock outperforms locked memory store.
Example of locking that checks synchronization variable for validity
See this example (Delphi programming language) of safer locking functions that checks data of a synchronization variable for validity, and catches attempts to release locks that were not acquired:
const
cLockAvailable = 107; // arbitrary constant, use any unique values that you like, I've chosen prime numbers
cLockLocked = 109;
cLockFinished = 113;
function AcquireLock(var Target: LONG): Boolean;
var
R: LONG;
begin
R := InterlockedExchange(Target, cLockByteLocked);
case R of
cLockAvailable: Result := True; // we've got a value that indicates that the lock was available, so return True to the caller indicating that we have acquired the lock
cLockByteLocked: Result := False; // we've got a value that indicates that the lock was already acquire by someone else, so return False to the caller indicating that we have failed to acquire the lock this time
else
begin
raise Exception.Create('Serious application error - tried to acquire lock using a variable that has not been properly initialized');
end;
end;
end;
procedure ReleaseLock(var Target: LONG);
var
R: LONG;
begin
// As Peter Cordes pointed out (see comments below), releasing the lock doesn't have to be interlocked, just a normal store. Even for debugging we use normal load. However, Windows 10 uses locked release on LeaveCriticalSection.
R := Target;
Target := cLockAvailable;
if R <> cLockByteLocked then
begin
raise Exception.Create('Serious application error - tried to release a lock that has not been actually locked');
end;
end;
Your main application goes here:
var
AreaLocked: LONG;
begin
AreaLocked := cLockAvailable; // on program initialization, fill the default value
....
if AcquireLock(AreaLocked) then
try
// do something critical with the locked area
...
finally
ReleaseLock(AreaLocked);
end;
....
AreaLocked := cLockFinished; // on program termination, set the special value to catch probable cases when somebody will try to acquire the lock
end.
Efficient pause-based spin-wait loops
Test, test-and-set
You may also use the following assembly code (see the "Assembly code example of pause-based spin-wait loop" section below) as a working example of the "pause"-based spin-wait loop.
This code it uses normal memory load while spinning to save resources, as suggested by Peter Cordes. This technique is called "test, test-and-set". You can find out more on this technique at https://stackoverflow.com/a/44916975/6910868
Number of iterations
The pause-based spin-wait loop in this example first tries to acquire the lock by reading the synchronization variable, and if it is not available, utilize pause instruction in a loop of 5000 cycles. After 5000 cycles it calls Windows API function SwitchToThread(). This value of 5000 cycles is empirical. It is based on my tests. Values from 500 to 50000 also seem to be OK, but in some scenarios lower values are better while in other scenarios higher values are better. You can read more on pause-based spin-wait loops at the URL that I gave in the preceding paragraph.
Availability of the pause instruction
Please note that you may use this code only on processors that support SSE2 - you should check the corresponding CPUID bit before calling pause instruction - otherwise there will just be a waste of power. On processors without pause just use other means, like EnterCriticalSection/LeaveCriticalSection or Sleep(0) and then Sleep(1) in a loop. Some people say that on 64-bit processors you may not check for SSE2 to make sure that the pause instruction is implemented, because the original AMD64 architecture adopted Intel's SSE and SSE2 as core instructions, and, practically, if you run 64-bit code, you have already have SSE2 for sure and thus the pause instruction. However, Intel discourages a practice of relying on a presence specific feature and explicitly states that certain feature may vanish in future processors and applications must always check features via CPUID. However, the SSE instructions became ubiquitous and many 64-bit compilers use them without checking (e.g. Delphi for Win64), so chances that in some future processors there will be no SSE2, let alone pause, are very slim.
Assembly code example of pause-based spin-wait loop
// on entry rcx = address of the byte-lock
// on exit: al (eax) = old value of the byte at [rcx]
#Init:
mov edx, cLockByteLocked
mov r9d, 5000
mov eax, edx
jmp #FirstCompare
#DidntLock:
#NormalLoadLoop:
dec r9
jz #SwitchToThread // for static branch prediction, jump forward means "unlikely"
pause
#FirstCompare:
cmp [rcx], al // we are using faster, normal load to not consume the resources and only after it is ready, do once again interlocked exchange
je #NormalLoadLoop // for static branch prediction, jump backwards means "likely"
lock xchg [rcx], al
cmp eax, edx // 32-bit comparison is faster on newer processors like Xeon Phi or Cannonlake.
je #DidntLock
jmp #Finish
#SwitchToThread:
push rcx
call SwitchToThreadIfSupported
pop rcx
jmp #Init
#Finish:

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight