Lock-free atomic operations on ARM architectures prior to ARMv6 - c

Are there any implementations thereof in C? All those I've seen so far are based on the LDREX/STREX instructions, which were introduced only in the ARMv6 architecture. The only possible solution for previous architectures seems to be to disable/enable IRQs, which makes the operations blocking.

Are there any implementations thereof in C?
No. It is impossible to implement this is 'C' without support from the compiler or assembly (assembly support in compiler). 'C' has no instruction to guarantee that something executes atomically.
The only possible solution for previous architectures seems to be to disable/enable IRQs, which makes the operations blocking.
Many lock-free algorithms need 'CAS' (compare-and-set). swp and swpb can be used to do some primitive four value operations, but they are not CAS. In order to do four sources and one consumer, you can give each of four bytes to the sources, using swpb and have the consumer use swp to transfer the four 'work' bytes. Most ARM cpus prior to ARMv6 are single core and locking interrupts is the common way to do things. ARMv6 cores have support for LDREX/STREX. The swp instruction is not multi-cpu friendly as it locks the entire bus for the transaction (read/write). However, swp can be used for spin locks if it is the only thing available.
Linux has support for a 'compare and exchange' with OS help. The gist is that a small fixed assembler sequence does the compare and exchange. Interrupt and data abort code is hooked to make sure if this code is interrupted that it is restarted.

Related

Do I need to use smp_mb() after binding the CPU

Suppose my system is a multicore system, if I bind my program on a cpu core, still I need the smp_mb() to guard the cpu would not reorder
the cpu instructions?
I have this point because I know that the smp_mb() on a single-core systems is not necessary,but I'm no sure this point is correct.
You rarely need a full barrier anyway, usually acquire/release is enough. And usually you want to use C11 atomic_load_explicit(&var, memory_order_acquire), or in Linux kernel code, use one of its functions for an acquire-load, which can be done more efficiently on some ISAs than a plain load and an acquire barrier. (Notably AArch64 or 32-bit ARMv8 with ldar or ldapr)
But yeah, if all threads are sharing the same logical core, run-time memory reordering is impossible, only compile-time. So you just need a compiler memory barrier like asm("" ::: "memory") or C11 atomic_signal_fence(seq_cst), not a CPU run-time barrier like atomic_thread_fence(seq_cst) or the Linux kernel's SMP memory barrier (smp_mb() is x86 mfence or equivalent, or ARM dmb ish, for example).
See Why memory reordering is not a problem on single core/processor machines? for more details about the fact that all instructions on the same core observe memory effects to have happened in program order, regardless of interrupts. e.g. a later load must see the value from an earlier store, otherwise the CPU is not maintaining the illusion of instructions on that core running in program order.
And if you can convince your compiler to emit atomic RMW instructions without the x86 lock prefix, for example, they'll be atomic wrt. context switches (and interrupts in general). Or use gcc -Wa,-momit-lock-prefix=yes to have GAS remove lock prefixes for you, so you can use <stdatomic.h> functions efficiently. At least on x86; for RISC ISAs, there's no way to do a read-modify-write of a memory location in a single instruction.
Or if there is (ARMv8.1), it implies an atomic RMW that's SMP-safe, like x86 lock add [mem], eax. But on a CISC like x86, we have instructions like add [mem], eax or whatever which are just like separate load / ADD / store glued into a single instruction, which either executes fully or not at all before an interrupt. (Note that "executing" a store just means writing into the store buffer, not globally visible cache, but that's sufficient for later code on the same core to see it.)
See also Is x86 CMPXCHG atomic, if so why does it need LOCK? for more about non-locked use-cases.

MultiCore Shared Memory Atomic Operations

Would it be necessary to use a mutex for atomic operations on shared memory, in a multicore environment, where one CPU is only ever reading and the other CPU is only ever writing? I am guessing that this may depend on architecture, so if an example is needed then ARM (Cortex) and/or ESP32?
I already know that a mutex is not needed for atomic operations in a single-core environment where one thread is only ever reading and the other thread only ever writing (https://www.freertos.org/FreeRTOS_Support_Forum_Archive/May_2019/freertos_Shared_variable_between_a_write_thred_and_a_read_thread_a0408decbaj.html).
One solution that has been around for decades (I already used this 30 years ago) is the concept of mailboxes.
Simplest mailbox is just a structure or buffer with a flag. This flag should be of the minimum size that can be accessed in an atomic operation from both processors sharing the memory. It should also be located at a memory address that both processors see as "aligned" to ensure single-cycle read/write accesses, e.g. 32 bit word boundaries in the case of 32 bit ARM processors. This might be tricky to implement in non- RISC-alike architectures.
The flag usage is very simple. The processor that writes the data waits for the flag to be signalled as "buffer empty", maybe a simple null value, then write the data to the mailbox's buffer and signal "buffer not empty" by setting a magic number into the flag, maybe a non- null value.
The processor receiving the data just has to wait for the flag to be signalled as "buffer not empty" before reading the data, and setting the flag back to "buffer empty".
Whether you have primitives supporting this mechanism without relying in a constant flag polling, or not, is tightly dependent on your hardware and operating system.
I've used this mechanism in heterogeneous architectures (processor + co-processor of different architectures/capabilities running different applications), but homogeneous multicore processors are well supported by many RTOSes today, including freeRTOS, and other mechanisms as queues and semaphores/mutexes are probably more appropriated for the synchronization part. Some current SoC's support hardware semaphores and memory-access interrupts that can improve performance greatly.
EDIT:
There is one freeRTOS feature that can assist you here, message buffers. There is one example using ST's STM32H745 dual-core SoC [here] that comes with a companion article [here] written by freeRTOS's Richard Barry.

Software mutex not working with out of order execution

In the article https://en.m.wikipedia.org/wiki/Mutual_exclusion#Software_solutions it is given that
These algorithms do not work if out-of-order execution is used on the platform that executes them. Programmers have to specify strict ordering on the memory operations within a thread.
But those algorithms are simple C programs and if we can't be sure of them working as expected in ooo systems how can we be sure that our other programs will work correctly?
What do these programs do that fails in ooo situation?
Basically when do ooo programs not work, so we can be careful in their usage?
Can we trust ooo processors to do what we code?
There are no problems with out-of-order execution within a program (more precisely within a thread). This can only lead to problems if there is concurrency (two or more threads run in parallel), for example with a software mutex.
Barriers with mfence (or lfence and sfence in special cases) can help on x86 plattforms. They instruct the processor that no out-of-order execution is to take place at this point. These are assembler instructions, so in C you have to write
asm volatile("mfence");
or use the corresponding instrinct.
Another problem could be that the compiler arranges the instructions differently than they are in the program or makes other optimizations (for example, does not write the values in the memory at all but keeps them in a register). To prevent this, the keyword volatile must be used at the variables you use for the software mutex.

What is meant by interleaved "multi-threading" in c?

Can anybody explain what is an interleaved multi-threading means?
Real-time examples are also allowed.
This is described by Intel as hyper-threading. The CPU has a single core with 2 register sets.
These can be used to increase the utilisation of the core.
This is opaque to code, in that it behaves like 2 cores. However only one at a time can run.
If multi-threaded, your code still needs atomic, mutexes etc
The wiki says:
The purpose of interleaved multithreading is to remove all data
dependency stalls from the execution pipeline. Since one thread is
relatively independent from other threads, there's less chance of one
instruction in one pipe stage needing an output from an older
instruction in the pipeline. Conceptually, it is similar to preemptive
multitasking used in operating systems; an analogy would be that the
time slice given to each active thread is one CPU cycle.
This Wikipedia page explains it pretty well.
Basically it's about interleaving instructions from different OS-level threads on the CPU, to reduce the risk of costly dependencies between instructions.

builtin gcc spinlock

How can I be sure the data that is written by multiple CPU cores during a mutex lock is synchronized across all L1 caches of all cores ? I am not talking about the variable that represents the lock, I am talking about the memory locations that are involved during the lock.
This is for Linux, x86_64, and my code is:
#include <sys/types.h>
#include "dlog.h"
uint *dlog_line;
volatile int dlog_lock;
char *dlog_get_new_line(void) {
uint val;
while(!__sync_bool_compare_and_swap(&dlog_lock, 0, 1)) {
val=*dlog_line;
if (val==DT_DLOG_MAX_LINES) val=0;
*dlog_line=val;
}
dlog_lock = 0;
}
Here, inside dlog_get_new_line() function, I use gcc builtin function so there shouldn't be any problem with aquiring the lock. But how can I ensure that when the lock is released, the value pointed by *dlog_line propagates into all the L1 cache of all the other CPU cores in the system?
I do not use pthreads, each process runs on different cpu core.
What you're interested in is called cache coherence. This is done automatically by the hardware.
So in short, you don't have to do anything if you are correctly using __sync_bool_compare_and_swap() (or any other locking intrinsic).
As an oversimplfied explanation, the thread will not return from the call to __sync_bool_compare_and_swap() until all the other processors are able to see the new value or are aware that their local copy is out-of-date.
If you're interested in what happens underneath (in the hardware), there are various cache coherence algorithms that are used to ensure that a core doesn't read an outdated copy of data.
Here's a partial list of commonly taught protocols:
MSI
MESI
Firefly
Modern hardware will typically have much more complicated algorithms for it.
Gcc has two other builtins that are exactly invented for the purpose you describe: __sync_lock_test_and_set and __sync_lock_release. They have so-called acquire/release semantics which guarantees you that stored values of other variables are visible as you need them while you hold your spinlock. These requirements are a bit weaker than what __sync_bool_compare_and_swap provides, so better use the tools that are tailored for the job.
They should well adapt to the capacity of different hardware. E.g on my x86_64 this puts an mfence instruction before the final atomic store into dlog_lock, but on different hardware this will be adapted to the available instruction set.

Resources