Suppose my system is a multicore system, if I bind my program on a cpu core, still I need the smp_mb() to guard the cpu would not reorder
the cpu instructions?
I have this point because I know that the smp_mb() on a single-core systems is not necessary,but I'm no sure this point is correct.
You rarely need a full barrier anyway, usually acquire/release is enough. And usually you want to use C11 atomic_load_explicit(&var, memory_order_acquire), or in Linux kernel code, use one of its functions for an acquire-load, which can be done more efficiently on some ISAs than a plain load and an acquire barrier. (Notably AArch64 or 32-bit ARMv8 with ldar or ldapr)
But yeah, if all threads are sharing the same logical core, run-time memory reordering is impossible, only compile-time. So you just need a compiler memory barrier like asm("" ::: "memory") or C11 atomic_signal_fence(seq_cst), not a CPU run-time barrier like atomic_thread_fence(seq_cst) or the Linux kernel's SMP memory barrier (smp_mb() is x86 mfence or equivalent, or ARM dmb ish, for example).
See Why memory reordering is not a problem on single core/processor machines? for more details about the fact that all instructions on the same core observe memory effects to have happened in program order, regardless of interrupts. e.g. a later load must see the value from an earlier store, otherwise the CPU is not maintaining the illusion of instructions on that core running in program order.
And if you can convince your compiler to emit atomic RMW instructions without the x86 lock prefix, for example, they'll be atomic wrt. context switches (and interrupts in general). Or use gcc -Wa,-momit-lock-prefix=yes to have GAS remove lock prefixes for you, so you can use <stdatomic.h> functions efficiently. At least on x86; for RISC ISAs, there's no way to do a read-modify-write of a memory location in a single instruction.
Or if there is (ARMv8.1), it implies an atomic RMW that's SMP-safe, like x86 lock add [mem], eax. But on a CISC like x86, we have instructions like add [mem], eax or whatever which are just like separate load / ADD / store glued into a single instruction, which either executes fully or not at all before an interrupt. (Note that "executing" a store just means writing into the store buffer, not globally visible cache, but that's sufficient for later code on the same core to see it.)
See also Is x86 CMPXCHG atomic, if so why does it need LOCK? for more about non-locked use-cases.
Related
Would it be necessary to use a mutex for atomic operations on shared memory, in a multicore environment, where one CPU is only ever reading and the other CPU is only ever writing? I am guessing that this may depend on architecture, so if an example is needed then ARM (Cortex) and/or ESP32?
I already know that a mutex is not needed for atomic operations in a single-core environment where one thread is only ever reading and the other thread only ever writing (https://www.freertos.org/FreeRTOS_Support_Forum_Archive/May_2019/freertos_Shared_variable_between_a_write_thred_and_a_read_thread_a0408decbaj.html).
One solution that has been around for decades (I already used this 30 years ago) is the concept of mailboxes.
Simplest mailbox is just a structure or buffer with a flag. This flag should be of the minimum size that can be accessed in an atomic operation from both processors sharing the memory. It should also be located at a memory address that both processors see as "aligned" to ensure single-cycle read/write accesses, e.g. 32 bit word boundaries in the case of 32 bit ARM processors. This might be tricky to implement in non- RISC-alike architectures.
The flag usage is very simple. The processor that writes the data waits for the flag to be signalled as "buffer empty", maybe a simple null value, then write the data to the mailbox's buffer and signal "buffer not empty" by setting a magic number into the flag, maybe a non- null value.
The processor receiving the data just has to wait for the flag to be signalled as "buffer not empty" before reading the data, and setting the flag back to "buffer empty".
Whether you have primitives supporting this mechanism without relying in a constant flag polling, or not, is tightly dependent on your hardware and operating system.
I've used this mechanism in heterogeneous architectures (processor + co-processor of different architectures/capabilities running different applications), but homogeneous multicore processors are well supported by many RTOSes today, including freeRTOS, and other mechanisms as queues and semaphores/mutexes are probably more appropriated for the synchronization part. Some current SoC's support hardware semaphores and memory-access interrupts that can improve performance greatly.
EDIT:
There is one freeRTOS feature that can assist you here, message buffers. There is one example using ST's STM32H745 dual-core SoC [here] that comes with a companion article [here] written by freeRTOS's Richard Barry.
In the article https://en.m.wikipedia.org/wiki/Mutual_exclusion#Software_solutions it is given that
These algorithms do not work if out-of-order execution is used on the platform that executes them. Programmers have to specify strict ordering on the memory operations within a thread.
But those algorithms are simple C programs and if we can't be sure of them working as expected in ooo systems how can we be sure that our other programs will work correctly?
What do these programs do that fails in ooo situation?
Basically when do ooo programs not work, so we can be careful in their usage?
Can we trust ooo processors to do what we code?
There are no problems with out-of-order execution within a program (more precisely within a thread). This can only lead to problems if there is concurrency (two or more threads run in parallel), for example with a software mutex.
Barriers with mfence (or lfence and sfence in special cases) can help on x86 plattforms. They instruct the processor that no out-of-order execution is to take place at this point. These are assembler instructions, so in C you have to write
asm volatile("mfence");
or use the corresponding instrinct.
Another problem could be that the compiler arranges the instructions differently than they are in the program or makes other optimizations (for example, does not write the values in the memory at all but keeps them in a register). To prevent this, the keyword volatile must be used at the variables you use for the software mutex.
I'm working on an assignment in operating system course on Xv6. I need to implement a data status structure for a process for its creation time, termination time, sleep time, etc...
As of now I decided to use the ticks variable directly without using the tickslock because it seems not a good idea to use a lock and slow down the system for such a low priority objective.
Since the ticks variable only used like so: ticks++, is there a way where I will try to retrieve the current number of ticks and get a wrong number?
I don't mind getting a wrong number by +-10 ticks but is there a way where it will be really off. Like when the number 01111111111111111 will increment it will need to change 2 bytes. So my question is this, is it possible that the CPU storing data in stages and another CPU will be able to fetch the data in that memory location between the start and complete of the store operation?
So as I see it, if the compiler will create a mov instruction or an inc instruction, what I want to know is if the store operation can be seen between the start and end of it.
There's no problem in asm: aligned loads/stores done with a single instruction on x86 are atomic up to qword (8-byte) width. Why is integer assignment on a naturally aligned variable atomic on x86?
(On 486, the guarantee is only for 4-byte aligned values, and maybe not even that for 386, so possibly this is why Xv6 uses locking? I'm not sure if it's supposed to be multi-core safe on 386; my understanding is that the rare 386 SMP machines didn't exactly implement the modern x86 memory model (memory ordering and so on).)
But C is not asm. Using a plain non-atomic variable from multiple "threads" at once is undefined behaviour, unless all threads are only reading. This means compilers can assume that a normal C variable isn't changed asynchronously by other threads.
Using ticks in a loop in C will let the compiler read it once and keep using the same value repeatedly. You need a READ_ONCE macro like the Linux kernel uses, e.g. *(volatile int*)&ticks. Or simply declare it as volatile unsigned ticks;
For a variable narrow enough to fit in one integer register, it's probably safe to assume that a sane compiler will write it with a single dword store, whether that's a mov or a memory-destination inc or add dword [mem], 1. (You can't assume that a compiler will use a memory-destination inc/add, though, so you can't depend on an increment being single-core-atomic with respect to interrupts.)
With one writer and multiple readers, yes the readers can simply read it without any need for any kind of locking, as long as they use volatile.
Even in portable ISO C, volatile sig_atomic_t has some very limited guarantees of working safely when written by a signal handler and read by the thread that ran the signal handler. (Not necessarily by other threads, though: in ISO C volatile doesn't avoid data-race UB. But in practice on x86 with non-hostile compilers it's fine.)
(POSIX signals are the user-space equivalent of interrupts.)
See also Can num++ be atomic for 'int num'?
For one thread to publish a wider counter in two halves, you'd usually use a SeqLock. With 1 writer and multiple readers, there's no actual locking, just retry by the readers if a write overlapped with their read. See Implementing 64 bit atomic counter with 32 bit atomics
First, using locks or not isn't a matter of whether your objective is low priority or not, but a matter of solving a race condition.
Second, in the specific case you describe, it will be safe to read ticks variable without any locks as this is not a race condition case because RAM access to the same region (even same address here) cannot be made by 2 separate CPUs simultaneously (read more) and because ticks writing only increments the value by 1 and not doing any major changes that you really miss.
Are there any implementations thereof in C? All those I've seen so far are based on the LDREX/STREX instructions, which were introduced only in the ARMv6 architecture. The only possible solution for previous architectures seems to be to disable/enable IRQs, which makes the operations blocking.
Are there any implementations thereof in C?
No. It is impossible to implement this is 'C' without support from the compiler or assembly (assembly support in compiler). 'C' has no instruction to guarantee that something executes atomically.
The only possible solution for previous architectures seems to be to disable/enable IRQs, which makes the operations blocking.
Many lock-free algorithms need 'CAS' (compare-and-set). swp and swpb can be used to do some primitive four value operations, but they are not CAS. In order to do four sources and one consumer, you can give each of four bytes to the sources, using swpb and have the consumer use swp to transfer the four 'work' bytes. Most ARM cpus prior to ARMv6 are single core and locking interrupts is the common way to do things. ARMv6 cores have support for LDREX/STREX. The swp instruction is not multi-cpu friendly as it locks the entire bus for the transaction (read/write). However, swp can be used for spin locks if it is the only thing available.
Linux has support for a 'compare and exchange' with OS help. The gist is that a small fixed assembler sequence does the compare and exchange. Interrupt and data abort code is hooked to make sure if this code is interrupted that it is restarted.
How can I be sure the data that is written by multiple CPU cores during a mutex lock is synchronized across all L1 caches of all cores ? I am not talking about the variable that represents the lock, I am talking about the memory locations that are involved during the lock.
This is for Linux, x86_64, and my code is:
#include <sys/types.h>
#include "dlog.h"
uint *dlog_line;
volatile int dlog_lock;
char *dlog_get_new_line(void) {
uint val;
while(!__sync_bool_compare_and_swap(&dlog_lock, 0, 1)) {
val=*dlog_line;
if (val==DT_DLOG_MAX_LINES) val=0;
*dlog_line=val;
}
dlog_lock = 0;
}
Here, inside dlog_get_new_line() function, I use gcc builtin function so there shouldn't be any problem with aquiring the lock. But how can I ensure that when the lock is released, the value pointed by *dlog_line propagates into all the L1 cache of all the other CPU cores in the system?
I do not use pthreads, each process runs on different cpu core.
What you're interested in is called cache coherence. This is done automatically by the hardware.
So in short, you don't have to do anything if you are correctly using __sync_bool_compare_and_swap() (or any other locking intrinsic).
As an oversimplfied explanation, the thread will not return from the call to __sync_bool_compare_and_swap() until all the other processors are able to see the new value or are aware that their local copy is out-of-date.
If you're interested in what happens underneath (in the hardware), there are various cache coherence algorithms that are used to ensure that a core doesn't read an outdated copy of data.
Here's a partial list of commonly taught protocols:
MSI
MESI
Firefly
Modern hardware will typically have much more complicated algorithms for it.
Gcc has two other builtins that are exactly invented for the purpose you describe: __sync_lock_test_and_set and __sync_lock_release. They have so-called acquire/release semantics which guarantees you that stored values of other variables are visible as you need them while you hold your spinlock. These requirements are a bit weaker than what __sync_bool_compare_and_swap provides, so better use the tools that are tailored for the job.
They should well adapt to the capacity of different hardware. E.g on my x86_64 this puts an mfence instruction before the final atomic store into dlog_lock, but on different hardware this will be adapted to the available instruction set.