Does anyone know of a good/correct implementation of Peterson's Lock algorithm in C? I can't seem to find this. Thanks.
Peterson's algorithm cannot be implemented correctly in C99, as explained in who ordered memory fences on an x86.
Peterson's algorithm is as follows:
LOCK:
interested[id] = 1 interested[other] = 1
turn = other turn = id
while turn == other while turn == id
and interested[other] == 1 and interested[id] == 1
UNLOCK:
interested[id] = 0 interested[other] = 0
There are some hidden assumptions here. To begin with, each thread must note its interest in acquiring the lock before giving away its turn. Giving away the turn must make visible to the other thread that we are interested in acquiring the lock.
Also, as in every lock, memory accesses in the critical section cannot be hoisted past the lock() call, nor sunk past the unlock(). I.e.: lock() must have at least acquire semantics, and unlock() must have at least release semantics.
In C11, the simplest way to achieve this would be to use a sequentially consistent memory order, which makes the code run as if it were a simple interleaving of threads running in program order (WARNING: totally untested code, but it's similar to an example in Dmitriy V'jukov's Relacy Race Detector):
lock(int id)
{
atomic_store(&interested[id], 1);
atomic_store(&turn, 1 - id);
while (atomic_load(&turn) == 1 - id
&& atomic_load(&interested[1 - id]) == 1);
}
unlock(int id)
{
atomic_store(&interested[id], 0);
}
This ensures that the compiler doesn't make optimizations that break the algorithm (by hoisting/sinking loads/stores across atomic operations), and emits the appropriate CPU instructions to ensure the CPU also doesn't break the algorithm. The default memory model for C11/C++11 atomic operations that don't explicitely select a memory model is the sequentially consistent memory model.
C11/C++11 also support weaker memory models, allowing as much optimization as possible. The following is a translation to C11 of the translation to C++11 by Anthony Williams of an algorithm originally by Dmitriy V'jukov in the syntax of his own Relacy Race Detector
[petersons_lock_with_C++0x_atomics] [the-inscrutable-c-memory-model]. If this algorithm is incorrect it's my fault (WARNING: also untested code, but based on good code from Dmitriy V'jukov and Anthony Williams):
lock(int id)
{
atomic_store_explicit(&interested[id], 1, memory_order_relaxed);
atomic_exchange_explicit(&turn, 1 - id, memory_order_acq_rel);
while (atomic_load_explicit(&interested[1 - id], memory_order_acquire) == 1
&& atomic_load_explicit(&turn, memory_order_relaxed) == 1 - id);
}
unlock(int id)
{
atomic_store_explicit(&interested[id], 0, memory_order_release);
}
Notice the exchange with acquire and release semantics. An exchange is an
atomic RMW operation. Atomic RMW operations always read the last value stored
before the write in the RMW operation. Also, an acquire on an atomic object
that reads a write from a release on that same atomic object (or any later
write on that object from the thread that performed the release or any later
write from any atomic RMW operation) creates a synchronizes-with relation
between the release and the acquire.
So, this operation is a synchronization point between the threads, there is
always a synchronizes-with relationship between the exchange in one thread and
the last exchange performed by any thread (or the initialization of turn, for
the very first exchange).
So we have a sequenced-before relationship between the store to interested[id]
and the exchange from/to turn, a synchronizes-with relationship between two
consecutive exchanges from/to turn, and a sequenced-before relationship
between the exchange from/to turn and the load of interested[1 - id]. This
amounts to a happens-before relationship between accesses to interested[x] in
different threads, with turn providing the synchronization between threads.
This forces all the ordering needed to make the algorithm work.
So how were these things done before C11? It involved using compiler and
CPU-specific magic. As an example, let's see the pretty strongly-ordered x86.
IIRC, all x86 loads have acquire semantics, and all stores have release
semantics (save non-temporal moves, in SSE, used precisely to achive higher
performance at the cost of ocassionally needing to issue CPU fences to achieve
coherence between CPUs). But this is not enough for Peterson's algorithm, as
Bartosz Milewsky explains at
who-ordered-memory-fences-on-an-x86 ,
for Peterson's algorithm to work we need to establish an ordering between
accesses to turn and interested, failing to do that may result in seeing loads
from interested[1 - id] before writes to interested[id], which is a bad thing.
So a way to do that in GCC/x86 would be (WARNING: although I tested something similar to the following, actually a modified version of the code at wrong-implementation-of-petersons-algorithm , testing is nowhere near assuring correctness of multithreaded code):
lock(int id)
{
interested[id] = 1;
turn = 1 - id;
__asm__ __volatile__("mfence");
do {
__asm__ __volatile__("":::"memory");
} while (turn == 1 - id
&& interested[1 - id] == 1);
}
unlock(int id)
{
interested[id] = 0;
}
The MFENCE prevents stores and loads to different memory addresses from being
reordered. Otherwise the write to interested[id] could be queued in the store
buffer while the load of interested[1 - id] proceeds. On many current
microarchitectures a SFENCE may be enough, since it may be implemented as a
store buffer drain, but IIUC SFENCE doesn't need to be implemented that way,
and may simply prevent reordering between stores. So SFENCE may not be enough everywhere, and we need a full MFENCE.
The compiler barrier (__asm__ __volatile__("":::"memory")) prevents
the compiler from deciding that it already knows the value of turn. We're
telling the compiler that we've clobbered memory, so all values cached in
registers must be reloaded from memory.
P.S: I feel this needs a closing paragraph, but my brain is drained.
I won't make any assertions about how good or correct the implementation is, but it was tested (briefly). This is a straight translation of the algorithm described on wikipedia.
struct petersonslock_t {
volatile unsigned flag[2];
volatile unsigned turn;
};
typedef struct petersonslock_t petersonslock_t;
petersonslock_t petersonslock () {
petersonslock_t l = { { 0U, 0U }, ~0U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 1;
l->turn = !p;
while (l->flag[!p] && (l->turn == !p)) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 0;
};
Greg points out that on an SMP architecture with slightly relaxed memory coherency (such as x86), although the loads to the same memory location are in order, loads to different locations on one processor may appear out of order to the other processor.
Jens Gustedt and ninjalj recommend modifying the original algorithm to use the atomic_flag type. This means setting the flags and turns would use the atomic_flag_test_and_set and clearing them would use atomic_flag_clear from C11. Alternatively, a memory barrier could be imposed between updates to flag.
Edit: I originally attempted to correct for this by writing to the same memory location for all the states. ninjalj pointed out that the bitwise operations turned the state operations into RMW rather than load and stores of the original algorithm. So, atomic bitwise operations are required. C11 provides such operators, as does GCC with built-ins. The algorithm below uses GCC built-ins, but wrapped in macros so that it can easily be changed to some other implementation. However, modifying the original algorithm above is the preferred solution.
struct petersonslock_t {
volatile unsigned state;
};
typedef struct petersonslock_t petersonslock_t;
#define ATOMIC_OR(x,v) __sync_or_and_fetch(&x, v)
#define ATOMIC_AND(x,v) __sync_and_and_fetch(&x, v)
petersonslock_t petersonslock () {
petersonslock_t l = { 0x000000U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
unsigned mask = (p == 0) ? 0xFF0000 : 0x00FF00;
ATOMIC_OR(l->state, (p == 0) ? 0x000100 : 0x010000);
(p == 0) ? ATOMIC_OR(l->state, 0x000001) : ATOMIC_AND(l->state, 0xFFFF00);
while ((l->state & mask) && (l->state & 0x0000FF) == !p) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
ATOMIC_AND(l->state, (p == 0) ? 0xFF00FF : 0x00FFFF);
};
Related
Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if
a) it could be faster in the common case (i.e. already initialized), and,
b) is the selected memory ordering strong enough / too strong?
Architectures of interest are x86_64 (primarily) and aarch64.
The intended use API is something like this
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
// do other things that assume the initialization has taken place
}
And here is the implementation:
int once_enter(int *b)
{
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
// on x86, insert a pause instruction here
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}
I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.
Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?
UPDATE
Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:
I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).
#ifdef __x86_64__
#define PAUSE() __asm __volatile("pause")
#else
#define PAUSE()
#endif
int once_enter(int *b)
{
if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
PAUSE();
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}
a, What you need is a double-checked lock.
Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
}
// do other things that assume the initialization has taken place
}
b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.
I have learned from SO threads here and here, among others, that it is not safe to assume that reads/writes of data in multithreaded applications are atomic at the OS/hardware level, and corruption of data may result. I would like to know the simplest way of making reads and writes of int variables atomic, using the <stdatomic.h> C11 library with the GCC compiler on Linux.
If I currently have an int assignment in a thread: messageBox[i] = 2, how do I make this assignment atomic? Equally for a reading test, like if (messageBox[i] == 2).
For C11 atomics you don't even have to use functions. If your implementation (= compiler) supports atomics you can just add an atomic specifier to a variable declaration and then subsequently all operations on that are atomic:
_Atomic(int) toto = 65;
...
toto += 2; // is an atomic read-modify-write operation
...
if (toto == 67) // is an atomic read of toto
Atomics have their price (they need much more computing resources) but as long as you use them scarcely they are the perfect tool to synchronize threads.
If I currently have an int assignment in a thread: messageBox[i] = 2, how do I make this assignment atomic? Equally for a reading test, like if (messageBox[i] == 2).
You almost never have to do anything. In almost every case, the data which your threads share (or communicate with) are protected from concurrent access via such things as mutexes, semaphores and the like. The implementation of the base operations ensure the synchronization of memory.
The reason for these atomics is to help you construct safer race conditions in your code. There are a number of hazards with them; including:
ai += 7;
would use an atomic protocol if ai were suitably defined. Trying to decipher race conditions is not aided by obscuring the implementation.
There is also a highly machine dependent portion to them. The line above, for example, could fail [1] on some platforms, but how is that failure communicated back to the program? It is not [2].
Only one operation has the option of dealing with failure; atomic_compare_exchange_(weak|strong). Weak just tries once, and lets the program choose how and whether to retry. Strong retries endlessly. It isn't enough to just try once -- spurious failures due to interrupts can occur -- but endless retries on a non-spurious failure is no good either.
Arguably, for robust programs or widely applicable libraries, the only bit of you should use is atomic_compare_exchange_weak().
[1] Load-linked, store-conditional (ll-sc) is a common means for making atomic transactions on asynchronous bus architectures. The load-linked sets a little flag on a cache line, which will be cleared if any other bus agent attempts to modify that cache line. Store-conditional stores a value iff the little flag is set in the cache, and clears the flag; iff the flag is cleared, Store-conditional signals an error, so an appropriate retry operation can be attempted. From these two operations, you can construct any atomic operation you like on a completely asynchronous bus architecture.
ll-sc can have subtle dependencies on the caching attributes of the location. Permissible cache attributes are platform dependent, as is which operations may be performed between the ll and sc.
If you put an ll-sc operation on a poorly cached access, and blindly retry, your program will lock up. This isn't just speculation; I had to debug one of these on an ARMv7-based "safe" system.
[2]:
#include <stdatomic.h>
int f(atomic_int *x) {
return (*x)++;
}
f:
dmb ish
.L2:
ldrex r3, [r0]
adds r2, r3, #1
strex r1, r2, [r0]
cmp r1, #0
bne .L2 /* note the retry loop */
dmb ish
mov r0, r3
bx lr
The most portable way is to use one of the C11 atomic variables. You can also use a spinlock atomic operation to guard non-atomic variables. Here is a simple pthread produce/consumer example to play with, modify as desired. Notice that the cnt_non and cnt_vol can be corrupted.
atomic_uint cnt_atomic;
int cnt_non;
volatile int cnt_vol;
typedef atomic_uint lock_t;
lock_t lockholder = 0;
#define LOCK_C 0x01
#define LOCK_P 0x02
int cnt_lock; /* not atomic on purpose to test spinlock */
atomic_int lock_held_c, lock_held_p;
void
lock(lock_t *bitarrayp, uint32_t desired)
{
uint32_t expected = 0; /* lock is not held */
/* the value in expected is updated if it does not match
* the value in bitarrayp. If the comparison fails then compare
* the returned value with the lock bits and update the appropriate
* counter.
*/
do {
if (expected & LOCK_P) lock_held_p++;
if (expected & LOCK_C) lock_held_c++;
expected = 0;
} while(!atomic_compare_exchange_weak(bitarrayp, &expected, desired));
}
void
unlock(lock_t *bitarrayp)
{
*bitarrayp = 0;
}
void*
fn_c(void *thr_data)
{
(void)thr_data;
for (int i=0; i<40000; i++) {
cnt_atomic++;
cnt_non++;
cnt_vol++;
/* lock, increment, unlock */
lock(&lockholder, LOCK_C);
cnt_lock++;
unlock(&lockholder);
}
return NULL;
}
void*
fn_p(void *thr_data)
{
(void)thr_data;
for (int i=0; i<30000; i++) {
cnt_atomic++;
cnt_non++;
cnt_vol++;
/* lock, increment, unlock */
lock(&lockholder, LOCK_P);
cnt_lock++;
unlock(&lockholder);
}
return NULL;
}
void
drv_pc(void)
{
pthread_t thr[2];
pthread_create(&thr[0], NULL, fn_c, NULL);
pthread_create(&thr[1], NULL, fn_p, NULL);
for(int n = 0; n < 2; ++n)
pthread_join(thr[n], NULL);
printf("cnt_a=%d, cnt_non=%d cnt_vol=%d\n", cnt_atomic, cnt_non, cnt_vol);
printf("lock %d held_c=%d held_p=%d\n", cnt_lock, lock_held_c, lock_held_p);
}
that it is not safe to assume that reads/writes of data in
multithreaded applications are atomic at the OS/hardware level, and
corruption of data may result
Actually non composite operations on types like int are atomic on all reasonable architecture. What you read is simply a hoax.
(An increment is a composite operation: it has a read, a calculation, and a write component. Each component is atomic but the whole composite operation is not.)
But atomicity at the hardware level isn't the issue. The high level language you use simply doesn't support that kind of manipulations on regular types. You need to use atomic types to even have the right to manipulate objects in such a way that the question of atomicity is relevant: when you are potentially modifying an object in use in another thread.
(Or volatile types. But don't use volatile. Use atomics.)
I've been having an implementation discussion where the idea that a CPU can choose to completely reorder the storing of memory has come up.
I was initializing a static array in C using code similar to:
static int array[10];
static int array_initialized = 0;
void initialize () {
array[0] = 1;
array[1] = 2;
...
array_initialized = -1;
}
and it is used later similar to:
int get_index(int index) {
if (!array_initialized) initialize();
if (index < 0 || index > 9) return -1;
return array[index];
}
is it possible for the CPU to reorder memory access in a multi-core intel architecture (or other architecture) such that it sets array_initialized before the initialize function has finished setting the array elements? or so that another execution thread can see array_initialized as non-zero before the entire array has been initialized in its view of the memory?
TL:DR: to make lazy-init safe if you don't do it before starting multiple threads, you need an _Atomic flag.
is it possible for the CPU to reorder memory access in a multi-core Intel (x86) architecture
No, such reordering is possible at compile time only. x86 asm effectively has acquire/release semantics for normal loads/stores. (seq_cst + a store buffer with store forwarding).
https://preshing.com/20120625/memory-ordering-at-compile-time/
(or other architecture)
Yes, most other ISAs have a weaker asm memory model that does allow StoreStore reordering and LoadLoad reordering. (Effectively memory_order_relaxed, or sort of like memory_order_consume on ISAs other than Alpha AXP, but compilers don't try to maintain data dependencies.)
None of this really matters from C because the C memory model is very weak, allowing compile-time reordering and simultaneous read/write or write+write of any object is data-race UB.
Data Race UB is what lets a compiler keep static variables in registers for the life of a function / inside a loop when compiling for "normal" ISAs.
Having 2 threads run this function is C data-race UB if array_initialized isn't already set before either of them run. (e.g. by having the main thread run it once before starting any more threads). And remove the array_initialized flag entirely, unless you have a use for the lazy-init feature before starting any more threads.
It's 100% safe for a single thread, regardless of how many other threads are running: the C programming model guarantees that a single thread always sees its own operations in program order. (Just like asm for all normal ISAs; other than explicit parallelism in ISAs like Itanium, you always see your own operations in order. It's only other threads seeing your operations where things get weird).
Starting a new thread is (I think) always a "full barrier", or in C terms "synchronizes with" the new thread. Stuff in the new thread can't happen before anything in the parent thread. So just calling get_index once from the main thread makes it safe with no further barriers for other threads to run get_index after that.
You could make lazy init thread-safe with an _Atomic flag
This is similar to what gcc does for function-local static variables with non-constant initializers. Check out the code-gen for that if you're curious: a read-only check of an already-init flag and then a call to an init function that makes sure only one thread runs the initializer.
This requires an acquire load in the fast-path for the already-initialized state. That's free on x86 and SPARC-TSO (same asm as a normal load), but not on weaker ISAs. AArch64 has an acquire load instruction, other ISAs need some barrier instructions.
Turn your array_initialized flag into a 3-state _Atomic variable:
init not started (e.g. init == 0). Check for this with an acquire load.
init started but not finished (e.g. init == -1)
init finished (e.g. init == 1)
You can leave static int array[10]; itself non-atomic by making sure exactly 1 thread "claims" responsibility for doing the init, using atomic_compare_exchange_strong (which will succeed for exactly one thread). And then have other threads spin-wait for the INIT_FINISHED state.
Using initial state == 0 lets it be in the BSS, hopefully next to the data. Otherwise we might prefer INIT_FINISHED=0 for ISAs where branching on an int from memory being (non)zero is slightly more efficient than other numbers. (e.g. AArch64 cbnz, MIPS bne $reg, $zero).
We could get the best of both worlds (cheapest possible fast-path for the already-init case) while still having the flag in the BSS: Have the main thread write it with INIT_NOTSTARTED = -1 before starting any more threads.
Having the flag next to the array is helpful for a small array where the flag is probably in the same cache line as the data we want to index. Or at least the same 4k page.
#include <stdatomic.h>
#include <stdbool.h>
#ifdef __x86_64__
#include <immintrin.h>
#define SPINLOOP_BODY _mm_pause()
#else
#define SPINLOOP_BODY /**/
#endif
#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#define NOINLINE __attribute__((noinline))
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#define NOINLINE /**/
#endif
enum init_states {
INIT_NOTSTARTED = 0,
INIT_STARTED = -1,
INIT_FINISHED = 1 // optional: make this 0 to speed up the fast-path on some ISAs, and store an INIT_NOTSTARTED before the first call
};
static int array[10];
static _Atomic int array_initialized = INIT_NOTSTARTED;
// called either before or during init.
// One thread claims responsibility for doing the init, others spin-wait
NOINLINE // this is rare, make sure it doesn't bloat the fast-path
void initialize(void) {
bool winner = false;
// check read-only if another thread has already claimed init
if (array_initialized == INIT_NOTSTARTED) {
int expected = INIT_NOTSTARTED;
winner = atomic_compare_exchange_strong(&array_initialized, &expected, INIT_STARTED);
// seq_cst memory order is fine. Weaker might be ok but it only has to run once
}
if (winner) {
array[0] = 1;
// ...
atomic_store_explicit(&array_initialized, INIT_FINISHED, memory_order_release);
} else {
// spin-wait for the winner in other threads
// yield(); optional.
// Or use some kind of mutex or condition var if init is really slow
// otherwise just spin on a seq_cst load. (Or acquire is fine.)
while(array_initialized != INIT_FINISHED)
SPINLOOP_BODY; // x86 only
// winner's release store syncs with our load:
// array[] stores Happened Before this point so we can read it without UB
}
}
int get_index(int index) {
// atomic acquire load is fine, doesn't need seq_cst. Cheaper than seq_cst on PowerPC
if (unlikely(atomic_load_explicit(&array_initialized, memory_order_acquire) != INIT_FINISHED))
initialize();
if (unlikely(index < 0 || index > 9)) return -1;
return array[index];
}
This does compile to correct-looking and efficient asm on Godbolt. Without unlikely() macros, gcc/clang think that at least the stand-alone version of get_index has initialize() and/or return -1 as the most likely fast-path.
And compilers wanted to inline the init function, which would be silly because it only runs once per thread at most. Hopefully profile-guided optimization would correct that.
I am studying the implementation of Seqlock. However all sources I found implement them differently.
Linux Kernel
Linux kernel implements it like this:
static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
unsigned ret;
repeat:
ret = READ_ONCE(s->sequence);
if (unlikely(ret & 1)) {
cpu_relax();
goto repeat;
}
return ret;
}
static inline unsigned raw_read_seqcount_begin(const seqcount_t *s)
{
unsigned ret = __read_seqcount_begin(s);
smp_rmb();
return ret;
}
Basically, it uses a volatile read plus a read barrier with acquire semantics on the reader side.
When used, subsequent reads are unprotected:
struct Data {
u64 a, b;
};
// ...
read_seqcount_begin(&seq);
int v1 = d.a, v2 = d.b;
// ...
rigtorp/Seqlock
RIGTORP_SEQLOCK_NOINLINE T load() const noexcept {
T copy;
std::size_t seq0, seq1;
do {
seq0 = seq_.load(std::memory_order_acquire);
std::atomic_signal_fence(std::memory_order_acq_rel);
copy = value_;
std::atomic_signal_fence(std::memory_order_acq_rel);
seq1 = seq_.load(std::memory_order_acquire);
} while (seq0 != seq1 || seq0 & 1);
return copy;
}
The load of data is still performed without an atomic operation or protection. However, an atomic_signal_fence with acquire-release semantics is added prior to the read, in contrast to the rmb with acquire semantics in Kernel.
Amanieu/seqlock (Rust)
pub fn read(&self) -> T {
loop {
// Load the first sequence number. The acquire ordering ensures that
// this is done before reading the data.
let seq1 = self.seq.load(Ordering::Acquire);
// If the sequence number is odd then it means a writer is currently
// modifying the value.
if seq1 & 1 != 0 {
// Yield to give the writer a chance to finish. Writing is
// expected to be relatively rare anyways so this isn't too
// performance critical.
thread::yield_now();
continue;
}
// We need to use a volatile read here because the data may be
// concurrently modified by a writer.
let result = unsafe { ptr::read_volatile(self.data.get()) };
// Make sure the seq2 read occurs after reading the data. What we
// ideally want is a load(Release), but the Release ordering is not
// available on loads.
fence(Ordering::Acquire);
// If the sequence number is the same then the data wasn't modified
// while we were reading it, and can be returned.
let seq2 = self.seq.load(Ordering::Relaxed);
if seq1 == seq2 {
return result;
}
}
}
No memory barrier between loading seq and data, but instead a volatile read is used here.
Can Seqlocks Get Along with Programming Language Memory Models? (Variant 3)
T reader() {
int r1, r2;
unsigned seq0, seq1;
do {
seq0 = seq.load(m_o_acquire);
r1 = data1.load(m_o_relaxed);
r2 = data2.load(m_o_relaxed);
atomic_thread_fence(m_o_acquire);
seq1 = seq.load(m_o_relaxed);
} while (seq0 != seq1 || seq0 & 1);
// do something with r1 and r2;
}
Similar to the Rust implementation, but atomic operations instead of volatile_read are used on data.
Arguments in P1478R1: Byte-wise atomic memcpy
This paper claims that:
In the general case, there are good semantic reasons to require that all data accesses inside such a seqlock "critical section" must be atomic. If we read a pointer p as part of reading the data, and then read *p as well, the code inside the critical section may read from a bad address if the read of p happened to see a half-updated pointer value. In such cases, there is probably no way to avoid reading the pointer with a conventional atomic load, and that's exactly what's desired.
However, in many cases, particularly in the multiple process case, seqlock data consists of a single trivially copyable object, and the seqlock "critical section" consists of a simple copy operation. Under normal circumstances, this could have been written using memcpy. But that's unacceptable here, since memcpy does not generate atomic accesses, and is (according to our specification anyway) susceptable to data races.
Currently to write such code correctly, we need to basically decompose such data into many small lock-free atomic subobjects, and copy them a piece at a time. Treating the data as a single large atomic object would defeat the purpose of the seqlock, since the atomic copy operation would acquire a conventional lock. Our proposal essentially adds a convenient library facility to automate this decomposition into small objects.
My question
Which of the above implementations are correct? Which are correct but inefficient?
Can the volatile_read be reordered before the acquire-read of seqlock?
Your qoutes from Linux seems wrong.
According to https://www.kernel.org/doc/html/latest/locking/seqlock.html the read process is:
Read path:
do {
seq = read_seqcount_begin(&foo_seqcount);
/* ... [[read-side critical section]] ... */
} while (read_seqcount_retry(&foo_seqcount, seq));
If you look at the github link posted in the question, you'll find a comment including nearly the same process.
It seems that you are only looking into one part of the read process. The linked file implements what you need to implement readers and writers but not the reader/writer them self.
Also notice this comment from the top of the file:
* The seqlock seqcount_t interface does not prescribe a precise sequence of
* read begin/retry/end. For readers, typically there is a call to
* read_seqcount_begin() and read_seqcount_retry(), however, there are more
* esoteric cases which do not follow this pattern.
I have a strange situation under C/Visual Studio on a Windows 7 platform. There is a problem from time to time and I spent a lot of time to find it. The problem is within a third party library, for which I have the complete code. There a thread is created (the printLog statements are from myself):
...
plafParams->eventThreadFlag = 2;
printLog("before CreateThread");
if (plafParams->hReadThread_p = CreateThread(NULL, 0, ( LPTHREAD_START_ROUTINE ) plafPortReadThread, ( void * ) dlmsInstance, 0,
&plafParams->portReadThreadID) )
{
printLog("after CreateThread: OK");
plafParams->eventThreadFlag = 3;
}
else
{
unsigned int lasterr = GetLastError();
printLog("error CreateThread, last error:%x", lasterr);
/* Could not create the read thread. */
...
...
return FAILURE;
}
printLog("SUCCESS");
...
...
The thread function is:
void *plafPortReadThread(DLMS_GLOBALS *dlmsInstance)
{
PLAF_PARAMS *plafParams;
plafParams = (PLAF_PARAMS *)(dlmsInstance->plafParams);
printLog("start - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
while ((plafParams->eventThreadFlag != 1) && (plafParams->eventThreadFlag != 3))
{
if (plafParams->eventThreadFlag == 0)
{
printLog("exit 1 - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
CloseHandle(plafParams->hReadThread_p);
plafFree((void **)&plafParams);
ExitThread(0);
break;
}
}
printLog("start - plafPortReadThread, proceed=%d", proceed);
...
Now, when the flag is set before the while loop is started within the thread, everything works OK:
SUCCESS
start - plafPortReadThread, plafParams->eventThreadFlag=3
But sometimes the thread is quick enough so the while loop is started before the flag is actually set within the outer part.
The output is then:
start - plafPortReadThread, plafParams->eventThreadFlag=2
SUCCESS
Most surprisingly the while loop doesn't exit, even after the flag has been set to 3.
It seems, that the compiler "optimizes" the flag and assumes, that it cannot be changed from outside.
What could be the problem? I'm really surprised. Or is there something else I have overseen completely? I know, that the code is not very elegant and that such things should better be done with semaphores or signals. But it is not my code and I want to change as little as possible.
After removing the whole while condition it works as expected.
Should I change the struct or its fields to volatile ? Everybody says, that volatile is useless in our days and not needed anymore, except in the case, where a memory location is changed by peripherals...
Prior to C11 this is totally platform-dependent, because the effect you are observing is due to the memory model used by your platform. This is different from a compiler optimization as synchronization points between threads require the compiler to insert barrier instructions, instead of, e.g., making something a constant. For C11 for section 7.17.3 specifies the different models. So your value is not optimized out statically, thread A just never reads the value thread B wrote, but still has its local value.
In practice many projects don't use C11 yet, and thus you will likely have to check the documentation of your platform. Note that in many cases you don't have to modify the type of the variable for the flag (in case you can't). Most memory models specify synchronization points that also forbid reordering of certain instructions, i.e. in:
int x = 3;
_Atomic int a = 1;
x = 5;
a = 2;
the compiler will often have to ensure that x has the value 3 when a has the value 1, and that when a is assigned the value 2, x will have the value 5. volatile does not participate in this relationship (in the C/C++ 11 models - often confused because it does participate in Java's happened-before), and is mostly useless, unless your writes should never be optimized out because they have side-effects such as a LED blinking which the compiler can't understand:
volatile int x = 1; // some special location - blink then clear
x = 1; // blink then clear
x = 1; // blink then clear