Memory ordering for a spin-lock "call once" implementation - c

Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if
a) it could be faster in the common case (i.e. already initialized), and,
b) is the selected memory ordering strong enough / too strong?
Architectures of interest are x86_64 (primarily) and aarch64.
The intended use API is something like this
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
// do other things that assume the initialization has taken place
}
And here is the implementation:
int once_enter(int *b)
{
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
// on x86, insert a pause instruction here
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}
I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.
Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?
UPDATE
Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:
I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).
#ifdef __x86_64__
#define PAUSE() __asm __volatile("pause")
#else
#define PAUSE()
#endif
int once_enter(int *b)
{
if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
PAUSE();
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}

a, What you need is a double-checked lock.
Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
}
// do other things that assume the initialization has taken place
}
b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.

Related

Trying to implement a spin-lock via "lock xchg" assembly

Basically, I'm trying to run void spinlock_counter(int) in two threads and count ought to be 2000(parameter doesn't do anything I'm just too lazy). However I made a breakpoint in the critical zone i.e. "count++" and printed "flag", the flag is "GO" (the flag should be "BLOCK" if everything worked).
didn't get why this hadn't worked.
Thank you for your answers!
code:
int xchg(volatile int *addr, int newval){
int result;
asm volatile("lock xchg %0, %1"
:"+m"(addr),
"=a"(result)
:"1"(newval)
:"cc");
return result;
}
#define GO 0
#define BLOCK 1
int flag = GO;
void lock(int *addr){
int note = BLOCK;
while(!xchg(addr, note));
}
void unlock_spin(int *addr){
int note = GO;
xchg(addr, note);
}
void spinlock_counter(int a){
while(1) {
lock(&flag);
count++;
unlock_spin(&flag);
}
printf("%d\n", count);
}
The condition in your while loop (in lock()) is backwards.
The consequence is that lock() effectively does nothing. If another thread already acquired the lock it won't spin at all, and if the lock was GO it'll spin once (setting the lock to BLOCK on the first iteration so that the spinning stops on the 2nd iteration).
I'd recommend writing code that describes your intent in a clearer/less confusing/less error prone way (and never relying on the value of true/false in C) so that these kinds of bugs are prevented. E.g.:
void lock(int *addr){
int note = BLOCK;
while(xchg(addr, note) != GO);
}
However I made a breakpoint in the critical zone i.e. "count++" and printed "flag", the flag is "GO"
This is probably a separate bug (although the first bug can cause a different thread to unlock after your thread locks it wouldn't be easily reproducible). Specifically; nothing prevents the compiler from re-ordering your code and you'll want some kind of barrier to prevent the compiler transforming your code into the equivalent of lock(&flag); unlock_spin(&flag); then count++;. Changing the clobber list for your inline assembly to "cc, memory" should prevent the compiler from reordering functions.

would this be a functioning spinlock in c?

I am currently learning about operating systems and was wondering, using these definitions, would this be working as expected or am I missing some atomic operations?
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
void lsUnlock(){
locked = 0;
}
Thanks in advance!
The first problem is that the compiler assumes that nothing will be altered by anything "external" (unless it's marked volatile); which means that the compiler is able to optimize this:
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
..into this:
int locked = 0;
void lsLock(){
if (locked){
while(1) { }
}
locked = 1;
}
Obviously that won't work if something else modifies locked - once the while(1) {} starts it won't stop. To fix that, you could use volatile int locked = 0;.
This only prevents the compiler from assuming locked didn't change. Nothing prevents a theoretical CPU from playing similar tricks (e.g. even if volatile is used, a non-cache coherent CPU could not realize a different CPU altered locked). For a guarantee you need to use atomics or something else (e.g. memory barriers).
However; with volatile int locked it may work, especially for common 80x86 CPUs. Please note that "works" can be considered the worst possibility - it leads to assuming that the code is fine and then having a spectacular (and nearly impossible to debug due to timing issues) disaster when you compile the same code for a different CPU.

Lazy-init an array with multi-threaded readers: is it safe without barriers or atomics?

I've been having an implementation discussion where the idea that a CPU can choose to completely reorder the storing of memory has come up.
I was initializing a static array in C using code similar to:
static int array[10];
static int array_initialized = 0;
void initialize () {
array[0] = 1;
array[1] = 2;
...
array_initialized = -1;
}
and it is used later similar to:
int get_index(int index) {
if (!array_initialized) initialize();
if (index < 0 || index > 9) return -1;
return array[index];
}
is it possible for the CPU to reorder memory access in a multi-core intel architecture (or other architecture) such that it sets array_initialized before the initialize function has finished setting the array elements? or so that another execution thread can see array_initialized as non-zero before the entire array has been initialized in its view of the memory?
TL:DR: to make lazy-init safe if you don't do it before starting multiple threads, you need an _Atomic flag.
is it possible for the CPU to reorder memory access in a multi-core Intel (x86) architecture
No, such reordering is possible at compile time only. x86 asm effectively has acquire/release semantics for normal loads/stores. (seq_cst + a store buffer with store forwarding).
https://preshing.com/20120625/memory-ordering-at-compile-time/
(or other architecture)
Yes, most other ISAs have a weaker asm memory model that does allow StoreStore reordering and LoadLoad reordering. (Effectively memory_order_relaxed, or sort of like memory_order_consume on ISAs other than Alpha AXP, but compilers don't try to maintain data dependencies.)
None of this really matters from C because the C memory model is very weak, allowing compile-time reordering and simultaneous read/write or write+write of any object is data-race UB.
Data Race UB is what lets a compiler keep static variables in registers for the life of a function / inside a loop when compiling for "normal" ISAs.
Having 2 threads run this function is C data-race UB if array_initialized isn't already set before either of them run. (e.g. by having the main thread run it once before starting any more threads). And remove the array_initialized flag entirely, unless you have a use for the lazy-init feature before starting any more threads.
It's 100% safe for a single thread, regardless of how many other threads are running: the C programming model guarantees that a single thread always sees its own operations in program order. (Just like asm for all normal ISAs; other than explicit parallelism in ISAs like Itanium, you always see your own operations in order. It's only other threads seeing your operations where things get weird).
Starting a new thread is (I think) always a "full barrier", or in C terms "synchronizes with" the new thread. Stuff in the new thread can't happen before anything in the parent thread. So just calling get_index once from the main thread makes it safe with no further barriers for other threads to run get_index after that.
You could make lazy init thread-safe with an _Atomic flag
This is similar to what gcc does for function-local static variables with non-constant initializers. Check out the code-gen for that if you're curious: a read-only check of an already-init flag and then a call to an init function that makes sure only one thread runs the initializer.
This requires an acquire load in the fast-path for the already-initialized state. That's free on x86 and SPARC-TSO (same asm as a normal load), but not on weaker ISAs. AArch64 has an acquire load instruction, other ISAs need some barrier instructions.
Turn your array_initialized flag into a 3-state _Atomic variable:
init not started (e.g. init == 0). Check for this with an acquire load.
init started but not finished (e.g. init == -1)
init finished (e.g. init == 1)
You can leave static int array[10]; itself non-atomic by making sure exactly 1 thread "claims" responsibility for doing the init, using atomic_compare_exchange_strong (which will succeed for exactly one thread). And then have other threads spin-wait for the INIT_FINISHED state.
Using initial state == 0 lets it be in the BSS, hopefully next to the data. Otherwise we might prefer INIT_FINISHED=0 for ISAs where branching on an int from memory being (non)zero is slightly more efficient than other numbers. (e.g. AArch64 cbnz, MIPS bne $reg, $zero).
We could get the best of both worlds (cheapest possible fast-path for the already-init case) while still having the flag in the BSS: Have the main thread write it with INIT_NOTSTARTED = -1 before starting any more threads.
Having the flag next to the array is helpful for a small array where the flag is probably in the same cache line as the data we want to index. Or at least the same 4k page.
#include <stdatomic.h>
#include <stdbool.h>
#ifdef __x86_64__
#include <immintrin.h>
#define SPINLOOP_BODY _mm_pause()
#else
#define SPINLOOP_BODY /**/
#endif
#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#define NOINLINE __attribute__((noinline))
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#define NOINLINE /**/
#endif
enum init_states {
INIT_NOTSTARTED = 0,
INIT_STARTED = -1,
INIT_FINISHED = 1 // optional: make this 0 to speed up the fast-path on some ISAs, and store an INIT_NOTSTARTED before the first call
};
static int array[10];
static _Atomic int array_initialized = INIT_NOTSTARTED;
// called either before or during init.
// One thread claims responsibility for doing the init, others spin-wait
NOINLINE // this is rare, make sure it doesn't bloat the fast-path
void initialize(void) {
bool winner = false;
// check read-only if another thread has already claimed init
if (array_initialized == INIT_NOTSTARTED) {
int expected = INIT_NOTSTARTED;
winner = atomic_compare_exchange_strong(&array_initialized, &expected, INIT_STARTED);
// seq_cst memory order is fine. Weaker might be ok but it only has to run once
}
if (winner) {
array[0] = 1;
// ...
atomic_store_explicit(&array_initialized, INIT_FINISHED, memory_order_release);
} else {
// spin-wait for the winner in other threads
// yield(); optional.
// Or use some kind of mutex or condition var if init is really slow
// otherwise just spin on a seq_cst load. (Or acquire is fine.)
while(array_initialized != INIT_FINISHED)
SPINLOOP_BODY; // x86 only
// winner's release store syncs with our load:
// array[] stores Happened Before this point so we can read it without UB
}
}
int get_index(int index) {
// atomic acquire load is fine, doesn't need seq_cst. Cheaper than seq_cst on PowerPC
if (unlikely(atomic_load_explicit(&array_initialized, memory_order_acquire) != INIT_FINISHED))
initialize();
if (unlikely(index < 0 || index > 9)) return -1;
return array[index];
}
This does compile to correct-looking and efficient asm on Godbolt. Without unlikely() macros, gcc/clang think that at least the stand-alone version of get_index has initialize() and/or return -1 as the most likely fast-path.
And compilers wanted to inline the init function, which would be silly because it only runs once per thread at most. Hopefully profile-guided optimization would correct that.

Why does thread not recognize change of a flag?

I have a strange situation under C/Visual Studio on a Windows 7 platform. There is a problem from time to time and I spent a lot of time to find it. The problem is within a third party library, for which I have the complete code. There a thread is created (the printLog statements are from myself):
...
plafParams->eventThreadFlag = 2;
printLog("before CreateThread");
if (plafParams->hReadThread_p = CreateThread(NULL, 0, ( LPTHREAD_START_ROUTINE ) plafPortReadThread, ( void * ) dlmsInstance, 0,
&plafParams->portReadThreadID) )
{
printLog("after CreateThread: OK");
plafParams->eventThreadFlag = 3;
}
else
{
unsigned int lasterr = GetLastError();
printLog("error CreateThread, last error:%x", lasterr);
/* Could not create the read thread. */
...
...
return FAILURE;
}
printLog("SUCCESS");
...
...
The thread function is:
void *plafPortReadThread(DLMS_GLOBALS *dlmsInstance)
{
PLAF_PARAMS *plafParams;
plafParams = (PLAF_PARAMS *)(dlmsInstance->plafParams);
printLog("start - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
while ((plafParams->eventThreadFlag != 1) && (plafParams->eventThreadFlag != 3))
{
if (plafParams->eventThreadFlag == 0)
{
printLog("exit 1 - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
CloseHandle(plafParams->hReadThread_p);
plafFree((void **)&plafParams);
ExitThread(0);
break;
}
}
printLog("start - plafPortReadThread, proceed=%d", proceed);
...
Now, when the flag is set before the while loop is started within the thread, everything works OK:
SUCCESS
start - plafPortReadThread, plafParams->eventThreadFlag=3
But sometimes the thread is quick enough so the while loop is started before the flag is actually set within the outer part.
The output is then:
start - plafPortReadThread, plafParams->eventThreadFlag=2
SUCCESS
Most surprisingly the while loop doesn't exit, even after the flag has been set to 3.
It seems, that the compiler "optimizes" the flag and assumes, that it cannot be changed from outside.
What could be the problem? I'm really surprised. Or is there something else I have overseen completely? I know, that the code is not very elegant and that such things should better be done with semaphores or signals. But it is not my code and I want to change as little as possible.
After removing the whole while condition it works as expected.
Should I change the struct or its fields to volatile ? Everybody says, that volatile is useless in our days and not needed anymore, except in the case, where a memory location is changed by peripherals...
Prior to C11 this is totally platform-dependent, because the effect you are observing is due to the memory model used by your platform. This is different from a compiler optimization as synchronization points between threads require the compiler to insert barrier instructions, instead of, e.g., making something a constant. For C11 for section 7.17.3 specifies the different models. So your value is not optimized out statically, thread A just never reads the value thread B wrote, but still has its local value.
In practice many projects don't use C11 yet, and thus you will likely have to check the documentation of your platform. Note that in many cases you don't have to modify the type of the variable for the flag (in case you can't). Most memory models specify synchronization points that also forbid reordering of certain instructions, i.e. in:
int x = 3;
_Atomic int a = 1;
x = 5;
a = 2;
the compiler will often have to ensure that x has the value 3 when a has the value 1, and that when a is assigned the value 2, x will have the value 5. volatile does not participate in this relationship (in the C/C++ 11 models - often confused because it does participate in Java's happened-before), and is mostly useless, unless your writes should never be optimized out because they have side-effects such as a LED blinking which the compiler can't understand:
volatile int x = 1; // some special location - blink then clear
x = 1; // blink then clear
x = 1; // blink then clear

A tested implementation of Peterson Lock algorithm?

Does anyone know of a good/correct implementation of Peterson's Lock algorithm in C? I can't seem to find this. Thanks.
Peterson's algorithm cannot be implemented correctly in C99, as explained in who ordered memory fences on an x86.
Peterson's algorithm is as follows:
LOCK:
interested[id] = 1 interested[other] = 1
turn = other turn = id
while turn == other while turn == id
and interested[other] == 1 and interested[id] == 1
UNLOCK:
interested[id] = 0 interested[other] = 0
There are some hidden assumptions here. To begin with, each thread must note its interest in acquiring the lock before giving away its turn. Giving away the turn must make visible to the other thread that we are interested in acquiring the lock.
Also, as in every lock, memory accesses in the critical section cannot be hoisted past the lock() call, nor sunk past the unlock(). I.e.: lock() must have at least acquire semantics, and unlock() must have at least release semantics.
In C11, the simplest way to achieve this would be to use a sequentially consistent memory order, which makes the code run as if it were a simple interleaving of threads running in program order (WARNING: totally untested code, but it's similar to an example in Dmitriy V'jukov's Relacy Race Detector):
lock(int id)
{
atomic_store(&interested[id], 1);
atomic_store(&turn, 1 - id);
while (atomic_load(&turn) == 1 - id
&& atomic_load(&interested[1 - id]) == 1);
}
unlock(int id)
{
atomic_store(&interested[id], 0);
}
This ensures that the compiler doesn't make optimizations that break the algorithm (by hoisting/sinking loads/stores across atomic operations), and emits the appropriate CPU instructions to ensure the CPU also doesn't break the algorithm. The default memory model for C11/C++11 atomic operations that don't explicitely select a memory model is the sequentially consistent memory model.
C11/C++11 also support weaker memory models, allowing as much optimization as possible. The following is a translation to C11 of the translation to C++11 by Anthony Williams of an algorithm originally by Dmitriy V'jukov in the syntax of his own Relacy Race Detector
[petersons_lock_with_C++0x_atomics] [the-inscrutable-c-memory-model]. If this algorithm is incorrect it's my fault (WARNING: also untested code, but based on good code from Dmitriy V'jukov and Anthony Williams):
lock(int id)
{
atomic_store_explicit(&interested[id], 1, memory_order_relaxed);
atomic_exchange_explicit(&turn, 1 - id, memory_order_acq_rel);
while (atomic_load_explicit(&interested[1 - id], memory_order_acquire) == 1
&& atomic_load_explicit(&turn, memory_order_relaxed) == 1 - id);
}
unlock(int id)
{
atomic_store_explicit(&interested[id], 0, memory_order_release);
}
Notice the exchange with acquire and release semantics. An exchange is an
atomic RMW operation. Atomic RMW operations always read the last value stored
before the write in the RMW operation. Also, an acquire on an atomic object
that reads a write from a release on that same atomic object (or any later
write on that object from the thread that performed the release or any later
write from any atomic RMW operation) creates a synchronizes-with relation
between the release and the acquire.
So, this operation is a synchronization point between the threads, there is
always a synchronizes-with relationship between the exchange in one thread and
the last exchange performed by any thread (or the initialization of turn, for
the very first exchange).
So we have a sequenced-before relationship between the store to interested[id]
and the exchange from/to turn, a synchronizes-with relationship between two
consecutive exchanges from/to turn, and a sequenced-before relationship
between the exchange from/to turn and the load of interested[1 - id]. This
amounts to a happens-before relationship between accesses to interested[x] in
different threads, with turn providing the synchronization between threads.
This forces all the ordering needed to make the algorithm work.
So how were these things done before C11? It involved using compiler and
CPU-specific magic. As an example, let's see the pretty strongly-ordered x86.
IIRC, all x86 loads have acquire semantics, and all stores have release
semantics (save non-temporal moves, in SSE, used precisely to achive higher
performance at the cost of ocassionally needing to issue CPU fences to achieve
coherence between CPUs). But this is not enough for Peterson's algorithm, as
Bartosz Milewsky explains at
who-ordered-memory-fences-on-an-x86 ,
for Peterson's algorithm to work we need to establish an ordering between
accesses to turn and interested, failing to do that may result in seeing loads
from interested[1 - id] before writes to interested[id], which is a bad thing.
So a way to do that in GCC/x86 would be (WARNING: although I tested something similar to the following, actually a modified version of the code at wrong-implementation-of-petersons-algorithm , testing is nowhere near assuring correctness of multithreaded code):
lock(int id)
{
interested[id] = 1;
turn = 1 - id;
__asm__ __volatile__("mfence");
do {
__asm__ __volatile__("":::"memory");
} while (turn == 1 - id
&& interested[1 - id] == 1);
}
unlock(int id)
{
interested[id] = 0;
}
The MFENCE prevents stores and loads to different memory addresses from being
reordered. Otherwise the write to interested[id] could be queued in the store
buffer while the load of interested[1 - id] proceeds. On many current
microarchitectures a SFENCE may be enough, since it may be implemented as a
store buffer drain, but IIUC SFENCE doesn't need to be implemented that way,
and may simply prevent reordering between stores. So SFENCE may not be enough everywhere, and we need a full MFENCE.
The compiler barrier (__asm__ __volatile__("":::"memory")) prevents
the compiler from deciding that it already knows the value of turn. We're
telling the compiler that we've clobbered memory, so all values cached in
registers must be reloaded from memory.
P.S: I feel this needs a closing paragraph, but my brain is drained.
I won't make any assertions about how good or correct the implementation is, but it was tested (briefly). This is a straight translation of the algorithm described on wikipedia.
struct petersonslock_t {
volatile unsigned flag[2];
volatile unsigned turn;
};
typedef struct petersonslock_t petersonslock_t;
petersonslock_t petersonslock () {
petersonslock_t l = { { 0U, 0U }, ~0U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 1;
l->turn = !p;
while (l->flag[!p] && (l->turn == !p)) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 0;
};
Greg points out that on an SMP architecture with slightly relaxed memory coherency (such as x86), although the loads to the same memory location are in order, loads to different locations on one processor may appear out of order to the other processor.
Jens Gustedt and ninjalj recommend modifying the original algorithm to use the atomic_flag type. This means setting the flags and turns would use the atomic_flag_test_and_set and clearing them would use atomic_flag_clear from C11. Alternatively, a memory barrier could be imposed between updates to flag.
Edit: I originally attempted to correct for this by writing to the same memory location for all the states. ninjalj pointed out that the bitwise operations turned the state operations into RMW rather than load and stores of the original algorithm. So, atomic bitwise operations are required. C11 provides such operators, as does GCC with built-ins. The algorithm below uses GCC built-ins, but wrapped in macros so that it can easily be changed to some other implementation. However, modifying the original algorithm above is the preferred solution.
struct petersonslock_t {
volatile unsigned state;
};
typedef struct petersonslock_t petersonslock_t;
#define ATOMIC_OR(x,v) __sync_or_and_fetch(&x, v)
#define ATOMIC_AND(x,v) __sync_and_and_fetch(&x, v)
petersonslock_t petersonslock () {
petersonslock_t l = { 0x000000U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
unsigned mask = (p == 0) ? 0xFF0000 : 0x00FF00;
ATOMIC_OR(l->state, (p == 0) ? 0x000100 : 0x010000);
(p == 0) ? ATOMIC_OR(l->state, 0x000001) : ATOMIC_AND(l->state, 0xFFFF00);
while ((l->state & mask) && (l->state & 0x0000FF) == !p) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
ATOMIC_AND(l->state, (p == 0) ? 0xFF00FF : 0x00FFFF);
};

Resources