Trying to implement a spin-lock via "lock xchg" assembly - c

Basically, I'm trying to run void spinlock_counter(int) in two threads and count ought to be 2000(parameter doesn't do anything I'm just too lazy). However I made a breakpoint in the critical zone i.e. "count++" and printed "flag", the flag is "GO" (the flag should be "BLOCK" if everything worked).
didn't get why this hadn't worked.
Thank you for your answers!
code:
int xchg(volatile int *addr, int newval){
int result;
asm volatile("lock xchg %0, %1"
:"+m"(addr),
"=a"(result)
:"1"(newval)
:"cc");
return result;
}
#define GO 0
#define BLOCK 1
int flag = GO;
void lock(int *addr){
int note = BLOCK;
while(!xchg(addr, note));
}
void unlock_spin(int *addr){
int note = GO;
xchg(addr, note);
}
void spinlock_counter(int a){
while(1) {
lock(&flag);
count++;
unlock_spin(&flag);
}
printf("%d\n", count);
}

The condition in your while loop (in lock()) is backwards.
The consequence is that lock() effectively does nothing. If another thread already acquired the lock it won't spin at all, and if the lock was GO it'll spin once (setting the lock to BLOCK on the first iteration so that the spinning stops on the 2nd iteration).
I'd recommend writing code that describes your intent in a clearer/less confusing/less error prone way (and never relying on the value of true/false in C) so that these kinds of bugs are prevented. E.g.:
void lock(int *addr){
int note = BLOCK;
while(xchg(addr, note) != GO);
}
However I made a breakpoint in the critical zone i.e. "count++" and printed "flag", the flag is "GO"
This is probably a separate bug (although the first bug can cause a different thread to unlock after your thread locks it wouldn't be easily reproducible). Specifically; nothing prevents the compiler from re-ordering your code and you'll want some kind of barrier to prevent the compiler transforming your code into the equivalent of lock(&flag); unlock_spin(&flag); then count++;. Changing the clobber list for your inline assembly to "cc, memory" should prevent the compiler from reordering functions.

Related

Memory ordering for a spin-lock "call once" implementation

Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if
a) it could be faster in the common case (i.e. already initialized), and,
b) is the selected memory ordering strong enough / too strong?
Architectures of interest are x86_64 (primarily) and aarch64.
The intended use API is something like this
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
// do other things that assume the initialization has taken place
}
And here is the implementation:
int once_enter(int *b)
{
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
// on x86, insert a pause instruction here
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}
I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.
Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?
UPDATE
Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:
I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).
#ifdef __x86_64__
#define PAUSE() __asm __volatile("pause")
#else
#define PAUSE()
#endif
int once_enter(int *b)
{
if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
PAUSE();
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}
a, What you need is a double-checked lock.
Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
}
// do other things that assume the initialization has taken place
}
b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.

would this be a functioning spinlock in c?

I am currently learning about operating systems and was wondering, using these definitions, would this be working as expected or am I missing some atomic operations?
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
void lsUnlock(){
locked = 0;
}
Thanks in advance!
The first problem is that the compiler assumes that nothing will be altered by anything "external" (unless it's marked volatile); which means that the compiler is able to optimize this:
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
..into this:
int locked = 0;
void lsLock(){
if (locked){
while(1) { }
}
locked = 1;
}
Obviously that won't work if something else modifies locked - once the while(1) {} starts it won't stop. To fix that, you could use volatile int locked = 0;.
This only prevents the compiler from assuming locked didn't change. Nothing prevents a theoretical CPU from playing similar tricks (e.g. even if volatile is used, a non-cache coherent CPU could not realize a different CPU altered locked). For a guarantee you need to use atomics or something else (e.g. memory barriers).
However; with volatile int locked it may work, especially for common 80x86 CPUs. Please note that "works" can be considered the worst possibility - it leads to assuming that the code is fine and then having a spectacular (and nearly impossible to debug due to timing issues) disaster when you compile the same code for a different CPU.

Lazy-init an array with multi-threaded readers: is it safe without barriers or atomics?

I've been having an implementation discussion where the idea that a CPU can choose to completely reorder the storing of memory has come up.
I was initializing a static array in C using code similar to:
static int array[10];
static int array_initialized = 0;
void initialize () {
array[0] = 1;
array[1] = 2;
...
array_initialized = -1;
}
and it is used later similar to:
int get_index(int index) {
if (!array_initialized) initialize();
if (index < 0 || index > 9) return -1;
return array[index];
}
is it possible for the CPU to reorder memory access in a multi-core intel architecture (or other architecture) such that it sets array_initialized before the initialize function has finished setting the array elements? or so that another execution thread can see array_initialized as non-zero before the entire array has been initialized in its view of the memory?
TL:DR: to make lazy-init safe if you don't do it before starting multiple threads, you need an _Atomic flag.
is it possible for the CPU to reorder memory access in a multi-core Intel (x86) architecture
No, such reordering is possible at compile time only. x86 asm effectively has acquire/release semantics for normal loads/stores. (seq_cst + a store buffer with store forwarding).
https://preshing.com/20120625/memory-ordering-at-compile-time/
(or other architecture)
Yes, most other ISAs have a weaker asm memory model that does allow StoreStore reordering and LoadLoad reordering. (Effectively memory_order_relaxed, or sort of like memory_order_consume on ISAs other than Alpha AXP, but compilers don't try to maintain data dependencies.)
None of this really matters from C because the C memory model is very weak, allowing compile-time reordering and simultaneous read/write or write+write of any object is data-race UB.
Data Race UB is what lets a compiler keep static variables in registers for the life of a function / inside a loop when compiling for "normal" ISAs.
Having 2 threads run this function is C data-race UB if array_initialized isn't already set before either of them run. (e.g. by having the main thread run it once before starting any more threads). And remove the array_initialized flag entirely, unless you have a use for the lazy-init feature before starting any more threads.
It's 100% safe for a single thread, regardless of how many other threads are running: the C programming model guarantees that a single thread always sees its own operations in program order. (Just like asm for all normal ISAs; other than explicit parallelism in ISAs like Itanium, you always see your own operations in order. It's only other threads seeing your operations where things get weird).
Starting a new thread is (I think) always a "full barrier", or in C terms "synchronizes with" the new thread. Stuff in the new thread can't happen before anything in the parent thread. So just calling get_index once from the main thread makes it safe with no further barriers for other threads to run get_index after that.
You could make lazy init thread-safe with an _Atomic flag
This is similar to what gcc does for function-local static variables with non-constant initializers. Check out the code-gen for that if you're curious: a read-only check of an already-init flag and then a call to an init function that makes sure only one thread runs the initializer.
This requires an acquire load in the fast-path for the already-initialized state. That's free on x86 and SPARC-TSO (same asm as a normal load), but not on weaker ISAs. AArch64 has an acquire load instruction, other ISAs need some barrier instructions.
Turn your array_initialized flag into a 3-state _Atomic variable:
init not started (e.g. init == 0). Check for this with an acquire load.
init started but not finished (e.g. init == -1)
init finished (e.g. init == 1)
You can leave static int array[10]; itself non-atomic by making sure exactly 1 thread "claims" responsibility for doing the init, using atomic_compare_exchange_strong (which will succeed for exactly one thread). And then have other threads spin-wait for the INIT_FINISHED state.
Using initial state == 0 lets it be in the BSS, hopefully next to the data. Otherwise we might prefer INIT_FINISHED=0 for ISAs where branching on an int from memory being (non)zero is slightly more efficient than other numbers. (e.g. AArch64 cbnz, MIPS bne $reg, $zero).
We could get the best of both worlds (cheapest possible fast-path for the already-init case) while still having the flag in the BSS: Have the main thread write it with INIT_NOTSTARTED = -1 before starting any more threads.
Having the flag next to the array is helpful for a small array where the flag is probably in the same cache line as the data we want to index. Or at least the same 4k page.
#include <stdatomic.h>
#include <stdbool.h>
#ifdef __x86_64__
#include <immintrin.h>
#define SPINLOOP_BODY _mm_pause()
#else
#define SPINLOOP_BODY /**/
#endif
#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#define NOINLINE __attribute__((noinline))
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#define NOINLINE /**/
#endif
enum init_states {
INIT_NOTSTARTED = 0,
INIT_STARTED = -1,
INIT_FINISHED = 1 // optional: make this 0 to speed up the fast-path on some ISAs, and store an INIT_NOTSTARTED before the first call
};
static int array[10];
static _Atomic int array_initialized = INIT_NOTSTARTED;
// called either before or during init.
// One thread claims responsibility for doing the init, others spin-wait
NOINLINE // this is rare, make sure it doesn't bloat the fast-path
void initialize(void) {
bool winner = false;
// check read-only if another thread has already claimed init
if (array_initialized == INIT_NOTSTARTED) {
int expected = INIT_NOTSTARTED;
winner = atomic_compare_exchange_strong(&array_initialized, &expected, INIT_STARTED);
// seq_cst memory order is fine. Weaker might be ok but it only has to run once
}
if (winner) {
array[0] = 1;
// ...
atomic_store_explicit(&array_initialized, INIT_FINISHED, memory_order_release);
} else {
// spin-wait for the winner in other threads
// yield(); optional.
// Or use some kind of mutex or condition var if init is really slow
// otherwise just spin on a seq_cst load. (Or acquire is fine.)
while(array_initialized != INIT_FINISHED)
SPINLOOP_BODY; // x86 only
// winner's release store syncs with our load:
// array[] stores Happened Before this point so we can read it without UB
}
}
int get_index(int index) {
// atomic acquire load is fine, doesn't need seq_cst. Cheaper than seq_cst on PowerPC
if (unlikely(atomic_load_explicit(&array_initialized, memory_order_acquire) != INIT_FINISHED))
initialize();
if (unlikely(index < 0 || index > 9)) return -1;
return array[index];
}
This does compile to correct-looking and efficient asm on Godbolt. Without unlikely() macros, gcc/clang think that at least the stand-alone version of get_index has initialize() and/or return -1 as the most likely fast-path.
And compilers wanted to inline the init function, which would be silly because it only runs once per thread at most. Hopefully profile-guided optimization would correct that.

Why does thread not recognize change of a flag?

I have a strange situation under C/Visual Studio on a Windows 7 platform. There is a problem from time to time and I spent a lot of time to find it. The problem is within a third party library, for which I have the complete code. There a thread is created (the printLog statements are from myself):
...
plafParams->eventThreadFlag = 2;
printLog("before CreateThread");
if (plafParams->hReadThread_p = CreateThread(NULL, 0, ( LPTHREAD_START_ROUTINE ) plafPortReadThread, ( void * ) dlmsInstance, 0,
&plafParams->portReadThreadID) )
{
printLog("after CreateThread: OK");
plafParams->eventThreadFlag = 3;
}
else
{
unsigned int lasterr = GetLastError();
printLog("error CreateThread, last error:%x", lasterr);
/* Could not create the read thread. */
...
...
return FAILURE;
}
printLog("SUCCESS");
...
...
The thread function is:
void *plafPortReadThread(DLMS_GLOBALS *dlmsInstance)
{
PLAF_PARAMS *plafParams;
plafParams = (PLAF_PARAMS *)(dlmsInstance->plafParams);
printLog("start - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
while ((plafParams->eventThreadFlag != 1) && (plafParams->eventThreadFlag != 3))
{
if (plafParams->eventThreadFlag == 0)
{
printLog("exit 1 - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
CloseHandle(plafParams->hReadThread_p);
plafFree((void **)&plafParams);
ExitThread(0);
break;
}
}
printLog("start - plafPortReadThread, proceed=%d", proceed);
...
Now, when the flag is set before the while loop is started within the thread, everything works OK:
SUCCESS
start - plafPortReadThread, plafParams->eventThreadFlag=3
But sometimes the thread is quick enough so the while loop is started before the flag is actually set within the outer part.
The output is then:
start - plafPortReadThread, plafParams->eventThreadFlag=2
SUCCESS
Most surprisingly the while loop doesn't exit, even after the flag has been set to 3.
It seems, that the compiler "optimizes" the flag and assumes, that it cannot be changed from outside.
What could be the problem? I'm really surprised. Or is there something else I have overseen completely? I know, that the code is not very elegant and that such things should better be done with semaphores or signals. But it is not my code and I want to change as little as possible.
After removing the whole while condition it works as expected.
Should I change the struct or its fields to volatile ? Everybody says, that volatile is useless in our days and not needed anymore, except in the case, where a memory location is changed by peripherals...
Prior to C11 this is totally platform-dependent, because the effect you are observing is due to the memory model used by your platform. This is different from a compiler optimization as synchronization points between threads require the compiler to insert barrier instructions, instead of, e.g., making something a constant. For C11 for section 7.17.3 specifies the different models. So your value is not optimized out statically, thread A just never reads the value thread B wrote, but still has its local value.
In practice many projects don't use C11 yet, and thus you will likely have to check the documentation of your platform. Note that in many cases you don't have to modify the type of the variable for the flag (in case you can't). Most memory models specify synchronization points that also forbid reordering of certain instructions, i.e. in:
int x = 3;
_Atomic int a = 1;
x = 5;
a = 2;
the compiler will often have to ensure that x has the value 3 when a has the value 1, and that when a is assigned the value 2, x will have the value 5. volatile does not participate in this relationship (in the C/C++ 11 models - often confused because it does participate in Java's happened-before), and is mostly useless, unless your writes should never be optimized out because they have side-effects such as a LED blinking which the compiler can't understand:
volatile int x = 1; // some special location - blink then clear
x = 1; // blink then clear
x = 1; // blink then clear

volatile and variable modification in C

A structure TsMyStruct is given as parameter to some functions :
typedef struct
{
uint16_t inc1;
uint16_t inc2;
}TsMyStruct;
void func1(TsMyStruct* myStruct)
{
myStruct->inc1 += 1;
}
void func2(TsMyStruct* myStruct)
{
myStruct->inc1 += 2;
myStruct->inc2 += 3;
}
func1 is called under non-interrupt context and func2 is called under interrupt context. Call stack of func2 has an interrupt vector as origin. C compiler does not know func2 can be called (but code isn't considered as "unused" code as linker needs it in interrupt vector table memory section), so some code reading myStruct->inc2 outside func2 can be possibly optimized preventing myStruct->inc2 to be reloaded from ram. It is true for C basic types, but is it true for inc2 structure member or some array...? Is it true for function parameters?
As a general rule, can I say "every memory zone (of basic type? or not?) modified in interrupt context and read elsewhere must be declared as volatile"?
Yes, any memory that is used both inside and outside of an interrupt handler should be volatile, including structs and arrays, and pointers passed as function parameters. Assuming that you are targeting a single-core device, you do not need additional synchronization.
Still, you have to consider that func1 could be interrupted anywhere, which may lead to inconsistent results if you're not careful. For instance, consider this:
void func1(volatile TsMyStruct* myStruct)
{
myStruct->inc1 += 1;
if (myStruct->inc1 == 4)
{
print(myStruct->inc1); // assume "print" exists
}
}
void func2(volatile TsMyStruct* myStruct)
{
myStruct->inc1 += 2;
myStruct->inc2 += 3;
}
Since interrupts are asynchronous, this could print numbers different from 4. That would happen, for instance, if func1 is interrupted after the check but before the print call.
no. volatile is not enough. You have both to set an optimization barrier for the compiler (which can be volatile) and for the processor. E.g. when a CPU core writes data, this can go into some cache and won't be visible for another core.
Usually, you need some locking in your code (spin locks, or mutex). Such functions contain usually an optimization barrier so you do not need an volatile.
Your code is racy, with proper locking it would look like
void func1(TsMyStruct* myStruct)
{
lock();
myStruct->inc1 += 1;
unlock();
}
void func2(TsMyStruct* myStruct)
{
lock();
myStruct->inc1 += 2;
unlock();
myStruct->inc1 += 3;
}
and the lock() + unlock() functions contain optimization barriers (e.g. __asm__ __volatile__("" ::: "memory") or just a call to a global function) which will cause the compiler to reload myStruct.
For nitpicking: lock() and unlock() are expected to do the right thing (e.g. disable irqs). Real world implementations would be e.g. spin_lock_irqsave() + spin_lock_irqrestore() in linux.

Resources