would this be a functioning spinlock in c? - c

I am currently learning about operating systems and was wondering, using these definitions, would this be working as expected or am I missing some atomic operations?
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
void lsUnlock(){
locked = 0;
}
Thanks in advance!

The first problem is that the compiler assumes that nothing will be altered by anything "external" (unless it's marked volatile); which means that the compiler is able to optimize this:
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
..into this:
int locked = 0;
void lsLock(){
if (locked){
while(1) { }
}
locked = 1;
}
Obviously that won't work if something else modifies locked - once the while(1) {} starts it won't stop. To fix that, you could use volatile int locked = 0;.
This only prevents the compiler from assuming locked didn't change. Nothing prevents a theoretical CPU from playing similar tricks (e.g. even if volatile is used, a non-cache coherent CPU could not realize a different CPU altered locked). For a guarantee you need to use atomics or something else (e.g. memory barriers).
However; with volatile int locked it may work, especially for common 80x86 CPUs. Please note that "works" can be considered the worst possibility - it leads to assuming that the code is fine and then having a spectacular (and nearly impossible to debug due to timing issues) disaster when you compile the same code for a different CPU.

Related

Memory ordering for a spin-lock "call once" implementation

Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if
a) it could be faster in the common case (i.e. already initialized), and,
b) is the selected memory ordering strong enough / too strong?
Architectures of interest are x86_64 (primarily) and aarch64.
The intended use API is something like this
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
// do other things that assume the initialization has taken place
}
And here is the implementation:
int once_enter(int *b)
{
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
// on x86, insert a pause instruction here
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}
I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.
Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?
UPDATE
Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:
I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).
#ifdef __x86_64__
#define PAUSE() __asm __volatile("pause")
#else
#define PAUSE()
#endif
int once_enter(int *b)
{
if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
PAUSE();
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}
a, What you need is a double-checked lock.
Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
}
// do other things that assume the initialization has taken place
}
b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.

Trying to implement a spin-lock via "lock xchg" assembly

Basically, I'm trying to run void spinlock_counter(int) in two threads and count ought to be 2000(parameter doesn't do anything I'm just too lazy). However I made a breakpoint in the critical zone i.e. "count++" and printed "flag", the flag is "GO" (the flag should be "BLOCK" if everything worked).
didn't get why this hadn't worked.
Thank you for your answers!
code:
int xchg(volatile int *addr, int newval){
int result;
asm volatile("lock xchg %0, %1"
:"+m"(addr),
"=a"(result)
:"1"(newval)
:"cc");
return result;
}
#define GO 0
#define BLOCK 1
int flag = GO;
void lock(int *addr){
int note = BLOCK;
while(!xchg(addr, note));
}
void unlock_spin(int *addr){
int note = GO;
xchg(addr, note);
}
void spinlock_counter(int a){
while(1) {
lock(&flag);
count++;
unlock_spin(&flag);
}
printf("%d\n", count);
}
The condition in your while loop (in lock()) is backwards.
The consequence is that lock() effectively does nothing. If another thread already acquired the lock it won't spin at all, and if the lock was GO it'll spin once (setting the lock to BLOCK on the first iteration so that the spinning stops on the 2nd iteration).
I'd recommend writing code that describes your intent in a clearer/less confusing/less error prone way (and never relying on the value of true/false in C) so that these kinds of bugs are prevented. E.g.:
void lock(int *addr){
int note = BLOCK;
while(xchg(addr, note) != GO);
}
However I made a breakpoint in the critical zone i.e. "count++" and printed "flag", the flag is "GO"
This is probably a separate bug (although the first bug can cause a different thread to unlock after your thread locks it wouldn't be easily reproducible). Specifically; nothing prevents the compiler from re-ordering your code and you'll want some kind of barrier to prevent the compiler transforming your code into the equivalent of lock(&flag); unlock_spin(&flag); then count++;. Changing the clobber list for your inline assembly to "cc, memory" should prevent the compiler from reordering functions.

Optimization with interrupts or threads and global variables

I find that I have some difficulty with how to best write communication between functions that are out of the normal flow of code. A simple example is:
int a = 0;
volatile int v = 0;
void __attribute__((used)) interrupt() {
a++;
v++;
}
int main() {
while(1) {
// asm("nop");
a--;
v--;
if (v > 10 && a > 10)
break;
}
return 0;
}
It is not surprising that the main while loop can optimize the a variable to a register and thus never see any changes from the interrupt. If the variable is volatile then it is annoying in that every time it is used in needs to be reread from or rewritten to memory. And in that technique any communication variable across threads would need to be volatile. A synchronization primitive (or even the commented out "nop") solves the problem because it seemingly has a side effect to create a compiler barrier. But if I understand correctly that would mean flushing the entire state of all the registers used in main, where maybe it's less harsh to just have a few variables as volatile. I currently use the two techniques but I wish I had a more standard method for dealing with the issue. Can anyone comment on best strategies here?
A link to some assembly
So you want a means of reducing the number of times a is looked up. The following reduces it to once a loop:
volatile int a = 0;
volatile int v = 0;
void __attribute__((used)) interrupt() {
a++;
v++;
}
int main() {
while(1) {
int b = --a;
--v;
if (v > 10 && b > 10)
break;
}
return 0;
}
Nothing stops you from checking even less often similarly.

Is this a decent home-made mutex implementation? Criticism? Potential Problems?

I'm wondering if anyone sees anything that would likely cause problems in this code. I know there are other ways/API calls I could used to have done this, but I'm trying to lay the foundation for my own platform independant? / cross-platform mutex framework.
Obviously I need to do some #ifdef's and define some macros for the Win32 Sleep() and GetCurrentThreadID() calls...
typedef struct aec {
unsigned long long lastaudibleframe; /* time stamp of last audible frame */
unsigned short aiws; /* Average mike input when speaker is playing */
unsigned short aiwos; /*Average mike input when speaker ISNT playing */
unsigned long long t_aiws, t_aiwos; /* Internal running total */
unsigned int c_aiws, c_aiwos; /* Internal counters */
unsigned long lockthreadid;
int stlc; /* Same thread lock count */
} AEC;
char lockecho( AEC *ec ) {
unsigned long tid=0;
static int inproc=0;
while (inproc) {
Sleep(1);
}
inproc=1;
if (!ec) {
inproc=0;
return 0;
}
tid=GetCurrentThreadId();
if (ec->lockthreadid==tid) {
inproc=0;
ec->stlc++;
return 1;
}
while (ec->lockthreadid!=0) {
Sleep(1);
}
ec->lockthreadid=tid;
inproc=0;
return 1;
}
char unlockecho( AEC *ec ) {
unsigned long tid=0;
if (!ec)
return 1;
tid=GetCurrentThreadId();
if (tid!=ec->lockthreadid)
return 0;
if (tid==ec->lockthreadid) {
if (ec->stlc>0) {
ec->stlc--;
} else {
ec->lockthreadid=0;
}
}
return 1;
}
No it's not, AFAIK you can't implement a mutex with plain C code without some low-level atomic operations (RMW, Test and Set... etc).. In your particular example, consider what happens if a context switch interrupts the first thread before it gets a chance to set inproc, then the second thread will resume and set it to 1 and now both threads "think" they have exclusive access to the struct.. this is just one of many things that could go wrong with your approach.
Also note that even if a thread gets a chance to set inproc, assignment is not guranteed to be atomic (it could be interrupted in the middle of assigning the variable).
As mux points out, your proposed code is incorrect due to many race conditions. You could solve this using atomic instructions like "Compare and Set", but you'll need to define those separately for each platform anyway. You're better off just defining a high-level "Lock" and "Unlock" interface, and implementing those using whatever the platform provides.

Bakery Lock when used inside a struct doesn't work

I'm new at multi-threaded programming and I tried to code the Bakery Lock Algorithm in C.
Here is the code:
int number[N]; // N is the number of threads
int choosing[N];
void lock(int id) {
choosing[id] = 1;
number[id] = max(number, N) + 1;
choosing[id] = 0;
for (int j = 0; j < N; j++)
{
if (j == id)
continue;
while (1)
if (choosing[j] == 0)
break;
while (1)
{
if (number[j] == 0)
break;
if (number[j] > number[id]
|| (number[j] == number[id] && j > id))
break;
}
}
}
void unlock(int id) {
number[id] = 0;
}
Then I run the following example. I run 100 threads and each thread runs the following code:
for (i = 0; i < 10; ++i) {
lock(id);
counter++;
unlock(id);
}
After all threads have been executed, the result of the shared counter is 10 * 100 = 1000 which is the expected value. I executed my program multiple times and the result was always 1000. So it seems that the implementation of the lock is correct. That seemed weird based on a previous question I had because I didn't use any memory barriers/fences. Was I just lucky?
Then I wanted to create a multi-threaded program that will use many different locks. So I created this (full code can be found here):
typedef struct {
int number[N];
int choosing[N];
} LOCK;
and the code changes to:
void lock(LOCK l, int id)
{
l.choosing[id] = 1;
l.number[id] = max(l.number, N) + 1;
l.choosing[id] = 0;
...
Now when executing my program, sometimes I get 997, sometimes 998, sometimes 1000. So the lock algorithm isn't correct.
What am I doing wrong? What can I do in order to fix it?
Is it perhaps a problem now that I'm reading arrays number and choosing from a struct
and that's not atomic or something?
Should I use memory fences and if so at which points (I tried using asm("mfence") in various points of my code, but it didn't help)?
With pthreads, the standard states that accessing a varable in one thread while another thread is, or might be, modifying it is undefined behavior. Your code does this all over the place. For example:
while (1)
if (choosing[j] == 0)
break;
This code accesses choosing[j] over and over while waiting for another thread to modify it. The compiler is entirely free to modify this code as follows:
int cj=choosing[j];
while(1)
if(cj == 0)
break;
Why? Because the standard is clear that another thread may not modify the variable while this thread may be accessing it, so the value can be assumed to stay the same. But clearly, that won't work.
It can also do this:
while(1)
{
int cj=choosing[j];
if(cj==0) break;
choosing[j]=cj;
}
Same logic. It is perfectly legal for the compiler to write back a variable whether it has been modified or not, so long as it does so at a time when the code could be accessing the variable. (Because, at that time, it's not legal for another thread to modify it, so the value must be the same and the write is harmless. In some cases, the write really is an optimization and real-world code has been broken by such writebacks.)
If you want to write your own synchronization functions, you have to build them with primitive functions that have the appropriate atomicity and memory visibility semantics. You must follow the rules or your code will fail, and fail horribly and unpredictably.

Resources