Optimization with interrupts or threads and global variables - c

I find that I have some difficulty with how to best write communication between functions that are out of the normal flow of code. A simple example is:
int a = 0;
volatile int v = 0;
void __attribute__((used)) interrupt() {
a++;
v++;
}
int main() {
while(1) {
// asm("nop");
a--;
v--;
if (v > 10 && a > 10)
break;
}
return 0;
}
It is not surprising that the main while loop can optimize the a variable to a register and thus never see any changes from the interrupt. If the variable is volatile then it is annoying in that every time it is used in needs to be reread from or rewritten to memory. And in that technique any communication variable across threads would need to be volatile. A synchronization primitive (or even the commented out "nop") solves the problem because it seemingly has a side effect to create a compiler barrier. But if I understand correctly that would mean flushing the entire state of all the registers used in main, where maybe it's less harsh to just have a few variables as volatile. I currently use the two techniques but I wish I had a more standard method for dealing with the issue. Can anyone comment on best strategies here?
A link to some assembly

So you want a means of reducing the number of times a is looked up. The following reduces it to once a loop:
volatile int a = 0;
volatile int v = 0;
void __attribute__((used)) interrupt() {
a++;
v++;
}
int main() {
while(1) {
int b = --a;
--v;
if (v > 10 && b > 10)
break;
}
return 0;
}
Nothing stops you from checking even less often similarly.

Related

Memory ordering for a spin-lock "call once" implementation

Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if
a) it could be faster in the common case (i.e. already initialized), and,
b) is the selected memory ordering strong enough / too strong?
Architectures of interest are x86_64 (primarily) and aarch64.
The intended use API is something like this
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
// do other things that assume the initialization has taken place
}
And here is the implementation:
int once_enter(int *b)
{
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
// on x86, insert a pause instruction here
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}
I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.
Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?
UPDATE
Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:
I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).
#ifdef __x86_64__
#define PAUSE() __asm __volatile("pause")
#else
#define PAUSE()
#endif
int once_enter(int *b)
{
if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
PAUSE();
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}
a, What you need is a double-checked lock.
Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
}
// do other things that assume the initialization has taken place
}
b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.

would this be a functioning spinlock in c?

I am currently learning about operating systems and was wondering, using these definitions, would this be working as expected or am I missing some atomic operations?
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
void lsUnlock(){
locked = 0;
}
Thanks in advance!
The first problem is that the compiler assumes that nothing will be altered by anything "external" (unless it's marked volatile); which means that the compiler is able to optimize this:
int locked = 0;
void lsLock(){
while(1) {
if (locked){
continue;
} else {
locked = 1;
break;
}
}
}
..into this:
int locked = 0;
void lsLock(){
if (locked){
while(1) { }
}
locked = 1;
}
Obviously that won't work if something else modifies locked - once the while(1) {} starts it won't stop. To fix that, you could use volatile int locked = 0;.
This only prevents the compiler from assuming locked didn't change. Nothing prevents a theoretical CPU from playing similar tricks (e.g. even if volatile is used, a non-cache coherent CPU could not realize a different CPU altered locked). For a guarantee you need to use atomics or something else (e.g. memory barriers).
However; with volatile int locked it may work, especially for common 80x86 CPUs. Please note that "works" can be considered the worst possibility - it leads to assuming that the code is fine and then having a spectacular (and nearly impossible to debug due to timing issues) disaster when you compile the same code for a different CPU.

understand membarrier function in linux

Example of using membarrier function from linux manual: https://man7.org/linux/man-pages/man2/membarrier.2.html
#include <stdlib.h>
static volatile int a, b;
static void
fast_path(int *read_b)
{
a = 1;
asm volatile ("mfence" : : : "memory");
*read_b = b;
}
static void
slow_path(int *read_a)
{
b = 1;
asm volatile ("mfence" : : : "memory");
*read_a = a;
}
int
main(int argc, char **argv)
{
int read_a, read_b;
/*
* Real applications would call fast_path() and slow_path()
* from different threads. Call those from main() to keep
* this example short.
*/
slow_path(&read_a);
fast_path(&read_b);
/*
* read_b == 0 implies read_a == 1 and
* read_a == 0 implies read_b == 1.
*/
if (read_b == 0 && read_a == 0)
abort();
exit(EXIT_SUCCESS);
}
The code above transformed to use membarrier() becomes:
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/membarrier.h>
static volatile int a, b;
static int
membarrier(int cmd, unsigned int flags, int cpu_id)
{
return syscall(__NR_membarrier, cmd, flags, cpu_id);
}
static int
init_membarrier(void)
{
int ret;
/* Check that membarrier() is supported. */
ret = membarrier(MEMBARRIER_CMD_QUERY, 0, 0);
if (ret < 0) {
perror("membarrier");
return -1;
}
if (!(ret & MEMBARRIER_CMD_GLOBAL)) {
fprintf(stderr,
"membarrier does not support MEMBARRIER_CMD_GLOBAL\n");
return -1;
}
return 0;
}
static void
fast_path(int *read_b)
{
a = 1;
asm volatile ("" : : : "memory");
*read_b = b;
}
static void
slow_path(int *read_a)
{
b = 1;
membarrier(MEMBARRIER_CMD_GLOBAL, 0, 0);
*read_a = a;
}
int
main(int argc, char **argv)
{
int read_a, read_b;
if (init_membarrier())
exit(EXIT_FAILURE);
/*
* Real applications would call fast_path() and slow_path()
* from different threads. Call those from main() to keep
* this example short.
*/
slow_path(&read_a);
fast_path(&read_b);
/*
* read_b == 0 implies read_a == 1 and
* read_a == 0 implies read_b == 1.
*/
if (read_b == 0 && read_a == 0)
abort();
exit(EXIT_SUCCESS);
}
This "membarrier" description is taken from the Linux manual. I am still confused about how does trhe "membarrier" function add overhead to the slow side, and remove overhead from the fast side, thus resulting in an overall performance increase as long as the slow side is infrequent enough that the overhead of the membarrier() calls does not outweigh the performance gain on the fast side.
Could you please help me to describe it in more detail.
Thanks!
This pair of writes-then-read-the-other-var is https://preshing.com/20120515/memory-reordering-caught-in-the-act/, a demo of StoreLoad reordering (the only kind x86 allows, given its program-order + store buffer with store forwarding memory model).
With only one local MFENCE you could still get reordering:
FAST using just mfence, not membarrier
a = 1 exec
read_b = b; // 0
b = 1;
mfence (force b=1 to be visible before reading a)
read_a = a; // 0
a = 1 visible (global vis. delayed by store buffer)
But consider what would happen if an mfence-on-every-core had to be part of every possible order, between the slow-path's store and its reload.
This ordering would no longer be possible. If read_b=b has already read a 0, then a=1 is already pending1 (if it isn't visible already). It's impossible for it to stay private until after read_a = a because membarrier() makes sure a full barrier runs on every core, and SLOW waits for that to happen (membarrier to return) before reading a.
And there's no way to get 0,0 from having SLOW execute first; it runs membarrier itself so its store is definitely visible to other threads before it reads a.
footnote 1: Waiting to execute, or waiting in the store buffer to commit to L1d cache. The asm("":::"memory") ensures that, but is actually redundant because volatile itself guarantees that the accesses happen in asm in program order. And we basically need volatile for other reasons when hand-rolling atomics instead of using C11 _Atomic. (But generally don't do that unless you're actually writing kernel code. Use atomic_store_explicit(&a, 1, memory_order_release);).
Note it's actually the store buffer that creates StoreLoad reordering (the only kind x86 allows), not so much OoO exec. In fact, a store buffer also lets x86 execute stores out-of-order and then make them globally visible in program order (if it turns out they weren't the result of mis-speculation or something!).
Also note that in-order CPUs can do their memory accesses out of order. They start instructions (including loads) in order, but can let them complete out of order, e.g. by scoreboarding loads to allow hit-under-miss. See also How is load->store reordering possible with in-order commit?

Do we need a lock in a single writer multi-reader system?

In os books they said there must be a lock to protect data from accessed by reader and writer at the same time.
but when I test the simple example in x86 machine,it works well.
I want to know, is the lock here nessesary?
#define _GNU_SOURCE
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
struct doulnum
{
int i;
long int l;
char c;
unsigned int ui;
unsigned long int ul;
unsigned char uc;
};
long int global_array[100] = {0};
void* start_read(void *_notused)
{
int i;
struct doulnum d;
int di;
long int dl;
char dc;
unsigned char duc;
unsigned long dul;
unsigned int dui;
while(1)
{
for(i = 0;i < 100;i ++)
{
dl = global_array[i];
//di = d.i;
//dl = d.l;
//dc = d.c;
//dui = d.ui;
//duc = d.uc;
//dul = d.ul;
if(dl > 5 || dl < 0)
printf("error\n");
/*if(di > 5 || di < 0 || dl > 10 || dl < 5)
{
printf("i l value %d,%ld\n",di,dl);
exit(0);
}
if(dc > 15 || dc < 10 || dui > 20 || dui < 15)
{
printf("c ui value %d,%u\n",dc,dui);
exit(0);
}
if(dul > 25 || dul < 20 || duc > 30 || duc < 25)
{
printf("uc ul value %u,%lu\n",duc,dul);
exit(0);
}*/
}
}
}
int start_write(void)
{
int i;
//struct doulnum dl;
while(1)
{
for(i = 0;i < 100;i ++)
{
//dl.i = random() % 5;
//dl.l = random() % 5 + 5;
//dl.c = random() % 5 + 10;
//dl.ui = random() % 5 + 15;
//dl.ul = random() % 5 + 20;
//dl.uc = random() % 5 + 25;
global_array[i] = random() % 5;
}
}
return 0;
}
int main(int argc,char **argv)
{
int i;
cpu_set_t cpuinfo;
pthread_t pt[3];
//struct doulnum dl;
//dl.i = 2;
//dl.l = 7;
//dl.c = 12;
//dl.ui = 17;
//dl.ul = 22;
//dl.uc = 27;
for(i = 0;i < 100;i ++)
global_array[i] = 2;
for(i = 0;i < 3;i ++)
if(pthread_create(pt + i,NULL,start_read,NULL) < 0)
return -1;
/* for(i = 0;i < 3;i ++)
{
CPU_ZERO(&cpuinfo);
CPU_SET_S(i,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pt[i],sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity %d\n",i);
exit(0);
}
}
CPU_ZERO(&cpuinfo);
CPU_SET_S(3,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pthread_self(),sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity recver\n");
exit(0);
}*/
start_write();
return 0;
}
If you don't synchronise reads and writes, a reader could read while a writer is writing, and read the data in a half-written state if the write operation is not atomic. So yes, synchronisation would be necessary to keep that from happening.
You surely need synchronization here . The simple reason being that there is a distinct possibility that data be in a inconsistent state when start_write is updating the information in the global array and one of your 3 threads try to read the same data from the global array .
What you quote is also incorrect . " must be a lock to protect data from accessed by reader and writer at the same time" should be "must be a lock to protect data from modified by reader and writer at the same time"
if the shared data is being modified by one of the threads and another thread is reading from it you need to use lock to protect it .
if the shared data is being accessed by two or more threads then you dont need to protect it .
It will work fine if the threads are just reading from global_array. printf should be fine since this does a single IO operation in append mode.
However, since the main thread calls start_write to update the global_array at the same time the other threads are in start_read then they are going to be reading the values in a very unpredictable manner. It depends highly on how the threads are implemented in the OS, how many CPUs/cores you have, etc.. This might work well on your dual core development box but then fail spectactuarly when you move to a 16 core production server.
For example, if the threads were not synchronizing, they might never see any updates to global_array in the right circumstances. Or some threads would see changes faster than others. It's all about the timing of when memory pages are flushed to central memory and when the threads see the changes in their caches. To ensure consistent results you need synchronization (memory barriers) to force the caches to the updated.
The general answer is you need some way to ensure/enforce necessary atomicity, so the reader doesn't see an inconsistent state.
A lock (done correctly) is sufficient but not always necessary. But in order to prove that it's not necessary, you need to be able to say something about the atomicity of the operations involved.
This involves both the architecture of the target host and, to some extent, the compiler.
In your example, you're writing a long to an array. In this case, the question is is the storage of a long atomic? It probably is, but it depends on the host. It's possible that the CPU writes out a portion of the long (upper/lower words/bytes) separately and thus the reader could get a value never written. (This is, I believe, unlikely on most modern CPU archs, but you'd have to check to be sure.)
It's also possible for there to be write buffering in the CPU. It's been a long time since I looked at this, but I believe it's possible to get store reordering if you don't have the necessary write barrier instructions. It's unclear from your example if you would be relying on this.
Finally, you'd probably need to flag the array as volatile (again, I haven't done this in a while so I'm rusty on the specifics) in order to ensure that the compiler doesn't make assumptions about the data not changing underneath it.
It depends on how much you care about portability.
At least on an actual Intel x86 processor, when you're reading/writing dword (32-bit) data that's also dword aligned, the hardware gives you atomicity "for free" -- i.e., without your having to do any sort of lock to enforce it.
Changing much of anything (up to an including compiler flags that might affect the data'a alignment) can break that -- but in ways that might remain hidden for a long time (especially if you have low contention over a particular data item). It also leads to extremely fragile code -- for example, switching to a smaller data type can break the code, even if you're only using a subset of the values.
The current atomic "guarantee" is pretty much an accidental side-effect of the way the cache and bus happen to be designed. While I'm not sure I'd really expect a change that broke things, I wouldn't consider it particularly far-fetched either. The only place I've seen documentation of this atomic behavior was in the same processor manuals that cover things like model-specific registers that definitely have changed (and continue to change) from one model of processor to the next.
The bottom line is that you really should do the locking, but you probably won't see a manifestation of the problem with your current hardware, no matter how much you test (unless you change conditions like mis-aligning the data).

Bakery Lock when used inside a struct doesn't work

I'm new at multi-threaded programming and I tried to code the Bakery Lock Algorithm in C.
Here is the code:
int number[N]; // N is the number of threads
int choosing[N];
void lock(int id) {
choosing[id] = 1;
number[id] = max(number, N) + 1;
choosing[id] = 0;
for (int j = 0; j < N; j++)
{
if (j == id)
continue;
while (1)
if (choosing[j] == 0)
break;
while (1)
{
if (number[j] == 0)
break;
if (number[j] > number[id]
|| (number[j] == number[id] && j > id))
break;
}
}
}
void unlock(int id) {
number[id] = 0;
}
Then I run the following example. I run 100 threads and each thread runs the following code:
for (i = 0; i < 10; ++i) {
lock(id);
counter++;
unlock(id);
}
After all threads have been executed, the result of the shared counter is 10 * 100 = 1000 which is the expected value. I executed my program multiple times and the result was always 1000. So it seems that the implementation of the lock is correct. That seemed weird based on a previous question I had because I didn't use any memory barriers/fences. Was I just lucky?
Then I wanted to create a multi-threaded program that will use many different locks. So I created this (full code can be found here):
typedef struct {
int number[N];
int choosing[N];
} LOCK;
and the code changes to:
void lock(LOCK l, int id)
{
l.choosing[id] = 1;
l.number[id] = max(l.number, N) + 1;
l.choosing[id] = 0;
...
Now when executing my program, sometimes I get 997, sometimes 998, sometimes 1000. So the lock algorithm isn't correct.
What am I doing wrong? What can I do in order to fix it?
Is it perhaps a problem now that I'm reading arrays number and choosing from a struct
and that's not atomic or something?
Should I use memory fences and if so at which points (I tried using asm("mfence") in various points of my code, but it didn't help)?
With pthreads, the standard states that accessing a varable in one thread while another thread is, or might be, modifying it is undefined behavior. Your code does this all over the place. For example:
while (1)
if (choosing[j] == 0)
break;
This code accesses choosing[j] over and over while waiting for another thread to modify it. The compiler is entirely free to modify this code as follows:
int cj=choosing[j];
while(1)
if(cj == 0)
break;
Why? Because the standard is clear that another thread may not modify the variable while this thread may be accessing it, so the value can be assumed to stay the same. But clearly, that won't work.
It can also do this:
while(1)
{
int cj=choosing[j];
if(cj==0) break;
choosing[j]=cj;
}
Same logic. It is perfectly legal for the compiler to write back a variable whether it has been modified or not, so long as it does so at a time when the code could be accessing the variable. (Because, at that time, it's not legal for another thread to modify it, so the value must be the same and the write is harmless. In some cases, the write really is an optimization and real-world code has been broken by such writebacks.)
If you want to write your own synchronization functions, you have to build them with primitive functions that have the appropriate atomicity and memory visibility semantics. You must follow the rules or your code will fail, and fail horribly and unpredictably.

Resources