I'm new at multi-threaded programming and I tried to code the Bakery Lock Algorithm in C.
Here is the code:
int number[N]; // N is the number of threads
int choosing[N];
void lock(int id) {
choosing[id] = 1;
number[id] = max(number, N) + 1;
choosing[id] = 0;
for (int j = 0; j < N; j++)
{
if (j == id)
continue;
while (1)
if (choosing[j] == 0)
break;
while (1)
{
if (number[j] == 0)
break;
if (number[j] > number[id]
|| (number[j] == number[id] && j > id))
break;
}
}
}
void unlock(int id) {
number[id] = 0;
}
Then I run the following example. I run 100 threads and each thread runs the following code:
for (i = 0; i < 10; ++i) {
lock(id);
counter++;
unlock(id);
}
After all threads have been executed, the result of the shared counter is 10 * 100 = 1000 which is the expected value. I executed my program multiple times and the result was always 1000. So it seems that the implementation of the lock is correct. That seemed weird based on a previous question I had because I didn't use any memory barriers/fences. Was I just lucky?
Then I wanted to create a multi-threaded program that will use many different locks. So I created this (full code can be found here):
typedef struct {
int number[N];
int choosing[N];
} LOCK;
and the code changes to:
void lock(LOCK l, int id)
{
l.choosing[id] = 1;
l.number[id] = max(l.number, N) + 1;
l.choosing[id] = 0;
...
Now when executing my program, sometimes I get 997, sometimes 998, sometimes 1000. So the lock algorithm isn't correct.
What am I doing wrong? What can I do in order to fix it?
Is it perhaps a problem now that I'm reading arrays number and choosing from a struct
and that's not atomic or something?
Should I use memory fences and if so at which points (I tried using asm("mfence") in various points of my code, but it didn't help)?
With pthreads, the standard states that accessing a varable in one thread while another thread is, or might be, modifying it is undefined behavior. Your code does this all over the place. For example:
while (1)
if (choosing[j] == 0)
break;
This code accesses choosing[j] over and over while waiting for another thread to modify it. The compiler is entirely free to modify this code as follows:
int cj=choosing[j];
while(1)
if(cj == 0)
break;
Why? Because the standard is clear that another thread may not modify the variable while this thread may be accessing it, so the value can be assumed to stay the same. But clearly, that won't work.
It can also do this:
while(1)
{
int cj=choosing[j];
if(cj==0) break;
choosing[j]=cj;
}
Same logic. It is perfectly legal for the compiler to write back a variable whether it has been modified or not, so long as it does so at a time when the code could be accessing the variable. (Because, at that time, it's not legal for another thread to modify it, so the value must be the same and the write is harmless. In some cases, the write really is an optimization and real-world code has been broken by such writebacks.)
If you want to write your own synchronization functions, you have to build them with primitive functions that have the appropriate atomicity and memory visibility semantics. You must follow the rules or your code will fail, and fail horribly and unpredictably.
Related
Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if
a) it could be faster in the common case (i.e. already initialized), and,
b) is the selected memory ordering strong enough / too strong?
Architectures of interest are x86_64 (primarily) and aarch64.
The intended use API is something like this
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
// do other things that assume the initialization has taken place
}
And here is the implementation:
int once_enter(int *b)
{
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
// on x86, insert a pause instruction here
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}
I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.
Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?
UPDATE
Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:
I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).
#ifdef __x86_64__
#define PAUSE() __asm __volatile("pause")
#else
#define PAUSE()
#endif
int once_enter(int *b)
{
if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
PAUSE();
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}
a, What you need is a double-checked lock.
Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
}
// do other things that assume the initialization has taken place
}
b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.
I find that I have some difficulty with how to best write communication between functions that are out of the normal flow of code. A simple example is:
int a = 0;
volatile int v = 0;
void __attribute__((used)) interrupt() {
a++;
v++;
}
int main() {
while(1) {
// asm("nop");
a--;
v--;
if (v > 10 && a > 10)
break;
}
return 0;
}
It is not surprising that the main while loop can optimize the a variable to a register and thus never see any changes from the interrupt. If the variable is volatile then it is annoying in that every time it is used in needs to be reread from or rewritten to memory. And in that technique any communication variable across threads would need to be volatile. A synchronization primitive (or even the commented out "nop") solves the problem because it seemingly has a side effect to create a compiler barrier. But if I understand correctly that would mean flushing the entire state of all the registers used in main, where maybe it's less harsh to just have a few variables as volatile. I currently use the two techniques but I wish I had a more standard method for dealing with the issue. Can anyone comment on best strategies here?
A link to some assembly
So you want a means of reducing the number of times a is looked up. The following reduces it to once a loop:
volatile int a = 0;
volatile int v = 0;
void __attribute__((used)) interrupt() {
a++;
v++;
}
int main() {
while(1) {
int b = --a;
--v;
if (v > 10 && b > 10)
break;
}
return 0;
}
Nothing stops you from checking even less often similarly.
I have an array of 100 requests(integers). I want to create 4 threads to which i call a function(thread_function) and with this function i want every thread to take one by one the requests:
(thread0->request0,
thread1->request1,
thread2->request2,
thread3->request3
and then thread0->request4 etc up to 100) all these by using mutexes.
Here is the code i have writen so far:
threadRes = pthread_create(&(threadID[i]), NULL,thread_function, (void *)id_size);
This is inside my main and it is in a loop for 4 times.Now outside my main:
void *thread_function(void *arg){
int *val_p=(int *) arg;
for(i=0; i<200; i=i+2)
{
f=false;
for (j= 0; j<100; j++)
{
if (val_p[i]==cache[j].id)
f=true;
}
if(f==true)
{
printf("The request %d has been served.\n",val_p[i]);
}
else
{
cache[k].id=val_p[i];
printf("\nCurrent request to be served:%d \n",cache[k].id);
k++;
}
}
Where: val_p is the array with the requests and cache is an array of structs to store the id(requests).
-So now i want mutexes to synchronize my threads. I considered using inside my main:
pthread_join(threadID[0], NULL);
pthread_join(threadID[1], NULL);
pthread_join(threadID[2], NULL);
pthread_join(threadID[3], NULL);
pthread_mutex_destroy(&mutex);
and inside the function to use:
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
Before i finish i would like to say that so far my programm result is that 4threads run 100 requests each(400) and what i want to achieve is that 4threads run 100 threads total.
Thanks for your time.
You need to use a loop that looks like this:
Acquire lock.
See if there's any work to be done. If not, release the lock and terminate.
Mark the work that we're going to do as not needing to be done anymore.
Release the lock.
Do the work.
(If necessary) Acquire the lock. Mark the work done and/or report results. Release the lock.
Go to step 1.
Notice how while holding the lock, the thread discovers what work it should do and then prevents any other thread from taking the same assignment before it releases the lock. Note also that the lock is not held while doing the work so that multiple threads can work concurrently.
You may want to post more of your code. How the arrays are set up, how the segment is passed to the individual threads, etc.
Note that using printf will perturb the timing of the threads. It does its own mutex for access to stdout, so it's probably better to no-op this. Or, have a set of per-thread logfiles so the printf calls don't block against one another.
Also, in your thread loop, once you set f to true, you can issue a break as there's no need to scan further.
val_p[i] is loop invariant, so we can fetch that just once at the start of the i loop.
We don't see k and cache, but you'd need to mutex wrap the code that sets these values.
But, that does not protect against races in the for loop. You'd have to wrap the fetch of cache[j].id in a mutex pair inside the loop. You might be okay without the mutex inside the loop on some arches that have good cache snooping (e.g. x86).
You might be better off using stdatomic.h primitives. Here's a version that illustrates that. It compiles but I've not tested it:
#include <stdio.h>
#include <pthread.h>
#include <stdatomic.h>
int k;
#define true 1
#define false 0
struct cache {
int id;
};
struct cache cache[100];
#ifdef DEBUG
#define dbgprt(_fmt...) \
printf(_fmt)
#else
#define dbgprt(_fmt...) \
do { } while (0)
#endif
void *
thread_function(void *arg)
{
int *val_p = arg;
int i;
int j;
int cval;
int *cptr;
for (i = 0; i < 200; i += 2) {
int pval = val_p[i];
int f = false;
// decide if request has already been served
for (j = 0; j < 100; j++) {
cptr = &cache[j].id;
cval = atomic_load(cptr);
if (cval == pval) {
f = true;
break;
}
}
if (f == true) {
dbgprt("The request %d has been served.\n",pval);
continue;
}
// increment the global k value [atomically]
int kold = atomic_load(&k);
int knew;
while (1) {
knew = kold + 1;
if (atomic_compare_exchange_strong(&k,&kold,knew))
break;
}
// get current cache value
cptr = &cache[kold].id;
int oldval = atomic_load(cptr);
// mark the cache
// this should never loop because we atomically got our slot with
// the k value
while (1) {
if (atomic_compare_exchange_strong(cptr,&oldval,pval))
break;
}
dbgprt("\nCurrent request to be served:%d\n",pval);
}
return (void *) 0;
}
The problem is that we have to implement a kind of "running-contest" using pthreads. After one track we have to wait until all runners/threads are done until this point, so we use a barrier for that.
But now we also have to implement the probability of injuries. So we wrote a function, which sometimes reduces the number of runners, and reinitialize the barrier with a smaller count. Now the problem is that the program is not always terminating. I guess the reason for this is that some of the threads have already been at the barrier, and after reinitializing them the required amount is not arriving.
The code for the simulation of the injury looks like this:
void simulateInjury(int number) {
int totalRunners = 0;
int i = 0;
if (rand() % 10 < 1) {
printf("Runner of Team %i injured!\n", number);
pthread_mutex_lock(&evaluate_teamsize);
standings.teamSize[number]--;
for (i = 0; i < teams; i++) {
totalRunners += standings.teamSize[i];
}
pthread_barrier_destroy(&barrier_track1);
pthread_barrier_destroy(&barrier_track4[number]);
pthread_barrier_init(&barrier_track1, NULL, totalRunners);
pthread_barrier_init(&barrier_track4[number], NULL, standings.teamSize[number]);
pthread_mutex_unlock(&evaluate_teamsize);
pthread_exit(NULL);
}
}
Or is there maybe a way to just change the count argument of the barrier?
I see two errors:
You should not re-initialize a barrier while some thread is using
it.
You should not execute the re-initialization of the barrier
simultaneously by several threads.
For the first you can create a second barrier that you use in alternation with the first.
For the second you should use the return value of the wait function to designate one particular thread that will do the re-initialization.
In os books they said there must be a lock to protect data from accessed by reader and writer at the same time.
but when I test the simple example in x86 machine,it works well.
I want to know, is the lock here nessesary?
#define _GNU_SOURCE
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
struct doulnum
{
int i;
long int l;
char c;
unsigned int ui;
unsigned long int ul;
unsigned char uc;
};
long int global_array[100] = {0};
void* start_read(void *_notused)
{
int i;
struct doulnum d;
int di;
long int dl;
char dc;
unsigned char duc;
unsigned long dul;
unsigned int dui;
while(1)
{
for(i = 0;i < 100;i ++)
{
dl = global_array[i];
//di = d.i;
//dl = d.l;
//dc = d.c;
//dui = d.ui;
//duc = d.uc;
//dul = d.ul;
if(dl > 5 || dl < 0)
printf("error\n");
/*if(di > 5 || di < 0 || dl > 10 || dl < 5)
{
printf("i l value %d,%ld\n",di,dl);
exit(0);
}
if(dc > 15 || dc < 10 || dui > 20 || dui < 15)
{
printf("c ui value %d,%u\n",dc,dui);
exit(0);
}
if(dul > 25 || dul < 20 || duc > 30 || duc < 25)
{
printf("uc ul value %u,%lu\n",duc,dul);
exit(0);
}*/
}
}
}
int start_write(void)
{
int i;
//struct doulnum dl;
while(1)
{
for(i = 0;i < 100;i ++)
{
//dl.i = random() % 5;
//dl.l = random() % 5 + 5;
//dl.c = random() % 5 + 10;
//dl.ui = random() % 5 + 15;
//dl.ul = random() % 5 + 20;
//dl.uc = random() % 5 + 25;
global_array[i] = random() % 5;
}
}
return 0;
}
int main(int argc,char **argv)
{
int i;
cpu_set_t cpuinfo;
pthread_t pt[3];
//struct doulnum dl;
//dl.i = 2;
//dl.l = 7;
//dl.c = 12;
//dl.ui = 17;
//dl.ul = 22;
//dl.uc = 27;
for(i = 0;i < 100;i ++)
global_array[i] = 2;
for(i = 0;i < 3;i ++)
if(pthread_create(pt + i,NULL,start_read,NULL) < 0)
return -1;
/* for(i = 0;i < 3;i ++)
{
CPU_ZERO(&cpuinfo);
CPU_SET_S(i,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pt[i],sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity %d\n",i);
exit(0);
}
}
CPU_ZERO(&cpuinfo);
CPU_SET_S(3,sizeof(cpuinfo),&cpuinfo);
if(0 != pthread_setaffinity_np(pthread_self(),sizeof(cpu_set_t),&cpuinfo))
{
printf("set affinity recver\n");
exit(0);
}*/
start_write();
return 0;
}
If you don't synchronise reads and writes, a reader could read while a writer is writing, and read the data in a half-written state if the write operation is not atomic. So yes, synchronisation would be necessary to keep that from happening.
You surely need synchronization here . The simple reason being that there is a distinct possibility that data be in a inconsistent state when start_write is updating the information in the global array and one of your 3 threads try to read the same data from the global array .
What you quote is also incorrect . " must be a lock to protect data from accessed by reader and writer at the same time" should be "must be a lock to protect data from modified by reader and writer at the same time"
if the shared data is being modified by one of the threads and another thread is reading from it you need to use lock to protect it .
if the shared data is being accessed by two or more threads then you dont need to protect it .
It will work fine if the threads are just reading from global_array. printf should be fine since this does a single IO operation in append mode.
However, since the main thread calls start_write to update the global_array at the same time the other threads are in start_read then they are going to be reading the values in a very unpredictable manner. It depends highly on how the threads are implemented in the OS, how many CPUs/cores you have, etc.. This might work well on your dual core development box but then fail spectactuarly when you move to a 16 core production server.
For example, if the threads were not synchronizing, they might never see any updates to global_array in the right circumstances. Or some threads would see changes faster than others. It's all about the timing of when memory pages are flushed to central memory and when the threads see the changes in their caches. To ensure consistent results you need synchronization (memory barriers) to force the caches to the updated.
The general answer is you need some way to ensure/enforce necessary atomicity, so the reader doesn't see an inconsistent state.
A lock (done correctly) is sufficient but not always necessary. But in order to prove that it's not necessary, you need to be able to say something about the atomicity of the operations involved.
This involves both the architecture of the target host and, to some extent, the compiler.
In your example, you're writing a long to an array. In this case, the question is is the storage of a long atomic? It probably is, but it depends on the host. It's possible that the CPU writes out a portion of the long (upper/lower words/bytes) separately and thus the reader could get a value never written. (This is, I believe, unlikely on most modern CPU archs, but you'd have to check to be sure.)
It's also possible for there to be write buffering in the CPU. It's been a long time since I looked at this, but I believe it's possible to get store reordering if you don't have the necessary write barrier instructions. It's unclear from your example if you would be relying on this.
Finally, you'd probably need to flag the array as volatile (again, I haven't done this in a while so I'm rusty on the specifics) in order to ensure that the compiler doesn't make assumptions about the data not changing underneath it.
It depends on how much you care about portability.
At least on an actual Intel x86 processor, when you're reading/writing dword (32-bit) data that's also dword aligned, the hardware gives you atomicity "for free" -- i.e., without your having to do any sort of lock to enforce it.
Changing much of anything (up to an including compiler flags that might affect the data'a alignment) can break that -- but in ways that might remain hidden for a long time (especially if you have low contention over a particular data item). It also leads to extremely fragile code -- for example, switching to a smaller data type can break the code, even if you're only using a subset of the values.
The current atomic "guarantee" is pretty much an accidental side-effect of the way the cache and bus happen to be designed. While I'm not sure I'd really expect a change that broke things, I wouldn't consider it particularly far-fetched either. The only place I've seen documentation of this atomic behavior was in the same processor manuals that cover things like model-specific registers that definitely have changed (and continue to change) from one model of processor to the next.
The bottom line is that you really should do the locking, but you probably won't see a manifestation of the problem with your current hardware, no matter how much you test (unless you change conditions like mis-aligning the data).