Pthreads- Mutex Locks but variables don't change - c

I'm writing an extremely simple program to demonstrate a Pthreads implementation I ported from C++ back to C.
I create two lock-step threads and give them two jobs
One increments a1 once per step
One decrements a2 once per step
During the synchronized phase (When the mutexes are locked for both t1 and t2) I compare a1 and a2 to see if we should stop stepping.
I'm wondering if i'm going crazy here because not only does the variable not always change after stepping and locking but they sometimes change at different rates as if the threads were running even after the locks.
EDIT: Yes, I did research this. Yes, the C++ implementation works. Yes, the C++ implementation is nearly identical to this one, but I had to cast PTHREAD_MUTEX_INITIALIZER and PTHREAD_COND_INITIALIZER in c and pass this as the first argument to every function. I spent a while trying to debug this (short of whipping out gdb) to no avail.
#ifndef LOCKSTEPTHREAD_H
#define LOCKSTEPTHREAD_H
#include <pthread.h>
#include <stdio.h>
typedef struct {
pthread_mutex_t myMutex;
pthread_cond_t myCond;
pthread_t myThread;
int isThreadLive;
int shouldKillThread;
void (*execute)();
} lsthread;
void init_lsthread(lsthread* t);
void start_lsthread(lsthread* t);
void kill_lsthread(lsthread* t);
void kill_lsthread_islocked(lsthread* t);
void lock(lsthread* t);
void step(lsthread* t);
void* lsthread_func(void* me_void);
#ifdef LOCKSTEPTHREAD_IMPL
//function declarations
void init_lsthread(lsthread* t){
//pthread_mutex_init(&(t->myMutex), NULL);
//pthread_cond_init(&(t->myCond), NULL);
t->myMutex = (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER;
t->myCond = (pthread_cond_t)PTHREAD_COND_INITIALIZER;
t->isThreadLive = 0;
t->shouldKillThread = 0;
t->execute = NULL;
}
void destroy_lsthread(lsthread* t){
pthread_mutex_destroy(&t->myMutex);
pthread_cond_destroy(&t->myCond);
}
void kill_lsthread_islocked(lsthread* t){
if(!t->isThreadLive)return;
//lock(t);
t->shouldKillThread = 1;
step(t);
pthread_join(t->myThread,NULL);
t->isThreadLive = 0;
t->shouldKillThread = 0;
}
void kill_lsthread(lsthread* t){
if(!t->isThreadLive)return;
lock(t);
t->shouldKillThread = 1;
step(t);
pthread_join(t->myThread,NULL);
t->isThreadLive = 0;
t->shouldKillThread = 0;
}
void lock(lsthread* t){
if(pthread_mutex_lock(&t->myMutex))
puts("\nError locking mutex.");
}
void step(lsthread* t){
if(pthread_cond_signal(&(t->myCond)))
puts("\nError signalling condition variable");
if(pthread_mutex_unlock(&(t->myMutex)))
puts("\nError unlocking mutex");
}
void* lsthread_func(void* me_void){
lsthread* me = (lsthread*) me_void;
int ret;
if (!me)pthread_exit(NULL);
if(!me->execute)pthread_exit(NULL);
while (!(me->shouldKillThread)) {
ret = pthread_cond_wait(&(me->myCond), &(me->myMutex));
if(ret)pthread_exit(NULL);
if (!(me->shouldKillThread) && me->execute)
me->execute();
}
pthread_exit(NULL);
}
void start_lsthread(lsthread* t){
if(t->isThreadLive)return;
t->isThreadLive = 1;
t->shouldKillThread = 0;
pthread_create(
&t->myThread,
NULL,
lsthread_func,
(void*)t
);
}
#endif
#endif
This is my driver program:
#define LOCKSTEPTHREAD_IMPL
#include "include/lockstepthread.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
unsigned char a1, a2;
void JobThread1(){
unsigned char copy = a1;
copy++;
a1 = copy;
}
void JobThread2(){
unsigned char copy = a2;
copy--;
a2 = copy;
}
int main(){
char inputline[2048];
inputline[2047] = '\0';
lsthread t1, t2;
init_lsthread(&t1);
init_lsthread(&t2);
t1.execute = JobThread1;
t2.execute = JobThread2;
printf(
"\nThis program demonstrates threading by having"
"\nTwo threads \"walk\" toward each other using unsigned chars."
"\nunsigned Integer overflow guarantees the two will converge."
);
printf("\nEnter a number for thread 1 to process: ");
fgets(inputline, 2047,stdin);
a1 = (unsigned char)atoi(inputline);
printf("\nEnter a number for thread 2 to process: ");
fgets(inputline, 2047,stdin);
a2 = (unsigned char)atoi(inputline);
start_lsthread(&t1);
start_lsthread(&t2);
unsigned int i = 0;
lock(&t1);
lock(&t2);
do{
printf("\n%u: a1 = %d, a2 = %d",i++,(int)a1,(int)a2);
fflush(stdout);
step(&t1);
step(&t2);
lock(&t1);
lock(&t2);
}while(a1 < a2);
kill_lsthread_islocked(&t1);
kill_lsthread_islocked(&t2);
destroy_lsthread(&t1);
destroy_lsthread(&t2);
return 0;
}
Example program usage:
Enter a number for thread 1 to process: 5
Enter a number for thread 2 to process: 10
0: a1 = 5, a2 = 10
1: a1 = 5, a2 = 10
2: a1 = 5, a2 = 10
3: a1 = 5, a2 = 10
4: a1 = 5, a2 = 10
5: a1 = 5, a2 = 10
6: a1 = 6, a2 = 9
7: a1 = 6, a2 = 9
8: a1 = 7, a2 = 9
9: a1 = 7, a2 = 9
10: a1 = 7, a2 = 9
11: a1 = 7, a2 = 9
12: a1 = 8, a2 = 9
So, what's the deal?

Generally speaking, it sounds like what you're really looking for is a barrier. Nevertheless, I answer the question as posed.
Yes, the C++ implementation works. Yes, the C++ implementation is
nearly identical to this one, but I had to cast
PTHREAD_MUTEX_INITIALIZER and PTHREAD_COND_INITIALIZER in c and pass
this as the first argument to every function. I spent a while trying
to debug this (short of whipping out gdb) to no avail.
That seems unlikely. There are data races and undefined behavior all over the code presented, whether interpreted as C or as C++.
General design
Since you provide an explicit lock() function, which seems reasonable, you should provide an explicit unlock() function as well. Any other functions that expect to be called with the mutex locked should return with the mutex locked, so that the caller can explicitly pair lock() calls with unlock() calls. Failure to adhere to this pattern invites bugs.
In particular, step() should not unlock the mutex unless it also locks it, but I think a non-locking version will suit the purpose.
Initialization
I had to cast PTHREAD_MUTEX_INITIALIZER and PTHREAD_COND_INITIALIZER in c
No, you didn't, because you can't, at least not if pthread_mutex_t and pthread_cond_t are structure types. Initializers for structure types are not values. They do not have types and cannot be cast. But you can form compound literals from them, and that is what you have inadvertently done. This is not a conforming way to assign a value to a pthread_mutex_t or a pthread_cond_t.* The initializer macros are designated for use only in initializing variables in their declarations. That's what "initializer" means in this context.
Example:
pthread_mutex_t mutex = PTREAD_MUTEX_INITIALIZER;
Example:
struct {
pthread_mutex_t mutex;
pthread_cond_t cv;
} example = { PTHREAD_MUTEX_INITIALIZER, PTHREAD_COND_INITIALIZER };
To initialize a mutex or condition variable object in any other context requires use of the corresponding initialization function, pthread_mutex_init() or pthread_cond_init().
Data races
Non-atomic accesses to shared data by multiple concurrently-running threads must be protected by a mutex or other synchronization object if any of the accesses are writes (exceptions apply for accesses to mutexes and other synchronization objects themselves). Shared data in your example include file-scope variables a1 and a2, and most of the members of your lsthread instances. Your lsthread_func and driver both sometimes fail to lock the appropriate mutex before accessing those shared data, and some of the accesses involved are indeed writes, so undefined behavior ensues. Observing unexpected values of a1 and a2 is an entirely plausible manifestation of that undefined behavior.
Condition variable usage
The thread that calls pthread_cond_wait() must do so while holding the specified mutex locked. Your lsthread_func() does not adhere to that requirement, so more undefined behavior ensues. If you're very lucky, that might manifest as an immediate spurious wakeup.
And speaking of spurious wakeups, you do not guard against them. If one does occur then lsthread_func() blithely goes on to perform another iteration of its loop. To avoid this, you need shared data somewhere upon which the condition of the condition variable is predicated. The standard usage of a CV is to check that predicate before waiting, and to loop back and check it again after waking up, repeatedly if necessary, not proceeding until the predicate evaluates true.
Synchronized stepping
The worker threads do not synchronize directly with each other, so only the driver can ensure that one does not run ahead of the other. But it doesn't. The driver does nothing at all to ensure that either thread has completed a step before signaling both threads to perform another. Condition variables do not store signals, so if, by some misfortune of scheduling or by the nature of the tasks involved, one thread should get a step ahead of the other, it will remain ahead until and unless the error happens to be spontaneously balanced by a misstep on the other side.
Probably you want to add a lsthread_wait() function that waits for the thread to complete a step. That would involve use of the CV from the opposite direction.
Overall, you could provide (better) for single-stepping by
Adding a member to type lsthread to indicate whether the thread should be or is executing a step vs. whether it is between steps and should wait.
typedef struct {
// ...
_Bool should_step;
} lsthread;
Adding the aforementioned lsthread_wait(), maybe something like this:
// The calling thread must hold t->myMutex locked
void lsthread_wait(lsthread *t) {
// Wait, if necessary, for the thread to complete a step
while (t->should_step) {
pthread_cond_wait(&t->myCond, &t->myMutex);
}
assert(!t->should_step);
// Prepare to perform another step
t->should_step = 1;
}
That would be paired with a revised version of lsthread_func():
void* lsthread_func(void* me_void){
lsthread* me = (lsthread*) me_void;
if (!me) pthread_exit(NULL);
lock(me); // needed to protect access to *me members and to globals
while (!me->shouldKillThread && me->execute) {
while (!me->should_step && !me->shouldKillThread) {
int ret = pthread_cond_wait(&(me->myCond), &(me->myMutex));
if (ret) {
unlock(me); // mustn't forget to unlock
pthread_exit(NULL);
}
}
assert(me->should_step || me->shouldKillThread);
if (!me->shouldKillThread && me->execute) {
me->execute();
}
// Mark and signal step completed
me->should_step = 0;
ret = pthread_cond_broadcast(me->myCond);
if (ret) break;
}
unlock(me);
pthread_exit(NULL);
}
Modifying step() to avoid it unlocking the mutex.
Modifying the driver loop to use the new wait function appropriately
lock(&t1);
lock(&t2);
do {
printf("\n%u: a1 = %d, a2 = %d", i++, (int) a1, (int) a2);
fflush(stdout);
step(&t1);
step(&t2);
lsthread_wait(&t1);
lsthread_wait(&t2);
} while (a1 < a2); // both locks must be held when this condition is evaluated
kill_lsthread_islocked(&t2);
kill_lsthread_islocked(&t1);
unlock(&t2);
unlock(&t1);
That's not necessarily all the changes that would be required, but I think I've covered the all the key points.
Final note
The above suggestions are based on the example program, in which different worker threads do not access any of the same shared data. That makes it feasible to use per-thread mutexes to protect the shared data they do access. If the workers accessed any of the same shared data, and those or any thread running concurrently with them modified that same data, then per-worker-thread mutexes would not offer sufficient protection.
* And if pthread_mutex_t and pthread_cond_t were pointer or integer types, which is allowed, then the compiler would have accepted the assignments in question without a cast (which would actually be a cast in that case), but those assignments still would be non-conforming as far as pthreads is concerned.

Related

Memory ordering for a spin-lock "call once" implementation

Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if
a) it could be faster in the common case (i.e. already initialized), and,
b) is the selected memory ordering strong enough / too strong?
Architectures of interest are x86_64 (primarily) and aarch64.
The intended use API is something like this
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
// do other things that assume the initialization has taken place
}
And here is the implementation:
int once_enter(int *b)
{
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
// on x86, insert a pause instruction here
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}
I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.
Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?
UPDATE
Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:
I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).
#ifdef __x86_64__
#define PAUSE() __asm __volatile("pause")
#else
#define PAUSE()
#endif
int once_enter(int *b)
{
if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;
int zero = 0;
int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
if (got_lock) return 1;
while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
PAUSE();
};
return 0;
}
void once_commit(int *b)
{
(void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}
a, What you need is a double-checked lock.
Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.
void gets_called_many_times_from_many_threads(void)
{
static int my_once_flag = 0;
if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
if (once_enter(&my_once_flag)) {
// do one-time initialization here
once_commit(&my_once_flag);
}
}
// do other things that assume the initialization has taken place
}
b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.

why my program throws 14000000 instead of 10000000 using threads?

i wrote a simple c program to make every thread multiplate its index by 1000000 and add it to sum , i created 5 threads so the logic answer would be (0+1+2+3+4)*1000000 which is 10000000 but it throws 14000000 instead .could anyone helps me understanding this?
#include<pthread.h>
#include<stdio.h>
typedef struct argument {
int index;
int sum;
} arg;
void *fonction(void *arg0) {
((arg *) arg0) -> sum += ((arg *) arg0) -> index * 1000000;
}
int main() {
pthread_t thread[5];
int order[5];
arg a;
for (int i = 0; i < 5; i++)
order[i] = i;
a.sum = 0;
for (int i = 0; i < 5; i++) {
a.index = order[i];
pthread_create(&thread[i], NULL, fonction, &a);
}
for (int i = 0; i < 5; i++)
pthread_join(thread[i], NULL);
printf("%d\n", a.sum);
return 0;
}
It is 140.. because the behavior is undefined. The results will differ on different machines and other environmental factors. The undefined behavior is caused as a result of all threads accessing the same object (see &a given to each thread) that is modified after the first thread is created.
When each thread runs it accesses the same index (as part of accessing a member of the same object (&a)). Thus the assumption that the threads will see [0,1,2,3,4] is incorrect: multiple threads likely see the same value of index (eg. [0,2,4,4,4]1) when they run. This depends on the scheduling with the loop creating threads as it also modifies the shared object.
When each thread updates sum it has to read and write to the same shared memory. This is inherently prone to race conditions and unreliable results. For example, it could be lack of memory visibility (thread X doesn’t see value updated from thread Y) or it could be a conflicting thread schedule between the read and write (thread X read, thread Y read, thread X write, thread Y write) etc..
If creating a new arg object for each thread, then both of these problems are avoided. While the sum issue can be fixed with the appropriate locking, the index issue can only be fixed by not sharing the object given as the thread input.
// create 5 arg objects, one for each thread
arg a[5];
for (..) {
a[i].index = i;
// give DIFFERENT object to each thread
pthread_create(.., &a[i]);
}
// after all threads complete
int sum = 0;
for (..) {
sum += a[i].result;
}
1 Even assuming that there is no race condition in the current execution wrt. the usage of sum, the sequence for the different threads seeing index values as [0,2,4,4,4], the sum of which is 14, might look as follows:
a.index <- 0 ; create thread A
thread A reads a.index (0)
a.index <- 1 ; create thread B
a.index <- 2 ; create thread C
thread B reads a.index (2)
a.index <- 3 ; create thread D
a.index <- 4 ; create thread E
thread D reads a.index (4)
thread C reads a.index (4)
thread E reads a.index (4)

multithreading with mutexes in c and running one thread at a time

I have an array of 100 requests(integers). I want to create 4 threads to which i call a function(thread_function) and with this function i want every thread to take one by one the requests:
(thread0->request0,
thread1->request1,
thread2->request2,
thread3->request3
and then thread0->request4 etc up to 100) all these by using mutexes.
Here is the code i have writen so far:
threadRes = pthread_create(&(threadID[i]), NULL,thread_function, (void *)id_size);
This is inside my main and it is in a loop for 4 times.Now outside my main:
void *thread_function(void *arg){
int *val_p=(int *) arg;
for(i=0; i<200; i=i+2)
{
f=false;
for (j= 0; j<100; j++)
{
if (val_p[i]==cache[j].id)
f=true;
}
if(f==true)
{
printf("The request %d has been served.\n",val_p[i]);
}
else
{
cache[k].id=val_p[i];
printf("\nCurrent request to be served:%d \n",cache[k].id);
k++;
}
}
Where: val_p is the array with the requests and cache is an array of structs to store the id(requests).
-So now i want mutexes to synchronize my threads. I considered using inside my main:
pthread_join(threadID[0], NULL);
pthread_join(threadID[1], NULL);
pthread_join(threadID[2], NULL);
pthread_join(threadID[3], NULL);
pthread_mutex_destroy(&mutex);
and inside the function to use:
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
Before i finish i would like to say that so far my programm result is that 4threads run 100 requests each(400) and what i want to achieve is that 4threads run 100 threads total.
Thanks for your time.
You need to use a loop that looks like this:
Acquire lock.
See if there's any work to be done. If not, release the lock and terminate.
Mark the work that we're going to do as not needing to be done anymore.
Release the lock.
Do the work.
(If necessary) Acquire the lock. Mark the work done and/or report results. Release the lock.
Go to step 1.
Notice how while holding the lock, the thread discovers what work it should do and then prevents any other thread from taking the same assignment before it releases the lock. Note also that the lock is not held while doing the work so that multiple threads can work concurrently.
You may want to post more of your code. How the arrays are set up, how the segment is passed to the individual threads, etc.
Note that using printf will perturb the timing of the threads. It does its own mutex for access to stdout, so it's probably better to no-op this. Or, have a set of per-thread logfiles so the printf calls don't block against one another.
Also, in your thread loop, once you set f to true, you can issue a break as there's no need to scan further.
val_p[i] is loop invariant, so we can fetch that just once at the start of the i loop.
We don't see k and cache, but you'd need to mutex wrap the code that sets these values.
But, that does not protect against races in the for loop. You'd have to wrap the fetch of cache[j].id in a mutex pair inside the loop. You might be okay without the mutex inside the loop on some arches that have good cache snooping (e.g. x86).
You might be better off using stdatomic.h primitives. Here's a version that illustrates that. It compiles but I've not tested it:
#include <stdio.h>
#include <pthread.h>
#include <stdatomic.h>
int k;
#define true 1
#define false 0
struct cache {
int id;
};
struct cache cache[100];
#ifdef DEBUG
#define dbgprt(_fmt...) \
printf(_fmt)
#else
#define dbgprt(_fmt...) \
do { } while (0)
#endif
void *
thread_function(void *arg)
{
int *val_p = arg;
int i;
int j;
int cval;
int *cptr;
for (i = 0; i < 200; i += 2) {
int pval = val_p[i];
int f = false;
// decide if request has already been served
for (j = 0; j < 100; j++) {
cptr = &cache[j].id;
cval = atomic_load(cptr);
if (cval == pval) {
f = true;
break;
}
}
if (f == true) {
dbgprt("The request %d has been served.\n",pval);
continue;
}
// increment the global k value [atomically]
int kold = atomic_load(&k);
int knew;
while (1) {
knew = kold + 1;
if (atomic_compare_exchange_strong(&k,&kold,knew))
break;
}
// get current cache value
cptr = &cache[kold].id;
int oldval = atomic_load(cptr);
// mark the cache
// this should never loop because we atomically got our slot with
// the k value
while (1) {
if (atomic_compare_exchange_strong(cptr,&oldval,pval))
break;
}
dbgprt("\nCurrent request to be served:%d\n",pval);
}
return (void *) 0;
}

Allocate the resource of some threads to a global resource such that there is no concurrency(using only mutexes)

The summary of the problem is the following: Given a global resource of size N and M threads with their resource size of Xi (i=1,M) , syncronize the threads such that a thread is allocated,it does its stuff and then it is deallocated.
The main problem is that there are no resources available and the thread has to wait until there is enough memory. I have tried to "block" it with a while statement, but I realized that two threads can pass the while loop and the first to be allocated can change the global resource such that the second thread does not have enough space, but it has already passed the conditional section.
//piece of pseudocode
...
int MAXRES = 100;
// in thread function
{
while (thread_res > MAXRES);
lock();
allocate_res(); // MAXRES-=thread_res;
unlock();
// do own stuff
lock()
deallocate(); // MAXRES +=thread_res;
unlock();
}
To make a robust solution you need something more. As you noted, you need a way to wait until the condition ' there are enough resources available for me ', and a proper mutex has no such mechanism. At any rate, you likely want to bury that in the allocator, so your mainline thread code looks like:
....
n = Allocate()
do stuff
Release(n)
....
and deal with the contention:
int navail = N;
int Acquire(int n) {
lock();
while (n < navail) {
unlock();
lock();
}
navail -= n;
unlock();
return n;
}
void Release(int n) {
lock();
navail += n;
unlock();
}
But such a spin waiting system may fail because of priorities -- if the highest priority thread is spinning, a thread trying to Release may not be able to run. Spinning isn't very elegant, and wastes energy. You could put a short nap in the spin, but if you make the nap too short it wastes energy, too long and it increases latency.
You really want a signalling primitive like semaphores or condvars rather than a locking primitive. With a semaphore, it could look impressively like:
Semaphore *sem;
int Acquire(int n) {
senter(sem, n);
return n;
}
int Release(int n) {
sleave(sem, n);
return n;
}
void init(int N) {
sem = screate(N);
}
Update: Revised to use System V semaphores, which provides the ability to specify arbitrary 'checkout' and 'checkin' to the semaphore value. Overall logic the same.
Disclaimer: I did not use System V for few years, test before using, in case I missed some details.
int semid ;
// Call from main
do_init() {
shmkey = ftok(...) ;
semid = shmget(shmkey, 1, ...) ;
// Setup 100 as max resources.
struct sembuf so ;
so.sem_num = 0 ;
so.sem_op = 100 ;
so.sem_flg = 0 ;
semop(semid, &so, 1) ;
}
// Thread work
do_thread_work() {
int N = <number of resources>
struct sembuf so ;
so.sem_num = 0;
so.sem_op = -N ;
so.sem_flg = 0 ;
semop(semid, &so, 1) ;
... Do thread work
// Return resources
so.sem_op = +N ;
semop(semid, &so, 1) ;
}
As per: https://pubs.opengroup.org/onlinepubs/009695399/functions/semop.html, this will result in atomic checkout of multiple items.
Note: the ... as sections related to IPC_NOWAIT and SEM_UNDO, not relevant to this case.
If sem_op is a negative integer and the calling process has alter permission, one of the following shall occur:
If semval(see ) is greater than or equal to the absolute value of sem_op, the absolute value of sem_op is subtracted from semval. ...
...
If semval is less than the absolute value of sem_op ..., semop() shall increment the semncnt associated with the specified semaphore and suspend execution of the calling thread until one of the following conditions occurs:
The value of semval becomes greater than or equal to the absolute value of sem_op. When this occurs, the value of semncnt associated with the specified semaphore shall be decremented, the absolute value of sem_op shall be subtracted from semval and, ... .

Atomic swap in C

I think I'm missing something obvious here. I have code like this:
int *a = <some_address>;
int *b = <another_address>;
// ...
// desired atomic swap of addresses a and b here
int *tmp = a;
a = b; b = tmp;
I want to do this atomically. I've been looking at __c11_atomic_exchange and on OSX the OSAtomic methods, but nothing seems to perform a straight swap atomically. They can all write a value into 1 of the 2 variables atomically, but never both.
Any ideas?
It depends a bit on your architecture, what is effectively and efficiently possible. In any case you would need that the two pointers that you want to swap be adjacent in memory, the easiest that you have them as some like
typedef struct pair { void *a[2]; } pair;
Then you can, if you have a full C11 compiler use _Atomic(pair) as an atomic type to swap the contents of such a pair: first load the actual pair in a temporary, construct a new one with the two values swapped, do a compare-exchange to store the new value and iterate if necessary:
inline
void pair_swap(_Atomic(pair) *myPair) {
pair actual = { 0 };
pair future = { 0 };
while (!atomic_compare_exchange_weak(myPair, &actual, future)) {
future.a[0] = actual.a[1];
future.a[1] = actual.a[0];
}
}
Depending on your architecture this might be realized with a lock free primitive operation or not. Modern 64 bit architectures usually have 128bit atomics, there this could result in just some 128 bit assembler instructions.
But this would depend a lot on the quality of the implementation of atomics on your platform. P99 would provide you with an implementation that implements the "functionlike" part of atomic operations, and that supports 128 bit swap, when detected; on my machine (64 bit Intel linux) this effectively compiles into a lock cmpxchg16b instruction.
Relevant documentation:
https://en.cppreference.com/w/c/language/atomic
https://en.cppreference.com/w/c/thread
https://en.cppreference.com/w/c/atomic/atomic_exchange
https://en.cppreference.com/w/c/atomic/atomic_compare_exchange
You should use a mutex.
Or, if you don't want your thread to go to sleep if the mutex isn't immediately available, you can improve efficiency by using a spin lock, since the number of times the spin lock has to spin is frequently so small that it is faster to spin than to put a thread to sleep and have a whole context switch.
C11 has a whole suite of atomic types and functions which allow you to easily implement your own spin lock to accomplish this.
Keep in mind, however, that a basic spin lock must not be used on a bare metal (no operating system) single core (processor) system, as it will result in deadlock if the main context takes the lock and then while the lock is still taken, an ISR fires and tries to take the lock. The ISR will block indefinitely.
To avoid this, more-complicated locking mechanisms must be used, such as a spin lock with a timeout, an ISR which auto-reschedules itself for a slightly later time if the lock is taken, automatic yielding, etc.
Quick answer:
Here is what we will be building up to:
int *a = &some_int1;
int *b = &some_int2;
spinlock_t spinlock = SPINLOCK_UNLOCKED;
// swap `a` and `b`
atomic_swap_int_ptr(&spinlock, &a, &b);
Here are the details:
For a basic spin lock (mutex) on a multi-core system, which is where spin locks are best, however, try this:
Basic spin lock implementation in C11 using _Atomic types
#include <stdbool.h>
#include <stdatomic.h>
#define SPINLOCK_UNLOCKED 0
#define SPINLOCK_LOCKED 1
typedef volatile atomic_bool spinlock_t;
// This is how you'd create a new spinlock to use in the functions below.
// spinlock_t spinlock = SPINLOCK_UNLOCKED;
/// \brief Set the spin lock to `SPINLOCK_LOCKED`, and return the previous
/// state of the `spinlock` object.
/// \details If the previous state was `SPINLOCK_LOCKED`, you can know that
/// the lock was already taken, and you do NOT now own the lock. If the
/// previous state was `SPINLOCK_UNLOCKED`, however, then you have now locked
/// the lock and own the lock.
bool atomic_test_and_set(spinlock_t* spinlock)
{
bool spinlock_val_old = atomic_exchange(spinlock, SPINLOCK_LOCKED);
return spinlock_val_old;
}
/// Spin (loop) until you take the spin lock.
void spinlock_lock(spinlock_t* spinlock)
{
// Spin lock: loop forever until we get the lock; we know the lock was
// successfully obtained after exiting this while loop because the
// atomic_test_and_set() function locks the lock and returns the previous
// lock value. If the previous lock value was SPINLOCK_LOCKED then the lock
// was **already** locked by another thread or process. Once the previous
// lock value was SPINLOCK_UNLOCKED, however, then it indicates the lock
// was **not** locked before we locked it, but now it **is** locked because
// we locked it, indicating we own the lock.
while (atomic_test_and_set(spinlock) == SPINLOCK_LOCKED) {};
}
/// Release the spin lock.
void spinlock_unlock(spinlock_t* spinlock)
{
*spinlock = SPINLOCK_UNLOCKED;
}
Now, you can use the spin lock, for your example, in your multi-core system like this:
/// Atomically swap two pointers to int (`int *`).
/// NB: this does NOT swap their contents, meaning: the integers they point to.
/// Rather, it swaps the pointers (addresses) themselves.
void atomic_swap_int_ptr(spinlock_t * spinlock, int ** ptr1, int ** ptr2)
{
spinlock_lock(spinlock)
// critical, atomic section start
int * temp = *ptr1;
*ptr1 = *ptr2;
*ptr2 = temp;
// critical, atomic section end
spinlock_unlock(spinlock)
}
// Demo to perform an address swap
int *a = &some_int1;
int *b = &some_int2;
spinlock_t spinlock = SPINLOCK_UNLOCKED;
// swap `a` and `b`
atomic_swap_int_ptr(&spinlock, &a, &b); // <=========== final answer
References
https://en.wikipedia.org/wiki/Test-and-set#Pseudo-C_implementation_of_a_spin_lock
C11 Concurrency support library and atomic_bool and other types: https://en.cppreference.com/w/c/thread
C11 atomic_exchange() function: https://en.cppreference.com/w/c/atomic/atomic_exchange
Other links
Note that you could use C11's struct atomic_flag as well, and its corresponding atomic_flag_test_and_set() function, instead of creating your own atomic_test_and_set() function using atomic_exchange(), like I did above.
_Atomic types in C11: https://en.cppreference.com/w/c/language/atomic
C++11 Implementation of Spinlock using header <atomic>
TODO
This implementation would need to be modified to add safety anti-deadlock mechanisms such as automatic deferral, timeout, and retry, to prevent deadlock on single core and bare-metal systems. See my notes here: Add basic mutex (lock) and spin lock implementations in C11, C++11, AVR, and STM32 and here: What are the various ways to disable and re-enable interrupts in STM32 microcontrollers in order to implement atomic access guards?

Resources