Locking in OMP regions - c

I'm trying to instrument some functions in my application to see how long they take. I'm recording all of the times in-memory using a linked list.
In the process, I've introduced a global variable that keeps track of the end of the list. When I enter a new timing region, I insert a new record at the end of the list. Fairly simple stuff.
Some of the functions I want to track are called in OpenMP regions, however. Which means they'll likely be called multiple times in parallel. And this is where I'm stumped.
If this was using normal Pthreads, I'd simply wrap access to the global variable in a mutex and call it a day. However, I'm unsure: will this strategy still work with functions called in an OpenMP region? As in, will they respect the lock?
For example (won't compile, but I think gets the point across):
Record *head;
Record *tail;
void start_timing(char *name) {
Record *r = create_record(name);
tail->next_record = r;
tail = r;
return r;
}
int foo(void) {
Record r = start_timing("foo");
//Do something...
stop_timing(r);
}
int main(void) {
Record r = start_timing("main");
//Do something...
#pragma omp parallel for...
for (int i = 0; i < 42; i++) {
foo();
}
//Do some more...
stop_timing(r);
}
Which I would then update to:
void start_timing(char *name) {
Record *r = create_record(name);
acquire_mutex_on_tail();
tail->next_record = r;
tail = r;
release_mutex_on_tail();
return r;
}
(Apologies if this has an obvious answer - I'm relatively inexperience with the OpenMP framework and multithreading in general.)

The idiomatic mutex solution is to use OpenMP locks:
omp_set_lock(&taillock)
tail->next_record = r;
tail = r;
omp_unset_lock(&taillock)
and somewhere:
omp_lock_t taillock;
omp_init_lock(&taillock);
...
omp_destroy_lock(&taillock);
The simple OpenMP solution:
void start_timing(char *name) {
Record *r = create_record(name);
#pragma omp critical
{
tail->next_record = r;
tail = r;
}
return r;
}
That creates an implicit global lock bound to the source code line. For some detailed discussions see the answers to this question.
For practical purposes, using Pthread locks will also work, at least for scenarios where OpenMP is based on Pthreads.
A word of warning
Using locks in performance measurement code is dangerous. So is memory allocation, which also often implies using locks. This means, that start_time has significant cost and the performance will even get worse with more threads. That doesn't even consider the cache invalidation from having one thread allocating a chunk of memory (record) and then another thread modifying it (tail pointer).
Now that may be fine if the sections you measure take seconds, but it will cause great overhead and perturbation when your sections are only hundreds of cycles.
To create a scalable performance tracing facility, you must pre-allocate thread-local memory in larger chunks and have each thread write only to it's local part.
You can also chose to use some of the existing measurement infrastructures, such as Score-P.
Overhead & perturbation
First, distinguish between the two (linked concepts). Overhead is extra time you spend, while perturbation refers to the impact on what you measure (i.e. you now measure something different than what happens without the measurement). Overhead is undesirable in large quantities, but perturbation is much worse.
Yes, you can avoid some of the perturbation by pausing the timer during your expensive measurement runtime (the overhead remains). However, in a multi-threaded context this is still very problematic.
Slowing down progress in one thread, may lead to other threads waiting for it e.g. during an implicit barrier. How do you attribute the waiting time of that thread and others that follow transitively?
Memory allocation is usually locked - so if you allocate memory during measurement runtime, you will slow down other threads that depend on memory allocation. You could try to mitigate with memory pools, but I'd avoid the linked list in the first place.

Related

Manual synchronization in OpenMP while loop

I recently started working with OpenMP to do some 'research' for an project in university. I have a rectangular and evenly spaced grid on which I'm solving a partial differential equation with an iterative scheme. So I basically have two for-loops (one in x- and y-direction of the grid each) wrapped by a while-loop for the iterations.
Now I want to investigate different parallelization schemes for this. The first (obvious) approach was to do a spatial a parallelization on the for loops.
Works fine too.
The approach I have problems with is a more tricky idea. Each thread calculates all grid points. The first thread starts solving the equation at the first grid row (y=0). When it's finished the thread goes on with the next row (y=1) and so on. At the same time thread #2 can already start at y=0, because all the necessary information are already available. I just need to do a kind of a manual synchronization between the threads so they can't overtake each other.
Therefore I used an array called check. It contains the thread-id that is currently allowed to work on each grid row. When the upcoming row is not 'ready' (value in check[j] is not correct), the thread goes into an empty while-loop, until it is.
Things will get clearer with a MWE:
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main()
{
// initialize variables
int iter = 0; // iteration step counter
int check[100] = { 0 }; // initialize all rows for thread #0
#pragma omp parallel num_threads(2)
{
int ID, num_threads, nextID;
double u[100 * 300] = { 0 };
// get parallelization info
ID = omp_get_thread_num();
num_threads = omp_get_num_threads();
// determine next valid id
if (ID == num_threads - 1) nextID = 0;
else nextID = ID + 1;
// iteration loop until abort criteria (HERE: SIMPLIFIED) are valid
while (iter<1000)
{
// rows (j=0 and j=99 are boundary conditions and don't have to be calculated)
for (int j = 1; j < (100 - 1); j++)
{
// manual sychronization: wait until previous thread completed enough rows
while (check[j + 1] != ID)
{
//printf("Thread #%d is waiting!\n", ID);
}
// gridpoints in row j
for (int i = 1; i < (300 - 1); i++)
{
// solve PDE on gridpoint
// replaced by random operation to consume time
double ignore = pow(8.39804,10.02938) - pow(12.72036,5.00983);
}
// update of check array in atomic to avoid race condition
#pragma omp atomic write
{
check[j] = nextID;
}
}// for j
#pragma omp atomic write
check[100 - 1] = nextID;
#pragma omp atomic
iter++;
#pragma omp single
{
printf("Iteration step: %d\n\n", iter);
}
}//while
}// omp parallel
}//main
The thing is, this MWE actually works on my machine. But if I copy it into my project, it doesn't. Additionally the outcome is always different: It stops either after the first iteration or after the third.
Another weird thing: when I remove the slashes of the comment in the inner while-loop it works! The output contains some
"Thread #1 is waiting!"
but that's reasonable. To me it looks like I created somehow a race condition, but I don't know where.
Does somebody has an idea what the problem could be? Or a hint how to realize this kind of synchronization?
I think you are mixing up atomicity and memory consitency. The OpenMP standard actually describes it very nicely in
1.4 Memory Model (emphasis mine):
The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.
1.4.3 The Flush Operation
The memory model has relaxed-consistency because a thread’s temporary
view of memory is not required to be consistent with memory at all
times. A value written to a variable can remain in the thread’s
temporary view until it is forced to memory at a later time. Likewise,
a read from a variable may retrieve the value from the thread’s
temporary view, unless it is forced to read from memory. The OpenMP
flush operation enforces consistency between the temporary view and
memory.
To avoid that, you should also make the read of check[] atomic and specify the seq_cst clause to your atomic constructs. This clause forces an implicit flush to the operation. (It is called a sequentially consistent atomic construct)
int c;
// manual sychronization: wait until previous thread completed enough rows
do
{
#pragma omp atomic read
c = check[j + 1];
} while (c != ID);
Disclaimer: I can't really try the code right now.
Furhter Notes:
I think the iter stop criteria is bogus, the way you use it, but I guess that's irrelevant given that it is not your actual criteria.
I assume this variant will perform worse than the spatial decomposition. You loose a lot of data locality, especially on NUMA systems. But of course it is fine to try and measure.
There seems to be a discrepancy between your code (using check[j + 1]) and your description "At the same time thread #2 can already start at y=0"

Why doesn't this code scale linearly?

I wrote this SOR solver code. Don't bother too much what this algorithm does, it is not the concern here. But just for the sake of completeness: it may solve a linear system of equations, depending on how well conditioned the system is.
I run it with an ill conditioned 2097152 rows sparce matrix (that never converges), with at most 7 non-zero columns per row.
Translating: the outer do-while loop will perform 10000 iterations (the value I pass as max_iters), the middle for will perform 2097152 iterations, split in chunks of work_line, divided among the OpenMP threads. The innermost for loop will have 7 iterations, except in very few cases (less than 1%) where it can be less.
There is data dependency among the threads in the values of sol array. Each iteration of the middle for updates one element but reads up to 6 other elements of the array. Since SOR is not an exact algorithm, when reading, it can have any of the previous or the current value on that position (if you are familiar with solvers, this is a Gauss-Siedel that tolerates Jacobi behavior on some places for the sake of parallelism).
typedef struct{
size_t size;
unsigned int *col_buffer;
unsigned int *row_jumper;
real *elements;
} Mat;
int work_line;
// Assumes there are no null elements on main diagonal
unsigned int solve(const Mat* matrix, const real *rhs, real *sol, real sor_omega, unsigned int max_iters, real tolerance)
{
real *coefs = matrix->elements;
unsigned int *cols = matrix->col_buffer;
unsigned int *rows = matrix->row_jumper;
int size = matrix->size;
real compl_omega = 1.0 - sor_omega;
unsigned int count = 0;
bool done;
do {
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
#pragma omp for nowait schedule(dynamic, work_line)
for(int i = 0; i < size; ++i) {
real new_val = rhs[i];
real diagonal;
real residual;
unsigned int end = rows[i+1];
for(int j = rows[i]; j < end; ++j) {
unsigned int col = cols[j];
if(col != i) {
real tmp;
#pragma omp atomic read
tmp = sol[col];
new_val -= coefs[j] * tmp;
} else {
diagonal = coefs[j];
}
}
residual = fabs(new_val - diagonal * sol[i]);
if(residual > tolerance) {
tdone = false;
}
new_val = sor_omega * new_val / diagonal + compl_omega * sol[i];
#pragma omp atomic write
sol[i] = new_val;
}
#pragma omp atomic update
done &= tdone;
}
} while(++count < max_iters && !done);
return count;
}
As you can see, there is no lock inside the parallel region, so, for what they always teach us, it is the kind of 100% parallel problem. That is not what I see in practice.
All my tests were run on a Intel(R) Xeon(R) CPU E5-2670 v2 # 2.50GHz, 2 processors, 10 cores each, hyper-thread enabled, summing up to 40 logical cores.
On my first set runs, work_line was fixed on 2048, and the number of threads varied from 1 to 40 (40 runs in total). This is the graph with the execution time of each run (seconds x number of threads):
The surprise was the logarithmic curve, so I thought that since the work line was so large, the shared caches were not very well used, so I dug up this virtual file /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size that told me this processor's L1 cache synchronizes updates in groups of 64 bytes (8 doubles in the array sol). So I set the work_line to 8:
Then I thought 8 was too low to avoid NUMA stalls and set work_line to 16:
While running the above, I thought "Who am I to predict what work_line is good? Lets just see...", and scheduled to run every work_line from 8 to 2048, steps of 8 (i.e. every multiple of the cache line, from 1 to 256). The results for 20 and 40 threads (seconds x size of the split of the middle for loop, divided among the threads):
I believe the cases with low work_line suffers badly from cache synchronization, while bigger work_line offers no benefit beyond a certain number of threads (I assume because the memory pathway is the bottleneck). It is very sad that a problem that seems 100% parallel presents such bad behavior on a real machine. So, before I am convinced multi-core systems are a very well sold lie, I am asking you here first:
How can I make this code scale linearly to the number of cores? What am I missing? Is there something in the problem that makes it not as good as it seems at first?
Update
Following suggestions, I tested both with static and dynamic scheduling, but removing the atomics read/write on the array sol. For reference, the blue and orange lines are the same from the previous graph (just up to work_line = 248;). The yellow and green lines are the new ones. For what I could see: static makes a significant difference for low work_line, but after 96 the benefits of dynamic outweighs its overhead, making it faster. The atomic operations makes no difference at all.
The sparse matrix vector multiplication is memory bound (see here) and it could be shown with a simple roofline model. Memory bound problems benefit from higher memory bandwidth of multisocket NUMA systems but only if the data initialisation is done in such a way that the data is distributed among the two NUMA domains. I have some reasons to believe that you are loading the matrix in serial and therefore all its memory is allocated on a single NUMA node. In that case you won't benefit from the double memory bandwidth available on a dual-socket system and it really doesn't matter if you use schedule(dynamic) or schedule(static). What you could do is enable memory interleaving NUMA policy in order to have the memory allocation spread among both NUMA nodes. Thus each thread would end up with 50% local memory access and 50% remote memory access instead of having all threads on the second CPU being hit by 100% remote memory access. The easiest way to enable the policy is by using numactl:
$ OMP_NUM_THREADS=... OMP_PROC_BIND=1 numactl --interleave=all ./program ...
OMP_PROC_BIND=1 enables thread pinning and should improve the performance a bit.
I would also like to point out that this:
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
// ...
#pragma omp atomic update
done &= tdone;
}
is a probably a not very efficient re-implementation of:
done = true;
#pragma omp parallel reduction(&:done)
{
// ...
if(residual > tolerance) {
done = false;
}
// ...
}
It won't have a notable performance difference between the two implementations because of the amount of work done in the inner loop, but still it is not a good idea to reimplement existing OpenMP primitives for the sake of portability and readability.
Try running the IPCM (Intel Performance Counter Monitor). You can watch memory bandwidth, and see if it maxes out with more cores. My gut feeling is that you are memory bandwidth limited.
As a quick back of the envelope calculation, I find that uncached read bandwidth is about 10 GB/s on a Xeon. If your clock is 2.5 GHz, that's one 32 bit word per clock cycle. Your inner loop is basically just a multiple-add operation whose cycles you can count on one hand, plus a few cycles for the loop overhead. It doesn't surprise me that after 10 threads, you don't get any performance gain.
Your inner loop has an omp atomic read, and your middle loop has an omp atomic write to a location that could be the same one read by one of the reads. OpenMP is obligated to ensure that atomic writes and reads of the same location are serialized, so in fact it probably does need to introduce a lock, even though there isn't any explicit one.
It might even need to lock the whole sol array unless it can somehow figure out which reads might conflict with which writes, and really, OpenMP processors aren't necessarily all that smart.
No code scales absolutely linearly, but rest assured that there are many codes that do scale much closer to linearly than yours does.
I suspect you are having caching issues. When one thread updates a value in the sol array, it invalids the caches on other CPUs that are storing that same cache line. This forces the caches to be updated, which then leads to the CPUs stalling.
Even if you don't have an explicit mutex lock in your code, you have one shared resource between your processes: the memory and its bus. You don't see this in your code because it is the hardware that takes care of handling all the different requests from the CPUs, but nevertheless, it is a shared resource.
So, whenever one of your processes writes to memory, that memory location will have to be reloaded from main memory by all other processes that use it, and they all have to use the same memory bus to do so. The memory bus saturates, and you have no more performance gain from additional CPU cores that only serve to worsen the situation.

Atomic swap in C

I think I'm missing something obvious here. I have code like this:
int *a = <some_address>;
int *b = <another_address>;
// ...
// desired atomic swap of addresses a and b here
int *tmp = a;
a = b; b = tmp;
I want to do this atomically. I've been looking at __c11_atomic_exchange and on OSX the OSAtomic methods, but nothing seems to perform a straight swap atomically. They can all write a value into 1 of the 2 variables atomically, but never both.
Any ideas?
It depends a bit on your architecture, what is effectively and efficiently possible. In any case you would need that the two pointers that you want to swap be adjacent in memory, the easiest that you have them as some like
typedef struct pair { void *a[2]; } pair;
Then you can, if you have a full C11 compiler use _Atomic(pair) as an atomic type to swap the contents of such a pair: first load the actual pair in a temporary, construct a new one with the two values swapped, do a compare-exchange to store the new value and iterate if necessary:
inline
void pair_swap(_Atomic(pair) *myPair) {
pair actual = { 0 };
pair future = { 0 };
while (!atomic_compare_exchange_weak(myPair, &actual, future)) {
future.a[0] = actual.a[1];
future.a[1] = actual.a[0];
}
}
Depending on your architecture this might be realized with a lock free primitive operation or not. Modern 64 bit architectures usually have 128bit atomics, there this could result in just some 128 bit assembler instructions.
But this would depend a lot on the quality of the implementation of atomics on your platform. P99 would provide you with an implementation that implements the "functionlike" part of atomic operations, and that supports 128 bit swap, when detected; on my machine (64 bit Intel linux) this effectively compiles into a lock cmpxchg16b instruction.
Relevant documentation:
https://en.cppreference.com/w/c/language/atomic
https://en.cppreference.com/w/c/thread
https://en.cppreference.com/w/c/atomic/atomic_exchange
https://en.cppreference.com/w/c/atomic/atomic_compare_exchange
You should use a mutex.
Or, if you don't want your thread to go to sleep if the mutex isn't immediately available, you can improve efficiency by using a spin lock, since the number of times the spin lock has to spin is frequently so small that it is faster to spin than to put a thread to sleep and have a whole context switch.
C11 has a whole suite of atomic types and functions which allow you to easily implement your own spin lock to accomplish this.
Keep in mind, however, that a basic spin lock must not be used on a bare metal (no operating system) single core (processor) system, as it will result in deadlock if the main context takes the lock and then while the lock is still taken, an ISR fires and tries to take the lock. The ISR will block indefinitely.
To avoid this, more-complicated locking mechanisms must be used, such as a spin lock with a timeout, an ISR which auto-reschedules itself for a slightly later time if the lock is taken, automatic yielding, etc.
Quick answer:
Here is what we will be building up to:
int *a = &some_int1;
int *b = &some_int2;
spinlock_t spinlock = SPINLOCK_UNLOCKED;
// swap `a` and `b`
atomic_swap_int_ptr(&spinlock, &a, &b);
Here are the details:
For a basic spin lock (mutex) on a multi-core system, which is where spin locks are best, however, try this:
Basic spin lock implementation in C11 using _Atomic types
#include <stdbool.h>
#include <stdatomic.h>
#define SPINLOCK_UNLOCKED 0
#define SPINLOCK_LOCKED 1
typedef volatile atomic_bool spinlock_t;
// This is how you'd create a new spinlock to use in the functions below.
// spinlock_t spinlock = SPINLOCK_UNLOCKED;
/// \brief Set the spin lock to `SPINLOCK_LOCKED`, and return the previous
/// state of the `spinlock` object.
/// \details If the previous state was `SPINLOCK_LOCKED`, you can know that
/// the lock was already taken, and you do NOT now own the lock. If the
/// previous state was `SPINLOCK_UNLOCKED`, however, then you have now locked
/// the lock and own the lock.
bool atomic_test_and_set(spinlock_t* spinlock)
{
bool spinlock_val_old = atomic_exchange(spinlock, SPINLOCK_LOCKED);
return spinlock_val_old;
}
/// Spin (loop) until you take the spin lock.
void spinlock_lock(spinlock_t* spinlock)
{
// Spin lock: loop forever until we get the lock; we know the lock was
// successfully obtained after exiting this while loop because the
// atomic_test_and_set() function locks the lock and returns the previous
// lock value. If the previous lock value was SPINLOCK_LOCKED then the lock
// was **already** locked by another thread or process. Once the previous
// lock value was SPINLOCK_UNLOCKED, however, then it indicates the lock
// was **not** locked before we locked it, but now it **is** locked because
// we locked it, indicating we own the lock.
while (atomic_test_and_set(spinlock) == SPINLOCK_LOCKED) {};
}
/// Release the spin lock.
void spinlock_unlock(spinlock_t* spinlock)
{
*spinlock = SPINLOCK_UNLOCKED;
}
Now, you can use the spin lock, for your example, in your multi-core system like this:
/// Atomically swap two pointers to int (`int *`).
/// NB: this does NOT swap their contents, meaning: the integers they point to.
/// Rather, it swaps the pointers (addresses) themselves.
void atomic_swap_int_ptr(spinlock_t * spinlock, int ** ptr1, int ** ptr2)
{
spinlock_lock(spinlock)
// critical, atomic section start
int * temp = *ptr1;
*ptr1 = *ptr2;
*ptr2 = temp;
// critical, atomic section end
spinlock_unlock(spinlock)
}
// Demo to perform an address swap
int *a = &some_int1;
int *b = &some_int2;
spinlock_t spinlock = SPINLOCK_UNLOCKED;
// swap `a` and `b`
atomic_swap_int_ptr(&spinlock, &a, &b); // <=========== final answer
References
https://en.wikipedia.org/wiki/Test-and-set#Pseudo-C_implementation_of_a_spin_lock
C11 Concurrency support library and atomic_bool and other types: https://en.cppreference.com/w/c/thread
C11 atomic_exchange() function: https://en.cppreference.com/w/c/atomic/atomic_exchange
Other links
Note that you could use C11's struct atomic_flag as well, and its corresponding atomic_flag_test_and_set() function, instead of creating your own atomic_test_and_set() function using atomic_exchange(), like I did above.
_Atomic types in C11: https://en.cppreference.com/w/c/language/atomic
C++11 Implementation of Spinlock using header <atomic>
TODO
This implementation would need to be modified to add safety anti-deadlock mechanisms such as automatic deferral, timeout, and retry, to prevent deadlock on single core and bare-metal systems. See my notes here: Add basic mutex (lock) and spin lock implementations in C11, C++11, AVR, and STM32 and here: What are the various ways to disable and re-enable interrupts in STM32 microcontrollers in order to implement atomic access guards?

multithreaded environment in c

I'm just trying to get my head around multithreading environments, specifically how you would implement a cooperative one in c (on an AVR, but out of interest I would like to keep this general).
My problem comes with the thread switch itself: I'm pretty sure I could write this in assembler, flushing all the registers to a stack and then saving the PC to return to later.
How would one pull something like this off in c? I have been told it can do "everything".
I realize this is quite a general question, so any links with information on this topic would be greatly appreciated.
Thanks
You can do this with setjmp/longjmp on most systems -- here is some code I've use in the past for task switching:
void task_switch(Task *to, int exit)
{
int tmp;
int task_errno; /* save space for errno */
task_errno = errno;
if (!(tmp = setjmp(current_task->env))) {
tmp = exit ? (int)current_task : 1;
current_task = to;
longjmp(to->env, tmp); }
if (exit) {
/* if we get here, the stack pointer is pointing into an already
** freed block ! */
abort(); }
if (tmp != 1)
free((void *)tmp);
errno = task_errno;
}
This depends on sizeof(int) == sizeof(void *) in order to pass a pointer as the argument to setjmp/longjmp, but that could be avoided by using handles (indexes into a global array of all task structures) instead of raw pointers here, or by using a static pointer.
Of course, the tricky part is setting up jmpbuf objects for newly created tasks, each with their own stack. You can use a signal handler with sigaltstack for that:
static void (*tfn)(void *);
static void *tfn_arg;
static stack_t old_ss;
static int old_sm;
static struct sigaction old_sa;
Task *current_task = 0;
static Task *parent_task;
static int task_count;
static void newtask()
{
int sm;
void (*fn)(void *);
void *fn_arg;
task_count++;
sigaltstack(&old_ss, 0);
sigaction(SIGUSR1, &old_sa, 0);
sm = old_sm;
fn = tfn;
fn_arg = tfn_arg;
task_switch(parent_task);
sigsetmask(sm);
(*fn)(fn_arg);
abort();
}
Task *task_start(int ssize, void (*_tfn)(void *), void *_arg)
{
Task *volatile new;
stack_t t_ss;
struct sigaction t_sa;
old_sm = sigsetmask(~sigmask(SIGUSR1));
if (!current_task) task_init();
tfn = _tfn;
tfn_arg = _arg;
new = malloc(sizeof(Task) + ssize + ALIGN);
new->next = 0;
new->task_data = 0;
t_ss.ss_sp = (void *)(new + 1);
t_ss.ss_size = ssize;
t_ss.ss_flags = 0;
if ((unsigned long)t_ss.ss_sp & (ALIGN-1))
t_ss.ss_sp = (void *)(((unsigned long)t_ss.ss_sp+ALIGN) & ~(ALIGN-1));
t_sa.sa_handler = newtask;
t_sa.sa_mask = ~sigmask(SIGUSR1);
t_sa.sa_flags = SA_ONSTACK|SA_RESETHAND;
sigaltstack(&t_ss, &old_ss);
sigaction(SIGUSR1, &t_sa, &old_sa);
parent_task = current_task;
if (!setjmp(current_task->env)) {
current_task = new;
kill(getpid(), SIGUSR1); }
sigaltstack(&old_ss, 0);
sigaction(SIGUSR1, &old_sa, 0);
sigsetmask(old_sm);
return new;
}
If you wanted to keep it pure C, I think you might be able to use setjmp and longjmp, but I've never tried it myself, and I imagine there's probably some platforms on which this wouldn't work (i.e. certain registers/other settings not being saved). The only other alternative would be to write it in assembly.
As mentioned, setjmp/longjmp are standard C and are available even in the libc of 8-bit AVRs. They do exactly what you said you'd do in assembler: save the processor context. But one has to keep in mind that the intended purpose of those functions is just to jump backwards in the flow of control; switching between tasks is an abuse. It does work anyway, and looks like this is even frequently used in a variety of user-level thread libraries -- like GNU Pth. But still, is an abuse of the intended purpose, and requires being careful.
As Chris Dodd said, you still need to provide an stack for each new task. He used sigaltstack() and other signal-related functions, but those do not exist in standard C, only in unix-like environments. For example, the AVR libc does not provide them. So as an alternative you can try reserving a part of your existing stack (by declaring a big local array, or using alloca()) for use as the stack of the new thread. Just keep in mind that the main/scheduler thread will keep using its stack, each thread uses its own stack, and all of them will grow and shrink as stacks usually do, so they will need space for doing so without interfering with each other.
And since we're already mentioning unix-like, non-standard-C mechanisms, there is also makecontext()/swapcontext() and family, which are more powerful but harder to find than setjmp()/longjmp(). The names say it all really: the context functions let you manage full process contexts (stacks included), the jmp functions let you just jump around - you'll have to hack the rest.
For the AVR anyway, given that you won't probably have an OS to help nor much memory to blindly reserve, you'd be probably better off using assembler for the switching and stack initializing.
In my experience if people start writing schedulers it isn't too long before they start wanting things like network stacks, memory allocation and file systems too. It's almost never worth going down that route; you end up spending more time writing your own operating system than you're spending on your actual application.
First whiff of your project heading that way and it's almost always worth putting the effort to put in an existing OS (linux, VxWorks, etc). Of course, that might mean that you run into problems if the CPU isn't up to it. And AVR isn't exactly a whole lot of CPU, and fitting an existing OS on to it ranges from mostly impossible to tricky for the major OSes, though there are some tiny OSes (some open source, see http://en.wikipedia.org/wiki/List_of_real-time_operating_systems).
So at the commencement of a project you should carefully consider how you might wish to evolve it going into the future. This might influence your choice of CPU now to save having to do hideous things in software later.

A tested implementation of Peterson Lock algorithm?

Does anyone know of a good/correct implementation of Peterson's Lock algorithm in C? I can't seem to find this. Thanks.
Peterson's algorithm cannot be implemented correctly in C99, as explained in who ordered memory fences on an x86.
Peterson's algorithm is as follows:
LOCK:
interested[id] = 1 interested[other] = 1
turn = other turn = id
while turn == other while turn == id
and interested[other] == 1 and interested[id] == 1
UNLOCK:
interested[id] = 0 interested[other] = 0
There are some hidden assumptions here. To begin with, each thread must note its interest in acquiring the lock before giving away its turn. Giving away the turn must make visible to the other thread that we are interested in acquiring the lock.
Also, as in every lock, memory accesses in the critical section cannot be hoisted past the lock() call, nor sunk past the unlock(). I.e.: lock() must have at least acquire semantics, and unlock() must have at least release semantics.
In C11, the simplest way to achieve this would be to use a sequentially consistent memory order, which makes the code run as if it were a simple interleaving of threads running in program order (WARNING: totally untested code, but it's similar to an example in Dmitriy V'jukov's Relacy Race Detector):
lock(int id)
{
atomic_store(&interested[id], 1);
atomic_store(&turn, 1 - id);
while (atomic_load(&turn) == 1 - id
&& atomic_load(&interested[1 - id]) == 1);
}
unlock(int id)
{
atomic_store(&interested[id], 0);
}
This ensures that the compiler doesn't make optimizations that break the algorithm (by hoisting/sinking loads/stores across atomic operations), and emits the appropriate CPU instructions to ensure the CPU also doesn't break the algorithm. The default memory model for C11/C++11 atomic operations that don't explicitely select a memory model is the sequentially consistent memory model.
C11/C++11 also support weaker memory models, allowing as much optimization as possible. The following is a translation to C11 of the translation to C++11 by Anthony Williams of an algorithm originally by Dmitriy V'jukov in the syntax of his own Relacy Race Detector
[petersons_lock_with_C++0x_atomics] [the-inscrutable-c-memory-model]. If this algorithm is incorrect it's my fault (WARNING: also untested code, but based on good code from Dmitriy V'jukov and Anthony Williams):
lock(int id)
{
atomic_store_explicit(&interested[id], 1, memory_order_relaxed);
atomic_exchange_explicit(&turn, 1 - id, memory_order_acq_rel);
while (atomic_load_explicit(&interested[1 - id], memory_order_acquire) == 1
&& atomic_load_explicit(&turn, memory_order_relaxed) == 1 - id);
}
unlock(int id)
{
atomic_store_explicit(&interested[id], 0, memory_order_release);
}
Notice the exchange with acquire and release semantics. An exchange is an
atomic RMW operation. Atomic RMW operations always read the last value stored
before the write in the RMW operation. Also, an acquire on an atomic object
that reads a write from a release on that same atomic object (or any later
write on that object from the thread that performed the release or any later
write from any atomic RMW operation) creates a synchronizes-with relation
between the release and the acquire.
So, this operation is a synchronization point between the threads, there is
always a synchronizes-with relationship between the exchange in one thread and
the last exchange performed by any thread (or the initialization of turn, for
the very first exchange).
So we have a sequenced-before relationship between the store to interested[id]
and the exchange from/to turn, a synchronizes-with relationship between two
consecutive exchanges from/to turn, and a sequenced-before relationship
between the exchange from/to turn and the load of interested[1 - id]. This
amounts to a happens-before relationship between accesses to interested[x] in
different threads, with turn providing the synchronization between threads.
This forces all the ordering needed to make the algorithm work.
So how were these things done before C11? It involved using compiler and
CPU-specific magic. As an example, let's see the pretty strongly-ordered x86.
IIRC, all x86 loads have acquire semantics, and all stores have release
semantics (save non-temporal moves, in SSE, used precisely to achive higher
performance at the cost of ocassionally needing to issue CPU fences to achieve
coherence between CPUs). But this is not enough for Peterson's algorithm, as
Bartosz Milewsky explains at
who-ordered-memory-fences-on-an-x86 ,
for Peterson's algorithm to work we need to establish an ordering between
accesses to turn and interested, failing to do that may result in seeing loads
from interested[1 - id] before writes to interested[id], which is a bad thing.
So a way to do that in GCC/x86 would be (WARNING: although I tested something similar to the following, actually a modified version of the code at wrong-implementation-of-petersons-algorithm , testing is nowhere near assuring correctness of multithreaded code):
lock(int id)
{
interested[id] = 1;
turn = 1 - id;
__asm__ __volatile__("mfence");
do {
__asm__ __volatile__("":::"memory");
} while (turn == 1 - id
&& interested[1 - id] == 1);
}
unlock(int id)
{
interested[id] = 0;
}
The MFENCE prevents stores and loads to different memory addresses from being
reordered. Otherwise the write to interested[id] could be queued in the store
buffer while the load of interested[1 - id] proceeds. On many current
microarchitectures a SFENCE may be enough, since it may be implemented as a
store buffer drain, but IIUC SFENCE doesn't need to be implemented that way,
and may simply prevent reordering between stores. So SFENCE may not be enough everywhere, and we need a full MFENCE.
The compiler barrier (__asm__ __volatile__("":::"memory")) prevents
the compiler from deciding that it already knows the value of turn. We're
telling the compiler that we've clobbered memory, so all values cached in
registers must be reloaded from memory.
P.S: I feel this needs a closing paragraph, but my brain is drained.
I won't make any assertions about how good or correct the implementation is, but it was tested (briefly). This is a straight translation of the algorithm described on wikipedia.
struct petersonslock_t {
volatile unsigned flag[2];
volatile unsigned turn;
};
typedef struct petersonslock_t petersonslock_t;
petersonslock_t petersonslock () {
petersonslock_t l = { { 0U, 0U }, ~0U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 1;
l->turn = !p;
while (l->flag[!p] && (l->turn == !p)) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
l->flag[p] = 0;
};
Greg points out that on an SMP architecture with slightly relaxed memory coherency (such as x86), although the loads to the same memory location are in order, loads to different locations on one processor may appear out of order to the other processor.
Jens Gustedt and ninjalj recommend modifying the original algorithm to use the atomic_flag type. This means setting the flags and turns would use the atomic_flag_test_and_set and clearing them would use atomic_flag_clear from C11. Alternatively, a memory barrier could be imposed between updates to flag.
Edit: I originally attempted to correct for this by writing to the same memory location for all the states. ninjalj pointed out that the bitwise operations turned the state operations into RMW rather than load and stores of the original algorithm. So, atomic bitwise operations are required. C11 provides such operators, as does GCC with built-ins. The algorithm below uses GCC built-ins, but wrapped in macros so that it can easily be changed to some other implementation. However, modifying the original algorithm above is the preferred solution.
struct petersonslock_t {
volatile unsigned state;
};
typedef struct petersonslock_t petersonslock_t;
#define ATOMIC_OR(x,v) __sync_or_and_fetch(&x, v)
#define ATOMIC_AND(x,v) __sync_and_and_fetch(&x, v)
petersonslock_t petersonslock () {
petersonslock_t l = { 0x000000U };
return l;
}
void petersonslock_lock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
unsigned mask = (p == 0) ? 0xFF0000 : 0x00FF00;
ATOMIC_OR(l->state, (p == 0) ? 0x000100 : 0x010000);
(p == 0) ? ATOMIC_OR(l->state, 0x000001) : ATOMIC_AND(l->state, 0xFFFF00);
while ((l->state & mask) && (l->state & 0x0000FF) == !p) {}
};
void petersonslock_unlock (petersonslock_t *l, int p) {
assert(p == 0 || p == 1);
ATOMIC_AND(l->state, (p == 0) ? 0xFF00FF : 0x00FFFF);
};

Resources