How Compare and Swap works - c

I have read quite some posts that say compare and swap guarantees atomicity, However I am still not able to get how does it. Here is general pseudo code for compare and swap:
int CAS(int *ptr,int oldvalue,int newvalue)
{
int temp = *ptr;
if(*ptr == oldvalue)
*ptr = newvalue
return temp;
}
How does this guarantee atomicity? For example, if I am using this to implement a mutex,
void lock(int *mutex)
{
while(!CAS(mutex, 0 , 1));
}
how does this prevent 2 threads from acquiring the mutex at the same time? Any pointers would be really appreciated.

"general pseudo code" is not an actual code of CAS (compare and swap) implementation. Special hardware instructions are used to activate special atomic hardware in the CPU. For example, in x86 the LOCK CMPXCHG can be used (http://en.wikipedia.org/wiki/Compare-and-swap).
In gcc, for example, there is __sync_val_compare_and_swap() builtin - which implements hardware-specific atomic CAS. There is description of this operation from fresh wonderful book from Paul E. McKenney (Is Parallel Programming Hard, And, If So, What Can You Do About It?, 2014), section 4.3 "Atomic operations", pages 31-32.
If you want to know more about building higher level synchronization on top of atomic operations and save your system from spinlocks and burning cpu cycles on active spinning, you can read something about futex mechanism in Linux. First paper on futexes is Futexes are tricky by Ulrich Drepper 2011; the other is LWN article http://lwn.net/Articles/360699/ (and the historic one is Fuss, Futexes and Furwocks: Fast Userland Locking in Linux, 2002)
Mutex locks described by Ulrich use only atomic operations for "fast path" (when the mutex is not locked and our thread is the only who wants to lock it), but if the mutex was locked, the thread will go to sleeping using futex(FUTEX_WAIT...) (and it will mark the mutex variable using atomic operation, to inform the unlocking thread about "there are somebody sleeping waiting on this mutex", so unlocker will know that he must wake them using futex(FUTEX_WAKE, ...)

How does it prevent two threads from acquiring the lock? Well, once any one thread succeeds, *mutex will be 1, so any other thread's CAS will fail (because it's called with expected value 0). The lock is released by storing 0 in *mutex.
Note that this is an odd use of CAS, since it's essentially requiring an ABA-violation. Typically you'd just use a plain atomic exchange:
while (exchange(mutex, 1) == 1) { /* spin */ }
// critical section
*mutex = 0; // atomically
Or if you want to be slightly more sophisticated and store information about which thread has the lock, you can do tricks with atomic-fetch-and-add (see for example the Linux kernel spinlock code).

You cannot implement CAS in C. It's done on a hardware level in assembly.

Related

Memory order for a ticket-taking spin-lock mutex

Suppose I have the following ticket-taking spinlock mutex implementation (in C using GCC atomic builtins). As I understand it, the use of the "release" memory order in the unlock function is correct. I'm unsure, though, about the lock function. Because this is a ticket-taking mutex, there's a field indicating the next ticket number to be handed out, and a field to indicate which ticket number currently holds the lock. I've used acquire-release on the ticket increment and acquire on the spin load. Is that unnecessarily strong, and if so, why?
Separately, should those two fields (ticket and serving) be spaced so that they're on different cache lines, or does that not matter? I'm mainly interested in arm64 and amd64.
typedef struct {
u64 ticket;
u64 serving;
} ticket_mutex;
void
ticket_mutex_lock(ticket_mutex *m)
{
u64 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_ACQ_REL);
while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE));
}
void
ticket_mutex_unlock(ticket_mutex *m)
{
(void) __atomic_fetch_add(&m->serving, 1, __ATOMIC_RELEASE);
}
UPDATE: based on the advice in the accepted answer, I've adjusted the implementation to the following. This mutex is intended for the low-contention case.
typedef struct {
u32 ticket;
u32 serving;
} ticket_mutex;
void
ticket_mutex_lock(ticket_mutex *m)
{
u32 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_RELAXED);
while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE)) {
#ifdef __x86_64__
__asm __volatile ("pause");
#endif
}
}
void
ticket_mutex_unlock(ticket_mutex *m)
{
u32 my_ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED);
(void) __atomic_store_n(&m->serving, my_ticket+1, __ATOMIC_RELEASE);
}
m->ticket increment only needs to be RELAXED. You only need each thread to get a different ticket number; it can happen as early or late as you want wrt. other operations in the same thread.
load(&m->serving, acquire) is the operation that orders the critical section, preventing those from starting until we've synchronized-with a RELEASE operation in the unlock function of the previous holder of the lock. So the m->serving loads needs to be at least acquire.
Even if the m->ticket++ doesn't complete until after an acquire load of m->serving, that's fine. The while condition still determines whether execution proceeds (non-speculatively) into the critical section. Speculative execution into the critical section is fine, and good since it probably means it's ready commit sooner, reducing the time with the lock held.
Extra ordering on the RMW operation won't make it any faster locally or in terms of inter-thread visibility, and would slow down the thread taking the lock.
One cache line or two
For performance, I think with high contention, there are advantages to keeping the members in separate cache lines.
Threads needing exclusive ownership of the cache line to get a ticket number won't contend with the thread unlocking .serving, so those inter-thread latency delays can happen in parallel.
With multiple cores in the spin-wait while(load(serving)) loop, they can hit in their local L1d cache until something invalidates shared copies of the line, not creating any extra traffic. But wasting a lot of power unless you use something like x86 _mm_pause(), as well as wasting execution resources that could be shared with another logical core on the same physical. x86 pause also avoids a branch mispredict when leaving the spin loop. Related:
What is the purpose of the "PAUSE" instruction in x86?
How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
Locks around memory manipulation via inline assembly
Exponential backoff up to some number of pauses between checks is a common recommendation, but here we can do better: A number of pause instructions between checks that scales with my_ticket - m->serving, so you check more often when your ticket is coming up.
In really high contention cases, fallback to OS-assisted sleep/wake is appropriate if you'll be waiting for long, like Linux futex. Or since we can see how close to the head of the queue we are, yield, nanosleep, or futex if your wait interval will be more than 3 or 8 ticket numbers or whatever. (Tunable depending on how long it takes to serve a ticket.)
(Using futex, you might introduce a read of m->ticket into the unlock to figure out if there might be any threads sleeping, waiting for a notify. Like C++20 atomic<>.wait() and atomic.notify_all(). Unfortunately I don't know a good way to figure out which thread to notify, instead of waking them all up to check if they're the lucky winner.
With low average contention, you should keep both in the same cache line. An access to .ticket is always immediately followed by a load of .serving. In the unlocked no-contention case, this means only one cache line is bouncing around, or having to stay hot for the same core to take/release the lock.
If the lock is already held, the thread wanting to unlock needs exclusive ownership of the cache line to RMW or store. It loses this whether another core does an RMW or just a pure load on the line containing .serving.
There won't be too many cases where multiple waiters are all spinning on the same lock, and where new threads getting a ticket number delay the unlock, and its visibility to the thread waiting for it.
This is my intuition, anyway; it's probably hard to microbenchmark, unless a cache-miss atomic RMW stops later load from even starting to request the later line, in which case you could have two cache-miss latencies in taking the lock.
Avoiding an atomic RMW in the unlock?
The thread holding the lock knows it has exclusive ownership, no other thread will be modifying m->serving concurrently. If you had the lock owner remember its own ticket number, you could optimize the unlock to just a store.
void ticket_mutex_unlock(ticket_mutex *m, uint32_t ticket_num)
{
(void) __atomic_store_n(&m->serving, ticket_num+1, __ATOMIC_RELEASE);
}
Or without that API change (to return an integer from u32 ticket_mutex_lock())
void ticket_mutex_unlock(ticket_mutex *m)
{
uint32_t ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED); // we already own the lock
// and no other thread can be writing concurrently, so a non-atomic increment is safe
(void) __atomic_store_n(&m->serving, ticket+1, __ATOMIC_RELEASE);
}
This has a nice efficiency advantage on ISAs that need LL/SC retry loops for atomic RMWs, where spurious failure from another core reading the value can happen. And on x86 where the only possible atomic RMW is a full barrier, stronger even than needed for C seq_cst semantics.
BTW, the lock fields would be fine as uint32_t. You're not going to have 2^32 threads waiting for a lock. So I used uint32_t instead of u64. Wrap-around is well-defined. Even subtraction like ticket - serving Just Works, even across that wrapping boundary, like 1 - 0xffffffffUL gives 2, so you can still calculate how close you are to being served, for sleep decisions.
Not a big deal on x86-64, only saving a bit of code size, and probably not a factor at all on AArch64. But will help significantly on some 32-bit ISAs.

Implement semaphore in User Level C

An effective and necessary implementation of semaphore requires it to be atomic instruction.
I see several User level C implementations on the internet implementing semaphores using variables like count or a data structure like queue. But, the instructions involving variable donot run as atomic instructions. So how can anyone implement a sempahore in User Level C.
How does a c library semaphore.h implement semaphore?
The answer is almost certainly "it doesn't" - instead it will call into kernel services which provide the necessary atomic operations.
It's not possible in standard C until c11. What you need is, as you said, atomic operations. c11 finally specifies them, see for example stdatomic.h.
If you're on an older version of the standard, you have to either use embedded assembler directly or rely on vendor-specific extensions of your compiler, see for example the GCC atomic builtins. Of course, processors support instructions for memory barriers, check and swap operations etc. They're just not accessible from pure c99 and earlier because parallel execution wasn't in the scope of the standard.
After reading MartinJames' comment, I should add clarification here: This only applies if you implement all your threading in user space because a semaphore must block threads waiting on it, so if the threads are managed by the kernel's scheduler (as is the case with pthreads on Linux for example), it's necessary to do a syscall. Not in the scope of your question, but atomic operations might still be interesting for implementing e.g. lock-free datastructures.
You could implement semaphore operations as simple as:
void sema_post(atomic_uint *value) {
unsigned old = 0;
while (!atomic_compare_exchange_weak(value, &old, old + 1));
}
void sema_wait(atomic_uint *value) {
unsigned old = 1;
while (old == 0 || !atomic_compare_exchange_weak(value, &old, old - 1));
}
It's OK semantically, but it does busy waiting (spinning) in sema_wait. (Note that sema_post is lock-free, although it also may spin.) Instead it should sleep until value becomes positive. This problem cannot be solved with atomics because all atomic operations are non-blocking. Here you need help from OS kernel. So an efficient semaphore could use similar algorithm based on atomics but go into kernel in two cases (see Linux futex for more details on this approach):
sema_wait: when it finds value == 0, ask to sleep
sema_post: when it has incremented value from 0 to 1, ask to wake another sleeping thread if any
In general, to implement a lock-free (using atomics) operations on a data structure it's required that every operation is applicable to any state. For semaphores, wait isn't applicable to value 0.

Self-written Mutex for 2+ Threads

I have written the following code, and so far in all my tests it seems as if I have written a working Mutex for my 4 Threads, but I would like to get someone else's opinion on the validity of my solution.
typedef struct Mutex{
int turn;
int * waiting;
int num_processes;
} Mutex;
void enterLock(Mutex * lock, int id){
int i;
for(i = 0; i < lock->num_processes; i++){
lock->waiting[id] = 1;
if (i != id && lock->waiting[i])
i = -1;
lock->waiting[id] = 0;
}
printf("ID %d Entered\n",id);
}
void leaveLock(Mutex * lock, int id){
printf("ID %d Left\n",id);
lock->waiting[id] = 0;
}
void foo(Muted * lock, int id){
enterLock(lock,id);
// do stuff now that i have access
leaveLock(lock,id);
}
I feel compelled writing an answer here because the question is a good one, taking into concern it could help others to understand the general problem with mutual exclusion. In your case, you came a long way to hide this problem, but you can't avoid it. It boils down to this:
01 /* pseudo-code */
02 if (! mutex.isLocked())
03 mutex.lock();
You always have to expect a thread switch between lines 02 and 03. So there is a possible situation where two threads find mutex unlocked and be interrupted after that ... only to resume later and lock this mutex individually. You will have two threads entering the critical section at the same time.
What you definitely need to implement reliable mutual exclusion is therefore an atomic operation that tests a condition and at the same time sets a value without any chance to be interrupted meanwhile.
01 /* pseudo-code */
02 while (! test_and_lock(mutex));
As soon as this test_and_lock function cannot be interrupted, your implementation is safe. Until c11, C didn't provide anything like this, so implementations of pthreads needed to use e.g. assembly or special compiler intrinsics. With c11, there is finally a "standard" way to write atomic operations like this, but I can't give an example here, because I don't have experience doing that. For general use, the pthreads library will give you what you need.
edit: of course, this is still simplified -- in a multi-processor scenario, you need to ensure that even memory accesses are mutually exclusive.
The Problem I see in you code:
The idea behind a mutex is to provide mutual exclusion, means that when thread_a is in the critical section, thread_b must wait(in case he wants also to enter) for thread_a.
This waiting part should be implemented in enterLock function. But what you have is a for loop which might end way before thread_a is done from the critical section and thus thread_b could also enter, hence you can't have mutual exclusion.
Way to fix it:
Take a look for example at Peterson's algorithm or Dekker's(more complicated), what they did there is what's called busy waiting which is basically a while loop which says:
while(i can't enter) { do nothing and wait...}
You are totally ignoring the topic of memory models. Unless you are on a machine with a sequential consistent memory model (which none of today's PC CPUs are), your code is incorrect, as any store executed by one thread is not necessarily immediately visible to other CPUs. However, exactly this seems to be an assumption in your code.
Bottom line: Use the existing synchronization primitives provided by the OS or a runtime library such a POSIX or Win32 API and don't try to be smart and implement this yourself. Unless you have years of experince in parallel programming as well as in-depth knowledge of CPU architecture, chances are quite good that you end up with an incorrect implementation. And debugging parallel programms can be hell...
After enterLock() returns, the state of the Mutex object is the same as before the function was called. Hence it will not prevent a second thread to enter the same Mutex object even before the first one released it calling leaveLock(). There is no mutual exclusiveness.

How are read/write locks implemented in pthread?

How are they implemented especially in case of pthreads. What pthread synchronization APIs do they use under the hood? A little bit of pseudocode would be appreciated.
I haven't done any pthreads programming for a while, but when I did, I never used POSIX read/write locks. The problem is that most of the time a mutex will suffice: ie. your critical section is small, and the region isn't so performance critical that the double barrier is worth worrying about.
In those cases where performance is an issue, normally using atomic operations (generally available as a compiler extension) are a better option (ie. the extra barrier is the problem, not the size of the critical section).
By the time you eliminate all these cases, you are left with cases where you have specific performance/fairness/rw-bias requirements that require a true rw-lock; and that is when you discover that all the relevant performance/fairness parameters of POSIX rw-lock are undefined and implementation specific. At this point you are generally better off implementing your own so you can ensure the appropriate fairness/rw-bias requirements are met.
The basic algorithm is to keep a count of how many of each are in the critical section, and if a thread isn't allowed access yet, to shunt it off to an appropriate queue to wait. Most of your effort will be in implementing the appropriate fairness/bias between servicing the two queues.
The following C-like pthreads-like pseudo-code illustrates what I'm trying to say.
struct rwlock {
mutex admin; // used to serialize access to other admin fields, NOT the critical section.
int count; // threads in critical section +ve for readers, -ve for writers.
fifoDequeue dequeue; // acts like a cond_var with fifo behaviour and both append and prepend operations.
void *data; // represents the data covered by the critical section.
}
void read(struct rwlock *rw, void (*readAction)(void *)) {
lock(rw->admin);
if (rw->count < 0) {
append(rw->dequeue, rw->admin);
}
while (rw->count < 0) {
prepend(rw->dequeue, rw->admin); // Used to avoid starvation.
}
rw->count++;
// Wake the new head of the dequeue, which may be a reader.
// If it is a writer it will put itself back on the head of the queue and wait for us to exit.
signal(rw->dequeue);
unlock(rw->admin);
readAction(rw->data);
lock(rw->admin);
rw->count--;
signal(rw->dequeue); // Wake the new head of the dequeue, which is probably a writer.
unlock(rw->admin);
}
void write(struct rwlock *rw, void *(*writeAction)(void *)) {
lock(rw->admin);
if (rw->count != 0) {
append(rw->dequeue, rw->admin);
}
while (rw->count != 0) {
prepend(rw->dequeue, rw->admin);
}
rw->count--;
// As we only allow one writer in at a time, we don't bother signaling here.
unlock(rw->admin);
// NOTE: This is the critical section, but it is not covered by the mutex!
// The critical section is rather, covered by the rw-lock itself.
rw->data = writeAction(rw->data);
lock(rw->admin);
rw->count++;
signal(rw->dequeue);
unlock(rw->admin);
}
Something like the above code is a starting point for any rwlock implementation. Give some thought to the nature of your problem and replace the dequeue with the appropriate logic that determines which class of thread should be woken up next. It is common to allow a limited number/period of readers to leapfrog writers or visa versa depending on the application.
Of course my general preference is to avoid rw-locks altogether; generally by using some combination of atomic operations, mutexes, STM, message-passing, and persistent data-structures. However there are times when what you really need is a rw-lock, and when you do it is useful to know how they work, so I hope this helped.
EDIT - In response to the (very reasonable) question, where do I wait in the pseudo-code above:
I have assumed that the dequeue implementation contains the wait, so that somewhere within append(dequeue, mutex) or prepend(dequeue, mutex) there is a block of code along the lines of:
while(!readyToLeaveQueue()) {
wait(dequeue->cond_var, mutex);
}
which was why I passed in the relevant mutex to the queue operations.
Each implementation can be different, but normally they have to favor readers by default due to the requirement by POSIX that a thread be able to obtain the read-lock on an rwlock multiple times. If they favored writers, then whenever a writer was waiting, the reader would deadlock on the second read-lock attempt unless the implementation could determine the reader already has a read lock, but the only way to determine that is storing a list of all threads that hold read locks, which is very inefficient in time and space requirements.

Implementing critical section

What way is better and faster to create a critical section?
With a binary semaphore, between sem_wait and sem_post.
Or with atomic operations:
#include <sched.h>
void critical_code(){
static volatile bool lock = false;
//Enter critical section
while ( !__sync_bool_compare_and_swap (&lock, false, true ) ){
sched_yield();
}
//...
//Leave critical section
lock = false;
}
Regardless of what method you use, the worst performance problem with your code has nothing to do with what type of lock you use, but the fact that you're locking code rather than data.
With that said, there is no reason to roll your own spinlocks like that. Either use pthread_spin_lock if you want a spinlock, or else pthread_mutex_lock or sem_wait (with a binary semaphore) if you want a lock that can yield to other processes when contended. The code you have written is the worst of both worlds in how it uses sched_yield. The call to sched_yield will ensure that the lock waits at least a few milliseconds (and probably a whole scheduling timeslice) in the case where there's both lock contention and cpu load, and it will burn 100% cpu when there's contention but no cpu load (due to the lock-holder being blocked in IO, for instance). If you want to get any of the benefits of a spin lock, you need to be spinning without making any syscalls. If you want any of the benefits of yielding the cpu, you should be using a proper synchronization primitive which will use (on Linux) futex (or equivalent) operations to yield exactly until the lock is available - no shorter and no longer.
And if by chance all that went over your head, don't even think about writing your own locks..
Spin-locks perform better if there is little contention for the lock and/or it is never held for a long period of time. Otherwise you are better off with a lock that blocks rather than spins. There are of course hybrid locks which will spin a few times, and if the lock cannot be acquired, then they will block.
Which is better for you depends on your application. Only you can answer that question.
You didn't look deep enough in the gcc documentation. The correct builtins for such type of lock are __sync_lock_test_and_set and __sync_lock_release. These have exactly the guarantees that you need for such a thing. In terms of the new C11 standard this would be the type atomic_flag with operations atomic_flag_test_and_set and atomic_flag_clear.
As R. already indicates, putting sched_yield into the loop, is really a bad idea.
If the code inside the critical section is only some cycles, the probability that the execution of it falls across the boundary of a scheduling slice is small. The number of threads that will be blocked spinning actively will be at most the number of processors minus one. All this doesn't hold if you yield execution as soon as you don't obtain the lock immediately. If you have real contention on your lock and yield, you will have a multitude of context switches, which will bring your system almost to a hold.
As others have pointed out its not really about how fast the locking code is. This is because once a lock sequence is initiated using "xchg reg,mem" a lock signal is sent down through the caches and out to the devices on all buses. When the last device has acknowledged that it will hold and acknowledged this - which may take hundreds of if not a thousand clocks cycles the actual exchange is performed. If your slowest device is a classic PCI card it will have a bus speed of 33 MHz which is about one hundredth of the CPU's internal clock. And the PCI device (if active) will need several clock cycles (#33 MHz) to respond. During that time the CPU will be waiting for the acknowledge to come back.
Most spinlocks are probably used in device drivers where the routine won't be pre-empted by the OS but might be interrupted by a higher-level driver.
A critical section is really just a spin-lock but with interfacing to the OS because it may be pre-empted.

Resources