Suppose I have the following ticket-taking spinlock mutex implementation (in C using GCC atomic builtins). As I understand it, the use of the "release" memory order in the unlock function is correct. I'm unsure, though, about the lock function. Because this is a ticket-taking mutex, there's a field indicating the next ticket number to be handed out, and a field to indicate which ticket number currently holds the lock. I've used acquire-release on the ticket increment and acquire on the spin load. Is that unnecessarily strong, and if so, why?
Separately, should those two fields (ticket and serving) be spaced so that they're on different cache lines, or does that not matter? I'm mainly interested in arm64 and amd64.
typedef struct {
u64 ticket;
u64 serving;
} ticket_mutex;
void
ticket_mutex_lock(ticket_mutex *m)
{
u64 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_ACQ_REL);
while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE));
}
void
ticket_mutex_unlock(ticket_mutex *m)
{
(void) __atomic_fetch_add(&m->serving, 1, __ATOMIC_RELEASE);
}
UPDATE: based on the advice in the accepted answer, I've adjusted the implementation to the following. This mutex is intended for the low-contention case.
typedef struct {
u32 ticket;
u32 serving;
} ticket_mutex;
void
ticket_mutex_lock(ticket_mutex *m)
{
u32 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_RELAXED);
while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE)) {
#ifdef __x86_64__
__asm __volatile ("pause");
#endif
}
}
void
ticket_mutex_unlock(ticket_mutex *m)
{
u32 my_ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED);
(void) __atomic_store_n(&m->serving, my_ticket+1, __ATOMIC_RELEASE);
}
m->ticket increment only needs to be RELAXED. You only need each thread to get a different ticket number; it can happen as early or late as you want wrt. other operations in the same thread.
load(&m->serving, acquire) is the operation that orders the critical section, preventing those from starting until we've synchronized-with a RELEASE operation in the unlock function of the previous holder of the lock. So the m->serving loads needs to be at least acquire.
Even if the m->ticket++ doesn't complete until after an acquire load of m->serving, that's fine. The while condition still determines whether execution proceeds (non-speculatively) into the critical section. Speculative execution into the critical section is fine, and good since it probably means it's ready commit sooner, reducing the time with the lock held.
Extra ordering on the RMW operation won't make it any faster locally or in terms of inter-thread visibility, and would slow down the thread taking the lock.
One cache line or two
For performance, I think with high contention, there are advantages to keeping the members in separate cache lines.
Threads needing exclusive ownership of the cache line to get a ticket number won't contend with the thread unlocking .serving, so those inter-thread latency delays can happen in parallel.
With multiple cores in the spin-wait while(load(serving)) loop, they can hit in their local L1d cache until something invalidates shared copies of the line, not creating any extra traffic. But wasting a lot of power unless you use something like x86 _mm_pause(), as well as wasting execution resources that could be shared with another logical core on the same physical. x86 pause also avoids a branch mispredict when leaving the spin loop. Related:
What is the purpose of the "PAUSE" instruction in x86?
How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
Locks around memory manipulation via inline assembly
Exponential backoff up to some number of pauses between checks is a common recommendation, but here we can do better: A number of pause instructions between checks that scales with my_ticket - m->serving, so you check more often when your ticket is coming up.
In really high contention cases, fallback to OS-assisted sleep/wake is appropriate if you'll be waiting for long, like Linux futex. Or since we can see how close to the head of the queue we are, yield, nanosleep, or futex if your wait interval will be more than 3 or 8 ticket numbers or whatever. (Tunable depending on how long it takes to serve a ticket.)
(Using futex, you might introduce a read of m->ticket into the unlock to figure out if there might be any threads sleeping, waiting for a notify. Like C++20 atomic<>.wait() and atomic.notify_all(). Unfortunately I don't know a good way to figure out which thread to notify, instead of waking them all up to check if they're the lucky winner.
With low average contention, you should keep both in the same cache line. An access to .ticket is always immediately followed by a load of .serving. In the unlocked no-contention case, this means only one cache line is bouncing around, or having to stay hot for the same core to take/release the lock.
If the lock is already held, the thread wanting to unlock needs exclusive ownership of the cache line to RMW or store. It loses this whether another core does an RMW or just a pure load on the line containing .serving.
There won't be too many cases where multiple waiters are all spinning on the same lock, and where new threads getting a ticket number delay the unlock, and its visibility to the thread waiting for it.
This is my intuition, anyway; it's probably hard to microbenchmark, unless a cache-miss atomic RMW stops later load from even starting to request the later line, in which case you could have two cache-miss latencies in taking the lock.
Avoiding an atomic RMW in the unlock?
The thread holding the lock knows it has exclusive ownership, no other thread will be modifying m->serving concurrently. If you had the lock owner remember its own ticket number, you could optimize the unlock to just a store.
void ticket_mutex_unlock(ticket_mutex *m, uint32_t ticket_num)
{
(void) __atomic_store_n(&m->serving, ticket_num+1, __ATOMIC_RELEASE);
}
Or without that API change (to return an integer from u32 ticket_mutex_lock())
void ticket_mutex_unlock(ticket_mutex *m)
{
uint32_t ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED); // we already own the lock
// and no other thread can be writing concurrently, so a non-atomic increment is safe
(void) __atomic_store_n(&m->serving, ticket+1, __ATOMIC_RELEASE);
}
This has a nice efficiency advantage on ISAs that need LL/SC retry loops for atomic RMWs, where spurious failure from another core reading the value can happen. And on x86 where the only possible atomic RMW is a full barrier, stronger even than needed for C seq_cst semantics.
BTW, the lock fields would be fine as uint32_t. You're not going to have 2^32 threads waiting for a lock. So I used uint32_t instead of u64. Wrap-around is well-defined. Even subtraction like ticket - serving Just Works, even across that wrapping boundary, like 1 - 0xffffffffUL gives 2, so you can still calculate how close you are to being served, for sleep decisions.
Not a big deal on x86-64, only saving a bit of code size, and probably not a factor at all on AArch64. But will help significantly on some 32-bit ISAs.
Related
Suppose we have multiple threads incrementing a common variable X, and each thread synchronizes by using a mutex M;
function_thread_n(){
ACQUIRE (M)
X++;
RELEASE (M)
}
The mutex ensures that only one thread is updating X at any time, but does a mutex ensure that once updated the value of X is visible to the other threads too. Say the initial values of X is 2; thread 1 increments it to 3. However, the cache of another processor might have the earlier value of 2, and another thread can still end up incrementing the value of 2 to 3. The third condition for cache coherence only requires that the order of writes made by different processors holds, right?
I guess this is what memory barriers are for and if a memory barrier is used before releasing the mutex, then the issue can be avoided.
This is a great question.
TL;DR: The short answer is "yes".
Mutexes provide three primary services:
Mutual exclusion, to ensure that only one thread is executing instructions within the critical section between acquire and release of a given mutex.
Compiler optimization fences, which prevent the compiler's optimizer from moving load/store instructions out of that critical section during compilation.
Architectural memory barriers appropriate to the current architecture, which in general includes a memory acquire fence instruction during mutex acquire and a memory release fence instruction during mutex release. These fences prevent superscalar processors from effectively reordering memory load/stores across the fence at runtime in a way that would cause them to appear to be "performed" outside the critical section.
The combination of all three ensure that data accesses within the critical section delimited by the mutex acquire/release will never observably race with data accesses from another thread who also protects its accesses using the same mutex.
Regarding the part of your question involving caches, coherent cache memory systems separately ensure that at any particular moment, a given line of memory is only writeable by at most one core at a time. Furthermore, memory store operations do not complete until they have evicted any "newly stale" copies cached elsewhere in the caching system (e.g. the L1 of other cores). See this question for more details.
I have a problem that I need to understand if there is a better solution. I have written the following code to pass a few variables from a writer thread to a reader thread. These threads pinned to different CPUs sharing the same L2 cache (disabled hyperthreading).
writer_thread.h
struct a_few_vars {
uint32_t x1;
uint32_t x2;
uint64_t x3;
uint64_t x4;
} __attribute__((aligned(64)));
volatile uint32_t head;
struct a_few_vars xxx[UINT16_MAX] __attribute__((aligned(64)));
reader_thread.h
uint32_t tail;
struct a_few_vars *p_xxx;
The writer thread increases the head variable and the reader thread checks whether the head variable and the tail is equal. If they are not equal then it reads the new data as follows
while (true) {
if (tail != head) {
.. process xxx[head] ..
.. update tail ..
}
}
Performance is by far the most important issue. I'm using Intel Xeon processors and the reader thread fetches the head value and the xxx[head] data from memory each time. I used the aligned array to be lock free
In my case, is there any method to flush the variables to the reader CPU cache as soon as possible. Can I trigger a prefetch for the reader CPU from writer CPU. I can use special Intel instructions using __asm__ if exist. In conclusion, what is the fastest way to pass the variables in the struct between threads pinning to different CPUs?
Thanks in advance
It's undefined behaviour for one thread to write a volatile variable while another thread reads it, according to C11. volatile accesses are also not ordered with respect to other accesses. You want atomic_store_explicit(&head, new_value, memory_order_release) in the writer and atomic_load_explicit(&head, memory_order_acquire) in the reader to create acq/rel synchronization, and force the compiler to make the stores into your struct visible before the store to head which indicates to the reader that there's new data.
(tail is private to the reader thread, so there's no mechanism for the writer to wait for the reader to have seen new data before writing more. So technically there's a possible race on the struct contents if the writer thread writes again while the reader is still reading. So the struct should also be _Atomic).
You might want a seq-lock where the writer updates a sequence number and the reader checks it before and after copying out the variables. https://en.wikipedia.org/wiki/Seqlock This lets you detect and retry in the rare cases where the writer was in the middle of an update when the reader copied the data.
It's pretty good for write-only / read-only situations, especially if you don't need to worry about the reader missing an update.
See my attempt at a SeqLock in C++11: Implementing 64 bit atomic counter with 32 bit atomics and also how to implement a seqlock lock using c++11 atomic library
And GCC reordering up across load with `memory_order_seq_cst`. Is this allowed? shows another example (this one causes a gcc bug).
Porting these from C++11 std::atomic to C11 stdatomic should be straightforward. Make sure to use atomic_store_explicit, because the default memory ordering for plain atomic_store is memory_order_seq_cst which is slower.
Not much you can do will actually speed up the writer making its stores globally visible. A CPU core already commits stores from its store buffer to its L1d as quickly as possible (obeying the restrictions of the x86 memory model, which doesn't allow StoreStore reordering).
On a Xeon, see When CPU flush value in storebuffer to L1 Cache? for some info about different Snoop Modes and their effect on inter-core latency within a single socket.
The caches on multiple cores are coherent, using MESI to maintain coherency.
A reader spin-waiting on an atomic variable is probably the best you can do, using _mm_pause() inside the spin loop to avoid a memory-order mis-speculation pipeline clear when exiting the spin-loop.
You also don't want to wake up in the middle of a write and have to retry. You might want to put the seq-lock counter in the same cache line as the data, so those stores can hopefully be merged in the store buffer of the writing core.
I have 2 questions regarding to threads, one is about race condition and the other is about mutex.
So the first question :
I've read about race condition in wikipedia page :
http://en.wikipedia.org/wiki/Race_condition
And in the example of race condition between 2 threads this is shown :
http://i60.tinypic.com/2vrtuz4.png[
Now so far I believed that threads works parallel to each other, but judging from this picture it's seems that I interpreted on how actions done by the computer wrong.
From this picture only 1 action is done at a time, and although the threads gets switched from time to time and the other thread gets to do some actions this is still 1 action at a time done by the computer. Is it really like this ? There's no "real" parallel computing, just 1 action done at a time in a very fast rate which gives the illusion of parallel computing ?
This leads me to my second question about mutex.
I've read that if threads read/write to the same memory we need some sort of synchronization mechanism. I've read the normal data types won't do and we need a mutex.
Let's take for example the following code :
#include <stdio.h>
#include <stdbool.h>
#include <windows.h>
#include <process.h>
bool lock = false;
void increment(void*);
void decrement(void*);
int main()
{
int n = 5;
HANDLE hIncrement = (HANDLE)_beginthread(increment, 0, (void*)&n);
HANDLE hDecrement = (HANDLE)_beginthread(decrement, 0, (void*)&n);
WaitForSingleObject(hIncrement, 1000 * 500);
WaitForSingleObject(hDecrement, 1000 * 500);
return 0;
}
void increment(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)++;
lock = false;
}
}
void decrement(void *p)
{
int *n = p;
for(int i = 0; i < 10; i++)
{
while (lock)
{
}
lock = true;
(*n)--;
lock = false;
}
}
Now in my example here, I use bool lock as my synchronization mechanism to avoid a race condition between the 2 threads over the memory space pointed by pointer n.
Now what I did here won't obviously work because although I avoided a race condition over the memory space pointed by pointer n between the 2 threads a new race condition over bool lock variable may occur.
Let's consider the following sequence of events (A = increment thread, B = decrement thread) :
A gets out of the while loop since lock is false
A gets to set lock to true
B waits in the while loop because lock is set to true
A increment the value pointed by n
A sets lock to false
A gets to the while loop
A gets out of the while loop since lock is false
B gets out of the while loop since lock is false
A sets lock to true
B sets lock to true
and from here we get an unexpected behavior of 2 un-synchronized threads because the bool lock is not race condition proof.
Ok, so far this is my understanding and the solution to our problem above we need a mutex.
I'm fine with that, a data type that will magically be condition race proof.
I just don't understand how with mutex type it won't happen where as with every other type it will and here lies my problem, I want to understand why mutex and how this is happening.
About your first question: Whether or not there are actually several different threads running at once, or whether it is just implemented as as fast switching, is a matter of your hardware. Typical PCs these days have several cores (often with more than one thread each), so you have to assume that things actually DO happen at the same time.
But even if you have only a single-core system, things are not quite so easy. This is because the compiler is usually allowed to re-order instructions in order to optimize code. It can also e.g. choose to cache a variable in a CPU register instead of loading it from memory every time you access it, and it also doesn't have to write it back to memory every time you write to that variable. The compiler is allowed to do that as long as the result is the same AS IF it had run your original code in its original order - as long as nobody else is looking closely at what's actually going on, such as a different thread.
And once you actually do have different cores, consider that they all have their own CPU registers and even their own cache. Even if a thread on one core wrote to a certain variable, as long as that core doesn't write its cache back to the shared memory a different core won't see that change.
In short, you have to be very careful in making any assumptions about what happens when two threads access variables at the same time, especially in C/C++. The interactions can be so surprising that I'd say, to stay on the safe side, you should make sure that there are no race conditions in your code, e.g. by always using mutexes for accessing memory that is shared between threads.
Which is where we can neatly segway into the second question: What's so special about mutexes, and how can they work if all basic data types are not threadsafe?
The thing about mutexes is that they are implemented with a lot of knowledge about the system for which they are being used (hardware and operating system), and with either the direct help or a deep knowledge of the compiler itself.
The C language does not give you direct access to all the capabilities of your hardware and operating system, because platforms can be very different from each other. Instead, C focuses on providing a level of abstraction that allows you to compile the same code for many different platforms. The different "basic" data types are just something that the C standard came up with as a set of data types which can in some way be supported on almost any platform - but the actual hardware that your program will be compiled for is usually not limited to those types and operations.
In other word, not everything that you can do with your PC can be expressed in terms of C's ints, bytes, assignments, arithmetic operators and so on. For example, PCs often calculate with 80-bit floating point types which are usually not mapped directly to a C floating point type at all. More to the point of our topic, there are also CPU instructions that influence how multiple CPU cores will work together. Additionally, if you know the CPU, you often know a few things about the behaviour of the basic types that the C standard doesn't guarantee (for example, whether loads and stores to 32-bit integers are atomic). With that extra knowledge, it can become possible to implement mutexes for that particular platform, and it will often require code that is e.g. written directly in assembly language, because the necessary features are not available in plain C.
How are they implemented especially in case of pthreads. What pthread synchronization APIs do they use under the hood? A little bit of pseudocode would be appreciated.
I haven't done any pthreads programming for a while, but when I did, I never used POSIX read/write locks. The problem is that most of the time a mutex will suffice: ie. your critical section is small, and the region isn't so performance critical that the double barrier is worth worrying about.
In those cases where performance is an issue, normally using atomic operations (generally available as a compiler extension) are a better option (ie. the extra barrier is the problem, not the size of the critical section).
By the time you eliminate all these cases, you are left with cases where you have specific performance/fairness/rw-bias requirements that require a true rw-lock; and that is when you discover that all the relevant performance/fairness parameters of POSIX rw-lock are undefined and implementation specific. At this point you are generally better off implementing your own so you can ensure the appropriate fairness/rw-bias requirements are met.
The basic algorithm is to keep a count of how many of each are in the critical section, and if a thread isn't allowed access yet, to shunt it off to an appropriate queue to wait. Most of your effort will be in implementing the appropriate fairness/bias between servicing the two queues.
The following C-like pthreads-like pseudo-code illustrates what I'm trying to say.
struct rwlock {
mutex admin; // used to serialize access to other admin fields, NOT the critical section.
int count; // threads in critical section +ve for readers, -ve for writers.
fifoDequeue dequeue; // acts like a cond_var with fifo behaviour and both append and prepend operations.
void *data; // represents the data covered by the critical section.
}
void read(struct rwlock *rw, void (*readAction)(void *)) {
lock(rw->admin);
if (rw->count < 0) {
append(rw->dequeue, rw->admin);
}
while (rw->count < 0) {
prepend(rw->dequeue, rw->admin); // Used to avoid starvation.
}
rw->count++;
// Wake the new head of the dequeue, which may be a reader.
// If it is a writer it will put itself back on the head of the queue and wait for us to exit.
signal(rw->dequeue);
unlock(rw->admin);
readAction(rw->data);
lock(rw->admin);
rw->count--;
signal(rw->dequeue); // Wake the new head of the dequeue, which is probably a writer.
unlock(rw->admin);
}
void write(struct rwlock *rw, void *(*writeAction)(void *)) {
lock(rw->admin);
if (rw->count != 0) {
append(rw->dequeue, rw->admin);
}
while (rw->count != 0) {
prepend(rw->dequeue, rw->admin);
}
rw->count--;
// As we only allow one writer in at a time, we don't bother signaling here.
unlock(rw->admin);
// NOTE: This is the critical section, but it is not covered by the mutex!
// The critical section is rather, covered by the rw-lock itself.
rw->data = writeAction(rw->data);
lock(rw->admin);
rw->count++;
signal(rw->dequeue);
unlock(rw->admin);
}
Something like the above code is a starting point for any rwlock implementation. Give some thought to the nature of your problem and replace the dequeue with the appropriate logic that determines which class of thread should be woken up next. It is common to allow a limited number/period of readers to leapfrog writers or visa versa depending on the application.
Of course my general preference is to avoid rw-locks altogether; generally by using some combination of atomic operations, mutexes, STM, message-passing, and persistent data-structures. However there are times when what you really need is a rw-lock, and when you do it is useful to know how they work, so I hope this helped.
EDIT - In response to the (very reasonable) question, where do I wait in the pseudo-code above:
I have assumed that the dequeue implementation contains the wait, so that somewhere within append(dequeue, mutex) or prepend(dequeue, mutex) there is a block of code along the lines of:
while(!readyToLeaveQueue()) {
wait(dequeue->cond_var, mutex);
}
which was why I passed in the relevant mutex to the queue operations.
Each implementation can be different, but normally they have to favor readers by default due to the requirement by POSIX that a thread be able to obtain the read-lock on an rwlock multiple times. If they favored writers, then whenever a writer was waiting, the reader would deadlock on the second read-lock attempt unless the implementation could determine the reader already has a read lock, but the only way to determine that is storing a list of all threads that hold read locks, which is very inefficient in time and space requirements.
What way is better and faster to create a critical section?
With a binary semaphore, between sem_wait and sem_post.
Or with atomic operations:
#include <sched.h>
void critical_code(){
static volatile bool lock = false;
//Enter critical section
while ( !__sync_bool_compare_and_swap (&lock, false, true ) ){
sched_yield();
}
//...
//Leave critical section
lock = false;
}
Regardless of what method you use, the worst performance problem with your code has nothing to do with what type of lock you use, but the fact that you're locking code rather than data.
With that said, there is no reason to roll your own spinlocks like that. Either use pthread_spin_lock if you want a spinlock, or else pthread_mutex_lock or sem_wait (with a binary semaphore) if you want a lock that can yield to other processes when contended. The code you have written is the worst of both worlds in how it uses sched_yield. The call to sched_yield will ensure that the lock waits at least a few milliseconds (and probably a whole scheduling timeslice) in the case where there's both lock contention and cpu load, and it will burn 100% cpu when there's contention but no cpu load (due to the lock-holder being blocked in IO, for instance). If you want to get any of the benefits of a spin lock, you need to be spinning without making any syscalls. If you want any of the benefits of yielding the cpu, you should be using a proper synchronization primitive which will use (on Linux) futex (or equivalent) operations to yield exactly until the lock is available - no shorter and no longer.
And if by chance all that went over your head, don't even think about writing your own locks..
Spin-locks perform better if there is little contention for the lock and/or it is never held for a long period of time. Otherwise you are better off with a lock that blocks rather than spins. There are of course hybrid locks which will spin a few times, and if the lock cannot be acquired, then they will block.
Which is better for you depends on your application. Only you can answer that question.
You didn't look deep enough in the gcc documentation. The correct builtins for such type of lock are __sync_lock_test_and_set and __sync_lock_release. These have exactly the guarantees that you need for such a thing. In terms of the new C11 standard this would be the type atomic_flag with operations atomic_flag_test_and_set and atomic_flag_clear.
As R. already indicates, putting sched_yield into the loop, is really a bad idea.
If the code inside the critical section is only some cycles, the probability that the execution of it falls across the boundary of a scheduling slice is small. The number of threads that will be blocked spinning actively will be at most the number of processors minus one. All this doesn't hold if you yield execution as soon as you don't obtain the lock immediately. If you have real contention on your lock and yield, you will have a multitude of context switches, which will bring your system almost to a hold.
As others have pointed out its not really about how fast the locking code is. This is because once a lock sequence is initiated using "xchg reg,mem" a lock signal is sent down through the caches and out to the devices on all buses. When the last device has acknowledged that it will hold and acknowledged this - which may take hundreds of if not a thousand clocks cycles the actual exchange is performed. If your slowest device is a classic PCI card it will have a bus speed of 33 MHz which is about one hundredth of the CPU's internal clock. And the PCI device (if active) will need several clock cycles (#33 MHz) to respond. During that time the CPU will be waiting for the acknowledge to come back.
Most spinlocks are probably used in device drivers where the routine won't be pre-empted by the OS but might be interrupted by a higher-level driver.
A critical section is really just a spin-lock but with interfacing to the OS because it may be pre-empted.