Suppose we have multiple threads incrementing a common variable X, and each thread synchronizes by using a mutex M;
function_thread_n(){
ACQUIRE (M)
X++;
RELEASE (M)
}
The mutex ensures that only one thread is updating X at any time, but does a mutex ensure that once updated the value of X is visible to the other threads too. Say the initial values of X is 2; thread 1 increments it to 3. However, the cache of another processor might have the earlier value of 2, and another thread can still end up incrementing the value of 2 to 3. The third condition for cache coherence only requires that the order of writes made by different processors holds, right?
I guess this is what memory barriers are for and if a memory barrier is used before releasing the mutex, then the issue can be avoided.
This is a great question.
TL;DR: The short answer is "yes".
Mutexes provide three primary services:
Mutual exclusion, to ensure that only one thread is executing instructions within the critical section between acquire and release of a given mutex.
Compiler optimization fences, which prevent the compiler's optimizer from moving load/store instructions out of that critical section during compilation.
Architectural memory barriers appropriate to the current architecture, which in general includes a memory acquire fence instruction during mutex acquire and a memory release fence instruction during mutex release. These fences prevent superscalar processors from effectively reordering memory load/stores across the fence at runtime in a way that would cause them to appear to be "performed" outside the critical section.
The combination of all three ensure that data accesses within the critical section delimited by the mutex acquire/release will never observably race with data accesses from another thread who also protects its accesses using the same mutex.
Regarding the part of your question involving caches, coherent cache memory systems separately ensure that at any particular moment, a given line of memory is only writeable by at most one core at a time. Furthermore, memory store operations do not complete until they have evicted any "newly stale" copies cached elsewhere in the caching system (e.g. the L1 of other cores). See this question for more details.
Related
Suppose I have the following ticket-taking spinlock mutex implementation (in C using GCC atomic builtins). As I understand it, the use of the "release" memory order in the unlock function is correct. I'm unsure, though, about the lock function. Because this is a ticket-taking mutex, there's a field indicating the next ticket number to be handed out, and a field to indicate which ticket number currently holds the lock. I've used acquire-release on the ticket increment and acquire on the spin load. Is that unnecessarily strong, and if so, why?
Separately, should those two fields (ticket and serving) be spaced so that they're on different cache lines, or does that not matter? I'm mainly interested in arm64 and amd64.
typedef struct {
u64 ticket;
u64 serving;
} ticket_mutex;
void
ticket_mutex_lock(ticket_mutex *m)
{
u64 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_ACQ_REL);
while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE));
}
void
ticket_mutex_unlock(ticket_mutex *m)
{
(void) __atomic_fetch_add(&m->serving, 1, __ATOMIC_RELEASE);
}
UPDATE: based on the advice in the accepted answer, I've adjusted the implementation to the following. This mutex is intended for the low-contention case.
typedef struct {
u32 ticket;
u32 serving;
} ticket_mutex;
void
ticket_mutex_lock(ticket_mutex *m)
{
u32 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_RELAXED);
while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE)) {
#ifdef __x86_64__
__asm __volatile ("pause");
#endif
}
}
void
ticket_mutex_unlock(ticket_mutex *m)
{
u32 my_ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED);
(void) __atomic_store_n(&m->serving, my_ticket+1, __ATOMIC_RELEASE);
}
m->ticket increment only needs to be RELAXED. You only need each thread to get a different ticket number; it can happen as early or late as you want wrt. other operations in the same thread.
load(&m->serving, acquire) is the operation that orders the critical section, preventing those from starting until we've synchronized-with a RELEASE operation in the unlock function of the previous holder of the lock. So the m->serving loads needs to be at least acquire.
Even if the m->ticket++ doesn't complete until after an acquire load of m->serving, that's fine. The while condition still determines whether execution proceeds (non-speculatively) into the critical section. Speculative execution into the critical section is fine, and good since it probably means it's ready commit sooner, reducing the time with the lock held.
Extra ordering on the RMW operation won't make it any faster locally or in terms of inter-thread visibility, and would slow down the thread taking the lock.
One cache line or two
For performance, I think with high contention, there are advantages to keeping the members in separate cache lines.
Threads needing exclusive ownership of the cache line to get a ticket number won't contend with the thread unlocking .serving, so those inter-thread latency delays can happen in parallel.
With multiple cores in the spin-wait while(load(serving)) loop, they can hit in their local L1d cache until something invalidates shared copies of the line, not creating any extra traffic. But wasting a lot of power unless you use something like x86 _mm_pause(), as well as wasting execution resources that could be shared with another logical core on the same physical. x86 pause also avoids a branch mispredict when leaving the spin loop. Related:
What is the purpose of the "PAUSE" instruction in x86?
How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
Locks around memory manipulation via inline assembly
Exponential backoff up to some number of pauses between checks is a common recommendation, but here we can do better: A number of pause instructions between checks that scales with my_ticket - m->serving, so you check more often when your ticket is coming up.
In really high contention cases, fallback to OS-assisted sleep/wake is appropriate if you'll be waiting for long, like Linux futex. Or since we can see how close to the head of the queue we are, yield, nanosleep, or futex if your wait interval will be more than 3 or 8 ticket numbers or whatever. (Tunable depending on how long it takes to serve a ticket.)
(Using futex, you might introduce a read of m->ticket into the unlock to figure out if there might be any threads sleeping, waiting for a notify. Like C++20 atomic<>.wait() and atomic.notify_all(). Unfortunately I don't know a good way to figure out which thread to notify, instead of waking them all up to check if they're the lucky winner.
With low average contention, you should keep both in the same cache line. An access to .ticket is always immediately followed by a load of .serving. In the unlocked no-contention case, this means only one cache line is bouncing around, or having to stay hot for the same core to take/release the lock.
If the lock is already held, the thread wanting to unlock needs exclusive ownership of the cache line to RMW or store. It loses this whether another core does an RMW or just a pure load on the line containing .serving.
There won't be too many cases where multiple waiters are all spinning on the same lock, and where new threads getting a ticket number delay the unlock, and its visibility to the thread waiting for it.
This is my intuition, anyway; it's probably hard to microbenchmark, unless a cache-miss atomic RMW stops later load from even starting to request the later line, in which case you could have two cache-miss latencies in taking the lock.
Avoiding an atomic RMW in the unlock?
The thread holding the lock knows it has exclusive ownership, no other thread will be modifying m->serving concurrently. If you had the lock owner remember its own ticket number, you could optimize the unlock to just a store.
void ticket_mutex_unlock(ticket_mutex *m, uint32_t ticket_num)
{
(void) __atomic_store_n(&m->serving, ticket_num+1, __ATOMIC_RELEASE);
}
Or without that API change (to return an integer from u32 ticket_mutex_lock())
void ticket_mutex_unlock(ticket_mutex *m)
{
uint32_t ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED); // we already own the lock
// and no other thread can be writing concurrently, so a non-atomic increment is safe
(void) __atomic_store_n(&m->serving, ticket+1, __ATOMIC_RELEASE);
}
This has a nice efficiency advantage on ISAs that need LL/SC retry loops for atomic RMWs, where spurious failure from another core reading the value can happen. And on x86 where the only possible atomic RMW is a full barrier, stronger even than needed for C seq_cst semantics.
BTW, the lock fields would be fine as uint32_t. You're not going to have 2^32 threads waiting for a lock. So I used uint32_t instead of u64. Wrap-around is well-defined. Even subtraction like ticket - serving Just Works, even across that wrapping boundary, like 1 - 0xffffffffUL gives 2, so you can still calculate how close you are to being served, for sleep decisions.
Not a big deal on x86-64, only saving a bit of code size, and probably not a factor at all on AArch64. But will help significantly on some 32-bit ISAs.
It seems _lwsync is synchronizing multiple processors and
_sync_synchronize is synchronizing in all threads using memory barriers.
But I want to know more specific about differences.
__lwsync:
The lwsync barrier is broadly similar to
sync, including cumulativity properties, except that does not
order store/load pairs and it is cheaper to execute; it suffices to guarantee SC behaviour in MP+lwsyncs (MP with
lwsync in each thread), WRC+lwsync+addr (WRC with
lwsync on Thread 1 and an address dependency on Thread
2), and ISA2+lwsync+data+addr, while SB+lwsyncs and
IRIW+lwsyncs are still allowed.
__sync_synchronize:
This function synchronizes data in all threads.
A full memory barrier is created when this function is invoked.It is a atomic builtin for full memory barrier.No memory operand will be moved across the operation, either forward or backward. Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.
There are 2 threads,one only reads the signal,the other only sets the signal.
Is it necessary to create a mutex for signal and the reason?
UPDATE
All I care is whether it'll crash if two threads read/set the same time
You will probably want to use atomic variables for this, though a mutex would work as well.
The problem is that there is no guarantee that data will stay in sync between threads, but using atomic variables ensures that as soon as one thread updates that variable, other threads immediately read its updated value.
A problem could occur if one thread updates the variable in cache, and a second thread reads the variable from memory. That second thread would read an out-of-date value for the variable, if the cache had not yet been flushed to memory. Atomic variables ensure that the value of the variable is consistent across threads.
If you are not concerned with timely variable updates, you may be able to get away with a single volatile variable.
It depends. If writes are atomic then you don't need a mutual exclusion lock. If writes are not atomic, then you do need a lock.
There is also the issue of compilers caching variables in the CPU cache which may cause the copy in main memory to not get updating on every write. Some languages have ways of telling the compiler to not cache a variable in the CPU like that (volatile keyword in Java), or to tell the compiler to sync any cached values with main memory (synchronized keyword in Java). But, mutex's in general don't solve this problem.
If all you need is synchronization between threads (one thread must complete something before the other can begin something else) then mutual exclusion should not be necessary.
Mutual exclusion is only necessary when threads are sharing some resource where the resource could be corrupted if they both run through the critical section at roughly the same time. Think of two people sharing a bank account and are at two different ATM's at the same time.
Depending on your language/threading library you may use the same mechanism for synchronization as you do for mutual exclusion- either a semaphore or a monitor. So, if you are using Pthreads someone here could post an example of synchronization and another for mutual exclusion. If its java, there would be another example. Perhaps you can tell us what language/library you're using.
If, as you've said in your edit, you only want to assure against a crash, then you don't need to do much of anything (at least as a rule). If you get a collision between threads, about the worst that will happen is that the data will be corrupted -- e.g., the reader might get a value that's been partially updated, and doesn't correspond directly to any value the writing thread ever wrote. The classic example would be a multi-byte number that you added something to, and there was a carry, (for example) the old value was 0x3f ffff, which was being incremented. It's possible the reading thread could see 0x3f 0000, where the lower 16 bits have been incremented, but the carry to the upper 16 bits hasn't happened (yet).
On a modern machine, an increment on that small of a data item will normally be atomic, but there will be some size (and alignment) where it's not -- typically if part of the variable is in one cache line, and part in another, it'll no longer be atomic. The exact size and alignment for that varies somewhat, but the basic idea remains the same -- it's mostly just a matter of the number having enough digits for it to happen.
Of course, if you're not careful, something like that could cause your code to deadlock or something on that order -- it's impossible to guess what might happen without knowing anything about how you plan to use the data.
I would like to copy the content of a an array without using a for loop. The copy is made when owning a spinlock.
Is there any chance that memcpy() can sleep?
Things that might happen with memcpy (or with really any memory access in general):
If part of the source or destination is inaccessible (invalid) memory, memcpy could crash your process, which might leave a shared spinlock in a bad state.
If part of the source memory needs to be paged in, memcpy can block while the kernel grabs the memory for you.
If part of the source or destination is memory-mapped to I/O, memcpy might block while the kernel performs that I/O. (In extreme cases, like memory-mapped network files, memcpy might block indefinitely).
The kernel is also free to swap your process out at any point during the copy, which means the copy could take arbitrarily long to actually complete.
However, memcpy does not do anything that a regular memory access wouldn't do. So, using it with a spinlock should be safe (as safe as accessing the memory normally would be, anyway).
I detect some inconsitency in your question. I'll explain myself.
A spinlock or a busy lock in general, maintains the process (or thread) that is waiting for the lock to be acquired without releasing the cpu to another process (or thread) This means a very fast unlocking and reschedule mechanism when the lock is freed, but a very expensive model for long wait times...
Once said this.... if you are using a spinlock, the reason must be that the loop the process or thread is using to check when the lock is freed should not execute more than three or four times, or the cpu will be wasted just checking once after another time if the lock has been freed.
This completely discourages doing blocking operations like the one you ask for (a memory copy normally is strange that has to deal with a non-present resource ---memory page---, but when it does, your spinlock will go into a loop of millions of checks)
spinlocks where designed to protect very small chuncks of memory, where access could signify at most two or three accesses to memory. In that case, a spinlock is going to solve the problem, as putting the thread to wait and rescheduling it will be milion times faster with the spinlock than with the wait/awake process. But this is in clear antagony to the use of memcpy(3) function, as it is a general copy function that allows for large memory copies in one shot. This means the time the resource is locked for one thread, can signify millions of checks of another thread (in a different core, as this is another reason to use a spinlock, when you have a different core that is going to wait two or three accesses to the lock to see it unlocked)
In my opinion, the only use a spinlock can have is to protect a semaphore's counter, or to protect the access to a cond variable or a mutex, but never to be used as a general memory copy or large resource protection. In those cases, it is better to use a normal, sleeping lock. If you plan to use memcpy(3) the only thing I can assume is that you use the lock to protect large amounts of memory while they are copied into.... that's better handler with a sempahore or a mutex.
In modern kernels, the awakening of a process is so efficient that makes user mode spinlocks almost unusable at all.
As a conclussion, my guess is that you don't have to consider the use of memcpy() to protect a shared memory region... but to consider to use a spinlock itself to do the protection. In most cases it will be a lost of resources, and will make your system heavier and slower.
If a single 32-bit variable is shared between multiple threads, should I put a mutex lock around the variable? For example, suppose 1 thread writes to a 32-bit counter and a 2nd thread reads it. Is there any chance the 2nd thread could read a corrupted value?
I'm working on a 32-bit ARM embedded system. The compiler always seems to align 32-bit variables so they can be read or written with a single instruction. If the 32-bit variable was not aligned, then the read or write would be broken down into multiple instructions and the 2nd thread could read a corrupted value.
Does the answer to this question change if I move to a multiple-core system in the future and the variable is shared between cores? (assuming a shared cache between cores)
Thanks!
A mutex protects you from more than just tearing - for example some ARM implementations use out-of-order execution, and a mutex will include memory (and compiler) barriers that may be necessary for your algorithm's correctness.
It is safer to include the mutex, then figure out a way to optimise it later if it shows as a performance problem.
Note also that if your compiler is GCC-based, you may have access to the GCC atomic builtins.
If all the writing is done from one thread (i.e. other threads are only reading), then no you don't need a mutex. If more than one thread may be writing, then you do.
You don't need mutex.
On 32-bit ARM, single write or read is an atomic operation. (regardless of the number of cores)
Of course, you should declare that variable as volatile.
On a 32-bit system, reads and writes of 32-bit vars are atomic. However, it depends what else you are doing with the variable. E.g. if you maniputale it somehow (e.g. add a value), then this requires a read, manipulation and write. If the CPU and compiler do not support an atomic operation for this, then you will need to use a mutex to protect this multi-operation sequence.
There are other, lock-free techniques which can reduce the need for mutexes.