is a matching __sync_lock_release mandatory? - c

I am reviewing some circa 2013 C code that uses __sync_xxx built-ins. Mostly these look fine but in one case I found a __sync_lock_test_and_set() without __sync_lock_release(). It looks like the original programmer was using __sync_lock_test_and_set() as an atomic store for pointer values in a hash table. The code doesn't wait on these memory locations (e.g. check the __sync_lock_test_and_set result in a while loop), it just sets them and moves on. Searching on "is __sync_lock_release mandatory" and similar isn't bringing up a clear answer (although that does bring up useful discussions [1][2] making clear the newer GNU atomic built-ins should be used, and I may be able to convince the client to do that).
What problem might be caused by not calling __sync_lock_release() and releasing an acquire memory barrier? Do such barriers "survive" in any way and possibly affect other code?
[1] What is the GCC builtin for an atomic set?
[2] Is my spin lock implementation correct and optimal?

Memory barriers (or memory orders for operations) aren't things you have to release in the first place. See https://preshing.com/20120913/acquire-and-release-semantics/ / https://preshing.com/20130922/acquire-and-release-fences
release is one of the possible orderings an operation can have, one that's appropriate for releasing a lock but also for the writing side of acquire/release synchronization.
An acquire operation is one you could use in implementing a function that takes a lock. There isn't a hidden lock anywhere that gets taken behind the scenes. (A non-lock-free implementation of __sync or __atomic builtins might use a lock internally, but it would take and release it within one invocation of the builtin.)
void __sync_lock_release (ptr, 0) is just __atomic_store_explicit(ptr, 0, __ATOMIC_RELEASE). (Actually the docs call it a "barrier", not just an operation with ordering wrt. other operations, so perhaps atomic_thread_fence(memory_order_release) and then a relaxed atomic store.
The documentation wording of "This built-in function releases the lock acquired by __sync_lock_test_and_set." is just an example use-case, there is no resource or lock that "needs to be released".
All legacy __sync builtins are like __atomic_...(__ATOMIC_SEQ_CST), except apparently for __sync_lock_test_and_set and __sync_lock_release; this is the only reason there's a special function with release in the name, because the old sync builtins didn't generally allow release operations.
If the code only wants to do atomic_exchange_explicit(ptr, 1, memory_order_acquire) at that point, that's totally fine and normal, and it can do it with an obsolete __sync_lock_test_and_set if it wants.
If you were going to rewrite anything, use C11 stdatomic.h, or use __atomic builtins if you don't want to make the array _Atomic. Either way, figure out if acquire ordering for that operation is appropriate, and if not use a weaker or stronger order, whatever is required. On x86, any atomic RMW will effectively be seq_cst except for possible compile-time reordering of weaker orders.

Related

When is ok to break up atomic Read-Modify-Write operations into constituent relaxed operations + barrier?

In the model-checked implementation of the CL-Deque, they use the following strategy to decrement the bottom pointer:
size_t b = atomic_load_explicit(&q->bottom, memory_order_relaxed) - 1;
Array *a = (Array *) atomic_load_explicit(&q->array, memory_order_relaxed);
atomic_store_explicit(&q->bottom, b, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
So they load the bottom pointer, decrement it locally, then store it. Why is it valid to do this? Wouldn't a concurrent thief be able to see either one of the bottom values?
An alternative way to perform this would be to combine the read-modify-write operation into a single atomic_fetch_sub, like so:
Array *a = (Array *) atomic_load_explicit(&q->array, memory_order_relaxed);
size_t b = atomic_fetch_sub_explicit(&q->bottom, 1, memory_order_seq_cst) - 1;
This would remove that possible race condition.
I think that the broken-up version is valid because of the CAS inside of the steal and take functions resolves this race later, only if the deque is small enough for it to matter. But is this a general technique?
It's a single producer/multi-consumer queue, where the producer uses push/take operations and consumers use the "steal" operation. The producer is the only one who modifies the "bottom" variable. Therefore, using the atomic RMW fetch_sub is redundant. The consumer (thief) will see either the value before or after the store/sub operation, in any case. The important aspect is the transaction semantics, which, as you pointed out, are concluded with a CAS operation on both ends.
Here is the original paper.
A barrier after all the relaxed operations doesn't make them like a seq_cst operation; it won't have release semantics wrt. earlier loads/stores because there's no barrier separating them from it. If there's a later relaxed store after that, you could make it seq_cst, or if you're not sure of everything that's being ordered, keep the separate fence; fences are stronger than operations.
Breaking up a ++ or -- (or other RMW) is safe when you're the only thread that could be concurrently writing this variable. (Ever, or right now because we hold a lock or otherwise have exclusive ownership). e.g. in a SeqLock or single-producer queue.
Readers are fine, that's why you're using std::atomic. But by definition, readers can't step on our atomic-load/modify/atomic-store sequence. Thus we don't need RMW atomicity, only atomicity of the store itself, which is all that reader threads will be able to see. Whether that's the store side of an RMW or just a pure store doesn't matter.
If this isn't the case, you can't "repair" it later; the damage is potentially already done. You can however design a whole algorithm so that if there was a problem earlier, it can be detected later and you can bail out. That might be what you're talking about, but IDK, I didn't investigate the details of the code you linked, just wanted to answer the general question.

Is incrementing and returning an integer in a thread atomic operation?

Consider following posix thread code:
void *thr(void *arg)
{
static int i = 0;
return ++i;
}
I think that this is atomic and safe operation as it should translate in on assembly instruction inc i for change of value. However, I am not a great expert of internals of cpu and multithreading thus I am doubtful.
EDIT: Since it is not safe say I use mutex lock to make this safe then how much extra cost would it be in terms of time.
EDIT: I got my answer from Overhead of pthread mutexes for time cost. Thanks a bunch.
It's neither atomic nor safe for many reasons:
No standard says it is. So if happens to be, it's only by luck and you'd be crazy to rely on that.
Who says a single instruction is atomic? An increment requires a fetch, an add, and a store. On most CPUs, that's not atomic. (Why do you think x86 CPUs offer a LOCK prefix? If single instructions were already atomic, what purpose would it serve?)
The compiler can optimize in all kinds of crazy ways.
But more importantly, you are making a very fundamental mistake. It is impossible to conclude that something is safe merely because you can't think of any way it can fail. This will never work and you should never, ever do it. If you don't have guarantees, and you don't in this case, it's not safe.
Since it is not safe say I use mutex lock to make this safe then how much extra cost would it be in terms of time.
Probably very little. On most platforms, acquiring an uncontested lock is roughly comparable in cost to an atomic increment. Since you need an atomic increment anyway, you have to pay this cost somehow.
In other words, in the uncontested case (where two threads don't try to increment at the same time), a lock/increment/unlock will perform about the same as an atomic increment on most platforms. However, if there's a lot of contention, the atomic increment will perform much better.
If you don't have to deal with contention super-effectively, just use a lock and be done with it. If you do, then you should be using some platform or library that provides effective ways to deal with contention, such as a C++ implementation with std::atomic, a library that provides inline assembly atomic operations, or a compiler with atomic intrinsics.
The operation is not atomic. It implies reading the variable from memory (it's a variable with static storage), incrementing it and then storing back in memory.

multithreaded C/C++ variable no cache (Linux)

I use 2 pthreads, where one thread "notifies" the other one of an event, and for that there is a variable ( normal integer ), which is set by the second thread.
This works, but my question is, is it possible that the update is not seen immediately by the first (reading) thread, meaning the cache is not updated directly? And if so, is there a way to prevent this behaviour, e.g. like the volatile keyword in java?
(the frequency which the event occurs is approximately in microsecond range, so more or less immediate update needs to be enforced).
/edit: 2nd question: is it possible to enforce that the variable is hold in the cache of the core where thread 1 is, since this one is reading it all the time. ?
It sounds to me as though you should be using a pthread condition variable as your signaling mechanism. This takes care of all the issues you describe.
It may not be immediately visible by the other processors but not because of cache coherence. The biggest problems of visibility will be due to your processor's out-of-order execution schemes or due to your compiler re-ordering instructions while optimizing.
In order to avoid both these problems, you have to use memory barriers. I believe that most pthread primitives are natural memory barriers which means that you shouldn't expect loads or stores to be moved beyond the boundaries formed by the lock and unlock calls. The volatile keyword can also be useful to disable a certain class of compiler optimizations that can be useful when doing lock-free algorithms but it's not a substitute for memory barriers.
That being said, I recommend you don't do this manually and there are quite a few pitfalls associated with lock-free algorithms. Leaving these headaches to library writters should make you a happier camper (unless you're like me and you love headaches :) ). So my final recomendation is to ignore everything I said and use what vromanov or David Heffman suggested.
The most appropriate way to pass a signal from one thread to another should be to use the runtime library's signalling mechanisms, such as mutexes, condition variables, semaphores, and so forth.
If these have too high an overhead, my first thought would be that there was something wrong with the structure of the program. If it turned out that this really was the bottleneck, and restructuring the program was inappropriate, then I would use atomic operations provided by the compiler or a suitable library.
Using plain int variables, or even volatile-qualified ones is error prone, unless the compiler guarantees they have the appropriate semantics. e.g. MSVC makes particular guarantees about the atomicity and ordering constraints of plain loads and stores to volatile variables, but gcc does not.
Better way to use atomic variables. For sample you can use libatomic. volatile keyword not enough.

Are memory barriers necessary for atomic reference counting shared immutable data?

I have some immutable data structures that I would like to manage using reference counts, sharing them across threads on an SMP system.
Here's what the release code looks like:
void avocado_release(struct avocado *p)
{
if (atomic_dec(p->refcount) == 0) {
free(p->pit);
free(p->juicy_innards);
free(p);
}
}
Does atomic_dec need a memory barrier in it? If so, what kind of memory barrier?
Additional notes: The application must run on PowerPC and x86, so any processor-specific information is welcomed. I already know about the GCC atomic builtins. As for immutability, the refcount is the only field that changes over the duration of the object.
On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.
Being a single instruction, it is non-interruptible. As an added "feature", the lock prefix results in a full memory barrier:
"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.
A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64, because mfence is slower on many CPUs even when it's guaranteed to be available, like in 64-bit mode. (Does lock xchg have the same behavior as mfence?)
So you have a full fence on x86 as an added bonus, whether you like it or not. :-)
On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted (meaning it will fail and retry).
It also does not mean an automatic full fence.
This does not however compromise the atomicity of the counter in any way.
It just means that in the x86 case, you happen to get a fence as well, "for free".
On PPC, one can insert a (partial or) full fence by emitting a (lw)sync instruction.
All in all, explicit memory barriers are not necessary for the atomic counter to work properly.
It is important to distinguish between atomic accesses (which guarantee that the read/modify/write of the value executes as one atomic unit) vs. memory reordering.
Memory barriers prevent reordering of reads and writes. Reordering is completely orthogonal to atomicity. For instance, on PowerPC if you implement the most efficient atomic increment possible then it will not prevent reordering. If you want to prevent reordering then you need an lwsync or sync instruction, or some equivalent high-level (C++ 11?) memory barrier.
Claims that there is "no possibility of the compiler reordering things in a problematic way" seem naive as general statements because compiler optimizations can be quite surprising and because CPUs (PowerPC/ARM/Alpha/MIPS in particular) aggressively reorder memory operations.
A coherent cache doesn't save you either. See https://preshing.com/archives/ to see how memory reordering really works.
In this case, however, I believe the answer is that no barriers are required. That is because for this specific case (reference counting) there is no need for a relationship between the reference count and the other values in the object. The one exception is when the reference count hits zero. At that point it is important to ensure that all updates from other threads are visible to the current thread so a read-acquire barrier may be necessary.
Are you intending to implement your own atomic_dec or are you just wondering whether a system-supplied function will behave as you want?
As a general rule, system-supplied atomic increment/decrement facilities will apply whatever memory barriers are required to just do the right thing. You generally don't have to worry about memory barriers unless you are doing something wacky like implementing your own lock-free data structures or an STM library.

Using C/Pthreads: do shared variables need to be volatile?

In the C programming language and Pthreads as the threading library; do variables/structures that are shared between threads need to be declared as volatile? Assuming that they might be protected by a lock or not (barriers perhaps).
Does the pthread POSIX standard have any say about this, is this compiler-dependent or neither?
Edit to add: Thanks for the great answers. But what if you're not using locks; what if you're using barriers for example? Or code that uses primitives such as compare-and-swap to directly and atomically modify a shared variable...
As long as you are using locks to control access to the variable, you do not need volatile on it. In fact, if you're putting volatile on any variable you're probably already wrong.
https://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/
The answer is absolutely, unequivocally, NO. You do not need to use 'volatile' in addition to proper synchronization primitives. Everything that needs to be done are done by these primitives.
The use of 'volatile' is neither necessary nor sufficient. It's not necessary because the proper synchronization primitives are sufficient. It's not sufficient because it only disables some optimizations, not all of the ones that might bite you. For example, it does not guarantee either atomicity or visibility on another CPU.
But unless you use volatile, the compiler is free to cache the shared data in a register for any length of time... if you want your data to be written to be predictably written to actual memory and not just cached in a register by the compiler at its discretion, you will need to mark it as volatile. Alternatively, if you only access the shared data after you have left a function modifying it, you might be fine. But I would suggest not relying on blind luck to make sure that values are written back from registers to memory.
Right, but even if you do use volatile, the CPU is free to cache the shared data in a write posting buffer for any length of time. The set of optimizations that can bite you is not precisely the same as the set of optimizations that 'volatile' disables. So if you use 'volatile', you are relying on blind luck.
On the other hand, if you use sychronization primitives with defined multi-threaded semantics, you are guaranteed that things will work. As a plus, you don't take the huge performance hit of 'volatile'. So why not do things that way?
I think one very important property of volatile is that it makes the variable be written to memory when modified, and reread from memory each time it accessed. The other answers here mix volatile and synchronization, and it is clear from some other answers than this that volatile is NOT a sync primitive (credit where credit is due).
But unless you use volatile, the compiler is free to cache the shared data in a register for any length of time... if you want your data to be written to be predictably written to actual memory and not just cached in a register by the compiler at its discretion, you will need to mark it as volatile. Alternatively, if you only access the shared data after you have left a function modifying it, you might be fine. But I would suggest not relying on blind luck to make sure that values are written back from registers to memory.
Especially on register-rich machines (i.e., not x86), variables can live for quite long periods in registers, and a good compiler can cache even parts of structures or entire structures in registers. So you should use volatile, but for performance, also copy values to local variables for computation and then do an explicit write-back. Essentially, using volatile efficiently means doing a bit of load-store thinking in your C code.
In any case, you positively have to use some kind of OS-level provided sync mechanism to create a correct program.
For an example of the weakness of volatile, see my Decker's algorithm example at http://jakob.engbloms.se/archives/65, which proves pretty well that volatile does not work to synchronize.
There is a widespread notion that the keyword volatile is good for multi-threaded programming.
Hans Boehm points out that there are only three portable uses for volatile:
volatile may be used to mark local variables in the same scope as a setjmp whose value should be preserved across a longjmp. It is unclear what fraction of such uses would be slowed down, since the atomicity and ordering constraints have no effect if there is no way to share the local variable in question. (It is even unclear what fraction of such uses would be slowed down by requiring all variables to be preserved across a longjmp, but that is a separate matter and is not considered here.)
volatile may be used when variables may be "externally modified", but the modification in fact is triggered synchronously by the thread itself, e.g. because the underlying memory is mapped at multiple locations.
A volatile sigatomic_t may be used to communicate with a signal handler in the same thread, in a restricted manner. One could consider weakening the requirements for the sigatomic_t case, but that seems rather counterintuitive.
If you are multi-threading for the sake of speed, slowing down code is definitely not what you want. For multi-threaded programming, there two key issues that volatile is often mistakenly thought to address:
atomicity
memory consistency, i.e. the order of a thread's operations as seen by another thread.
Let's deal with (1) first. Volatile does not guarantee atomic reads or writes. For example, a volatile read or write of a 129-bit structure is not going to be atomic on most modern hardware. A volatile read or write of a 32-bit int is atomic on most modern hardware, but volatile has nothing to do with it. It would likely be atomic without the volatile. The atomicity is at the whim of the compiler. There's nothing in the C or C++ standards that says it has to be atomic.
Now consider issue (2). Sometimes programmers think of volatile as turning off optimization of volatile accesses. That's largely true in practice. But that's only the volatile accesses, not the non-volatile ones. Consider this fragment:
volatile int Ready;
int Message[100];
void foo( int i ) {
Message[i/10] = 42;
Ready = 1;
}
It's trying to do something very reasonable in multi-threaded programming: write a message and then send it to another thread. The other thread will wait until Ready becomes non-zero and then read Message. Try compiling this with "gcc -O2 -S" using gcc 4.0, or icc. Both will do the store to Ready first, so it can be overlapped with the computation of i/10. The reordering is not a compiler bug. It's an aggressive optimizer doing its job.
You might think the solution is to mark all your memory references volatile. That's just plain silly. As the earlier quotes say, it will just slow down your code. Worst yet, it might not fix the problem. Even if the compiler does not reorder the references, the hardware might. In this example, x86 hardware will not reorder it. Neither will an Itanium(TM) processor, because Itanium compilers insert memory fences for volatile stores. That's a clever Itanium extension. But chips like Power(TM) will reorder. What you really need for ordering are memory fences, also called memory barriers. A memory fence prevents reordering of memory operations across the fence, or in some cases, prevents reordering in one direction.Volatile has nothing to do with memory fences.
So what's the solution for multi-threaded programming? Use a library or language extension that implements the atomic and fence semantics. When used as intended, the operations in the library will insert the right fences. Some examples:
POSIX threads
Windows(TM) threads
OpenMP
TBB
Based on article by Arch Robison (Intel)
In my experience, no; you just have to properly mutex yourself when you write to those values, or structure your program such that the threads will stop before they need to access data that depends on another thread's actions. My project, x264, uses this method; threads share an enormous amount of data but the vast majority of it doesn't need mutexes because its either read-only or a thread will wait for the data to become available and finalized before it needs to access it.
Now, if you have many threads that are all heavily interleaved in their operations (they depend on each others' output on a very fine-grained level), this may be a lot harder--in fact, in such a case I'd consider revisiting the threading model to see if it can possibly be done more cleanly with more separation between threads.
NO.
Volatile is only required when reading a memory location that can change independently of the CPU read/write commands. In the situation of threading, the CPU is in full control of read/writes to memory for each thread, therefore the compiler can assume the memory is coherent and optimizes the CPU instructions to reduce unnecessary memory access.
The primary usage for volatile is for accessing memory-mapped I/O. In this case, the underlying device can change the value of a memory location independently from CPU. If you do not use volatile under this condition, the CPU may use a previously cached memory value, instead of reading the newly updated value.
POSIX 7 guarantees that functions such as pthread_lock also synchronize memory
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11 "4.12 Memory Synchronization" says:
The following functions synchronize memory with respect to other threads:
pthread_barrier_wait()
pthread_cond_broadcast()
pthread_cond_signal()
pthread_cond_timedwait()
pthread_cond_wait()
pthread_create()
pthread_join()
pthread_mutex_lock()
pthread_mutex_timedlock()
pthread_mutex_trylock()
pthread_mutex_unlock()
pthread_spin_lock()
pthread_spin_trylock()
pthread_spin_unlock()
pthread_rwlock_rdlock()
pthread_rwlock_timedrdlock()
pthread_rwlock_timedwrlock()
pthread_rwlock_tryrdlock()
pthread_rwlock_trywrlock()
pthread_rwlock_unlock()
pthread_rwlock_wrlock()
sem_post()
sem_timedwait()
sem_trywait()
sem_wait()
semctl()
semop()
wait()
waitpid()
Therefore if your variable is guarded between pthread_mutex_lock and pthread_mutex_unlock then it does not need further synchronization as you might attempt to provide with volatile.
Related questions:
Does guarding a variable with a pthread mutex guarantee it's also not cached?
Does pthread_mutex_lock contains memory fence instruction?
Volatile would only be useful if you need absolutely no delay between when one thread writes something and another thread reads it. Without some sort of lock, though, you have no idea of when the other thread wrote the data, only that it's the most recent possible value.
For simple values (int and float in their various sizes) a mutex might be overkill if you don't need an explicit synch point. If you don't use a mutex or lock of some sort, you should declare the variable volatile. If you use a mutex you're all set.
For complicated types, you must use a mutex. Operations on them are non-atomic, so you could read a half-changed version without a mutex.
Volatile means that we have to go to memory to get or set this value. If you don't set volatile, the compiled code might store the data in a register for a long time.
What this means is that you should mark variables that you share between threads as volatile so that you don't have situations where one thread starts modifying the value but doesn't write its result before a second thread comes along and tries to read the value.
Volatile is a compiler hint that disables certain optimizations. The output assembly of the compiler might have been safe without it but you should always use it for shared values.
This is especially important if you are NOT using the expensive thread sync objects provided by your system - you might for example have a data structure where you can keep it valid with a series of atomic changes. Many stacks that do not allocate memory are examples of such data structures, because you can add a value to the stack then move the end pointer or remove a value from the stack after moving the end pointer. When implementing such a structure, volatile becomes crucial to ensure that your atomic instructions are actually atomic.
The underlying reason is that the C language semantic is based upon a single-threaded abstract machine. And the compiler is within its own right to transform the program as long as the program's 'observable behaviors' on the abstract machine stay unchanged. It can merge adjacent or overlapping memory accesses, redo a memory access multiple times (upon register spilling for example), or simply discard a memory access, if it thinks the program's behaviors, when executed in a single thread, doesn't change. Therefore as you may suspect, the behaviors do change if the program is actually supposed to be executing in a multi-threaded way.
As Paul Mckenney pointed out in a famous Linux kernel document:
It _must_not_ be assumed that the compiler will do what you want
with memory references that are not protected by READ_ONCE() and
WRITE_ONCE(). Without them, the compiler is within its rights to
do all sorts of "creative" transformations, which are covered in
the COMPILER BARRIER section.
READ_ONCE() and WRITE_ONCE() are defined as volatile casts on referenced variables. Thus:
int y;
int x = READ_ONCE(y);
is equivalent to:
int y;
int x = *(volatile int *)&y;
So, unless you make a 'volatile' access, you are not assured that the access happens exactly once, no matter what synchronization mechanism you are using. Calling an external function (pthread_mutex_lock for example) may force the compiler do memory accesses to global variables. But this happens only when the compiler fails to figure out whether the external function changes these global variables or not. Modern compilers employing sophisticated inter-procedure analysis and link-time optimization make this trick simply useless.
In summary, you should mark variables shared by multiple threads volatile or access them using volatile casts.
As Paul McKenney has also pointed out:
I have seen the glint in their eyes when they discuss optimization techniques that you would not want your children to know about!
But see what happens to C11/C++11.
Some people obviously are assuming that the compiler treats the synchronization calls as memory barriers. "Casey" is assuming there is exactly one CPU.
If the sync primitives are external functions and the symbols in question are visible outside the compilation unit (global names, exported pointer, exported function that may modify them) then the compiler will treat them -- or any other external function call -- as a memory fence with respect to all externally visible objects.
Otherwise, you are on your own. And volatile may be the best tool available for making the compiler produce correct, fast code. It generally won't be portable though, when you need volatile and what it actually does for you depends a lot on the system and compiler.
No.
First, volatile is not necessary. There are numerous other operations that provide guaranteed multithreaded semantics that don't use volatile. These include atomic operations, mutexes, and so on.
Second, volatile is not sufficient. The C standard does not provide any guarantees about multithreaded behavior for variables declared volatile.
So being neither necessary nor sufficient, there's not much point in using it.
One exception would be particular platforms (such as Visual Studio) where it does have documented multithreaded semantics.
Variables that are shared among threads should be declared 'volatile'. This tells the
compiler that when one thread writes to such variables, the write should be to memory
(as opposed to a register).

Resources