g_atomic_int_get guarantees visibility into another thread? - c

This is related to this question.
rmmh's claim on that question was that on certain architectures, no special magic is needed to implement atomic get and set. Specifically, in this case that g_atomic_int_get(&foo) from glib gets expanded simply to (*(&foo)). I understand that this means that foo will not be in an internally consistent state. However am I also guaranteed that foo won't be cached by a given CPU or core?
Specifically, if one thread is setting foo, and another reading it (using the glib g_atomic_* functions), can I assume that the reader will see the updates to the variable made by the writer. Or is it possible for the writer to simply update the value in a register? For reference my target platform is gcc (4.1.2) on a multi-core multi-CPU X86_64 machine.

What most architecture ensures (included) is atomicity and coherence of reads and writes of suitably sized and aligned read/write (so every processors see a subsequence of the same master sequence of values for a given memory adress (*)), and int is most probably suitably size and compilers generally ensure that they are also correctly aligned.
But compilers rarely ensures that they aren't optimizing out some reads or writes if they aren't marked in a way or another. I've tried to compile:
int f(int* ptr)
{
int i, r=0;
*ptr = 5;
for (i = 0; i < 100; ++i) {
r += i*i;
}
*ptr = r;
return *ptr;
}
with gcc 4.1.2 gcc optimized out without problem the first write to *ptr, something you probably don't want for an atomic write.
(*) Coherence is not to be confused with consistency: the relationship between reads and writes at different address is often relaxed with respect to the intuitive, but costly to implement, sequential consistency. That's why memory barriers are needed.

Volatile will only ensure that the compiler doesn't use a register to hold the variable. Volatile will not prevent the compiler from re-ordering code; although, it might act as a hint to not reorder.
Depending on the architecture, certain instructions are atomic. writing to an integer and reading from an integer are often atomic. If gcc uses atomic instructions for reading/writing to/from an integer memory location, there will be no "intermediate garbage" read by one thread if another thread is in the middle of a write.
But, you might run into problems because of compiler reordering and instruction reordering.
With optimizations enabled, gcc reorders code. Gcc usually doesn't reorder code when global variables or function calls are involved since gcc can't guarantee the same outcome. Volatile might act as a hint to gcc wrt reordering, but I don't know. If you do run into reordering problems, this will act as a general purpose compiler barrier for gcc:
__asm__ __volatile__ ("" ::: "memory");
Even if the compiler doesn't reorder code, the CPU constantly reorders instructions during execution. Here is a very good article on the subject. A "memory barrier" is used to prevent the cpu from reordering instructions over a barrier. Here is one possible way to make a memory barrier using gcc:
__sync_synchronize();
You can also execute asm instructions to do different kinds of barriers.
That said, we read and write global integers without using atomic operations or mutexes from multiple threads and have no problems. This is most likely because A) we run on Intel and Intel does not reorder instructions aggressively and B) there is enough code executing before we do something "bad" with an early read of a flag. Also in our favor is the fact that a lot of system calls have barriers and the gcc atomic operations are barriers. We use a lot of atomic operations.
Here is a good discussion in stack overflow of a similar question.

Related

Is assigning a pointer in C program considered atomic on x86-64

https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html#Atomic-Types says - In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
My question is whether pointer assignment can be considered atomic on x86_64 architecture for a C program compiled with gcc m64 flag. OS is 64bit Linux and CPU is Intel(R) Xeon(R) CPU D-1548. One thread will be setting a pointer and another thread accessing the pointer. There is only one writer thread and one reader thread. Reader should either be getting the previous value of the pointer or the latest value and no garbage value in between.
If it is not considered atomic, please let me know how can I use the gcc atomic builtins or maybe memory barrier like __sync_synchronize to achieve the same without using locks. Interested only in C solution and not C++. Thanks!
Bear in mind that atomicity alone is not enough for communicating between threads. Nothing prevents the compiler and CPU from reordering previous/subsequent load and store instructions with that "atomic" store. In old days people used volatile to prevent that reordering but that was never intended for use with threads and doesn't provide means to specify less or more restrictive memory order (see "Relationship with volatile" in there).
You should use C11 atomics because they guarantee both atomicity and memory order.
For almost all architectures, pointer load and store are atomic. A once notable exception was 8086/80286 where pointers could be seg:offset; there was an l[des]s instruction which could make an atomic load; but no corresponding atomic store.
The integrity of the pointer is only a small concern; your bigger issue revolves around synchronization: the pointer was at value Y, you set it to X; how will you know when nobody is using the (old) Y value?
A somewhat related problem is that you may have stored things at X, which the other thread expects to find. Without synchronization, other might see the new pointer value, however what it points to might not be up to date yet.
A plain global char *ptr should not be considered atomic. It might work sometimes, especially with optimization disabled, but you can get the compiler to make safe and efficient optimized asm by using modern language features to tell it you want atomicity.
Use C11 stdatomic.h or GNU C __atomic builtins. And see Why is integer assignment on a naturally aligned variable atomic on x86? - yes the underlying asm operations are atomic "for free", but you need to control the compiler's code-gen to get sane behaviour for multithreading.
See also LWN: Who's afraid of a big bad optimizing compiler? - weird effects of using plain vars include several really bad well-known things, but also more obscure stuff like invented loads, reading a variable more than once if the compiler decides to optimize away a local tmp and load the shared var twice, instead of loading it into a register. Using asm("" ::: "memory") compiler barriers may not be sufficient to defeat that depending on where you put them.
So use proper atomic stores and loads that tell the compiler what you want: You should generally use atomic loads to read them, too.
#include <stdatomic.h> // C11 way
_Atomic char *c11_shared_var; // all access to this is atomic, functions needed only if you want weaker ordering
void foo(){
atomic_store_explicit(&c11_shared_var, newval, memory_order_relaxed);
}
char *plain_shared_var; // GNU C
// This is a plain C var. Only specific accesses to it are atomic; be careful!
void foo() {
__atomic_store_n(&plain_shared_var, newval, __ATOMIC_RELAXED);
}
Using __atomic_store_n on a plain var is the functionality that C++20 atomic_ref exposes. If multiple threads access a variable for the entire time that it needs to exist, you might as well just use C11 stdatomic because every access needs to be atomic (not optimized into a register or whatever). When you want to let the compiler load once and reuse that value, do char *tmp = c11_shared_var; (or atomic_load_explicit if you only want acquire instead of seq_cst; cheaper on a few non-x86 ISAs).
Besides lack of tearing (atomicity of asm load or store), the other key parts of _Atomic foo * are:
The compiler will assume that other threads may have changed memory contents (like volatile effectively implies), otherwise the assumption of no data-race UB will let the compiler hoist loads out of loops. Without this, dead-store elimination might only do one store at the end of a loop, not updating the value multiple times.
The read side of the problem is usually what bites people in practice, see Multithreading program stuck in optimized mode but runs normally in -O0 - e.g. while(!flag){} becomes if(!flag) infinite_loop; with optimization enabled.
Ordering wrt. other code. e.g. you can use memory_order_release to make sure that other threads that see the pointer update also see all changes to the pointed-to data. (On x86 that's as simple as compile-time ordering, no extra barriers needed for acquire/release, only for seq_cst. Avoid seq_cst if you can; mfence or locked operations are slow.)
Guarantee that the store will compile to a single asm instruction. You'd be depending on this. It does happen in practice with sane compilers, although it's conceivable that a compiler might decide to use rep movsb to copy a few contiguous pointers, and that some machine somewhere might have a microcoded implementation that does some stores narrower than 8 bytes.
(This failure mode is highly unlikely; the Linux kernel relies on volatile load/store compiling to a single instruction with GCC / clang for its hand-rolled intrinsics. But if you just used asm("" ::: "memory") to make sure a store happened on a non-volatile variable, there's a chance.)
Also, something like ptr++ will compile to an atomic RMW operation like lock add qword [mem], 4, rather than separate load and store like volatile would. (See Can num++ be atomic for 'int num'? for more about atomic RMWs). Avoid that if you don't need it, it's slower. e.g. atomic_store_explicit(&ptr, ptr + 1, mo_release); - seq_cst loads are cheap on x86-64 but seq_cst stores aren't.
Also note that memory barriers can't create atomicity (lack of tearing), they can only create ordering wrt other ops.
In practice x86-64 ABIs do have alignof(void*) = 8 so all pointer objects should be naturally aligned (except in a __attribute__((packed)) struct which violates the ABI, so you can use __atomic_store_n on them. It should compile to what you want (plain store, no overhead), and meet the asm requirements to be atomic.
See also When to use volatile with multi threading? - you can roll your own atomics with volatile and asm memory barriers, but don't. The Linux kernel does that, but it's a lot of effort for basically no gain, especially for a user-space program.
Side note: an often repeated misconception is that volatile or _Atomic are needed to avoid reading stale values from cache. This is not the case.
All machines that run C11 threads across multiple cores have coherent caches, not needing explicit flush instructions in the reader or writer. Just ordinary load or store instructions, like x86 mov. The key is to not let the compiler keep values of shared variable in CPU registers (which are thread-private). It normally can do this optimization because of the assumption of no data-race Undefined Behaviour. Registers are very much not the same thing as L1d CPU cache; managing what's in registers vs. memory is done by the compiler, while hardware keeps cache in sync. See When to use volatile with multi threading? for more details about why coherent caches is sufficient to make volatile work like memory_order_relaxed.
See Multithreading program stuck in optimized mode but runs normally in -O0 for an example.
"Atomic" is treated as this quantum state where something can be both atomic and not atomic at the same time because "it's possible" that "some machines" "somewhere" "might not" write "a certain value" atomically. Maybe.
That is not the case. Atomicity has a very specific meaning, and it solves a very specific problem: threads being pre-empted by the OS to schedule another thread in its place on that core. And you cannot stop a thread from executing mid-assembly instruction.
What that means is that any single assembly instruction is "atomic" by definition. And since you have registry moving instructions, any register-sized copy is atomic by definition. That means a 32-bit integer on a 32-bit CPU, and a 64-bit integer on a 64-bit CPU are all atomic -- and of course that includes pointers (ignore all the people who will tell you "some architectures" have pointers of "different size" than registers, that hasn't been the case since 386).
You should however be careful not to hit variable caching problems (ie one thread writing a pointer, and another trying to read it but getting an old value from the cache), use volatile as needed to prevent this.

c - volatile on heap variable in multithreaded application [duplicate]

If there are two threads accessing a global variable then many tutorials say make the variable volatile to prevent the compiler caching the variable in a register and it thus not getting updated correctly.
However two threads both accessing a shared variable is something which calls for protection via a mutex isn't it?
But in that case, between the thread locking and releasing the mutex the code is in a critical section where only that one thread can access the variable, in which case the variable doesn't need to be volatile?
So therefore what is the use/purpose of volatile in a multi-threaded program?
Short & quick answer: volatile is (nearly) useless for platform-agnostic, multithreaded application programming. It does not provide any synchronization, it does not create memory fences, nor does it ensure the order of execution of operations. It does not make operations atomic. It does not make your code magically thread safe. volatile may be the single-most misunderstood facility in all of C++. See this, this and this for more information about volatile
On the other hand, volatile does have some use that may not be so obvious. It can be used much in the same way one would use const to help the compiler show you where you might be making a mistake in accessing some shared resource in a non-protected way. This use is discussed by Alexandrescu in this article. However, this is basically using the C++ type system in a way that is often viewed as a contrivance and can evoke Undefined Behavior.
volatile was specifically intended to be used when interfacing with memory-mapped hardware, signal handlers, and the setjmp machine code instruction. This makes volatile directly applicable to systems-level programming rather than normal applications-level programming.
The 2003 C++ Standard does not say that volatile applies any kind of Acquire or Release semantics on variables. In fact, the Standard is completely silent on all matters of multithreading. However, specific platforms do apply Acquire and Release semantics on volatile variables.
[Update for C++11]
The C++11 Standard now does acknowledge multithreading directly in the memory model and the language, and it provides library facilities to deal with it in a platform-independent way. However the semantics of volatile still have not changed. volatile is still not a synchronization mechanism. Bjarne Stroustrup says as much in TCPPPL4E:
Do not use volatile except in low-level code that deals directly
with hardware.
Do not assume volatile has special meaning in the memory model. It
does not. It is not -- as in some later languages -- a
synchronization mechanism. To get synchronization, use atomic, a
mutex, or a condition_variable.
[/End update]
The above all applies to the C++ language itself, as defined by the 2003 Standard (and now the 2011 Standard). Some specific platforms however do add additional functionality or restrictions to what volatile does. For example, in MSVC 2010 (at least) Acquire and Release semantics do apply to certain operations on volatile variables. From the MSDN:
When optimizing, the compiler must maintain ordering among references
to volatile objects as well as references to other global objects. In
particular,
A write to a volatile object (volatile write) has Release semantics; a
reference to a global or static object that occurs before a write to a
volatile object in the instruction sequence will occur before that
volatile write in the compiled binary.
A read of a volatile object (volatile read) has Acquire semantics; a
reference to a global or static object that occurs after a read of
volatile memory in the instruction sequence will occur after that
volatile read in the compiled binary.
However, you might take note of the fact that if you follow the above link, there is some debate in the comments as to whether or not acquire/release semantics actually apply in this case.
In C++11, don't use volatile for threading, only for MMIO
But TL:DR, it does "work" sort of like atomic with mo_relaxed on hardware with coherent caches (i.e. everything); it is sufficient to stop compilers keeping vars in registers. atomic doesn't need memory barriers to create atomicity or inter-thread visibility, only to make the current thread wait before/after an operation to create ordering between this thread's accesses to different variables. mo_relaxed never needs any barriers, just load, store, or RMW.
For roll-your-own atomics with volatile (and inline-asm for barriers) in the bad old days before C++11 std::atomic, volatile was the only good way to get some things to work. But it depended on a lot of assumptions about how implementations worked and was never guaranteed by any standard.
For example the Linux kernel still uses its own hand-rolled atomics with volatile, but only supports a few specific C implementations (GNU C, clang, and maybe ICC). Partly that's because of GNU C extensions and inline asm syntax and semantics, but also because it depends on some assumptions about how compilers work.
It's almost always the wrong choice for new projects; you can use std::atomic (with std::memory_order_relaxed) to get a compiler to emit the same efficient machine code you could with volatile. std::atomic with mo_relaxed obsoletes volatile for threading purposes. (except maybe to work around missed-optimization bugs with atomic<double> on some compilers.)
The internal implementation of std::atomic on mainstream compilers (like gcc and clang) does not just use volatile internally; compilers directly expose atomic load, store and RMW builtin functions. (e.g. GNU C __atomic builtins which operate on "plain" objects.)
Volatile is usable in practice (but don't do it)
That said, volatile is usable in practice for things like an exit_now flag on all(?) existing C++ implementations on real CPUs, because of how CPUs work (coherent caches) and shared assumptions about how volatile should work. But not much else, and is not recommended. The purpose of this answer is to explain how existing CPUs and C++ implementations actually work. If you don't care about that, all you need to know is that std::atomic with mo_relaxed obsoletes volatile for threading.
(The ISO C++ standard is pretty vague on it, just saying that volatile accesses should be evaluated strictly according to the rules of the C++ abstract machine, not optimized away. Given that real implementations use the machine's memory address-space to model C++ address space, this means volatile reads and assignments have to compile to load/store instructions to access the object-representation in memory.)
As another answer points out, an exit_now flag is a simple case of inter-thread communication that doesn't need any synchronization: it's not publishing that array contents are ready or anything like that. Just a store that's noticed promptly by a not-optimized-away load in another thread.
// global
bool exit_now = false;
// in one thread
while (!exit_now) { do_stuff; }
// in another thread, or signal handler in this thread
exit_now = true;
Without volatile or atomic, the as-if rule and assumption of no data-race UB allows a compiler to optimize it into asm that only checks the flag once, before entering (or not) an infinite loop. This is exactly what happens in real life for real compilers. (And usually optimize away much of do_stuff because the loop never exits, so any later code that might have used the result is not reachable if we enter the loop).
// Optimizing compilers transform the loop into asm like this
if (!exit_now) { // check once before entering loop
while(1) do_stuff; // infinite loop
}
Multithreading program stuck in optimized mode but runs normally in -O0 is an example (with description of GCC's asm output) of how exactly this happens with GCC on x86-64. Also MCU programming - C++ O2 optimization breaks while loop on electronics.SE shows another example.
We normally want aggressive optimizations that CSE and hoist loads out of loops, including for global variables.
Before C++11, volatile bool exit_now was one way to make this work as intended (on normal C++ implementations). But in C++11, data-race UB still applies to volatile so it's not actually guaranteed by the ISO standard to work everywhere, even assuming HW coherent caches.
Note that for wider types, volatile gives no guarantee of lack of tearing. I ignored that distinction here for bool because it's a non-issue on normal implementations. But that's also part of why volatile is still subject to data-race UB instead of being equivalent to relaxed atomic.
Note that "as intended" doesn't mean the thread doing exit_now waits for the other thread to actually exit. Or even that it waits for the volatile exit_now=true store to even be globally visible before continuing to later operations in this thread. (atomic<bool> with the default mo_seq_cst would make it wait before any later seq_cst loads at least. On many ISAs you'd just get a full barrier after the store).
C++11 provides a non-UB way that compiles the same
A "keep running" or "exit now" flag should use std::atomic<bool> flag with mo_relaxed
Using
flag.store(true, std::memory_order_relaxed)
while( !flag.load(std::memory_order_relaxed) ) { ... }
will give you the exact same asm (with no expensive barrier instructions) that you'd get from volatile flag.
As well as no-tearing, atomic also gives you the ability to store in one thread and load in another without UB, so the compiler can't hoist the load out of a loop. (The assumption of no data-race UB is what allows the aggressive optimizations we want for non-atomic non-volatile objects.) This feature of atomic<T> is pretty much the same as what volatile does for pure loads and pure stores.
atomic<T> also make += and so on into atomic RMW operations (significantly more expensive than an atomic load into a temporary, operate, then a separate atomic store. If you don't want an atomic RMW, write your code with a local temporary).
With the default seq_cst ordering you'd get from while(!flag), it also adds ordering guarantees wrt. non-atomic accesses, and to other atomic accesses.
(In theory, the ISO C++ standard doesn't rule out compile-time optimization of atomics. But in practice compilers don't because there's no way to control when that wouldn't be ok. There are a few cases where even volatile atomic<T> might not be enough control over optimization of atomics if compilers did optimize, so for now compilers don't. See Why don't compilers merge redundant std::atomic writes? Note that wg21/p0062 recommends against using volatile atomic in current code to guard against optimization of atomics.)
volatile does actually work for this on real CPUs (but still don't use it)
even with weakly-ordered memory models (non-x86). But don't actually use it, use atomic<T> with mo_relaxed instead!! The point of this section is to address misconceptions about how real CPUs work, not to justify volatile. If you're writing lockless code, you probably care about performance. Understanding caches and the costs of inter-thread communication is usually important for good performance.
Real CPUs have coherent caches / shared memory: after a store from one core becomes globally visible, no other core can load a stale value. (See also Myths Programmers Believe about CPU Caches which talks some about Java volatiles, equivalent to C++ atomic<T> with seq_cst memory order.)
When I say load, I mean an asm instruction that accesses memory. That's what a volatile access ensures, and is not the same thing as lvalue-to-rvalue conversion of a non-atomic / non-volatile C++ variable. (e.g. local_tmp = flag or while(!flag)).
The only thing you need to defeat is compile-time optimizations that don't reload at all after the first check. Any load+check on each iteration is sufficient, without any ordering. Without synchronization between this thread and the main thread, it's not meaningful to talk about when exactly the store happened, or ordering of the load wrt. other operations in the loop. Only when it's visible to this thread is what matters. When you see the exit_now flag set, you exit. Inter-core latency on a typical x86 Xeon can be something like 40ns between separate physical cores.
In theory: C++ threads on hardware without coherent caches
I don't see any way this could be remotely efficient, with just pure ISO C++ without requiring the programmer to do explicit flushes in the source code.
In theory you could have a C++ implementation on a machine that wasn't like this, requiring compiler-generated explicit flushes to make things visible to other threads on other cores. (Or for reads to not use a maybe-stale copy). The C++ standard doesn't make this impossible, but C++'s memory model is designed around being efficient on coherent shared-memory machines. E.g. the C++ standard even talks about "read-read coherence", "write-read coherence", etc. One note in the standard even points the connection to hardware:
http://eel.is/c++draft/intro.races#19
[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note ]
There's no mechanism for a release store to only flush itself and a few select address-ranges: it would have to sync everything because it wouldn't know what other threads might want to read if their acquire-load saw this release-store (forming a release-sequence that establishes a happens-before relationship across threads, guaranteeing that earlier non-atomic operations done by the writing thread are now safe to read. Unless it did further writes to them after the release store...) Or compilers would have to be really smart to prove that only a few cache lines needed flushing.
Related: my answer on Is mov + mfence safe on NUMA? goes into detail about the non-existence of x86 systems without coherent shared memory. Also related: Loads and stores reordering on ARM for more about loads/stores to the same location.
There are I think clusters with non-coherent shared memory, but they're not single-system-image machines. Each coherency domain runs a separate kernel, so you can't run threads of a single C++ program across it. Instead you run separate instances of the program (each with their own address space: pointers in one instance aren't valid in the other).
To get them to communicate with each other via explicit flushes, you'd typically use MPI or other message-passing API to make the program specify which address ranges need flushing.
Real hardware doesn't run std::thread across cache coherency boundaries:
Some asymmetric ARM chips exist, with shared physical address space but not inner-shareable cache domains. So not coherent. (e.g. comment thread an A8 core and an Cortex-M3 like TI Sitara AM335x).
But different kernels would run on those cores, not a single system image that could run threads across both cores. I'm not aware of any C++ implementations that run std::thread threads across CPU cores without coherent caches.
For ARM specifically, GCC and clang generate code assuming all threads run in the same inner-shareable domain. In fact, the ARMv7 ISA manual says
This architecture (ARMv7) is written with an expectation that all processors using the same operating system or hypervisor are in the same Inner Shareable shareability domain
So non-coherent shared memory between separate domains is only a thing for explicit system-specific use of shared memory regions for communication between different processes under different kernels.
See also this CoreCLR discussion about code-gen using dmb ish (Inner Shareable barrier) vs. dmb sy (System) memory barriers in that compiler.
I make the assertion that no C++ implementation for other any other ISA runs std::thread across cores with non-coherent caches. I don't have proof that no such implementation exists, but it seems highly unlikely. Unless you're targeting a specific exotic piece of HW that works that way, your thinking about performance should assume MESI-like cache coherency between all threads. (Preferably use atomic<T> in ways that guarantees correctness, though!)
Coherent caches makes it simple
But on a multi-core system with coherent caches, implementing a release-store just means ordering commit into cache for this thread's stores, not doing any explicit flushing. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/). (And an acquire-load means ordering access to cache in the other core).
A memory barrier instruction just blocks the current thread's loads and/or stores until the store buffer drains; that always happens as fast as possible on its own. (Or for LoadLoad / LoadStore barriers, block until previous loads have completed.) (Does a memory barrier ensure that the cache coherence has been completed? addresses this misconception). So if you don't need ordering, just prompt visibility in other threads, mo_relaxed is fine. (And so is volatile, but don't do that.)
See also C/C++11 mappings to processors
Fun fact: on x86, every asm store is a release-store because the x86 memory model is basically seq-cst plus a store buffer (with store forwarding).
Semi-related re: store buffer, global visibility, and coherency: C++11 guarantees very little. Most real ISAs (except PowerPC) do guarantee that all threads can agree on the order of a appearance of two stores by two other threads. (In formal computer-architecture memory model terminology, they're "multi-copy atomic").
Will two atomic writes to different locations in different threads always be seen in the same order by other threads?
Concurrent stores seen in a consistent order
Another misconception is that memory fence asm instructions are needed to flush the store buffer for other cores to see our stores at all. Actually the store buffer is always trying to drain itself (commit to L1d cache) as fast as possible, otherwise it would fill up and stall execution. What a full barrier / fence does is stall the current thread until the store buffer is drained, so our later loads appear in the global order after our earlier stores.
Are loads and stores the only instructions that gets reordered?
x86 mfence and C++ memory barrier
Globally Invisible load instructions
(x86's strongly ordered asm memory model means that volatile on x86 may end up giving you closer to mo_acq_rel, except that compile-time reordering with non-atomic variables can still happen. But most non-x86 have weakly-ordered memory models so volatile and relaxed are about as weak as mo_relaxed allows.)
(Editor's note: in C++11 volatile is not the right tool for this job and still has data-race UB. Use std::atomic<bool> with std::memory_order_relaxed loads/stores to do this without UB. On real implementations it will compile to the same asm as volatile. I added an answer with more detail, and also addressing the misconceptions in comments that weakly-ordered memory might be a problem for this use-case: all real-world CPUs have coherent shared memory so volatile will work for this on real C++ implementations. But still don't do it.
Some discussion in comments seems to be talking about other use-cases where you would need something stronger than relaxed atomics. This answer already points out that volatile gives you no ordering.)
Volatile is occasionally useful for the following reason: this code:
/* global */ bool flag = false;
while (!flag) {}
is optimized by gcc to:
if (!flag) { while (true) {} }
Which is obviously incorrect if the flag is written to by the other thread. Note that without this optimization the synchronization mechanism probably works (depending on the other code some memory barriers may be needed) - there is no need for a mutex in 1 producer - 1 consumer scenario.
Otherwise the volatile keyword is too weird to be useable - it does not provide any memory ordering guarantees wrt both volatile and non-volatile accesses and does not provide any atomic operations - i.e. you get no help from the compiler with volatile keyword except disabled register caching.
You need volatile and possibly locking.
volatile tells the optimiser that the value can change asynchronously, thus
volatile bool flag = false;
while (!flag) {
/*do something*/
}
will read flag every time around the loop.
If you turn optimisation off or make every variable volatile a program will behave the same but slower. volatile just means 'I know you may have just read it and know what it says, but if I say read it then read it.
Locking is a part of the program. So ,by the way, if you are implementing semaphores then among other things they must be volatile. (Don't try it, it is hard, will probably need a little assembler or the new atomic stuff, and it has already been done.)
#include <iostream>
#include <thread>
#include <unistd.h>
using namespace std;
bool checkValue = false;
int main()
{
std::thread writer([&](){
sleep(2);
checkValue = true;
std::cout << "Value of checkValue set to " << checkValue << std::endl;
});
std::thread reader([&](){
while(!checkValue);
});
writer.join();
reader.join();
}
Once an interviewer who also believed that volatile is useless argued with me that Optimisation wouldn't cause any issues and was referring to different cores having separate cache lines and all that (didn't really understand what he was exactly referring to). But this piece of code when compiled with -O3 on g++ (g++ -O3 thread.cpp -lpthread), it shows undefined behaviour. Basically if the value gets set before the while check it works fine and if not it goes into a loop without bothering to fetch the value (which was actually changed by the other thread). Basically i believe the value of checkValue only gets fetched once into the register and never gets checked again under the highest level of optimisation. If its set to true before the fetch, it works fine and if not it goes into a loop. Please correct me if am wrong.

embedded C - using "volatile" to assert consistency

Consider the following code:
// In the interrupt handler file:
volatile uint32_t gSampleIndex = 0; // declared 'extern'
void HandleSomeIrq()
{
gSampleIndex++;
}
// In some other file
void Process()
{
uint32_t localSampleIndex = gSampleIndex; // will this be optimized away?
PrevSample = RawSamples[(localSampleIndex + 0) % NUM_RAW_SAMPLE_BUFFERS];
CurrentSample = RawSamples[(localSampleIndex + 1) % NUM_RAW_SAMPLE_BUFFERS];
NextSample = RawSamples[(localSampleIndex + 2) % NUM_RAW_SAMPLE_BUFFERS];
}
My intention is that PrevSample, CurrentSample and NextSample are consistent, even if gSampleIndex is updated during the call to Process().
Will the assignment to the localSampleIndex do the trick, or is there any chance it will be optimized away even though gSampleIndex is volatile?
In principle, volatile is not enough to guarantee that Process only sees consistent values of gSampleIndex. In practice, however, you should not run into any issues if uinit32_t is directly supported by the hardware. The proper solution would be to use atomic accesses.
The problem
Suppose that you are running on a 16-bit architecture, so that the instruction
localSampleIndex = gSampleIndex;
gets compiled into two instructions (loading the upper half, loading the lower half). Then the interrupt might be called between the two instructions, and you'll get half of the old value combined with half of the new value.
The solution
The solution is to access gSampleCounter using atomic operations only. I know of three ways of doing that.
C11 atomics
In C11 (supported since GCC 4.9), you declare your variable as atomic:
#include <stdatomic.h>
atomic_uint gSampleIndex;
You then take care to only ever access the variable using the documented atomic interfaces. In the IRQ handler:
atomic_fetch_add(&gSampleIndex, 1);
and in the Process function:
localSampleIndex = atomic_load(gSampleIndex);
Do not bother with the _explicit variants of the atomic functions unless you're trying to get your program to scale across large numbers of cores.
GCC atomics
Even if your compiler does not support C11 yet, it probably has some support for atomic operations. For example, in GCC you can say:
volatile int gSampleIndex;
...
__atomic_add_fetch(&gSampleIndex, 1, __ATOMIC_SEQ_CST);
...
__atomic_load(&gSampleIndex, &localSampleIndex, __ATOMIC_SEQ_CST);
As above, do not bother with weak consistency unless you're trying to achieve good scaling behaviour.
Implementing atomic operations yourself
Since you're not trying to protect against concurrent access from multiple cores, just race conditions with an interrupt handler, it is possible to implement a consistency protocol using standard C primitives only. Dekker's algorithm is the oldest known such protocol.
In your function you access volatile variable just once (and it's the only volatile one in that function) so you don't need to worry about code reorganization that compiler may do (and volatile prevents). What standard says for these optimizations at §5.1.2.3 is:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
Note last sentence: "...no needed side effects are produced (...accessing a volatile object)".
Simply volatile will prevent any optimization compiler may do around that code. Just to mention few: no instruction reordering respect other volatile variables. no expression removing, no caching, no value propagation across functions.
BTW I doubt any compiler may break your code (with or without volatile). Maybe local stack variable will be elided but value will be stored in a registry (for sure it won't repeatedly access a memory location). What you need volatile for is value visibility.
EDIT
I think some clarification is needed.
Let me safely assume you know what you're doing (you're working with interrupt handlers so this shouldn't be your first C program): CPU word matches your variable type and memory is properly aligned.
Let me also assume your interrupt is not reentrant (some magic cli/sti stuff or whatever your CPU uses for this) unless you're planning some hard-time debugging and tuning.
If these assumptions are satisfied then you don't need atomic operations. Why? Because localSampleIndex = gSampleIndex is atomic (because it's properly aligned, word size matches and it's volatile), with ++gSampleIndex there isn't any race condition (HandleSomeIrq won't be called again while it's still in execution). More than useless they're wrong.
One may think: "OK, I may not need atomic but why I can't use them? Even if such assumption are satisfied this is an *extra* and it'll achieve same goal" . No, it doesn't. Atomic has not same semantic of volatile variables (and seldom volatile is/should be used outside memory mapped I/O and signal handling). Volatile (usually) is useless with atomic (unless a specific architecture says it is) but it has a great difference: visibility. When you update gSampleIndex in HandleSomeIrq standard guarantees that value will be immediately visible to all threads (and devices). with atomic_uint standard guarantees it'll be visible in a reasonable amount of time.
To make it short and clear: volatile and atomic are not the same thing. Atomic operations are useful for concurrency, volatile are useful for lower level stuff (interrupts, devices). If you're still thinking "hey they do *exactly* what I need" please read few useful links picked from comments: cache coherency and a nice reading about atomics.
To summarize:
In your case you may use an atomic variable with a lock (to have both atomic access and value visibility) but no one on this earth would put a lock inside an interrupt handler (unless absolutely definitely doubtless unquestionably needed, and from code you posted it's not your case).

Read and Write atomic operation implementation in the Linux Kernel

Recently I've peeked into the Linux kernel implementation of an atomic read and write and a few questions came up.
First the relevant code from the ia64 architecture:
typedef struct {
int counter;
} atomic_t;
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic64_read(v) (*(volatile long *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
#define atomic64_set(v,i) (((v)->counter) = (i))
For both read and write operations, it seems that the direct approach was taken to read from or write to the variable. Unless there is another trick somewhere, I do not understand what guarantees exist that this operation will be atomic in the assembly domain. I guess an obvious answer will be that such an operation translates to one assembly opcode, but even so, how is that guaranteed when taking into account the different memory cache levels (or other optimizations)?
On the read macros, the volatile type is used in a casting trick. Anyone has a clue how this affects the atomicity here? (Note that it is not used in the write operation)
I think you are misunderstanding the (very much vague) usage of the word "atomic" and "volatile" here. Atomic only really means that the words will be read or written atomically (in one step, and guaranteeing that the contents of this memory position will always be one write or the other, and not something in between). And the volatile keyword tells the compiler to never assume the data in that location due to an earlier read/write (basically, never optimize away the read).
What the words "atomic" and "volatile" do NOT mean here is that there's any form of memory synchronization. Neither implies ANY read/write barriers or fences. Nothing is guaranteed with regards to memory and cache coherence. These functions are basically atomic only at the software level, and the hardware can optimize/lie however it deems fit.
Now as to why simply reading is enough: the memory models for each architecture are different. Many architectures can guarantee atomic reads or writes for data aligned to a certain byte offset, or x words in length, etc. and vary from CPU to CPU. The Linux kernel contains many defines for the different architectures that let it do without any atomic calls (CMPXCHG, basically) on platforms that guarantee (sometimes even only in practice even if in reality their spec says the don't actually guarantee) atomic reads/writes.
As for the volatile, while there is no need for it in general unless you're accessing memory-mapped IO, it all depends on when/where/why the atomic_read and atomic_write macros are being called. Many compilers will (though it is not set in the C spec) generate memory barriers/fences for volatile variables (GCC, off the top of my head, is one. MSVC does for sure.). While this would normally mean that all reads/writes to this variable are now officially exempt from just about any compiler optimizations, in this case by creating a "virtual" volatile variable only this particular instance of a read/write is off-limits for optimization and re-ordering.
The reads are atomic on most major architectures, so long as they are aligned to a multiple of their size (and aren't bigger than the read size of a give type), see the Intel Architecture manuals. Writes on the other hand many be different, Intel states that under x86, single byte write and aligned writes may be atomic, under IPF (IA64), everything use acquire and release semantics, which would make it guaranteed atomic, see this.
the volatile prevents the compiler from caching the value locally, forcing it to be retrieve where ever there is access to it.
If you write for a specific architecture, you can make assumptions specific to it.
I guess IA-64 does compile these things to a single instruction.
The cache shouldn't be an issue, unless the counter crosses a cache line boundry. But if 4/8 byte alignment is required, this can't happen.
A "real" atomic instruction is required when a machine instruction translates into two memory accesses. This is the case for increments (read, increment, write) or compare&swap.
volatile affects the optimizations the compiler can do.
For example, it prevents the compiler from converting multiple reads into one read.
But on the machine instruction level, it does nothing.

how can I convert non atomic operation to atomic

I am trying to understand atomic and non atomic operations.With respect to Operating System and also with respect to C.
As per the wikipedia page here
Consider a simple counter which different processes can increment.
Non-atomic
The naive, non-atomic implementation:
reads the value in the memory location;
adds one to the value;
writes the new value back into the memory location.
Now, imagine two processes are running incrementing a single, shared memory location:
the first process reads the value in memory location;
the first process adds one to the value;
but before it can write the new value back to the memory location it is suspended, and the second process is allowed to run:
the second process reads the value in memory location, the same value that the first process read;
the second process adds one to the value;
the second process writes the new value into the memory location.
How can the above operation be made an atmoic operation.
My understanding of atomic operation is that any thing which executes without interruption is atomic.
So for example
int b=1000;
b+=1000;
Should be an atomic operation as per my understanding because both the instructions executed without an interruption,how ever I learned from some one that in C there is nothing known as atomic operation so above both statements are non atomic.
So what I want to understand is what is atomicity is different when it comes to programming languages than the Operating Systems?
C99 doesn't have any way to make variables atomic with respect to other threads. C99 has no concept of multiple threads of execution. Thus, you need to use compiler-specific extensions, and/or CPU-level instructions to achieve atomicity.
The next C standard, currently known as C1x, will include atomic operations.
Even then, mere atomicity just guarantees that an operation is atomic, it doesn't guarantee when that operation becomes visible to other CPUs. To achieve visibility guarantees, in C99 you would need to study your CPU's memory model, and possibly use a special kind of CPU instructions known as fences or memory barriers. You also need to tell the compiler about it, using some compiler-specific compiler barrier. C1x defines several memory orderings, and when you use an atomic operation you can decide which memory ordering to use.
Some examples:
/* NOT atomic */
b += 1000;
/* GCC-extension, only in newish GCCs
* requirements on b's loads are CPU-specific
*/
__sync_add_and_fetch(&b, 1000);
/* GCC-extension + x86-assembly,
* b should be aligned to its size (natural alignment),
* or loads will not be atomic
*/
__asm__ __volatile__("lock add $1000, %0" : "+r"(b));
/* C1x */
#include <stdatomic.h>
atomic_int b = ATOMIC_INIT(1000);
int r = atomic_fetch_add(&b, 1000) + 1000;
All of this is as complex as it seems, so you should normally stick to mutexes, which makes things easier.
int b = 1000;
b+=1000;
gets turned into multiple statements at the instruction level. At the very least, preparing a register or memory, assigning 1000, then getting the contents of that register/memory, adding 1000 to the contents, and re-assigning the new value (2000) to that register. Without locking, the OS can suspend the process/thread at any point in that operation. In addition, on multiproc systems, a different processor could access that memory (wouldn't be a register in this case) while your operation is in progress.
When you take a lock out (which is how you would make this atomic), you are, in part, informing the OS that it is not ok to suspend this process/thread, and that this memory should not be accessed by other processes.
Now the above code would probably be optimized by the compiler to a simple assignment of 2000 to the memory location for b, but I'm ignoring that for the purposes of this answer.
b+=1000 is compiled, on all systems that I know, to multiple instructions. Thus it is not atomic.
Even b=1000 can be non atomic although you have to work hard to construct a situation where it is not atomic.
In fact C has no concept of threads and so there is nothing that is atomic in C. You need to rely on implementation specific details of your compiler and tools.
The above statements are non atomic because it becomes a move instruction to load b into a register (if it isnt) then add 1000 to it and the store back into memory. Many instruction sets allow for atomicity through atomic increment easiest being x86 with lock addl dest, src; some other instruction sets use cmpxchg to achieve the same result.
So what I want to understand is what
is atomicity is different when it
comes to programming languages than
the Operating Systems?
I'm a bit confused by this question. What do you mean exactly? The atomicity concept is the same both in prog. languages and OS.
Regarding atomicity and language, here is for example a link about atomicity in JAVA, that might give you a different perspective: What operations in Java are considered atomic?

Resources