Why can `asm volatile("" ::: "memory")` serve as a compiler barrier? - c

It is known that asm volatile ("" ::: "memory") can serve as a compiler barrier to prevent compiler from reordering assembly instructions across it. For example, it is mentioned in https://preshing.com/20120625/memory-ordering-at-compile-time/, section "Explicit Compiler Barriers".
However, all the articles I can find only mention the fact that asm volatile ("" ::: "memory") can serve as a compiler barrier without giving a reason why the "memory" clobber can effectively form a compiler barrier. The GCC online documentation only says that all the special clobber "memory" does is tell the compiler that the assembly code may potentially perform memory reads or writes other than those specified in operands lists. But how does such a semantic cause compiler to stop any attempt to reorder memory instructions across it? I tried to answer myself but failed, so I ask here: why can asm volatile ("" ::: "memory") serve as a compiler barrier, based on the semantics of "memory" clobber? Please note that I am asking about "compiler barrier" (in effect at compile-time), not stronger "memory barrier" (in effect at run-time). For convenience, I excerpt the semantics of "memory" clobber in GCC online doc below:
The "memory" clobber tells the compiler that the assembly code
performs memory reads or writes to items other than those listed in
the input and output operands (for example, accessing the memory
pointed to by one of the input parameters). To ensure memory contains
correct values, GCC may need to flush specific register values to
memory before executing the asm. Further, the compiler does not assume
that any values read from memory before an asm remain unchanged after
that asm; it reloads them as needed. Using the "memory" clobber
effectively forms a read/write memory barrier for the compiler.

If a variable is potentially read or written, it matters what order that happens in. The point of a "memory" clobber is to make sure the reads and/or writes in an asm statement happen at the right point in the program's execution.
(Or more specifically, in this thread's execution, since a compiler barrier is like atomic_signal_fence not atomic_thread_fence. Except on ISAs like x86 where acquire or release thread fences only require blocking compile-time reordering to take advantage of the hardware's strong run-time ordering. e.g. asm("":::"memory") is a possible implementation of atomic_thread_fence(memory_order_release) on x86, but not on AArch64.)
Any read of a C variable's value that happens in the source after an asm statement must be after the memory-clobbering asm statement in the compiler-generated assembly output for the target machine, otherwise it might be reading a value before the asm statement would have changed it.
Any read of a C var in the source before an asm statement similarly must stay sequenced before, otherwise it might incorrectly read a modified value.
Similar reasoning applies to assignments to (writes of) C variables before/after any asm statement with a "memory" clobber. Just like a function call to an "opaque" function, one who's definition the compiler can't see.
No reads or writes can reorder (at compile time) with the barrier in either direction, therefore no operation before the barrier can reorder with any operation after the barrier, or vice versa.
Another way to look at it: the actual machine memory contents must match the C abstract machine at that point. The compiler-generated asm has to respect that, by storing any variable values from registers to memory before the start of an asm("":::"memory") statement, and afterwards it has to assume that any registers that had copies of variable values might not be up to date anymore. So they have to be reloaded if they're needed.
This reads-everything / writes-everything assumption for the "memory" clobber is what keeps the asm statement from reordering at all at compile time wrt. all accesses, even non-volatile ones. The volatile is already implicit from being an asm() statement with no "=..." output operands, and is what stops it from being optimized away entirely (and with it the memory clobber).
Note that only potentially "reachable" C variables are affected. For example, escape analysis can still let the compiler keep a local int i in a register across a "memory" clobber, as long as the asm statement itself doesn't have the address as an input.
Just like a function call: for (int i=0;i<10;i++) {foobar("%d\n", i);} can keep the loop counter in a register, and just copy it to the 2nd arg-passing register for foobar every iteration. There's no way foobar can have a reference to i because its address hasn't been stored anywhere or passed anywhere.
(This is fine for the memory barrier use-case; no other thread could have its address either.)
Related:
How does a mutex lock and unlock functions prevents CPU reordering? - why opaque function calls work as compiler barriers.
How can I indicate that the memory *pointed* to by an inline ASM argument may be used? - cases where a "memory" clobber is needed for a non-empty asm statement (or other dummy operands to tell the asm statement which memory is read / written.)

I'll add that : memory is only a compiler directive. A speculative processor may reorder instructions. To prevent this an explicit memory barrier call is necessary. See Linux doc on memory barriers.

Related

How does Google's `DoNotOptimize()` function enforce statement ordering

I'm trying to understand exactly how Google's DoNotOptimize() is supposed to work.
For completeness, here is its definition (for clang, and non-const data):
template <class Tp>
inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp& value) {
asm volatile("" : "+r,m"(value) : : "memory");
}
As I understand we can use this in code like this:
start_time = time();
bench_output = run_bench(bench_inputs);
result = time() - start_time;
To ensure that the benchmark stays in the critical section:
start_time = time();
DoNotOptimize(bench_inputs);
bench_output = run_bench(bench_inputs);
DoNotOptimise(bench_output);
result = time() - start_time;
Specifically what I don't understand is why this guarantees (does it?) that run_bench() is not moved above start_time = time().
(Someone asked exactly this in this comment, however I don't understand the answer).
As I understand, the above DoNotOptimze() does several things:
It forces value to the stack, as it is passed by C++ reference. You can't have a pointer to a register, so it must be in memory.
Because value is now on the stack, subsequently clobbering memory (as done in the asm constraints) will force the compiler to assume that value is both read and written by the call to DoNotOptimize(value).
(it's not clear to me if the +r,m constraint is relevant. As far as I know this says that the pointer itself may be stored in a register or in memory, but the pointer value itself may be read and/or written.)
And this is where things get fuzzy for me.
If start_time is also stack allocated, the memory clobbering in DoNotOptimize() will mean that the compiler must assume that DoNotOptimize() potentially reads start_time. Therefore the order of the statements can only be:
start_time = time(); // on the stack
DoNotOptimize(bench_inputs); // reads start_time, writes bench_inputs
bench_output = run_bench(bench_inputs)
But if start_time is not stored in memory, but instead in a register, then clobbering memory will not clobber start_time, right? In that case the desired ordering of start_time = time() and DoNotOptimize(bench_inputs) is lost and the compiler is free to do:
DoNotOptimize(bench_inputs); // only writes bench_inputs
bench_output = run_bench(bench_inputs)
start_time = time(); // in a register
Obviously I've misunderstood something. Can anyone help explain? Thanks :)
I'm wondering if this is because reordering optimisations happen prior to register allocation, and thus everything is assumed to be stack allocated at that time. But if that were the case, then DoNotOptimize() would be redundant, as ClobberMemory() would be sufficient.
Summary: DoNotOptimize is ordered wrt. time() by the the "memory" clobbers, as if it were another function call to an opaque function that could modify any global state.
DoNotOptimize is ordered wrt. the computation of output from input by the data dependency of the calculation on the input, and the output on the calculation, as Chandler Carruth explained in the Q&A you linked. The "memory" clobber is irrelevant for this part.
"memory" clobber is like a non-inline function call
DoNotOptimize's asm statement contains a "memory" clobber. As far as the optimizer is concerned, that's equivalent to an opaque function call: it has to be assumed to read and write every globally-reachable object1. (Even ones this compilation unit might not know about.)
Since time() itself doesn't have an inline definition in any header, it can't reorder with DoNotOptimize at compile time for the same reason that a compiler can't reorder calls to foo() and bar() when it can't see the definitions of those functions. Same reason compilers don't need any special logic to stop them from reordering puts("hi"); puts("mom");.
(A hypothetical time() that could inline and only contained an asm statement would have to use asm volatile to make sure repeated calls didn't just use the first one's output. asm volatile statements can't reorder with each other or accesses to volatile variables, so that would be ok too, for a different reason.)
Footnote 1: Globally reachable = any object that might be pointed-to by any hypothetical global variable. i.e. anything except local variables within this function, or memory freshly allocated with new, if escape analysis can prove that nothing outside this function could have pointers to them.
How the asm statement works
I think you're pretty seriously misunderstanding how the asm works. "+r,m" tells the compiler to materialize the value in a register (or memory if it wants), and then use the value there at the end of the (empty) asm template as the new value of that C++ object.
So it forces the compiler to actually materialize (produce) the value somewhere, which means it has to be computed. And it means has to forget what it previously knew about the value (e.g. that it was a compile time constant 5, or non-negative, or anything) because the "+" modifier declares a read/write operand.
The point of DoNotOptimize on the input is to defeat constant-propagation that would let the benchmark optimize away.
And on the output to make sure a final result is actually materialized in a register (or memory) instead of optimizing away all the computation leading to an unused result. (This is where being asm volatile is relevant; defeating constant-propagation still works with non-volatile asm.)
So the computation you want to benchmark has to happen between the two DoNotOptimize() statements, and separately those two statements can't reorder with time().
The compiler has to assume that the asm statement modifies the value like val ^= random for all it knows, along with changing the value in memory of any/every other object except for private locals that weren't operands, so e.g. the "memory" clobber doesn't stop the compiler from keeping a local loop counter in memory. (It doesn't special case an empty asm template string here; programs don't contain asm statements like this by accident so nobody wants them optimized away.)
Misconceptions about the reference arg and picking "m"
I only got part way into the details of your attempt to reason about the "+r,m" operand and the reference function-arg before deciding it would probably be better to just explain from scratch. The correct reason isn't that complicated. But a couple things are worth specifically correcting:
The C++ function containing the asm statement can inline, letting the by-reference function arg optimize away. (It's even declared inline __attribute__((always_inline)) to force inlining even with optimization disabled, although in that case the reference variable won't optimize away.)
The net result is as if the asm statement were used directly on the C++ variable passed to DoNotOptimize. e.g. DoNotOptimize(foo) is like asm volatile("" : "+r,m"(foo) :: "memory")
The compiler can always pick register if it wants to, e.g. choosing to load a variable's value into a register before an asm statement. (And if the C++ semantics demand updating the variable's value in memory, also emitting a store instruction after the asm statement.)
For example, we can see that GCC does choose to do that. (I guess I could have used incl %0 as the example, but I just chose nop as a way to show what the compiler picked for the operand location as an alternative to # %0 pure comment, so the Godbolt compiler explorer wouldn't filter it out.)
void foo(int *p)
{
asm volatile("nop # operand picked %0" : "+r,m" (p[4]) );
}
# GCC 11.2 -O2
foo(int*):
movl 16(%rdi), %eax
nop # operand picked %eax
movl %eax, 16(%rdi)
ret
vs. clang choosing to leave the value in memory, so every instruction in the asm template would be accessing memory instead of a register. (If there were any instructions).
# clang 12.0.1 -O2 -fPIE
foo(int*): # #foo(int*)
nop # operand picked 16(%rdi)
retq
Fun fact: "r,m" is an attempt to work around a clang missed-optimization bug that makes it always pick memory for "rm" constraints, even if the value was already in a register. Spilling it first, even if it has to invent a temporary location for the value of an expression as an input.

Is assigning a pointer in C program considered atomic on x86-64

https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html#Atomic-Types says - In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
My question is whether pointer assignment can be considered atomic on x86_64 architecture for a C program compiled with gcc m64 flag. OS is 64bit Linux and CPU is Intel(R) Xeon(R) CPU D-1548. One thread will be setting a pointer and another thread accessing the pointer. There is only one writer thread and one reader thread. Reader should either be getting the previous value of the pointer or the latest value and no garbage value in between.
If it is not considered atomic, please let me know how can I use the gcc atomic builtins or maybe memory barrier like __sync_synchronize to achieve the same without using locks. Interested only in C solution and not C++. Thanks!
Bear in mind that atomicity alone is not enough for communicating between threads. Nothing prevents the compiler and CPU from reordering previous/subsequent load and store instructions with that "atomic" store. In old days people used volatile to prevent that reordering but that was never intended for use with threads and doesn't provide means to specify less or more restrictive memory order (see "Relationship with volatile" in there).
You should use C11 atomics because they guarantee both atomicity and memory order.
For almost all architectures, pointer load and store are atomic. A once notable exception was 8086/80286 where pointers could be seg:offset; there was an l[des]s instruction which could make an atomic load; but no corresponding atomic store.
The integrity of the pointer is only a small concern; your bigger issue revolves around synchronization: the pointer was at value Y, you set it to X; how will you know when nobody is using the (old) Y value?
A somewhat related problem is that you may have stored things at X, which the other thread expects to find. Without synchronization, other might see the new pointer value, however what it points to might not be up to date yet.
A plain global char *ptr should not be considered atomic. It might work sometimes, especially with optimization disabled, but you can get the compiler to make safe and efficient optimized asm by using modern language features to tell it you want atomicity.
Use C11 stdatomic.h or GNU C __atomic builtins. And see Why is integer assignment on a naturally aligned variable atomic on x86? - yes the underlying asm operations are atomic "for free", but you need to control the compiler's code-gen to get sane behaviour for multithreading.
See also LWN: Who's afraid of a big bad optimizing compiler? - weird effects of using plain vars include several really bad well-known things, but also more obscure stuff like invented loads, reading a variable more than once if the compiler decides to optimize away a local tmp and load the shared var twice, instead of loading it into a register. Using asm("" ::: "memory") compiler barriers may not be sufficient to defeat that depending on where you put them.
So use proper atomic stores and loads that tell the compiler what you want: You should generally use atomic loads to read them, too.
#include <stdatomic.h> // C11 way
_Atomic char *c11_shared_var; // all access to this is atomic, functions needed only if you want weaker ordering
void foo(){
atomic_store_explicit(&c11_shared_var, newval, memory_order_relaxed);
}
char *plain_shared_var; // GNU C
// This is a plain C var. Only specific accesses to it are atomic; be careful!
void foo() {
__atomic_store_n(&plain_shared_var, newval, __ATOMIC_RELAXED);
}
Using __atomic_store_n on a plain var is the functionality that C++20 atomic_ref exposes. If multiple threads access a variable for the entire time that it needs to exist, you might as well just use C11 stdatomic because every access needs to be atomic (not optimized into a register or whatever). When you want to let the compiler load once and reuse that value, do char *tmp = c11_shared_var; (or atomic_load_explicit if you only want acquire instead of seq_cst; cheaper on a few non-x86 ISAs).
Besides lack of tearing (atomicity of asm load or store), the other key parts of _Atomic foo * are:
The compiler will assume that other threads may have changed memory contents (like volatile effectively implies), otherwise the assumption of no data-race UB will let the compiler hoist loads out of loops. Without this, dead-store elimination might only do one store at the end of a loop, not updating the value multiple times.
The read side of the problem is usually what bites people in practice, see Multithreading program stuck in optimized mode but runs normally in -O0 - e.g. while(!flag){} becomes if(!flag) infinite_loop; with optimization enabled.
Ordering wrt. other code. e.g. you can use memory_order_release to make sure that other threads that see the pointer update also see all changes to the pointed-to data. (On x86 that's as simple as compile-time ordering, no extra barriers needed for acquire/release, only for seq_cst. Avoid seq_cst if you can; mfence or locked operations are slow.)
Guarantee that the store will compile to a single asm instruction. You'd be depending on this. It does happen in practice with sane compilers, although it's conceivable that a compiler might decide to use rep movsb to copy a few contiguous pointers, and that some machine somewhere might have a microcoded implementation that does some stores narrower than 8 bytes.
(This failure mode is highly unlikely; the Linux kernel relies on volatile load/store compiling to a single instruction with GCC / clang for its hand-rolled intrinsics. But if you just used asm("" ::: "memory") to make sure a store happened on a non-volatile variable, there's a chance.)
Also, something like ptr++ will compile to an atomic RMW operation like lock add qword [mem], 4, rather than separate load and store like volatile would. (See Can num++ be atomic for 'int num'? for more about atomic RMWs). Avoid that if you don't need it, it's slower. e.g. atomic_store_explicit(&ptr, ptr + 1, mo_release); - seq_cst loads are cheap on x86-64 but seq_cst stores aren't.
Also note that memory barriers can't create atomicity (lack of tearing), they can only create ordering wrt other ops.
In practice x86-64 ABIs do have alignof(void*) = 8 so all pointer objects should be naturally aligned (except in a __attribute__((packed)) struct which violates the ABI, so you can use __atomic_store_n on them. It should compile to what you want (plain store, no overhead), and meet the asm requirements to be atomic.
See also When to use volatile with multi threading? - you can roll your own atomics with volatile and asm memory barriers, but don't. The Linux kernel does that, but it's a lot of effort for basically no gain, especially for a user-space program.
Side note: an often repeated misconception is that volatile or _Atomic are needed to avoid reading stale values from cache. This is not the case.
All machines that run C11 threads across multiple cores have coherent caches, not needing explicit flush instructions in the reader or writer. Just ordinary load or store instructions, like x86 mov. The key is to not let the compiler keep values of shared variable in CPU registers (which are thread-private). It normally can do this optimization because of the assumption of no data-race Undefined Behaviour. Registers are very much not the same thing as L1d CPU cache; managing what's in registers vs. memory is done by the compiler, while hardware keeps cache in sync. See When to use volatile with multi threading? for more details about why coherent caches is sufficient to make volatile work like memory_order_relaxed.
See Multithreading program stuck in optimized mode but runs normally in -O0 for an example.
"Atomic" is treated as this quantum state where something can be both atomic and not atomic at the same time because "it's possible" that "some machines" "somewhere" "might not" write "a certain value" atomically. Maybe.
That is not the case. Atomicity has a very specific meaning, and it solves a very specific problem: threads being pre-empted by the OS to schedule another thread in its place on that core. And you cannot stop a thread from executing mid-assembly instruction.
What that means is that any single assembly instruction is "atomic" by definition. And since you have registry moving instructions, any register-sized copy is atomic by definition. That means a 32-bit integer on a 32-bit CPU, and a 64-bit integer on a 64-bit CPU are all atomic -- and of course that includes pointers (ignore all the people who will tell you "some architectures" have pointers of "different size" than registers, that hasn't been the case since 386).
You should however be careful not to hit variable caching problems (ie one thread writing a pointer, and another trying to read it but getting an old value from the cache), use volatile as needed to prevent this.

understanding GCC inline asm function

i will write my assumptions (based on my researches) in the question below i assume that there are mistakes in my assemptions outside the question it self:
i'm looking into some code written for ARM:
this function (taken from FreeRTOS port code):
portFORCE_INLINE static uint32_t ulPortRaiseBASEPRI(void)
{
uint32_t ulOriginalBASEPRI, ulNewBASEPRI;
__asm volatile(" mrs %0, basepri \n"
" mov %1, %2 \n"
" msr basepri, %1 \n"
" isb \n"
" dsb \n"
: "=r"(ulOriginalBASEPRI), "=r"(ulNewBASEPRI)
: "i"(configMAX_SYSCALL_INTERRUPT_PRIORITY));
/* This return will not be reached but is necessary to prevent compiler
warnings. */
return ulOriginalBASEPRI;
}
i understand in gcc "=r" is output operand. so we save values from asm to C variable
now the code in my understanding is equivalent to:
ulOriginalBASEPRI = basepri
ulNewBASEPRI = configMAX_SYSCALL_INTERRUPT_PRIORITY
basepri = ulNewBASEPRI
i understand we are returning the original value of BASEPRI so thats the first line. however, i didn't understand why we assign variable ulNewBASEPRI then we use it in MSR instruction..
so I've looked in the ARMV7 instruction set and i saw this:
i assume there is no (MSR immediate) in thumb instruction and "Encoding A1" means its only in Arm instruction mode.
so we have to use =r output operand to let asembler to auto select a register for our variable am i correct?
EDIT: ignore this section because i miscounted colons
: "i"(configMAX_SYSCALL_INTERRUPT_PRIORITY));
from my understanding for assembly template:
asm ( assembler template
: output operands /* optional */
: input operands /* optional */
: list of clobbered registers /* optional */
);
isn't "i" just means (immediate) or constant in the assembly?
does this mean the third colon is not only for clobber list?
if that so, isn't it more appropriate to find the constraint "i" in the input operands?
EDIT: ignore this section because i miscounted colons
i understand isb, dsb are memory barrier stuff but i really dont understand the discription of them. what they really do?
what happen if we remove dsb or isb instruction for example.?
so we have to use =r output operand to let assembler to auto select a register for our variable am i correct?
Yes, but it's the compiler that does register allocation. It just fills in the %[operand] in the asm template string as a text substitution and feeds that to the assembler.
Alternatively, you could hard-code a specific register in the asm template string, and use a register-asm local variable to make sure an "=r" constraint picked it. Or use an "=m" memory output operand and str a result into it, and declare a clobber on any registers you used. But those alternatives are obviously terrible compared to just telling the compiler about how your block of asm can produce an output.
I don't understand why the comment says the return statement doesn't run:
/* This return will not be reached but is necessary to prevent compiler
warnings. */
return ulOriginalBASEPRI;
Raising the basepri (ARM docs) to a higher number might allow an interrupt handler to run right away, before later instructions, but if that exception ever returns, execution will eventually reach the C outside the asm statement. That's the whole point of saving the old basepri into a register and having an output operand for it, I assume.
(I had been assuming that "raise" meant higher number = more interrupts allowed. But Ross comments that it will never allow more interrupts; they're "raising the bar" = lower number = fewer interrupts allowed.)
If execution really never comes out the end of your asm, you should tell the compiler about it. There is asm goto, but that needs a list of possible branch targets. The GCC manual says:
GCC assumes that asm execution falls through to the next statement (if this is not the case, consider using the __builtin_unreachable() intrinsic after the asm statement).
Failing to do this might lead to the compiler planning to do something after the asm, and then it never happening even though in the source it's before the asm.
It might be a good idea to use a "memory" clobber to make sure the compiler has memory contents in sync with the C abstract machine. (At least for variables other than locals, which an interrupt handler might access). This is usually desirable around asm barrier instructions like dsb, but it seems here we maybe don't care about being an SMP memory barrier, just about consistent execution after changing basepri? I don't understand why that's necessary, but if you do then worth considering one way or another whether compile-time reordering of memory access around the asm statement is or isn't a problem.
You'd use a third colon-separated section in the asm statement (after the inputs) : "memory"
Without that, compilers might decide to do an assignment after this asm instead of before, leaving a value just in registers.
// actual C source
global_var = 1;
uint32_t oldpri = ulPortRaiseBASEPRI();
global_var = 2;
could optimize (via dead-store elimination) into asm that worked like this
// possible asm
global_var = 2;
uint32_t oldpri = ulPortRaiseBASEPRI();
// or global_var = 2; here *instead* of before the asm
Concerning ARM/Thumb instruction set differences on msr: you should be able to answer this yourself from the documentation. ;-) It is just 2 pages later. Edit: Chapter A8.1.3 of the linked manual clearly states how encodings are documented on instructions.
dsb (data synchronization barrier) makes sure that all memory accesses are finished before the next instruction is executed. This is really shortly written, for the full details you need to read the documentation. If you have further specific questions about this operation, please post another question.
isb (instruction synchronization barrier) purges the instruction pipeline. This pipeline buffers instructions which are already fetched from memory but are not yet executed. So the next instruction will be fetched with possibly changed memory access, and this is what a programmer expects. The note above applies here, too.

GCC inline assembly - what's difference from __volatile__ and "memory"?

In GCC inline assembly, there are two ways to prevent from being optimized-out: __volatile__ keyword and inserting "memory" into clobber registers list.
My question is what is difference from __volatile__ and "memory" - It seems that they're the same... However, today I faced the strange situation, which shows they're definitely different! (My program had a bug in port I/O functions when I used "memory", but it becomes fine when I used __volatile__.)
What's the difference?
My reading of the GCC documentation is that the __volatile__ keyword is for assembly that has side-effects: that is, it does something other than produce inputs from outputs. In your case, I imagine "port I/O functions" would cause side-effects.
The "memory" clobber is just for assembly that reads/writes memory other than the input/output operands. While this is a side-effect, not all side-effects involve memory.
From the manual:
The typical use of Extended asm statements is to manipulate input values to produce output values. However, your asm statements may also produce side effects. If so, you may need to use the volatile qualifier to disable certain optimizations.
and
The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example accessing the memory pointed to by one of the input parameters).
Using __volatile__ you warrant that the value is always retrieved from RAM baypassing the CPU cache. This, as stated in the answer by Michael Rawson, produces side effects but in the sense that the normal optimization through CPU cache is "disabled" and nothing else.
In your case a value readed from an I/O port (and stored in a variabile) can be updated faster than the CPU cache invalidation so you may read an "old" value. Using __volatile__ you read always the non cached value.
You can also see: this post (I don't know if your architecture is ARM but te concept is the same).

g_atomic_int_get guarantees visibility into another thread?

This is related to this question.
rmmh's claim on that question was that on certain architectures, no special magic is needed to implement atomic get and set. Specifically, in this case that g_atomic_int_get(&foo) from glib gets expanded simply to (*(&foo)). I understand that this means that foo will not be in an internally consistent state. However am I also guaranteed that foo won't be cached by a given CPU or core?
Specifically, if one thread is setting foo, and another reading it (using the glib g_atomic_* functions), can I assume that the reader will see the updates to the variable made by the writer. Or is it possible for the writer to simply update the value in a register? For reference my target platform is gcc (4.1.2) on a multi-core multi-CPU X86_64 machine.
What most architecture ensures (included) is atomicity and coherence of reads and writes of suitably sized and aligned read/write (so every processors see a subsequence of the same master sequence of values for a given memory adress (*)), and int is most probably suitably size and compilers generally ensure that they are also correctly aligned.
But compilers rarely ensures that they aren't optimizing out some reads or writes if they aren't marked in a way or another. I've tried to compile:
int f(int* ptr)
{
int i, r=0;
*ptr = 5;
for (i = 0; i < 100; ++i) {
r += i*i;
}
*ptr = r;
return *ptr;
}
with gcc 4.1.2 gcc optimized out without problem the first write to *ptr, something you probably don't want for an atomic write.
(*) Coherence is not to be confused with consistency: the relationship between reads and writes at different address is often relaxed with respect to the intuitive, but costly to implement, sequential consistency. That's why memory barriers are needed.
Volatile will only ensure that the compiler doesn't use a register to hold the variable. Volatile will not prevent the compiler from re-ordering code; although, it might act as a hint to not reorder.
Depending on the architecture, certain instructions are atomic. writing to an integer and reading from an integer are often atomic. If gcc uses atomic instructions for reading/writing to/from an integer memory location, there will be no "intermediate garbage" read by one thread if another thread is in the middle of a write.
But, you might run into problems because of compiler reordering and instruction reordering.
With optimizations enabled, gcc reorders code. Gcc usually doesn't reorder code when global variables or function calls are involved since gcc can't guarantee the same outcome. Volatile might act as a hint to gcc wrt reordering, but I don't know. If you do run into reordering problems, this will act as a general purpose compiler barrier for gcc:
__asm__ __volatile__ ("" ::: "memory");
Even if the compiler doesn't reorder code, the CPU constantly reorders instructions during execution. Here is a very good article on the subject. A "memory barrier" is used to prevent the cpu from reordering instructions over a barrier. Here is one possible way to make a memory barrier using gcc:
__sync_synchronize();
You can also execute asm instructions to do different kinds of barriers.
That said, we read and write global integers without using atomic operations or mutexes from multiple threads and have no problems. This is most likely because A) we run on Intel and Intel does not reorder instructions aggressively and B) there is enough code executing before we do something "bad" with an early read of a flag. Also in our favor is the fact that a lot of system calls have barriers and the gcc atomic operations are barriers. We use a lot of atomic operations.
Here is a good discussion in stack overflow of a similar question.

Resources