What's the purpose of glib's g_atomic_int_get? - c

glib a provides g_atomic_int_get function to atomically read a standard C int type. Isn't reading 32-bit integers from memory into registers not already guaranteed to be an atomic operation by the processor (e.g. mov <reg32>, <mem>)?
If yes, then what's the purpose of glib's g_atomic_int_get function?

Some processors allow reading unaligned data, but that may take more than a single cycle. I.e. it's no longer atomic. On others it might not be an atomic operation at all to begin with.

The x86 mov instruction is not always atomic, either: it is non-atomic if the addresses involved are not naturally aligned.
Even if it were always atomic, it is not a memory barrier, which means the compiler is free to reorder the instruction with reference to other instructions nearby; and the processor is free to reorder the instruction with reference to other instructions in the instruction stream at runtime.
Unless you are writing code targeting only a single platform (and are sure that code will never need to be ported to another platform), you must always use explicit atomic instructions if you want atomic guarantees.

Related

Is assigning a pointer in C program considered atomic on x86-64

https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html#Atomic-Types says - In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
My question is whether pointer assignment can be considered atomic on x86_64 architecture for a C program compiled with gcc m64 flag. OS is 64bit Linux and CPU is Intel(R) Xeon(R) CPU D-1548. One thread will be setting a pointer and another thread accessing the pointer. There is only one writer thread and one reader thread. Reader should either be getting the previous value of the pointer or the latest value and no garbage value in between.
If it is not considered atomic, please let me know how can I use the gcc atomic builtins or maybe memory barrier like __sync_synchronize to achieve the same without using locks. Interested only in C solution and not C++. Thanks!
Bear in mind that atomicity alone is not enough for communicating between threads. Nothing prevents the compiler and CPU from reordering previous/subsequent load and store instructions with that "atomic" store. In old days people used volatile to prevent that reordering but that was never intended for use with threads and doesn't provide means to specify less or more restrictive memory order (see "Relationship with volatile" in there).
You should use C11 atomics because they guarantee both atomicity and memory order.
For almost all architectures, pointer load and store are atomic. A once notable exception was 8086/80286 where pointers could be seg:offset; there was an l[des]s instruction which could make an atomic load; but no corresponding atomic store.
The integrity of the pointer is only a small concern; your bigger issue revolves around synchronization: the pointer was at value Y, you set it to X; how will you know when nobody is using the (old) Y value?
A somewhat related problem is that you may have stored things at X, which the other thread expects to find. Without synchronization, other might see the new pointer value, however what it points to might not be up to date yet.
A plain global char *ptr should not be considered atomic. It might work sometimes, especially with optimization disabled, but you can get the compiler to make safe and efficient optimized asm by using modern language features to tell it you want atomicity.
Use C11 stdatomic.h or GNU C __atomic builtins. And see Why is integer assignment on a naturally aligned variable atomic on x86? - yes the underlying asm operations are atomic "for free", but you need to control the compiler's code-gen to get sane behaviour for multithreading.
See also LWN: Who's afraid of a big bad optimizing compiler? - weird effects of using plain vars include several really bad well-known things, but also more obscure stuff like invented loads, reading a variable more than once if the compiler decides to optimize away a local tmp and load the shared var twice, instead of loading it into a register. Using asm("" ::: "memory") compiler barriers may not be sufficient to defeat that depending on where you put them.
So use proper atomic stores and loads that tell the compiler what you want: You should generally use atomic loads to read them, too.
#include <stdatomic.h> // C11 way
_Atomic char *c11_shared_var; // all access to this is atomic, functions needed only if you want weaker ordering
void foo(){
atomic_store_explicit(&c11_shared_var, newval, memory_order_relaxed);
}
char *plain_shared_var; // GNU C
// This is a plain C var. Only specific accesses to it are atomic; be careful!
void foo() {
__atomic_store_n(&plain_shared_var, newval, __ATOMIC_RELAXED);
}
Using __atomic_store_n on a plain var is the functionality that C++20 atomic_ref exposes. If multiple threads access a variable for the entire time that it needs to exist, you might as well just use C11 stdatomic because every access needs to be atomic (not optimized into a register or whatever). When you want to let the compiler load once and reuse that value, do char *tmp = c11_shared_var; (or atomic_load_explicit if you only want acquire instead of seq_cst; cheaper on a few non-x86 ISAs).
Besides lack of tearing (atomicity of asm load or store), the other key parts of _Atomic foo * are:
The compiler will assume that other threads may have changed memory contents (like volatile effectively implies), otherwise the assumption of no data-race UB will let the compiler hoist loads out of loops. Without this, dead-store elimination might only do one store at the end of a loop, not updating the value multiple times.
The read side of the problem is usually what bites people in practice, see Multithreading program stuck in optimized mode but runs normally in -O0 - e.g. while(!flag){} becomes if(!flag) infinite_loop; with optimization enabled.
Ordering wrt. other code. e.g. you can use memory_order_release to make sure that other threads that see the pointer update also see all changes to the pointed-to data. (On x86 that's as simple as compile-time ordering, no extra barriers needed for acquire/release, only for seq_cst. Avoid seq_cst if you can; mfence or locked operations are slow.)
Guarantee that the store will compile to a single asm instruction. You'd be depending on this. It does happen in practice with sane compilers, although it's conceivable that a compiler might decide to use rep movsb to copy a few contiguous pointers, and that some machine somewhere might have a microcoded implementation that does some stores narrower than 8 bytes.
(This failure mode is highly unlikely; the Linux kernel relies on volatile load/store compiling to a single instruction with GCC / clang for its hand-rolled intrinsics. But if you just used asm("" ::: "memory") to make sure a store happened on a non-volatile variable, there's a chance.)
Also, something like ptr++ will compile to an atomic RMW operation like lock add qword [mem], 4, rather than separate load and store like volatile would. (See Can num++ be atomic for 'int num'? for more about atomic RMWs). Avoid that if you don't need it, it's slower. e.g. atomic_store_explicit(&ptr, ptr + 1, mo_release); - seq_cst loads are cheap on x86-64 but seq_cst stores aren't.
Also note that memory barriers can't create atomicity (lack of tearing), they can only create ordering wrt other ops.
In practice x86-64 ABIs do have alignof(void*) = 8 so all pointer objects should be naturally aligned (except in a __attribute__((packed)) struct which violates the ABI, so you can use __atomic_store_n on them. It should compile to what you want (plain store, no overhead), and meet the asm requirements to be atomic.
See also When to use volatile with multi threading? - you can roll your own atomics with volatile and asm memory barriers, but don't. The Linux kernel does that, but it's a lot of effort for basically no gain, especially for a user-space program.
Side note: an often repeated misconception is that volatile or _Atomic are needed to avoid reading stale values from cache. This is not the case.
All machines that run C11 threads across multiple cores have coherent caches, not needing explicit flush instructions in the reader or writer. Just ordinary load or store instructions, like x86 mov. The key is to not let the compiler keep values of shared variable in CPU registers (which are thread-private). It normally can do this optimization because of the assumption of no data-race Undefined Behaviour. Registers are very much not the same thing as L1d CPU cache; managing what's in registers vs. memory is done by the compiler, while hardware keeps cache in sync. See When to use volatile with multi threading? for more details about why coherent caches is sufficient to make volatile work like memory_order_relaxed.
See Multithreading program stuck in optimized mode but runs normally in -O0 for an example.
"Atomic" is treated as this quantum state where something can be both atomic and not atomic at the same time because "it's possible" that "some machines" "somewhere" "might not" write "a certain value" atomically. Maybe.
That is not the case. Atomicity has a very specific meaning, and it solves a very specific problem: threads being pre-empted by the OS to schedule another thread in its place on that core. And you cannot stop a thread from executing mid-assembly instruction.
What that means is that any single assembly instruction is "atomic" by definition. And since you have registry moving instructions, any register-sized copy is atomic by definition. That means a 32-bit integer on a 32-bit CPU, and a 64-bit integer on a 64-bit CPU are all atomic -- and of course that includes pointers (ignore all the people who will tell you "some architectures" have pointers of "different size" than registers, that hasn't been the case since 386).
You should however be careful not to hit variable caching problems (ie one thread writing a pointer, and another trying to read it but getting an old value from the cache), use volatile as needed to prevent this.

Read and Write atomic operation implementation in the Linux Kernel

Recently I've peeked into the Linux kernel implementation of an atomic read and write and a few questions came up.
First the relevant code from the ia64 architecture:
typedef struct {
int counter;
} atomic_t;
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic64_read(v) (*(volatile long *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
#define atomic64_set(v,i) (((v)->counter) = (i))
For both read and write operations, it seems that the direct approach was taken to read from or write to the variable. Unless there is another trick somewhere, I do not understand what guarantees exist that this operation will be atomic in the assembly domain. I guess an obvious answer will be that such an operation translates to one assembly opcode, but even so, how is that guaranteed when taking into account the different memory cache levels (or other optimizations)?
On the read macros, the volatile type is used in a casting trick. Anyone has a clue how this affects the atomicity here? (Note that it is not used in the write operation)
I think you are misunderstanding the (very much vague) usage of the word "atomic" and "volatile" here. Atomic only really means that the words will be read or written atomically (in one step, and guaranteeing that the contents of this memory position will always be one write or the other, and not something in between). And the volatile keyword tells the compiler to never assume the data in that location due to an earlier read/write (basically, never optimize away the read).
What the words "atomic" and "volatile" do NOT mean here is that there's any form of memory synchronization. Neither implies ANY read/write barriers or fences. Nothing is guaranteed with regards to memory and cache coherence. These functions are basically atomic only at the software level, and the hardware can optimize/lie however it deems fit.
Now as to why simply reading is enough: the memory models for each architecture are different. Many architectures can guarantee atomic reads or writes for data aligned to a certain byte offset, or x words in length, etc. and vary from CPU to CPU. The Linux kernel contains many defines for the different architectures that let it do without any atomic calls (CMPXCHG, basically) on platforms that guarantee (sometimes even only in practice even if in reality their spec says the don't actually guarantee) atomic reads/writes.
As for the volatile, while there is no need for it in general unless you're accessing memory-mapped IO, it all depends on when/where/why the atomic_read and atomic_write macros are being called. Many compilers will (though it is not set in the C spec) generate memory barriers/fences for volatile variables (GCC, off the top of my head, is one. MSVC does for sure.). While this would normally mean that all reads/writes to this variable are now officially exempt from just about any compiler optimizations, in this case by creating a "virtual" volatile variable only this particular instance of a read/write is off-limits for optimization and re-ordering.
The reads are atomic on most major architectures, so long as they are aligned to a multiple of their size (and aren't bigger than the read size of a give type), see the Intel Architecture manuals. Writes on the other hand many be different, Intel states that under x86, single byte write and aligned writes may be atomic, under IPF (IA64), everything use acquire and release semantics, which would make it guaranteed atomic, see this.
the volatile prevents the compiler from caching the value locally, forcing it to be retrieve where ever there is access to it.
If you write for a specific architecture, you can make assumptions specific to it.
I guess IA-64 does compile these things to a single instruction.
The cache shouldn't be an issue, unless the counter crosses a cache line boundry. But if 4/8 byte alignment is required, this can't happen.
A "real" atomic instruction is required when a machine instruction translates into two memory accesses. This is the case for increments (read, increment, write) or compare&swap.
volatile affects the optimizations the compiler can do.
For example, it prevents the compiler from converting multiple reads into one read.
But on the machine instruction level, it does nothing.

Are assignment = and subtraction assignment -= atomic operations in C?

int b = 1000;
b -= 20;
Is any of the above an atomic operation? What is an atomic operation in C?
It depends on the implementation. By the standard, nothing is atomic in C. If you need atomic ops you can look at your compiler's builtins.
It is architecture/implementation dependent.
If you want atomic operations, I think sig_atomic_t type is standardized by C99, but not sure.
From the GNU LibC docs:
In practice, you can assume that int and other integer types no longer than int are atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these are true on all of the machines that the GNU C library supports, and on all POSIX systems we know of.
This link seems to me to be on the right track in telling us what an atomic operation is in C:
http://odetocode.com/blogs/scott/archive/2006/05/17/atomic-operations.aspx
And it says, "...computer science adopted the term 'atomic operation' to describe an instruction that is indivisible and uninterruptible by other threads of execution."
And by that definition, the first line of code in the original question
int b=1000;
b-=20;
ought to be an atomic operation. The second could be an atomic operation if the CPU's instruction set includes an instruction to subtract directly from memory. The reason I think so is that the first code line would most likely require only one assembly (machine) instruction. And instructions either execute or not. I don't think any machine instruction can be interrupted in the middle.
At that same link, says, "If thread A is writing a 32-bit value to memory as an atomic operation, thread B will never be able to read the memory location and see only the first 16 of 32 bits written out." Seems that any single machine instruction cannot be interrupted in the middle, therefore would automatically be atomic between threads.
Incrementing and decrementing a number is not an atomic operation in C. Certain architectures support atomic incrementing and decrementing instructions, but there is no guarantee that the compiler would use them. You can look as an example at Qt reference counting. It uses atomic reference counting, on certain platforms it is implemented with platform-specific assembly code, and on the rest it is using a mutex to lock the counter.
If you're not incrementing or decrementing in a performance-critical part of your code, you'd simply use a mutex while doing it. If you're using it in performance-critical part of your code, you might want to try to rewrite your code in a way that doesn't use shared memory for this operation accessed from multiple places for this operation or use mutexes with higher granularity so that they don't affect the performance, or use assembly to ensure that the operation is atomic.
Quoting from ISO C89, 7.7 Signal handling <signal.h>
The type defined is sig_atomic_t which is the integral type of an
object that can be accessed as an atomic entity, even in the presence
of asynchronous interrupts.

Are memory barriers necessary for atomic reference counting shared immutable data?

I have some immutable data structures that I would like to manage using reference counts, sharing them across threads on an SMP system.
Here's what the release code looks like:
void avocado_release(struct avocado *p)
{
if (atomic_dec(p->refcount) == 0) {
free(p->pit);
free(p->juicy_innards);
free(p);
}
}
Does atomic_dec need a memory barrier in it? If so, what kind of memory barrier?
Additional notes: The application must run on PowerPC and x86, so any processor-specific information is welcomed. I already know about the GCC atomic builtins. As for immutability, the refcount is the only field that changes over the duration of the object.
On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.
Being a single instruction, it is non-interruptible. As an added "feature", the lock prefix results in a full memory barrier:
"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.
A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64, because mfence is slower on many CPUs even when it's guaranteed to be available, like in 64-bit mode. (Does lock xchg have the same behavior as mfence?)
So you have a full fence on x86 as an added bonus, whether you like it or not. :-)
On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted (meaning it will fail and retry).
It also does not mean an automatic full fence.
This does not however compromise the atomicity of the counter in any way.
It just means that in the x86 case, you happen to get a fence as well, "for free".
On PPC, one can insert a (partial or) full fence by emitting a (lw)sync instruction.
All in all, explicit memory barriers are not necessary for the atomic counter to work properly.
It is important to distinguish between atomic accesses (which guarantee that the read/modify/write of the value executes as one atomic unit) vs. memory reordering.
Memory barriers prevent reordering of reads and writes. Reordering is completely orthogonal to atomicity. For instance, on PowerPC if you implement the most efficient atomic increment possible then it will not prevent reordering. If you want to prevent reordering then you need an lwsync or sync instruction, or some equivalent high-level (C++ 11?) memory barrier.
Claims that there is "no possibility of the compiler reordering things in a problematic way" seem naive as general statements because compiler optimizations can be quite surprising and because CPUs (PowerPC/ARM/Alpha/MIPS in particular) aggressively reorder memory operations.
A coherent cache doesn't save you either. See https://preshing.com/archives/ to see how memory reordering really works.
In this case, however, I believe the answer is that no barriers are required. That is because for this specific case (reference counting) there is no need for a relationship between the reference count and the other values in the object. The one exception is when the reference count hits zero. At that point it is important to ensure that all updates from other threads are visible to the current thread so a read-acquire barrier may be necessary.
Are you intending to implement your own atomic_dec or are you just wondering whether a system-supplied function will behave as you want?
As a general rule, system-supplied atomic increment/decrement facilities will apply whatever memory barriers are required to just do the right thing. You generally don't have to worry about memory barriers unless you are doing something wacky like implementing your own lock-free data structures or an STM library.

In C is "i+=1;" atomic? [duplicate]

This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 1 year ago.
In C, is i+=1; atomic?
The C standard does not define whether it is atomic or not.
In practice, you never write code which fails if a given operation is atomic, but you might well write code which fails if it isn't. So assume it isn't.
No.
The only operation guaranteed by the C language standard to be atomic is assigning or retrieving a value to/from a variable of type sig_atomic_t, defined in <signal.h>.
(C99, chapter 7.14 Signal handling.)
Defined in C, no. In practice, maybe. Write it in assembly.
The standard make no guarantees.
Therefore a portable program would not make the assumption. It's not clear if you mean "required to be atomic", or "happens to be atomic in my C code", and the answer to that second question is that it depends on a lot of things:
Not all machines even have an increment memory op. Some need to load and store the value in order to operate on it, so the answer there is "never".
On machines that do have an increment memory op, there is no assurance that the compiler will not output a load, increment, and store sequence anyway, or use some other non-atomic instruction.
On machines that do have an increment memory operation, it may or may not be atomic with respect to other CPU units.
On machines that do have an atomic increment memory op, it may not be specified as part of the architecture, but just a property of a particular edition of the CPU chip, or even just of certain core logic or motherboard designs.
As to "how do I do this atomically", there is generally a way to do this quickly rather than resort to (more expensive) negotiated mutual exclusion. Sometimes this involves special collision-detecting repeatable code sequences. It's best to implement these in an assembly language module, because it's target-specific anyway so there is no portability benefit to the HLL.
Finally, because atomic operations that do not require (expensive) negotiated mutual exclusion are fast and hence useful, and in any case needed for portable code, systems typically have a library, generally written in assembly, that already implements similar functions.
Whether the expression is atomic or not depends only on the machine code that the compiler generates, and the CPU architectre that it will run on. Unless the addition can be achieved in one machine instruction, its unlikely to be atomic.
If you are using Windows then you can use the InterlockedIncrement() API function to do a guaranteed atomic increment. There are similar functions for decrement, etc.
Although i may not be atomic for the C language, it should be noted that is atomic on most platforms. The GNU C Library documentation states:
In practice, you can assume that int and other integer types no longer than int are atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C library supports and on all POSIX systems we know of.
It really depends on your target and the mnemonic set of your uC/processor.
If i is a variable held in a register then it is possible to have it atomic.
No, it isn't. If the value of i is not loaded to one of the registers already, it cannot be done in one single assembly instruction.
The C / C++ language itself makes no claim of atomicity or lack thereof. You need to rely on intrinsics or library functions to ensure atomic behavior.
Just put a mutex or a semaphore around it. Of course it is not atomic and you can make a test program with 50 or so threads accessing the same variable and incrementing it, to check it for yourself
Not usually.
If i is volatile, then it would depend on your CPU architecure and compiler - if adding two integers in main memory is atomic on your CPU, then that C statement might be atomic with a volatile int i.
No, the C standard doesn't guarantee atomicity, and in practice, the operation won't be atomic. You have to use a library (eg the Windows API) or compiler builtin functions (GCC, MSVC).
The answer to your question depends on whether i is a local, static, or global variable. If i is a static or global variable, then no, the statement i += 1 is not atomic. If, however, i is a local variable, then the statement is atomic for modern operating systems running on the x86 architecture and probably other architectures as well. #Dan Cristoloveanu was on the right track for the local variable case, but there is also more that can be said.
(In what follows, I assume a modern operating system with protection on an x86 architecture with threading entirely implemented with task switching.)
Given that this is C code, the syntax i += 1 implies that i is some kind of integer variable, the value of which, if it is a local variable, is stored in either a register such as %eax or in the stack. Handling the easy case first, if the value of i is stored in a register, say %eax, then the C compiler will most likely translate the statement to something like:
addl $1, %eax
which of course is atomic because no other process/thread should be able to modify the running thread's %eax register, and the thread itself cannot modify %eax again until this instruction completes.
If the value of i is stored in the stack, then this means that there is a memory fetch, increment, and commit. Something like:
movl -16(%esp), %eax
addl $1, %eax
movl %eax, -16(%esp) # this is the commit. It may actually come later if `i += 1` is part of a series of calculations involving `i`.
Normally this series of operations is not atomic. However, on a modern operating system, processes/threads should not be able to modify another thread's stack, so these operations do complete without other processes being able to interfere. Thus, the statement i += 1 is atomic in this case as well.

Resources