In C is "i+=1;" atomic? [duplicate] - c

This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 1 year ago.
In C, is i+=1; atomic?

The C standard does not define whether it is atomic or not.
In practice, you never write code which fails if a given operation is atomic, but you might well write code which fails if it isn't. So assume it isn't.

No.
The only operation guaranteed by the C language standard to be atomic is assigning or retrieving a value to/from a variable of type sig_atomic_t, defined in <signal.h>.
(C99, chapter 7.14 Signal handling.)

Defined in C, no. In practice, maybe. Write it in assembly.
The standard make no guarantees.
Therefore a portable program would not make the assumption. It's not clear if you mean "required to be atomic", or "happens to be atomic in my C code", and the answer to that second question is that it depends on a lot of things:
Not all machines even have an increment memory op. Some need to load and store the value in order to operate on it, so the answer there is "never".
On machines that do have an increment memory op, there is no assurance that the compiler will not output a load, increment, and store sequence anyway, or use some other non-atomic instruction.
On machines that do have an increment memory operation, it may or may not be atomic with respect to other CPU units.
On machines that do have an atomic increment memory op, it may not be specified as part of the architecture, but just a property of a particular edition of the CPU chip, or even just of certain core logic or motherboard designs.
As to "how do I do this atomically", there is generally a way to do this quickly rather than resort to (more expensive) negotiated mutual exclusion. Sometimes this involves special collision-detecting repeatable code sequences. It's best to implement these in an assembly language module, because it's target-specific anyway so there is no portability benefit to the HLL.
Finally, because atomic operations that do not require (expensive) negotiated mutual exclusion are fast and hence useful, and in any case needed for portable code, systems typically have a library, generally written in assembly, that already implements similar functions.

Whether the expression is atomic or not depends only on the machine code that the compiler generates, and the CPU architectre that it will run on. Unless the addition can be achieved in one machine instruction, its unlikely to be atomic.
If you are using Windows then you can use the InterlockedIncrement() API function to do a guaranteed atomic increment. There are similar functions for decrement, etc.

Although i may not be atomic for the C language, it should be noted that is atomic on most platforms. The GNU C Library documentation states:
In practice, you can assume that int and other integer types no longer than int are atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C library supports and on all POSIX systems we know of.

It really depends on your target and the mnemonic set of your uC/processor.
If i is a variable held in a register then it is possible to have it atomic.

No, it isn't. If the value of i is not loaded to one of the registers already, it cannot be done in one single assembly instruction.

The C / C++ language itself makes no claim of atomicity or lack thereof. You need to rely on intrinsics or library functions to ensure atomic behavior.

Just put a mutex or a semaphore around it. Of course it is not atomic and you can make a test program with 50 or so threads accessing the same variable and incrementing it, to check it for yourself

Not usually.
If i is volatile, then it would depend on your CPU architecure and compiler - if adding two integers in main memory is atomic on your CPU, then that C statement might be atomic with a volatile int i.

No, the C standard doesn't guarantee atomicity, and in practice, the operation won't be atomic. You have to use a library (eg the Windows API) or compiler builtin functions (GCC, MSVC).

The answer to your question depends on whether i is a local, static, or global variable. If i is a static or global variable, then no, the statement i += 1 is not atomic. If, however, i is a local variable, then the statement is atomic for modern operating systems running on the x86 architecture and probably other architectures as well. #Dan Cristoloveanu was on the right track for the local variable case, but there is also more that can be said.
(In what follows, I assume a modern operating system with protection on an x86 architecture with threading entirely implemented with task switching.)
Given that this is C code, the syntax i += 1 implies that i is some kind of integer variable, the value of which, if it is a local variable, is stored in either a register such as %eax or in the stack. Handling the easy case first, if the value of i is stored in a register, say %eax, then the C compiler will most likely translate the statement to something like:
addl $1, %eax
which of course is atomic because no other process/thread should be able to modify the running thread's %eax register, and the thread itself cannot modify %eax again until this instruction completes.
If the value of i is stored in the stack, then this means that there is a memory fetch, increment, and commit. Something like:
movl -16(%esp), %eax
addl $1, %eax
movl %eax, -16(%esp) # this is the commit. It may actually come later if `i += 1` is part of a series of calculations involving `i`.
Normally this series of operations is not atomic. However, on a modern operating system, processes/threads should not be able to modify another thread's stack, so these operations do complete without other processes being able to interfere. Thus, the statement i += 1 is atomic in this case as well.

Related

Is assigning a pointer in C program considered atomic on x86-64

https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html#Atomic-Types says - In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
My question is whether pointer assignment can be considered atomic on x86_64 architecture for a C program compiled with gcc m64 flag. OS is 64bit Linux and CPU is Intel(R) Xeon(R) CPU D-1548. One thread will be setting a pointer and another thread accessing the pointer. There is only one writer thread and one reader thread. Reader should either be getting the previous value of the pointer or the latest value and no garbage value in between.
If it is not considered atomic, please let me know how can I use the gcc atomic builtins or maybe memory barrier like __sync_synchronize to achieve the same without using locks. Interested only in C solution and not C++. Thanks!
Bear in mind that atomicity alone is not enough for communicating between threads. Nothing prevents the compiler and CPU from reordering previous/subsequent load and store instructions with that "atomic" store. In old days people used volatile to prevent that reordering but that was never intended for use with threads and doesn't provide means to specify less or more restrictive memory order (see "Relationship with volatile" in there).
You should use C11 atomics because they guarantee both atomicity and memory order.
For almost all architectures, pointer load and store are atomic. A once notable exception was 8086/80286 where pointers could be seg:offset; there was an l[des]s instruction which could make an atomic load; but no corresponding atomic store.
The integrity of the pointer is only a small concern; your bigger issue revolves around synchronization: the pointer was at value Y, you set it to X; how will you know when nobody is using the (old) Y value?
A somewhat related problem is that you may have stored things at X, which the other thread expects to find. Without synchronization, other might see the new pointer value, however what it points to might not be up to date yet.
A plain global char *ptr should not be considered atomic. It might work sometimes, especially with optimization disabled, but you can get the compiler to make safe and efficient optimized asm by using modern language features to tell it you want atomicity.
Use C11 stdatomic.h or GNU C __atomic builtins. And see Why is integer assignment on a naturally aligned variable atomic on x86? - yes the underlying asm operations are atomic "for free", but you need to control the compiler's code-gen to get sane behaviour for multithreading.
See also LWN: Who's afraid of a big bad optimizing compiler? - weird effects of using plain vars include several really bad well-known things, but also more obscure stuff like invented loads, reading a variable more than once if the compiler decides to optimize away a local tmp and load the shared var twice, instead of loading it into a register. Using asm("" ::: "memory") compiler barriers may not be sufficient to defeat that depending on where you put them.
So use proper atomic stores and loads that tell the compiler what you want: You should generally use atomic loads to read them, too.
#include <stdatomic.h> // C11 way
_Atomic char *c11_shared_var; // all access to this is atomic, functions needed only if you want weaker ordering
void foo(){
atomic_store_explicit(&c11_shared_var, newval, memory_order_relaxed);
}
char *plain_shared_var; // GNU C
// This is a plain C var. Only specific accesses to it are atomic; be careful!
void foo() {
__atomic_store_n(&plain_shared_var, newval, __ATOMIC_RELAXED);
}
Using __atomic_store_n on a plain var is the functionality that C++20 atomic_ref exposes. If multiple threads access a variable for the entire time that it needs to exist, you might as well just use C11 stdatomic because every access needs to be atomic (not optimized into a register or whatever). When you want to let the compiler load once and reuse that value, do char *tmp = c11_shared_var; (or atomic_load_explicit if you only want acquire instead of seq_cst; cheaper on a few non-x86 ISAs).
Besides lack of tearing (atomicity of asm load or store), the other key parts of _Atomic foo * are:
The compiler will assume that other threads may have changed memory contents (like volatile effectively implies), otherwise the assumption of no data-race UB will let the compiler hoist loads out of loops. Without this, dead-store elimination might only do one store at the end of a loop, not updating the value multiple times.
The read side of the problem is usually what bites people in practice, see Multithreading program stuck in optimized mode but runs normally in -O0 - e.g. while(!flag){} becomes if(!flag) infinite_loop; with optimization enabled.
Ordering wrt. other code. e.g. you can use memory_order_release to make sure that other threads that see the pointer update also see all changes to the pointed-to data. (On x86 that's as simple as compile-time ordering, no extra barriers needed for acquire/release, only for seq_cst. Avoid seq_cst if you can; mfence or locked operations are slow.)
Guarantee that the store will compile to a single asm instruction. You'd be depending on this. It does happen in practice with sane compilers, although it's conceivable that a compiler might decide to use rep movsb to copy a few contiguous pointers, and that some machine somewhere might have a microcoded implementation that does some stores narrower than 8 bytes.
(This failure mode is highly unlikely; the Linux kernel relies on volatile load/store compiling to a single instruction with GCC / clang for its hand-rolled intrinsics. But if you just used asm("" ::: "memory") to make sure a store happened on a non-volatile variable, there's a chance.)
Also, something like ptr++ will compile to an atomic RMW operation like lock add qword [mem], 4, rather than separate load and store like volatile would. (See Can num++ be atomic for 'int num'? for more about atomic RMWs). Avoid that if you don't need it, it's slower. e.g. atomic_store_explicit(&ptr, ptr + 1, mo_release); - seq_cst loads are cheap on x86-64 but seq_cst stores aren't.
Also note that memory barriers can't create atomicity (lack of tearing), they can only create ordering wrt other ops.
In practice x86-64 ABIs do have alignof(void*) = 8 so all pointer objects should be naturally aligned (except in a __attribute__((packed)) struct which violates the ABI, so you can use __atomic_store_n on them. It should compile to what you want (plain store, no overhead), and meet the asm requirements to be atomic.
See also When to use volatile with multi threading? - you can roll your own atomics with volatile and asm memory barriers, but don't. The Linux kernel does that, but it's a lot of effort for basically no gain, especially for a user-space program.
Side note: an often repeated misconception is that volatile or _Atomic are needed to avoid reading stale values from cache. This is not the case.
All machines that run C11 threads across multiple cores have coherent caches, not needing explicit flush instructions in the reader or writer. Just ordinary load or store instructions, like x86 mov. The key is to not let the compiler keep values of shared variable in CPU registers (which are thread-private). It normally can do this optimization because of the assumption of no data-race Undefined Behaviour. Registers are very much not the same thing as L1d CPU cache; managing what's in registers vs. memory is done by the compiler, while hardware keeps cache in sync. See When to use volatile with multi threading? for more details about why coherent caches is sufficient to make volatile work like memory_order_relaxed.
See Multithreading program stuck in optimized mode but runs normally in -O0 for an example.
"Atomic" is treated as this quantum state where something can be both atomic and not atomic at the same time because "it's possible" that "some machines" "somewhere" "might not" write "a certain value" atomically. Maybe.
That is not the case. Atomicity has a very specific meaning, and it solves a very specific problem: threads being pre-empted by the OS to schedule another thread in its place on that core. And you cannot stop a thread from executing mid-assembly instruction.
What that means is that any single assembly instruction is "atomic" by definition. And since you have registry moving instructions, any register-sized copy is atomic by definition. That means a 32-bit integer on a 32-bit CPU, and a 64-bit integer on a 64-bit CPU are all atomic -- and of course that includes pointers (ignore all the people who will tell you "some architectures" have pointers of "different size" than registers, that hasn't been the case since 386).
You should however be careful not to hit variable caching problems (ie one thread writing a pointer, and another trying to read it but getting an old value from the cache), use volatile as needed to prevent this.

Does "volatile" guarantee anything at all in portable C code for multi-core systems?

After looking at a bunch of other questions and their answers, I get the impression that there is no widespread agreement on what the "volatile" keyword in C means exactly.
Even the standard itself does not seem to be clear enough for everyone to agree on what it means.
Among other problems:
It seems to provide different guarantees depending on your hardware and depending on your compiler.
It affects compiler optimizations but not hardware optimizations, so on an advanced processor that does its own run-time optimizations, it is not even clear whether the compiler can prevent whatever optimization you want to prevent. (Some compilers do generate instructions to prevent some hardware optimizations on some systems, but this does not appear to be standardized in any way.)
To summarize the problem, it appears (after reading a lot) that "volatile" guarantees something like: The value will be read/written not just from/to a register, but at least to the core's L1 cache, in the same order that the reads/writes appear in the code. But this seems useless, since reading/writing from/to a register is already sufficient within the same thread, while coordinating with L1 cache doesn't guarantee anything further regarding coordination with other threads. I can't imagine when it could ever be important to sync just with L1 cache.
USE 1
The only widely-agreed-upon use of volatile seems to be for old or embedded systems where certain memory locations are hardware-mapped to I/O functions, like a bit in memory that controls (directly, in the hardware) a light, or a bit in memory that tells you whether a keyboard key is down or not (because it is connected by the hardware directly to the key).
It seems that "use 1" does not occur in portable code whose targets include multi-core systems.
USE 2
Not too different from "use 1" is memory that could be read or written at any time by an interrupt handler (which might control a light or store info from a key). But already for this we have the problem that depending on the system, the interrupt handler might run on a different core with its own memory cache, and "volatile" does not guarantee cache coherency on all systems.
So "use 2" seems to be beyond what "volatile" can deliver.
USE 3
The only other undisputed use I see is to prevent mis-optimization of accesses via different variables pointing to the same memory that the compiler doesn't realize is the same memory. But this is probably only undisputed because people aren't talking about it -- I only saw one mention of it. And I thought the C standard already recognized that "different" pointers (like different args to a function) might point to the same item or nearby items, and already specified that the compiler must produce code that works even in such cases. However, I couldn't quickly find this topic in the latest (500 page!) standard.
So "use 3" maybe doesn't exist at all?
Hence my question:
Does "volatile" guarantee anything at all in portable C code for multi-core systems?
EDIT -- update
After browsing the latest standard, it is looking like the answer is at least a very limited yes:
1. The standard repeatedly specifies special treatment for the specific type "volatile sig_atomic_t". However the standard also says that use of the signal function in a multi-threaded program results in undefined behavior. So this use case seems limited to communication between a single-threaded program and its signal handler.
2. The standard also specifies a clear meaning for "volatile" in relation to setjmp/longjmp. (Example code where it matters is given in other questions and answers.)
So the more precise question becomes:
Does "volatile" guarantee anything at all in portable C code for multi-core systems, apart from (1) allowing a single-threaded program to receive information from its signal handler, or (2) allowing setjmp code to see variables modified between setjmp and longjmp?
This is still a yes/no question.
If "yes", it would be great if you could show an example of bug-free portable code which becomes buggy if "volatile" is omitted. If "no", then I suppose a compiler is free to ignore "volatile" outside of these two very specific cases, for multi-core targets.
I'm no expert, but cppreference.com has what appears to me to be some pretty good information on volatile. Here's the gist of it:
Every access (both read and write) made through an lvalue expression
of volatile-qualified type is considered an observable side effect for
the purpose of optimization and is evaluated strictly according to the
rules of the abstract machine (that is, all writes are completed at
some time before the next sequence point). This means that within a
single thread of execution, a volatile access cannot be optimized out
or reordered relative to another visible side effect that is separated
by a sequence point from the volatile access.
It also gives some uses:
Uses of volatile
1) static volatile objects model memory-mapped I/O ports, and static
const volatile objects model memory-mapped input ports, such as a
real-time clock
2) static volatile objects of type sig_atomic_t are used for
communication with signal handlers.
3) volatile variables that are local to a function that contains an
invocation of the setjmp macro are the only local variables guaranteed
to retain their values after longjmp returns.
4) In addition, volatile variables can be used to disable certain
forms of optimization, e.g. to disable dead store elimination or
constant folding for microbenchmarks.
And of course, it mentions that volatile is not useful for thread synchronization:
Note that volatile variables are not suitable for communication
between threads; they do not offer atomicity, synchronization, or
memory ordering. A read from a volatile variable that is modified by
another thread without synchronization or concurrent modification from
two unsynchronized threads is undefined behavior due to a data race.
First of all, there's historically been various hiccups regarding different intepretations of the meaning of volatile access and similar. See this study: Volatiles Are Miscompiled, and What to Do about It.
Apart from the various issues mentioned in that study, the behavior of volatile is portable, save for one aspect of them: when they act as memory barriers. A memory barrier is some mechanism which is there to prevent concurrent unsequenced execution of your code. Using volatile as a memory barrier is certainly not portable.
Whether the C language guarantees memory behavior or not from volatile is apparently arguable, though personally I think the language is clear. First we have the formal definition of side effects, C17 5.1.2.3:
Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment.
The standard defines the term sequencing, as a way of determining order of evaluation (execution). The definition is formal and cumbersome:
Sequenced before is an asymmetric, transitive, pair-wise relation between evaluations
executed by a single thread, which induces a partial order among those evaluations.
Given any two evaluations A and B, if A is sequenced before B, then the execution of A
shall precede the execution of B. (Conversely, if A is sequenced before B, then B is
sequenced after A.) If A is not sequenced before or after B, then A and B are
unsequenced. Evaluations A and B are indeterminately sequenced when A is sequenced
either before or after B, but it is unspecified which.13) The presence of a sequence point
between the evaluation of expressions A and B implies that every value computation and
side effect associated with A is sequenced before every value computation and side effect
associated with B. (A summary of the sequence points is given in annex C.)
The TL;DR of the above is basically that in case we have an expression A which contains side-effects, it must be done executing before another expression B, in case B is sequenced after A.
Optimizations of C code are made possible through this part:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual
implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a
volatile object).
This means that the program may evaluate (execute) expressions in the order that the standard mandates elsewhere (order of evaluation etc). But it need not evaluate (execute) a value if it can deduce that it is not used. For example, the operation 0 * x doesn't need to evaluate x and simply replace the expression with 0.
Unless accessing a variable is a side-effect. Meaning that in case x is volatile, it must evaluate (execute) 0 * x even though the result will always be 0. Optimization is not allowed.
Furthermore, the standard speaks of observable behavior:
The least requirements on a conforming implementation are:
Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine.
/--/
This is the observable behavior of the program.
Given all of the above, a conforming implementation (compiler + underlying system) may not execute the access of volatile objects in an unsequenced order, in case the semantics of the written C source says otherwise.
This means that in this example
volatile int x;
volatile int y;
z = x;
z = y;
Both assignment expressions must be evaluated and z = x; must be evaluated before z = y;. A multi-processor implementation that outsource these two operations to two different unsequenced cores is not conforming!
The dilemma is that compilers can't do much about things like pre-fetch caching and instruction pipelining etc, particularly not when running on top of an OS. And so compilers hand that problem over to the programmers, telling them that memory barriers is now the programmer's responsibility. While the C standard clearly states that the problem needs to be solved by the compiler.
The compiler doesn't necessarily care to solve the problem though, and so volatile for the sake of acting as a memory barrier is non-portable. It has become a quality of implementation issue.
To summarize the problem, it appears (after reading a lot) that
"volatile" guarantees something like: The value will be read/written
not just from/to a register, but at least to the core's L1 cache, in
the same order that the reads/writes appear in the code.
No, it absolutely does not. And that makes volatile almost useless for the purpose of MT safe code.
If it did, then volatile would be quite good for variables shared by multiple thread as ordering the events in the L1 cache is all you need to do in typical CPU (that is either multi-core or multi-CPU on motherboard) capable of cooperating in a way that makes a normal implementation of either C/C++ or Java multithreading possible with typical expected costs (that is, not a huge cost on most atomic or non-contented mutex operations).
But volatile does not provide any guaranteed ordering (or "memory visibility") in the cache either in theory or in practice.
(Note: the following is based on sound interpretation of the standard documents, the standard's intent, historical practice, and a deep understand of the expectations of compiler writers. This approach based on history, actual practices, and expectations and understanding of real persons in the real world, which is much stronger and more reliable than parsing the words of a document that is not known to be stellar specification writing and which has been revised many times.)
In practice, volatile does guarantees ptrace-ability that is the ability to use debug information for the running program, at any level of optimization, and the fact the debug information makes sense for these volatile objects:
you may use ptrace (a ptrace-like mechanism) to set meaningful break points at the sequence points after operations involving volatile objects: you can really break at exactly these points (note that this works only if you are willing to set many break points as any C/C++ statement may be compiled to many different assembly start and end points, as in a massively unrolled loop);
while a thread of execution of stopped, you may read the value of all volatile objects, as they have their canonical representation (following the ABI for their respective type); a non volatile local variable could have an atypical representation, f.ex. a shifted representation: a variable used for indexing an array might be multiplied by the size of individual objects, for easier indexing; or it might be replaced by a pointer to an array element (as long as all uses of the variable as similarly converted) (think changing dx to du in an integral);
you can also modify those objects (as long as the memory mappings allow that, as volatile object with static lifetime that are const qualified might be in a memory range mapped read only).
Volatile guarantee in practice a little more than the strict ptrace interpretation: it also guarantees that volatile automatic variables have an address on the stack, as they aren't allocated to a register, a register allocation which would make ptrace manipulations more delicate (compiler can output debug information to explain how variables are allocated to registers, but reading and changing register state is slightly more involved than accessing memory addresses).
Note that full program debug-ability, that is considering all variables volatile at least at sequence points, is provided by the "zero optimization" mode of the compiler, a mode which still performs trivial optimizations like arithmetic simplifications (there is usually no guaranteed no optimization at all mode). But volatile is stronger than non optimization: x-x can be simplified for a non volatile integer x but not of a volatile object.
So volatile means guaranteed to be compiled as is, like the translation from source to binary/assembly by the compiler of a system call isn't a reinterpretation, changed, or optimized in any way by a compiler. Note that library calls may or may not be system calls. Many official system functions are actually library function that offer a thin layer of interposition and generally defer to the kernel at the end. (In particular getpid doesn't need to go to the kernel and could well read a memory location provided by the OS containing the information.)
Volatile interactions are interactions with the outside world of the real machine, which must follow the "abstract machine". They aren't internal interactions of program parts with other program parts. The compiler can only reason about what it knows, that is the internal program parts.
The code generation for a volatile access should follow the most natural interaction with that memory location: it should be unsurprising. That means that some volatile accesses are expected to be atomic: if the natural way to read or write the representation of a long on the architecture is atomic, then it's expected that a read or write of a volatile long will be atomic, as the compiler should not generate silly inefficient code to access volatile objects byte by byte, for example.
You should be able to determine that by knowing the architecture. You don't have to know anything about the compiler, as volatile means that the compiler should be transparent.
But volatile does no more than force the emission of expected assembly for the least optimized for particular cases to do a memory operation: volatile semantics means general case semantic.
The general case is what the compiler does when it doesn't have any information about a construct: f.ex. calling a virtual function on an lvalue via dynamic dispatch is a general case, making a direct call to the overrider after determining at compile time the type of the object designated by the expression is a particular case. The compiler always have a general case handling of all constructs, and it follows the ABI.
Volatile does nothing special to synchronize threads or provide "memory visibility": volatile only provides guarantees at the abstract level seen from inside a thread executing or stopped, that is the inside of a CPU core:
volatile says nothing about which memory operations reach main RAM (you may set specific memory caching types with assembly instructions or system calls to obtain these guarantees);
volatile doesn't provide any guarantee about when memory operations will be committed to any level of cache (not even L1).
Only the second point means volatile is not useful in most inter threads communication problems; the first point is essentially irrelevant in any programming problem that doesn't involve communication with hardware components outside the CPU(s) but still on the memory bus.
The property of volatile providing guaranteed behavior from the point of the view of the core running the thread means that asynchronous signals delivered to that thread, which are run from the point of view of the execution ordering of that thread, see operations in source code order.
Unless you plan to send signals to your threads (an extremely useful approach to consolidation of information about currently running threads with no previously agreed point of stopping), volatile is not for you.
The ISO C standard, no, but in practice all machines that we run threads across have coherent shared memory, so volatile in practice works somewhat like _Atomic with memory_order_relaxed, at least for pure-load / pure-store operations on small-enough types. (But of course only _Atomic will give you atomic RMWs for stuff like n += 1;)
There's also the question of what exactly volatile means to a compiler. The standard allows wiggle room, but in real-world compilers, it means the load or store has to actually happen in the asm. No more, no less. (A compiler that didn't work this way couldn't correctly compile pre-C11 multi-threaded code that used hand-rolled volatile, so that de-facto standard is a requirement for compilers to be generally useful and for anyone to want to actually use them. ISO C leaves enough choice up to the implementation that a DeathStation 9000 could be ISO C compliant and almost totally unusable for real programs, and break most real code bases.)
The requirement that volatile accesses are guaranteed to happen in source order is normally interpreted as putting the asm in that order, leaving runtime reordering at the mercy of the target machine's memory model. volatile accesses aren't ordered wrt. anything else, so plain operations can still optimize away separately from them.
When to use volatile with multi threading? is a C++ version of the question. Answer: basically never, use stdatomic. My answer there explains why cache-coherency makes volatile useful in practice: there are no C or C++ implementations I'm aware of where shared_var.store(1, std::memory_order_relaxed) needs to explicitly flush anything to make the store visible to other cores. It compiles to just a normal asm store instruction, for variables narrow enough to be "naturally" atomic.
(Memory barriers just make this core wait, e.g. until the store commits from the store buffer to L1d cache and thus becomes globally visible, before doing later loads/stores. So they order this core's accesses to coherent shared memory.)
For example, the Linux kernel depends on this, using volatile for inter-thread visibility, and asm() for memory barriers to order those accesses, and for atomic-RMW operations. All multi-core systems that can run a single instance of Linux across those cores have coherent shared memory.
There are some rare systems with shared memory that isn't coherent, for example some clusters. But you don't run threads of the same process across different coherency domains. (Or run a single instance of the OS on it). Instead, the shared memory has to get mapped differently from normal write-back cacheable, or you have to do explicit flushing.

What's the purpose of glib's g_atomic_int_get?

glib a provides g_atomic_int_get function to atomically read a standard C int type. Isn't reading 32-bit integers from memory into registers not already guaranteed to be an atomic operation by the processor (e.g. mov <reg32>, <mem>)?
If yes, then what's the purpose of glib's g_atomic_int_get function?
Some processors allow reading unaligned data, but that may take more than a single cycle. I.e. it's no longer atomic. On others it might not be an atomic operation at all to begin with.
The x86 mov instruction is not always atomic, either: it is non-atomic if the addresses involved are not naturally aligned.
Even if it were always atomic, it is not a memory barrier, which means the compiler is free to reorder the instruction with reference to other instructions nearby; and the processor is free to reorder the instruction with reference to other instructions in the instruction stream at runtime.
Unless you are writing code targeting only a single platform (and are sure that code will never need to be ported to another platform), you must always use explicit atomic instructions if you want atomic guarantees.

how can I convert non atomic operation to atomic

I am trying to understand atomic and non atomic operations.With respect to Operating System and also with respect to C.
As per the wikipedia page here
Consider a simple counter which different processes can increment.
Non-atomic
The naive, non-atomic implementation:
reads the value in the memory location;
adds one to the value;
writes the new value back into the memory location.
Now, imagine two processes are running incrementing a single, shared memory location:
the first process reads the value in memory location;
the first process adds one to the value;
but before it can write the new value back to the memory location it is suspended, and the second process is allowed to run:
the second process reads the value in memory location, the same value that the first process read;
the second process adds one to the value;
the second process writes the new value into the memory location.
How can the above operation be made an atmoic operation.
My understanding of atomic operation is that any thing which executes without interruption is atomic.
So for example
int b=1000;
b+=1000;
Should be an atomic operation as per my understanding because both the instructions executed without an interruption,how ever I learned from some one that in C there is nothing known as atomic operation so above both statements are non atomic.
So what I want to understand is what is atomicity is different when it comes to programming languages than the Operating Systems?
C99 doesn't have any way to make variables atomic with respect to other threads. C99 has no concept of multiple threads of execution. Thus, you need to use compiler-specific extensions, and/or CPU-level instructions to achieve atomicity.
The next C standard, currently known as C1x, will include atomic operations.
Even then, mere atomicity just guarantees that an operation is atomic, it doesn't guarantee when that operation becomes visible to other CPUs. To achieve visibility guarantees, in C99 you would need to study your CPU's memory model, and possibly use a special kind of CPU instructions known as fences or memory barriers. You also need to tell the compiler about it, using some compiler-specific compiler barrier. C1x defines several memory orderings, and when you use an atomic operation you can decide which memory ordering to use.
Some examples:
/* NOT atomic */
b += 1000;
/* GCC-extension, only in newish GCCs
* requirements on b's loads are CPU-specific
*/
__sync_add_and_fetch(&b, 1000);
/* GCC-extension + x86-assembly,
* b should be aligned to its size (natural alignment),
* or loads will not be atomic
*/
__asm__ __volatile__("lock add $1000, %0" : "+r"(b));
/* C1x */
#include <stdatomic.h>
atomic_int b = ATOMIC_INIT(1000);
int r = atomic_fetch_add(&b, 1000) + 1000;
All of this is as complex as it seems, so you should normally stick to mutexes, which makes things easier.
int b = 1000;
b+=1000;
gets turned into multiple statements at the instruction level. At the very least, preparing a register or memory, assigning 1000, then getting the contents of that register/memory, adding 1000 to the contents, and re-assigning the new value (2000) to that register. Without locking, the OS can suspend the process/thread at any point in that operation. In addition, on multiproc systems, a different processor could access that memory (wouldn't be a register in this case) while your operation is in progress.
When you take a lock out (which is how you would make this atomic), you are, in part, informing the OS that it is not ok to suspend this process/thread, and that this memory should not be accessed by other processes.
Now the above code would probably be optimized by the compiler to a simple assignment of 2000 to the memory location for b, but I'm ignoring that for the purposes of this answer.
b+=1000 is compiled, on all systems that I know, to multiple instructions. Thus it is not atomic.
Even b=1000 can be non atomic although you have to work hard to construct a situation where it is not atomic.
In fact C has no concept of threads and so there is nothing that is atomic in C. You need to rely on implementation specific details of your compiler and tools.
The above statements are non atomic because it becomes a move instruction to load b into a register (if it isnt) then add 1000 to it and the store back into memory. Many instruction sets allow for atomicity through atomic increment easiest being x86 with lock addl dest, src; some other instruction sets use cmpxchg to achieve the same result.
So what I want to understand is what
is atomicity is different when it
comes to programming languages than
the Operating Systems?
I'm a bit confused by this question. What do you mean exactly? The atomicity concept is the same both in prog. languages and OS.
Regarding atomicity and language, here is for example a link about atomicity in JAVA, that might give you a different perspective: What operations in Java are considered atomic?

Are assignment = and subtraction assignment -= atomic operations in C?

int b = 1000;
b -= 20;
Is any of the above an atomic operation? What is an atomic operation in C?
It depends on the implementation. By the standard, nothing is atomic in C. If you need atomic ops you can look at your compiler's builtins.
It is architecture/implementation dependent.
If you want atomic operations, I think sig_atomic_t type is standardized by C99, but not sure.
From the GNU LibC docs:
In practice, you can assume that int and other integer types no longer than int are atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these are true on all of the machines that the GNU C library supports, and on all POSIX systems we know of.
This link seems to me to be on the right track in telling us what an atomic operation is in C:
http://odetocode.com/blogs/scott/archive/2006/05/17/atomic-operations.aspx
And it says, "...computer science adopted the term 'atomic operation' to describe an instruction that is indivisible and uninterruptible by other threads of execution."
And by that definition, the first line of code in the original question
int b=1000;
b-=20;
ought to be an atomic operation. The second could be an atomic operation if the CPU's instruction set includes an instruction to subtract directly from memory. The reason I think so is that the first code line would most likely require only one assembly (machine) instruction. And instructions either execute or not. I don't think any machine instruction can be interrupted in the middle.
At that same link, says, "If thread A is writing a 32-bit value to memory as an atomic operation, thread B will never be able to read the memory location and see only the first 16 of 32 bits written out." Seems that any single machine instruction cannot be interrupted in the middle, therefore would automatically be atomic between threads.
Incrementing and decrementing a number is not an atomic operation in C. Certain architectures support atomic incrementing and decrementing instructions, but there is no guarantee that the compiler would use them. You can look as an example at Qt reference counting. It uses atomic reference counting, on certain platforms it is implemented with platform-specific assembly code, and on the rest it is using a mutex to lock the counter.
If you're not incrementing or decrementing in a performance-critical part of your code, you'd simply use a mutex while doing it. If you're using it in performance-critical part of your code, you might want to try to rewrite your code in a way that doesn't use shared memory for this operation accessed from multiple places for this operation or use mutexes with higher granularity so that they don't affect the performance, or use assembly to ensure that the operation is atomic.
Quoting from ISO C89, 7.7 Signal handling <signal.h>
The type defined is sig_atomic_t which is the integral type of an
object that can be accessed as an atomic entity, even in the presence
of asynchronous interrupts.

Resources