Is a C compiler allowed to coalesce sequential assignments to volatile variables?

Is a C compiler allowed to coalesce sequential assignments to volatile variables? - c

I'm having a theoretical (non-deterministic, hard to test, never happened in practice) hardware issue reported by hardware vendor where double-word write to certain memory ranges may corrupt any future bus transfers.
While I don't have any double-word writes explicitly in C code, I'm worried the compiler is allowed (in current or future implementations) to coalesce multiple adjacent word assignments into a single double-word assignment.
The compiler is not allowed to reorder assignments of volatiles, but it is unclear (to me) whether coalescing counts as reordering. My gut says it is, but I've been corrected by language lawyers before!
Example:
typedef struct
{
volatile unsigned reg0;
volatile unsigned reg1;
} Module;
volatile Module* module = (volatile Module*)0xFF000000u;
// two word stores, or one double-word store?
module->reg0 = 1;
module->reg1 = 2;
(I'll ask my compiler vendor about this separately, but I'm curious what the canonical/community interpretation of the standard is.)

No, the compiler is absolutely not allowed to optimize those two writes into a single double word write. It's kind of hard to quote the standard since the part regarding optimizations and side effects is so fuzzily written. The relevant parts are found in C17 5.1.2.3:
The semantic descriptions in this International Standard describe the behavior of an
abstract machine in which issues of optimization are irrelevant.
Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment.
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine.
When you access part of a struct, that in itself is a side-effect, which may have consequences that the compiler can't determine. Suppose for example that your struct is a hardware register map and those registers need to be written in a certain order. Like for example some microcontroller documentation could be along the lines of: "reg0 enables the hardware peripheral and must be written to before you can configure the details in reg1".
A compiler that would merge the volatile object writes into a single one would be non-conforming and plain broken.

The compiler is not allowed to make two such assignments into a single memory write. There must be two independent writes from the core. The answer from #Lundin gives relevant references to the C standard.
However, be aware that a cache - if present - may trick you. The keyword volatile doesn't imply "uncached" memory. So besides using volatile, you also need to make sure that the address 0xFF000000 is mapped as uncached. If the address is mapped as cached, the cache HW may turn the two assignments into a single memory write. In other words - for cached memory two core memory write operations may end up as a single write operation on the systems memory interface.

The behavior of volatile seems to be up to the implementation, partly because of a curious sentence which says: "What constitutes an access to an object that has volatile-qualified type is implementation-defined".
In ISO C 99, section 5.1.2.3, there is also:
3 In the abstract machine, all expressions are evaluated as specified by the semantics. An
actual implementation need not evaluate part of an expression if it can deduce that its
value is not used and that no needed side effects are produced (including any caused by
calling a function or accessing a volatile object).
So although requirements are given that a volatile object must be treated in accordance with the abstract semantics (i.e not optimized), curiously, the abstract semantics itself allows for the elimination of dead code and data flows, which are examples of optimizations!
I'm afraid that to know what volatile will and will not do, you have to go by your compiler's documentation.

The C Standard is agnostic to any relationship between operations on volatile objects and operations on the actual machine. While most implementations would specify that a construct like *(char volatile*)0x1234 = 0x56; would generate a byte store with value 0x56 to hardware address 0x1234, an implementation could, at its leisure, allocate space for e.g. an 8192-byte array and specify that *(char volatile*)0x1234 = 0x56; would immediately store 0x56 to element 0x1234 of that array, without ever doing anything with hardware address 0x1234. Alternatively, an implementation may include some process that periodically stores whatever happens to be in 0x1234 of that array to hardware address 0x56.
All that is required for conformance is that all operations on volatile objects within a single thread are, from the standpoint of the Abstract machine, regarded as absolutely sequenced. From the point of view of the Standard, implementations can convert such accesses into real machine operations in whatever fashion they see fit.

Changing it will change the observable behavior of the program. So compiler is not allowed to do so.

Related

Does "volatile" guarantee anything at all in portable C code for multi-core systems?

After looking at a bunch of other questions and their answers, I get the impression that there is no widespread agreement on what the "volatile" keyword in C means exactly.
Even the standard itself does not seem to be clear enough for everyone to agree on what it means.
Among other problems:
It seems to provide different guarantees depending on your hardware and depending on your compiler.
It affects compiler optimizations but not hardware optimizations, so on an advanced processor that does its own run-time optimizations, it is not even clear whether the compiler can prevent whatever optimization you want to prevent. (Some compilers do generate instructions to prevent some hardware optimizations on some systems, but this does not appear to be standardized in any way.)
To summarize the problem, it appears (after reading a lot) that "volatile" guarantees something like: The value will be read/written not just from/to a register, but at least to the core's L1 cache, in the same order that the reads/writes appear in the code. But this seems useless, since reading/writing from/to a register is already sufficient within the same thread, while coordinating with L1 cache doesn't guarantee anything further regarding coordination with other threads. I can't imagine when it could ever be important to sync just with L1 cache.
USE 1
The only widely-agreed-upon use of volatile seems to be for old or embedded systems where certain memory locations are hardware-mapped to I/O functions, like a bit in memory that controls (directly, in the hardware) a light, or a bit in memory that tells you whether a keyboard key is down or not (because it is connected by the hardware directly to the key).
It seems that "use 1" does not occur in portable code whose targets include multi-core systems.
USE 2
Not too different from "use 1" is memory that could be read or written at any time by an interrupt handler (which might control a light or store info from a key). But already for this we have the problem that depending on the system, the interrupt handler might run on a different core with its own memory cache, and "volatile" does not guarantee cache coherency on all systems.
So "use 2" seems to be beyond what "volatile" can deliver.
USE 3
The only other undisputed use I see is to prevent mis-optimization of accesses via different variables pointing to the same memory that the compiler doesn't realize is the same memory. But this is probably only undisputed because people aren't talking about it -- I only saw one mention of it. And I thought the C standard already recognized that "different" pointers (like different args to a function) might point to the same item or nearby items, and already specified that the compiler must produce code that works even in such cases. However, I couldn't quickly find this topic in the latest (500 page!) standard.
So "use 3" maybe doesn't exist at all?
Hence my question:
Does "volatile" guarantee anything at all in portable C code for multi-core systems?
EDIT -- update
After browsing the latest standard, it is looking like the answer is at least a very limited yes:
1. The standard repeatedly specifies special treatment for the specific type "volatile sig_atomic_t". However the standard also says that use of the signal function in a multi-threaded program results in undefined behavior. So this use case seems limited to communication between a single-threaded program and its signal handler.
2. The standard also specifies a clear meaning for "volatile" in relation to setjmp/longjmp. (Example code where it matters is given in other questions and answers.)
So the more precise question becomes:
Does "volatile" guarantee anything at all in portable C code for multi-core systems, apart from (1) allowing a single-threaded program to receive information from its signal handler, or (2) allowing setjmp code to see variables modified between setjmp and longjmp?
This is still a yes/no question.
If "yes", it would be great if you could show an example of bug-free portable code which becomes buggy if "volatile" is omitted. If "no", then I suppose a compiler is free to ignore "volatile" outside of these two very specific cases, for multi-core targets.

I'm no expert, but cppreference.com has what appears to me to be some pretty good information on volatile. Here's the gist of it:
Every access (both read and write) made through an lvalue expression
of volatile-qualified type is considered an observable side effect for
the purpose of optimization and is evaluated strictly according to the
rules of the abstract machine (that is, all writes are completed at
some time before the next sequence point). This means that within a
single thread of execution, a volatile access cannot be optimized out
or reordered relative to another visible side effect that is separated
by a sequence point from the volatile access.
It also gives some uses:
Uses of volatile
1) static volatile objects model memory-mapped I/O ports, and static
const volatile objects model memory-mapped input ports, such as a
real-time clock
2) static volatile objects of type sig_atomic_t are used for
communication with signal handlers.
3) volatile variables that are local to a function that contains an
invocation of the setjmp macro are the only local variables guaranteed
to retain their values after longjmp returns.
4) In addition, volatile variables can be used to disable certain
forms of optimization, e.g. to disable dead store elimination or
constant folding for microbenchmarks.
And of course, it mentions that volatile is not useful for thread synchronization:
Note that volatile variables are not suitable for communication
between threads; they do not offer atomicity, synchronization, or
memory ordering. A read from a volatile variable that is modified by
another thread without synchronization or concurrent modification from
two unsynchronized threads is undefined behavior due to a data race.

First of all, there's historically been various hiccups regarding different intepretations of the meaning of volatile access and similar. See this study: Volatiles Are Miscompiled, and What to Do about It.
Apart from the various issues mentioned in that study, the behavior of volatile is portable, save for one aspect of them: when they act as memory barriers. A memory barrier is some mechanism which is there to prevent concurrent unsequenced execution of your code. Using volatile as a memory barrier is certainly not portable.
Whether the C language guarantees memory behavior or not from volatile is apparently arguable, though personally I think the language is clear. First we have the formal definition of side effects, C17 5.1.2.3:
Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment.
The standard defines the term sequencing, as a way of determining order of evaluation (execution). The definition is formal and cumbersome:
Sequenced before is an asymmetric, transitive, pair-wise relation between evaluations
executed by a single thread, which induces a partial order among those evaluations.
Given any two evaluations A and B, if A is sequenced before B, then the execution of A
shall precede the execution of B. (Conversely, if A is sequenced before B, then B is
sequenced after A.) If A is not sequenced before or after B, then A and B are
unsequenced. Evaluations A and B are indeterminately sequenced when A is sequenced
either before or after B, but it is unspecified which.13) The presence of a sequence point
between the evaluation of expressions A and B implies that every value computation and
side effect associated with A is sequenced before every value computation and side effect
associated with B. (A summary of the sequence points is given in annex C.)
The TL;DR of the above is basically that in case we have an expression A which contains side-effects, it must be done executing before another expression B, in case B is sequenced after A.
Optimizations of C code are made possible through this part:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual
implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a
volatile object).
This means that the program may evaluate (execute) expressions in the order that the standard mandates elsewhere (order of evaluation etc). But it need not evaluate (execute) a value if it can deduce that it is not used. For example, the operation 0 * x doesn't need to evaluate x and simply replace the expression with 0.
Unless accessing a variable is a side-effect. Meaning that in case x is volatile, it must evaluate (execute) 0 * x even though the result will always be 0. Optimization is not allowed.
Furthermore, the standard speaks of observable behavior:
The least requirements on a conforming implementation are:
Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine.
/--/
This is the observable behavior of the program.
Given all of the above, a conforming implementation (compiler + underlying system) may not execute the access of volatile objects in an unsequenced order, in case the semantics of the written C source says otherwise.
This means that in this example
volatile int x;
volatile int y;
z = x;
z = y;
Both assignment expressions must be evaluated and z = x; must be evaluated before z = y;. A multi-processor implementation that outsource these two operations to two different unsequenced cores is not conforming!
The dilemma is that compilers can't do much about things like pre-fetch caching and instruction pipelining etc, particularly not when running on top of an OS. And so compilers hand that problem over to the programmers, telling them that memory barriers is now the programmer's responsibility. While the C standard clearly states that the problem needs to be solved by the compiler.
The compiler doesn't necessarily care to solve the problem though, and so volatile for the sake of acting as a memory barrier is non-portable. It has become a quality of implementation issue.

To summarize the problem, it appears (after reading a lot) that
"volatile" guarantees something like: The value will be read/written
not just from/to a register, but at least to the core's L1 cache, in
the same order that the reads/writes appear in the code.
No, it absolutely does not. And that makes volatile almost useless for the purpose of MT safe code.
If it did, then volatile would be quite good for variables shared by multiple thread as ordering the events in the L1 cache is all you need to do in typical CPU (that is either multi-core or multi-CPU on motherboard) capable of cooperating in a way that makes a normal implementation of either C/C++ or Java multithreading possible with typical expected costs (that is, not a huge cost on most atomic or non-contented mutex operations).
But volatile does not provide any guaranteed ordering (or "memory visibility") in the cache either in theory or in practice.
(Note: the following is based on sound interpretation of the standard documents, the standard's intent, historical practice, and a deep understand of the expectations of compiler writers. This approach based on history, actual practices, and expectations and understanding of real persons in the real world, which is much stronger and more reliable than parsing the words of a document that is not known to be stellar specification writing and which has been revised many times.)
In practice, volatile does guarantees ptrace-ability that is the ability to use debug information for the running program, at any level of optimization, and the fact the debug information makes sense for these volatile objects:
you may use ptrace (a ptrace-like mechanism) to set meaningful break points at the sequence points after operations involving volatile objects: you can really break at exactly these points (note that this works only if you are willing to set many break points as any C/C++ statement may be compiled to many different assembly start and end points, as in a massively unrolled loop);
while a thread of execution of stopped, you may read the value of all volatile objects, as they have their canonical representation (following the ABI for their respective type); a non volatile local variable could have an atypical representation, f.ex. a shifted representation: a variable used for indexing an array might be multiplied by the size of individual objects, for easier indexing; or it might be replaced by a pointer to an array element (as long as all uses of the variable as similarly converted) (think changing dx to du in an integral);
you can also modify those objects (as long as the memory mappings allow that, as volatile object with static lifetime that are const qualified might be in a memory range mapped read only).
Volatile guarantee in practice a little more than the strict ptrace interpretation: it also guarantees that volatile automatic variables have an address on the stack, as they aren't allocated to a register, a register allocation which would make ptrace manipulations more delicate (compiler can output debug information to explain how variables are allocated to registers, but reading and changing register state is slightly more involved than accessing memory addresses).
Note that full program debug-ability, that is considering all variables volatile at least at sequence points, is provided by the "zero optimization" mode of the compiler, a mode which still performs trivial optimizations like arithmetic simplifications (there is usually no guaranteed no optimization at all mode). But volatile is stronger than non optimization: x-x can be simplified for a non volatile integer x but not of a volatile object.
So volatile means guaranteed to be compiled as is, like the translation from source to binary/assembly by the compiler of a system call isn't a reinterpretation, changed, or optimized in any way by a compiler. Note that library calls may or may not be system calls. Many official system functions are actually library function that offer a thin layer of interposition and generally defer to the kernel at the end. (In particular getpid doesn't need to go to the kernel and could well read a memory location provided by the OS containing the information.)
Volatile interactions are interactions with the outside world of the real machine, which must follow the "abstract machine". They aren't internal interactions of program parts with other program parts. The compiler can only reason about what it knows, that is the internal program parts.
The code generation for a volatile access should follow the most natural interaction with that memory location: it should be unsurprising. That means that some volatile accesses are expected to be atomic: if the natural way to read or write the representation of a long on the architecture is atomic, then it's expected that a read or write of a volatile long will be atomic, as the compiler should not generate silly inefficient code to access volatile objects byte by byte, for example.
You should be able to determine that by knowing the architecture. You don't have to know anything about the compiler, as volatile means that the compiler should be transparent.
But volatile does no more than force the emission of expected assembly for the least optimized for particular cases to do a memory operation: volatile semantics means general case semantic.
The general case is what the compiler does when it doesn't have any information about a construct: f.ex. calling a virtual function on an lvalue via dynamic dispatch is a general case, making a direct call to the overrider after determining at compile time the type of the object designated by the expression is a particular case. The compiler always have a general case handling of all constructs, and it follows the ABI.
Volatile does nothing special to synchronize threads or provide "memory visibility": volatile only provides guarantees at the abstract level seen from inside a thread executing or stopped, that is the inside of a CPU core:
volatile says nothing about which memory operations reach main RAM (you may set specific memory caching types with assembly instructions or system calls to obtain these guarantees);
volatile doesn't provide any guarantee about when memory operations will be committed to any level of cache (not even L1).
Only the second point means volatile is not useful in most inter threads communication problems; the first point is essentially irrelevant in any programming problem that doesn't involve communication with hardware components outside the CPU(s) but still on the memory bus.
The property of volatile providing guaranteed behavior from the point of the view of the core running the thread means that asynchronous signals delivered to that thread, which are run from the point of view of the execution ordering of that thread, see operations in source code order.
Unless you plan to send signals to your threads (an extremely useful approach to consolidation of information about currently running threads with no previously agreed point of stopping), volatile is not for you.

The ISO C standard, no, but in practice all machines that we run threads across have coherent shared memory, so volatile in practice works somewhat like _Atomic with memory_order_relaxed, at least for pure-load / pure-store operations on small-enough types. (But of course only _Atomic will give you atomic RMWs for stuff like n += 1;)
There's also the question of what exactly volatile means to a compiler. The standard allows wiggle room, but in real-world compilers, it means the load or store has to actually happen in the asm. No more, no less. (A compiler that didn't work this way couldn't correctly compile pre-C11 multi-threaded code that used hand-rolled volatile, so that de-facto standard is a requirement for compilers to be generally useful and for anyone to want to actually use them. ISO C leaves enough choice up to the implementation that a DeathStation 9000 could be ISO C compliant and almost totally unusable for real programs, and break most real code bases.)
The requirement that volatile accesses are guaranteed to happen in source order is normally interpreted as putting the asm in that order, leaving runtime reordering at the mercy of the target machine's memory model. volatile accesses aren't ordered wrt. anything else, so plain operations can still optimize away separately from them.
When to use volatile with multi threading? is a C++ version of the question. Answer: basically never, use stdatomic. My answer there explains why cache-coherency makes volatile useful in practice: there are no C or C++ implementations I'm aware of where shared_var.store(1, std::memory_order_relaxed) needs to explicitly flush anything to make the store visible to other cores. It compiles to just a normal asm store instruction, for variables narrow enough to be "naturally" atomic.
(Memory barriers just make this core wait, e.g. until the store commits from the store buffer to L1d cache and thus becomes globally visible, before doing later loads/stores. So they order this core's accesses to coherent shared memory.)
For example, the Linux kernel depends on this, using volatile for inter-thread visibility, and asm() for memory barriers to order those accesses, and for atomic-RMW operations. All multi-core systems that can run a single instance of Linux across those cores have coherent shared memory.
There are some rare systems with shared memory that isn't coherent, for example some clusters. But you don't run threads of the same process across different coherency domains. (Or run a single instance of the OS on it). Instead, the shared memory has to get mapped differently from normal write-back cacheable, or you have to do explicit flushing.

Race condition when accessing adjacent members in a shared struct, according to CERT coding rule POS49-C?

According to CERT coding rule POS49-C it is possible that different threads accessing different fields of the same structure may conflict.
Instead of bit-field, I use regular unsigned int.
struct multi_threaded_flags {
unsigned int flag1;
unsigned int flag2;
};
struct multi_threaded_flags flags;
void thread1(void) {
flags.flag1 = 1;
}
void thread2(void) {
flags.flag2 = 2;
}
I can see that even unsigned int, there can still be racing condition IF compiler decides to use load/store 8 bytes instead of 4 bytes.
I think compiler will never do that and racing condition will never happen here, but that's completely just my guess.
Is there any well-defined assembly/compiler documentation regarding this case ? I hope locking, which is costly, is the last resort when this situation happens to be undefined.
FYI, I use gcc.

The C11 memory model guarantees that accesses to distinct structure members (which aren't part of a bit-field) are independent, so you'll run into no problems modifying the two flags from different threads (i.e., the "load 8 bytes, modify 4, and write back 8" scenario is not allowed).
This guarantee does not extend in general to bitfields, so you have to be careful there.
Of course, if you are concurrently modifying the same flag from more than one thread, you'll likely trigger the prohibition against data races, so don't do that.

Before C11, ISO C had nothing to say about threads, and writing multi-threaded code relied on other standards (e.g. POSIX which defines a memory model for pthreads), and multi-threaded code essentially depended on the way real compilers worked.
Note that this CERT coding-standard rule is in the POSIX section, and appears to be about pthreads without C11. (There's a CON32-C. Prevent data races when accessing bit-fields from multiple threads rule for C11, where they solve the bit-field concurrency problem by simply promoting the bit-fields to unsigned char, which C11 defines as "separate memory locations". This rule appears to be an insufficiently-edited copy of that, because many of its suggestions suck.)
But unfortunately POSIX pthreads doesn't clearly define what a "memory location is", and this is all they have to say on the subject:
Single UNIX® Specification, Version 4, 2016 Edition (online HTML copy, requires free registration)
4.12 Memory Synchronization
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads.
This is why C11 defines it more clearly, where only bitfields are dangerous to write from different threads (barring compiler bugs).
However, I think everyone (including all compilers) agreed that separate int variables / struct members / array elements were separate "memory locations". Most real-world software doesn't take any special precautions for int or char variables that may be written by separate threads (especially outside of structs).
A compiler that gets int wrong will cause problems all over the place unless the bug is limited to very specific circumstances.
Most bugs like this are very hard to detect with testing, because usually the other data that's non-atomically loaded and stored back isn't written by another thread very often / ever. But if a compiler always did that for every int, problems would show up in some software pretty quickly.
Normally, separate char members would also be considered separate "memory locations", but some pre-C11 implementations might have exceptions to that rule. (Especially on early Alpha AXP which famously has no byte store instruction (so a C11 implementation would have to use 32-bit char), but optimizations that invent writes when updating multiple members can happen anywhere, either by accident or because the compiler developers define "memory location" as a 32 or 64-bit word.)
There's also the issue of compiler bugs. This can affect even compilers that intend to conform to C11. For example gcc bug 52080 which affected some non-x86 architectures. (Discovered in gcc4.7 in 2012, fixed in gcc4.8 a couple months later). Using a bitfield "tricked" the compiler into doing a non-atomic read-modify-write of the containing 64-bit word, even though that included a non-bitfield member. (Bitfields are bait for compiler bugs. Any defensive / safe-coding standard should recommend avoiding them in structs where different members can be modified from different threads. And definitely don't put them next to the actual lock.)
Herb Sutter's talk atomic<> Weapons: The C++ Memory Model and Modern Hardware part 2 goes into some detail about the kinds of compiler bugs that have affected multi-threaded code. Most of these should be shaken out by now (2017) if you're using a modern compiler. Most things like inventing writes (or non-atomic read and write-back of the same value) were usually still considered bugs before C11; C11 mostly just firmed up the rules compilers were already trying to follow. It also made it easier to report such bugs, because you could say unequivocally that it violates the standard instead of just "it breaks my code".
That coding-rule article is poorly written. Its examples with adjacent bit-fields are unsafe, but it claims that all variables are at risk. This is not true in general, especially not with C11. Many users of pthreads can or already do compile with C11 compilers.
(The phrasing I'm referring to is "Bit-fields are especially prone to this behavior", which incorrectly implies that this is allowed to happen with ordinary members of structs, or variables that happen to be adjacent outside of structs)
It's part of a defensive coding standard, but it should definitely make the distinction between what the standards require and what is just belt-and-suspenders defense against compiler bugs.
Also, putting variables that will usually be accessed by different threads into one struct is generally terrible. False sharing of a cache line (typically 64 bytes) is really bad for performance, causing cache misses and (on out-of-order x86 CPUs) memory-ordering mis-speculation (like a branch mispredict requiring a roll-back.) Putting separately-used shared variables into the same byte with bit-fields is even worse, because it prevents efficient stores (any store has to be a RMW of the containing byte).
Solving the bit-field problem by promoting the two bit-fields to unsigned char makes much more sense than using a mutex if they need to be independently writeable from separate threads. Or even unsigned long if you're paranoid.
If the two members are often used together, it makes sense to put them nearby. But if you're going to pad to a whole long to contain both members (like that article does), you might as well make them at least unsigned char or bool instead of 1-byte bitfields.
Although honestly, having two threads modify separate members of a struct at the same time seems like poor design unless one of the members is the lock, and the modification is part of an attempt to take the lock. Using a bit-field as a lock is a bad idea unless you're writing for a specific ISA building and your own lock primitive using something like x86's lock bts instruction to atomically test-and-set a bit. Even then it's a bad idea unless you need to pack it with other bitfields for space saving; the Linux code that exposed the gcc bug with an int lock:1 member was a horrible idea.
In addition, the flags are declared volatile to ensure that the compiler will not attempt to move operations on them outside the mutex.
If your compiler needs this, your compiler is seriously broken, and will create broken code for most multi-threaded programs. (Unless the compiler bug only happens with bit-fields, because shared bit-fields are rare).
Most code doesn't make shared variables volatile, and relies on the guarantee that mutex lock/unlock stops operations from reordering at compile or run time out of the critical section.
Back in 2012, and possibly still today, gcc -pthread might affect code-gen choices in C89/C99 mode (-std=gnu99). In discussion on an LWN article about that gcc bug, this user claimed that -pthread would prohibit the compiler from doing a 64-bit load/store when modifying a 32-bit variable, but that without -pthread it could do so (although on most architectures, IDK why it would). But it turns out that gcc bug manifested even with -pthread, so it was really a bug rather than an aggressive optimization choice.
ISO C11 standard:
N1570, section 3.14 definitions:
memory location: either an object of scalar type, or a maximal sequence of adjacent bit-fields all having nonzero width
NOTE 1 Two threads of execution can update and access separate memory locations without interfering with each other.
A bit-field and an adjacent non-bit-field member are in separate memory locations. ... It is not safe to concurrently update two non-atomic bit-fields in the same structure if all members declared between them are also (non-zero-length) bit-fields, no matter what the sizes of those intervening bit-fields happen to be.
(...gives an example of a struct with bit-fields...)
So in C11, you can't assume anything about the compiler munging other bit-fields when writing one bit-field, but otherwise you're safe. Unless you use a separator :0 field to force the compiler pad enough (or use atomic bit-ops) so that it can update your bit-field without concurrency problems for other fields. But if you want to be safe, it's probably not a good idea to use bit-fields at all in structs that are written by multiple threads at once.
See also other Notes in the C11 standard, e.g. the one linked by #Fuz in Is it well-defined behavior to modify one element of an array while another thread modifies another element of the same array? that explicitly says that compiler transformations that would make this dangerous are disallowed.

Enforcing correct ordering of volatile memory location accesses

Consider the following snippet:
volatile uint32_t *registers;
\\ ...
\\ registers gets mmap'd and so on
\\ ...
registers[0] = 0x1;
registers[1] = 0x1;
In this registers is initialised from an mmap to some peripheral address space (hence the volatile).
I've yet to see this discussed anywhere, but in my mind in the general case, these register writes (or indeed any register accesses) should be protected from each other by a memory barrier. The problem being if the peripheral is expecting the accesses to be correctly ordered, the compiler may well ignore that.
So it should look something like:
volatile uint32_t *registers;
pthread_mutex_t reg_mutex;
\\ ...
\\ registers gets mmap'd and so on
\\ ...
pthread_mutex_lock(&reg_mutex);
registers[0] = 0x1;
pthread_mutex_unlock(&reg_mutex);
pthread_mutex_lock(&reg_mutex)
registers[1] = 0x1;
pthread_mutex_unlock(&reg_mutex);
Is my reasoning correct? Have I missed something? Is there a better way to do this?
It seems to me that this should be core to any understanding of using memory mapped devices.
EDIT: In response to the questions, I've noted that there is potential issues with the out-of-order execution of instructions. In which case, I'll rephrase the question to address that explicitly: Is a memory barrier the correct approach to constrain the processor to order correctly? (and is a mutex a valid strategy for this?)

According to the C and C++ standards, accesses to volatile memory locations are never reordered, so you don't have to do anything special. They are also never combined, so if you do sth like this:
registers[0] = 0x1;
registers[0] = 0x1;
registers[0] = 0x1;
registers[0] = 0x1;
The compiler will generate 4 memory writes.
See the latest draft of C11 standard - http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
6.7.3 Type qualifiers
...
7 An object that has volatile-qualified type may be modified in ways unknown to the
implementation or have other unknown side effects. Therefore any expression referring
to such an object shall be evaluated strictly according to the rules of the abstract machine,
as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the
object shall agree with that prescribed by the abstract machine, except as modified by the
unknown factors mentioned previously.134) What constitutes an access to an object that
has volatile-qualified type is implementation-defined.
In your example each line is actually a "full expression" (as each one is terminated with a ;) so each line contains a sequence point. Above requirement forces all the accesses to be performed in given order with no optimizations, reordering or any clever tricks.
However - do note, that this only states that compiler will generate instructions that access these memory locations in the exact order and number as in the source code. If you for example need to synchronize these accesses between multiple cores, then this is all hardware-dependent and any C standard won't help you here. Anyway - a mutex may be a bit too much to force such synchronization. You'd be better off by using some intrinsic functions of the compiler for that.

Rationale for pointer comparisons outside an array to be UB

So, the standard (referring to N1570) says the following about comparing pointers:
C99 6.5.8/5 Relational operators
When two pointers are compared, the result depends on the relative
locations in the address space of the objects pointed to.
... [snip obvious definitions of comparison within aggregates] ...
In all other cases,
the behavior is undefined.
What is the rationale for this instance of UB, as opposed to specifying (for instance) conversion to intptr_t and comparison of that?
Is there some machine architecture where a sensible total ordering on pointers is hard to construct? Is there some class of optimization or analysis that unrestricted pointer comparisons would impede?
A deleted answer to this question mentions that this piece of UB allows for skipping comparison of segment registers and only comparing offsets. Is that particularly valuable to preserve?
(That same deleted answer, as well as one here, note that in C++, std::less and the like are required to implement a total order on pointers, whether the normal comparison operator does or not.)

Various comments in the ub mailing list discussion Justification for < not being a total order on pointers? strongly allude to segmented architectures being the reason. Including the follow comments, 1:
Separately, I believe that the Core Language should simply recognize the fact that all machines these days have a flat memory model.
and 2:
Then we maybe need an new type that guarantees a total order when
converted from a pointer (e.g. in segmented architectures, conversion
would require taking the address of the segment register and adding the
offset stored in the pointer).
and 3:
Pointers, while historically not totally ordered, are practically so
for all systems in existence today, with the exception of the ivory tower
minds of the committee, so the point is moot.
and 4:
But, even if segmented architectures, unlikely though it is, do come
back, the ordering problem still has to be addressed, as std::less
is required to totally order pointers. I just want operator< to be an
alternate spelling for that property.
Why should everyone else pretend to suffer (and I do mean pretend,
because outside of a small contingent of the committee, people already
assume that pointers are totally ordered with respect to operator<) to
meet the theoretical needs of some currently non-existent
architecture?
Counter to the trend of comments from the ub mailing list, FUZxxl points out that supporting DOS is a reason not to support totally ordered pointers.
Update
This is also supported by the Annotated C++ Reference Manual(ARM) which says this was due to burden of supporting this on segmented architectures:
The expression may not evaluate to false on segmented architectures
[...] This explains why addition, subtraction and comparison of
pointers are defined only for pointers into an array and one element
beyond the end. [...] Users of machines with a nonsegmented address
space developed idioms, however, that referred to the elements beyond
the end of the array [...] was not portable to segmented architectures
unless special effort was taken [...] Allowing [...] would be costly
and serve few useful purposes.

The 8086 is a processor with 16 bit registers and a 20 bit address space. To cope with the lack of bits in its registers, a set of segment registers exists. On memory access, the dereferenced address is computed like this:
address = 16 * segment + register
Notice that among other things, an address has generally multiple ways to be represented. Comparing two arbitrary addresses is tedious as the compiler has to first normalize both addresses and then compare the normalized addresses.
Many compilers specify (in the memory models where this is possible) that when doing pointer arithmetic, the segment part is to be left untouched. This has several consequences:
objects can have a size of at most 64 kB
all addresses in an object have the same segment part
comparing addresses in an object can be done just by comparing the register part; that can be done in a single instruction
This fast comparison of course only works when the pointers are derived from the same base-address, which is one of the reasons why the C standard defines pointer comparisons only for when both pointers point into the same object.
If you want a well-ordered comparison for all pointers, consider converting the pointers to uintptr_t values first.

I believe it's undefined so that C can be run on architectures where, in effect, "smart pointers" are implemented in hardware, with various checks to ensure that pointers never accidentally point outside of the memory regions they're defined to refer to. I've never personally used such a machine, but the way to think about them is that computing an invalid pointer is precisely as forbidden as dividing by 0; you're likely to get a run-time exception that terminates your program. Furthermore, what's forbidden is computing the pointer, you don't even have to dereference it to get the exception.
Yes, I believe the definition also ended up permitting more-efficient comparisons of offset registers in old 8086 code, but that was not the only reason.
Yes, a compiler for one of these protected pointer architectures could theoretically implement the "forbidden" comparisons by converting to unsigned or the equivalent, but (a) it would likely be significantly less efficient to do so and (b) that would be a wantonly deliberate circumvention of the architecture's intended protection, protection which at least some of the architecture's C programmers would presumably want to have enabled (not disabled).

Historically, saying that action invoked Undefined Behavior meant that any program which made use of such actions could be expected to correctly only on those implementations which defined, for that action, behavior meeting their requirements. Specifying that an action invoked Undefined Behavior didn't mean that programs using such action should be considered "illegitimate", but was rather intended to allow C to be used to run programs that didn't require such actions, on platforms which could not efficiently support them.
Generally, the expectation was that a compiler would either output the sequence of instructions which would most efficiently perform the indicated action in the cases required by the standard, and do whatever that sequence of instructions happened to do in other cases, or would output a sequence of instructions whose behavior in such cases was deemed to be in some fashion more "useful" than the natural sequence. In cases where an action might trigger a hardware trap, or where triggering an OS trap might plausibly in some cases be considered preferable to executing the "natural" sequence of instructions, and where a trap might cause behaviors outside the control of the C compiler, the Standard imposes no requirements. Such cases are thus labeled as "Undefined Behavior".
As others have noted, there are some platforms where p1 < p2, for unrelated pointers p1 and p2, could be guaranteed to yield 0 or 1, but where the most efficient means of comparing p1 and p2 that would work in the cases defined by the Standard might not uphold the usual expectation that p1 < p2 || p2 > p2 || p1 != p2. If a program written for such a platform knows that it will never deliberately compare unrelated pointers (implying that any such comparison would represent a program bug) it may be helpful to have stress-testing or troubleshooting builds generate code which traps on any such comparisons. The only way for the Standard to allow such implementations is to make such comparisons Undefined Behavior.
Until recently, the fact that a particular action would invoke behavior that was not defined by the Standard would generally only pose difficulties for people trying to write code on platforms where the action would have undesirable consequences. Further, on platforms where an action could only have undesirable consequences if a compiler went out of its way to make it do so, it was generally accepted practice for programmers to rely upon such an action behaving sensibly.
If one accepts the notions that:
The authors of the Standard expected that comparisons between unrelated pointers would work usefully on those platforms, and only those platforms, where the most natural means of comparing related pointers would also work with unrelated ones, and
There exist platforms where comparing unrelated pointers would be problematic
Then it makes complete sense for the Standard to regard unrelated-pointer comparisons as Undefined Behavior. Had they anticipated that even compilers for platforms which define a disjoint global ranking for all pointers might make unrelated-pointer comparisons negate the laws of time and causality (e.g. given:
int needle_in_haystack(char const *hs_base, int hs_size, char *needle)
{ return needle >= hs_base && needle < hs_base+hs_size; }
a compiler may infer that the program will never receive any input which would cause needle_in_haystack to be given unrelated pointers, and any code which would only be relevant when the program receives such input may be eliminated) I think they would have specified things differently. Compiler writers would probably argue that the proper way to write needle_in_haystack would be:
int needle_in_haystack(char const *hs_base, int hs_size, char *needle)
{
for (int i=0; i<size; i++)
if (hs_base+i == needle) return 1;
return 0;
}
since their compilers would recognize what the loop is doing and also recognize that it's running on a platform where unrelated pointer comparisons work, and thus generate the same machine code as older compilers would have generated for the earlier-stated formulation. As to whether it would be better to require compilers provide a means of specifying that code resembling the former version should either sensibly on platforms that will support it or refuse compilation on those that won't, or better to require that programmers intending the former semantics should write the latter and hope that optimizers turn it into something useful, I leave that to the reader's judgment.

embedded C - using "volatile" to assert consistency

Consider the following code:
// In the interrupt handler file:
volatile uint32_t gSampleIndex = 0; // declared 'extern'
void HandleSomeIrq()
{
gSampleIndex++;
}
// In some other file
void Process()
{
uint32_t localSampleIndex = gSampleIndex; // will this be optimized away?
PrevSample = RawSamples[(localSampleIndex + 0) % NUM_RAW_SAMPLE_BUFFERS];
CurrentSample = RawSamples[(localSampleIndex + 1) % NUM_RAW_SAMPLE_BUFFERS];
NextSample = RawSamples[(localSampleIndex + 2) % NUM_RAW_SAMPLE_BUFFERS];
}
My intention is that PrevSample, CurrentSample and NextSample are consistent, even if gSampleIndex is updated during the call to Process().
Will the assignment to the localSampleIndex do the trick, or is there any chance it will be optimized away even though gSampleIndex is volatile?

In principle, volatile is not enough to guarantee that Process only sees consistent values of gSampleIndex. In practice, however, you should not run into any issues if uinit32_t is directly supported by the hardware. The proper solution would be to use atomic accesses.
The problem
Suppose that you are running on a 16-bit architecture, so that the instruction
localSampleIndex = gSampleIndex;
gets compiled into two instructions (loading the upper half, loading the lower half). Then the interrupt might be called between the two instructions, and you'll get half of the old value combined with half of the new value.
The solution
The solution is to access gSampleCounter using atomic operations only. I know of three ways of doing that.
C11 atomics
In C11 (supported since GCC 4.9), you declare your variable as atomic:
#include <stdatomic.h>
atomic_uint gSampleIndex;
You then take care to only ever access the variable using the documented atomic interfaces. In the IRQ handler:
atomic_fetch_add(&gSampleIndex, 1);
and in the Process function:
localSampleIndex = atomic_load(gSampleIndex);
Do not bother with the _explicit variants of the atomic functions unless you're trying to get your program to scale across large numbers of cores.
GCC atomics
Even if your compiler does not support C11 yet, it probably has some support for atomic operations. For example, in GCC you can say:
volatile int gSampleIndex;
...
__atomic_add_fetch(&gSampleIndex, 1, __ATOMIC_SEQ_CST);
...
__atomic_load(&gSampleIndex, &localSampleIndex, __ATOMIC_SEQ_CST);
As above, do not bother with weak consistency unless you're trying to achieve good scaling behaviour.
Implementing atomic operations yourself
Since you're not trying to protect against concurrent access from multiple cores, just race conditions with an interrupt handler, it is possible to implement a consistency protocol using standard C primitives only. Dekker's algorithm is the oldest known such protocol.

In your function you access volatile variable just once (and it's the only volatile one in that function) so you don't need to worry about code reorganization that compiler may do (and volatile prevents). What standard says for these optimizations at §5.1.2.3 is:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
Note last sentence: "...no needed side effects are produced (...accessing a volatile object)".
Simply volatile will prevent any optimization compiler may do around that code. Just to mention few: no instruction reordering respect other volatile variables. no expression removing, no caching, no value propagation across functions.
BTW I doubt any compiler may break your code (with or without volatile). Maybe local stack variable will be elided but value will be stored in a registry (for sure it won't repeatedly access a memory location). What you need volatile for is value visibility.
EDIT
I think some clarification is needed.
Let me safely assume you know what you're doing (you're working with interrupt handlers so this shouldn't be your first C program): CPU word matches your variable type and memory is properly aligned.
Let me also assume your interrupt is not reentrant (some magic cli/sti stuff or whatever your CPU uses for this) unless you're planning some hard-time debugging and tuning.
If these assumptions are satisfied then you don't need atomic operations. Why? Because localSampleIndex = gSampleIndex is atomic (because it's properly aligned, word size matches and it's volatile), with ++gSampleIndex there isn't any race condition (HandleSomeIrq won't be called again while it's still in execution). More than useless they're wrong.
One may think: "OK, I may not need atomic but why I can't use them? Even if such assumption are satisfied this is an *extra* and it'll achieve same goal" . No, it doesn't. Atomic has not same semantic of volatile variables (and seldom volatile is/should be used outside memory mapped I/O and signal handling). Volatile (usually) is useless with atomic (unless a specific architecture says it is) but it has a great difference: visibility. When you update gSampleIndex in HandleSomeIrq standard guarantees that value will be immediately visible to all threads (and devices). with atomic_uint standard guarantees it'll be visible in a reasonable amount of time.
To make it short and clear: volatile and atomic are not the same thing. Atomic operations are useful for concurrency, volatile are useful for lower level stuff (interrupts, devices). If you're still thinking "hey they do *exactly* what I need" please read few useful links picked from comments: cache coherency and a nice reading about atomics.
To summarize:
In your case you may use an atomic variable with a lock (to have both atomic access and value visibility) but no one on this earth would put a lock inside an interrupt handler (unless absolutely definitely doubtless unquestionably needed, and from code you posted it's not your case).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight