C11 mixing atomic and non-atomic access to variable - c

Sometimes you may want to access a variable both atomically and non-atomically. Which is why I find convinient that on gcc you can write something like :
int var = 0;
var++;
atomic_fetch_add(&var, 1);
However this does not compile with clang 4.0.1 :
error: address argument to atomic operation must be a pointer to _Atomic type ('int *' invalid)
atomic_fetch_add(&var, 1);
The best solution I could find is a cast :
int var = 0;
(*(int*)&var)++;
atomic_fetch_add(&var, 1);
Is there a simpler and portable way to achieve this ?

There are two interfaces in C11 that allow you to act on an atomic object which are less restrictive.
First, you can always overwrite an atomic object, when you know that you are the only one accessing it, usually during an initialization phase, use atomic_init for that.
Second, if you need less guarantees for an access during the execution even with several threads, you can use a less restrictive access mode. Namely you could e.g do atomic_fetch_and_add_explicit(&var, 1, memory_order_relaxed). This still guarantees that your access is indivisible (one of the properties that you want from an atomic) but it doesn't guarantee when another thread sees the updated value.
But generally speaking, if atomic accesses are performance critical, you are doing something wrong. So before you try semantically difficult dealings with atomics, benchmark your code and see if this really is a bottleneck. If so, think first of a way to change your algorithm, e.g by doing more computations in local variables that are not subject to races. Only if all of that fails to give you the performance you want, have a look into the different memory semantics that C11 offers.

The abstract machine defined by the C Standard has a rather different view of storage than most real machines. In particular, rather than thinking of memory accesses as actions which can be performed in a variety of different way depending upon required circumstances, it instead views each object has supporting one kind of read and at most one kind of write (const-qualified objects don't support any kind of write); the kind of read and write required to access an object depend upon its type.
Such an approach may be useful for some kinds of hardware platforms, or for some optimization strategies, but is grossly unsuitable for many kinds of programs running on real-world platforms. Unfortunately, the Standard doesn't recognize any practical way by which programmers can indicate that certain objects should be treated as "ordinary" storage most of the time, but recognize that they require more precise memory semantics at certain specific times during program execution.

Related

Is a C compiler allowed to coalesce sequential assignments to volatile variables?

I'm having a theoretical (non-deterministic, hard to test, never happened in practice) hardware issue reported by hardware vendor where double-word write to certain memory ranges may corrupt any future bus transfers.
While I don't have any double-word writes explicitly in C code, I'm worried the compiler is allowed (in current or future implementations) to coalesce multiple adjacent word assignments into a single double-word assignment.
The compiler is not allowed to reorder assignments of volatiles, but it is unclear (to me) whether coalescing counts as reordering. My gut says it is, but I've been corrected by language lawyers before!
Example:
typedef struct
{
volatile unsigned reg0;
volatile unsigned reg1;
} Module;
volatile Module* module = (volatile Module*)0xFF000000u;
// two word stores, or one double-word store?
module->reg0 = 1;
module->reg1 = 2;
(I'll ask my compiler vendor about this separately, but I'm curious what the canonical/community interpretation of the standard is.)
No, the compiler is absolutely not allowed to optimize those two writes into a single double word write. It's kind of hard to quote the standard since the part regarding optimizations and side effects is so fuzzily written. The relevant parts are found in C17 5.1.2.3:
The semantic descriptions in this International Standard describe the behavior of an
abstract machine in which issues of optimization are irrelevant.
Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment.
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine.
When you access part of a struct, that in itself is a side-effect, which may have consequences that the compiler can't determine. Suppose for example that your struct is a hardware register map and those registers need to be written in a certain order. Like for example some microcontroller documentation could be along the lines of: "reg0 enables the hardware peripheral and must be written to before you can configure the details in reg1".
A compiler that would merge the volatile object writes into a single one would be non-conforming and plain broken.
The compiler is not allowed to make two such assignments into a single memory write. There must be two independent writes from the core. The answer from #Lundin gives relevant references to the C standard.
However, be aware that a cache - if present - may trick you. The keyword volatile doesn't imply "uncached" memory. So besides using volatile, you also need to make sure that the address 0xFF000000 is mapped as uncached. If the address is mapped as cached, the cache HW may turn the two assignments into a single memory write. In other words - for cached memory two core memory write operations may end up as a single write operation on the systems memory interface.
The behavior of volatile seems to be up to the implementation, partly because of a curious sentence which says: "What constitutes an access to an object that has volatile-qualified type is implementation-defined".
In ISO C 99, section 5.1.2.3, there is also:
3 In the abstract machine, all expressions are evaluated as specified by the semantics. An
actual implementation need not evaluate part of an expression if it can deduce that its
value is not used and that no needed side effects are produced (including any caused by
calling a function or accessing a volatile object).
So although requirements are given that a volatile object must be treated in accordance with the abstract semantics (i.e not optimized), curiously, the abstract semantics itself allows for the elimination of dead code and data flows, which are examples of optimizations!
I'm afraid that to know what volatile will and will not do, you have to go by your compiler's documentation.
The C Standard is agnostic to any relationship between operations on volatile objects and operations on the actual machine. While most implementations would specify that a construct like *(char volatile*)0x1234 = 0x56; would generate a byte store with value 0x56 to hardware address 0x1234, an implementation could, at its leisure, allocate space for e.g. an 8192-byte array and specify that *(char volatile*)0x1234 = 0x56; would immediately store 0x56 to element 0x1234 of that array, without ever doing anything with hardware address 0x1234. Alternatively, an implementation may include some process that periodically stores whatever happens to be in 0x1234 of that array to hardware address 0x56.
All that is required for conformance is that all operations on volatile objects within a single thread are, from the standpoint of the Abstract machine, regarded as absolutely sequenced. From the point of view of the Standard, implementations can convert such accesses into real machine operations in whatever fashion they see fit.
Changing it will change the observable behavior of the program. So compiler is not allowed to do so.

Does "volatile" guarantee anything at all in portable C code for multi-core systems?

After looking at a bunch of other questions and their answers, I get the impression that there is no widespread agreement on what the "volatile" keyword in C means exactly.
Even the standard itself does not seem to be clear enough for everyone to agree on what it means.
Among other problems:
It seems to provide different guarantees depending on your hardware and depending on your compiler.
It affects compiler optimizations but not hardware optimizations, so on an advanced processor that does its own run-time optimizations, it is not even clear whether the compiler can prevent whatever optimization you want to prevent. (Some compilers do generate instructions to prevent some hardware optimizations on some systems, but this does not appear to be standardized in any way.)
To summarize the problem, it appears (after reading a lot) that "volatile" guarantees something like: The value will be read/written not just from/to a register, but at least to the core's L1 cache, in the same order that the reads/writes appear in the code. But this seems useless, since reading/writing from/to a register is already sufficient within the same thread, while coordinating with L1 cache doesn't guarantee anything further regarding coordination with other threads. I can't imagine when it could ever be important to sync just with L1 cache.
USE 1
The only widely-agreed-upon use of volatile seems to be for old or embedded systems where certain memory locations are hardware-mapped to I/O functions, like a bit in memory that controls (directly, in the hardware) a light, or a bit in memory that tells you whether a keyboard key is down or not (because it is connected by the hardware directly to the key).
It seems that "use 1" does not occur in portable code whose targets include multi-core systems.
USE 2
Not too different from "use 1" is memory that could be read or written at any time by an interrupt handler (which might control a light or store info from a key). But already for this we have the problem that depending on the system, the interrupt handler might run on a different core with its own memory cache, and "volatile" does not guarantee cache coherency on all systems.
So "use 2" seems to be beyond what "volatile" can deliver.
USE 3
The only other undisputed use I see is to prevent mis-optimization of accesses via different variables pointing to the same memory that the compiler doesn't realize is the same memory. But this is probably only undisputed because people aren't talking about it -- I only saw one mention of it. And I thought the C standard already recognized that "different" pointers (like different args to a function) might point to the same item or nearby items, and already specified that the compiler must produce code that works even in such cases. However, I couldn't quickly find this topic in the latest (500 page!) standard.
So "use 3" maybe doesn't exist at all?
Hence my question:
Does "volatile" guarantee anything at all in portable C code for multi-core systems?
EDIT -- update
After browsing the latest standard, it is looking like the answer is at least a very limited yes:
1. The standard repeatedly specifies special treatment for the specific type "volatile sig_atomic_t". However the standard also says that use of the signal function in a multi-threaded program results in undefined behavior. So this use case seems limited to communication between a single-threaded program and its signal handler.
2. The standard also specifies a clear meaning for "volatile" in relation to setjmp/longjmp. (Example code where it matters is given in other questions and answers.)
So the more precise question becomes:
Does "volatile" guarantee anything at all in portable C code for multi-core systems, apart from (1) allowing a single-threaded program to receive information from its signal handler, or (2) allowing setjmp code to see variables modified between setjmp and longjmp?
This is still a yes/no question.
If "yes", it would be great if you could show an example of bug-free portable code which becomes buggy if "volatile" is omitted. If "no", then I suppose a compiler is free to ignore "volatile" outside of these two very specific cases, for multi-core targets.
I'm no expert, but cppreference.com has what appears to me to be some pretty good information on volatile. Here's the gist of it:
Every access (both read and write) made through an lvalue expression
of volatile-qualified type is considered an observable side effect for
the purpose of optimization and is evaluated strictly according to the
rules of the abstract machine (that is, all writes are completed at
some time before the next sequence point). This means that within a
single thread of execution, a volatile access cannot be optimized out
or reordered relative to another visible side effect that is separated
by a sequence point from the volatile access.
It also gives some uses:
Uses of volatile
1) static volatile objects model memory-mapped I/O ports, and static
const volatile objects model memory-mapped input ports, such as a
real-time clock
2) static volatile objects of type sig_atomic_t are used for
communication with signal handlers.
3) volatile variables that are local to a function that contains an
invocation of the setjmp macro are the only local variables guaranteed
to retain their values after longjmp returns.
4) In addition, volatile variables can be used to disable certain
forms of optimization, e.g. to disable dead store elimination or
constant folding for microbenchmarks.
And of course, it mentions that volatile is not useful for thread synchronization:
Note that volatile variables are not suitable for communication
between threads; they do not offer atomicity, synchronization, or
memory ordering. A read from a volatile variable that is modified by
another thread without synchronization or concurrent modification from
two unsynchronized threads is undefined behavior due to a data race.
First of all, there's historically been various hiccups regarding different intepretations of the meaning of volatile access and similar. See this study: Volatiles Are Miscompiled, and What to Do about It.
Apart from the various issues mentioned in that study, the behavior of volatile is portable, save for one aspect of them: when they act as memory barriers. A memory barrier is some mechanism which is there to prevent concurrent unsequenced execution of your code. Using volatile as a memory barrier is certainly not portable.
Whether the C language guarantees memory behavior or not from volatile is apparently arguable, though personally I think the language is clear. First we have the formal definition of side effects, C17 5.1.2.3:
Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment.
The standard defines the term sequencing, as a way of determining order of evaluation (execution). The definition is formal and cumbersome:
Sequenced before is an asymmetric, transitive, pair-wise relation between evaluations
executed by a single thread, which induces a partial order among those evaluations.
Given any two evaluations A and B, if A is sequenced before B, then the execution of A
shall precede the execution of B. (Conversely, if A is sequenced before B, then B is
sequenced after A.) If A is not sequenced before or after B, then A and B are
unsequenced. Evaluations A and B are indeterminately sequenced when A is sequenced
either before or after B, but it is unspecified which.13) The presence of a sequence point
between the evaluation of expressions A and B implies that every value computation and
side effect associated with A is sequenced before every value computation and side effect
associated with B. (A summary of the sequence points is given in annex C.)
The TL;DR of the above is basically that in case we have an expression A which contains side-effects, it must be done executing before another expression B, in case B is sequenced after A.
Optimizations of C code are made possible through this part:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual
implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a
volatile object).
This means that the program may evaluate (execute) expressions in the order that the standard mandates elsewhere (order of evaluation etc). But it need not evaluate (execute) a value if it can deduce that it is not used. For example, the operation 0 * x doesn't need to evaluate x and simply replace the expression with 0.
Unless accessing a variable is a side-effect. Meaning that in case x is volatile, it must evaluate (execute) 0 * x even though the result will always be 0. Optimization is not allowed.
Furthermore, the standard speaks of observable behavior:
The least requirements on a conforming implementation are:
Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine.
/--/
This is the observable behavior of the program.
Given all of the above, a conforming implementation (compiler + underlying system) may not execute the access of volatile objects in an unsequenced order, in case the semantics of the written C source says otherwise.
This means that in this example
volatile int x;
volatile int y;
z = x;
z = y;
Both assignment expressions must be evaluated and z = x; must be evaluated before z = y;. A multi-processor implementation that outsource these two operations to two different unsequenced cores is not conforming!
The dilemma is that compilers can't do much about things like pre-fetch caching and instruction pipelining etc, particularly not when running on top of an OS. And so compilers hand that problem over to the programmers, telling them that memory barriers is now the programmer's responsibility. While the C standard clearly states that the problem needs to be solved by the compiler.
The compiler doesn't necessarily care to solve the problem though, and so volatile for the sake of acting as a memory barrier is non-portable. It has become a quality of implementation issue.
To summarize the problem, it appears (after reading a lot) that
"volatile" guarantees something like: The value will be read/written
not just from/to a register, but at least to the core's L1 cache, in
the same order that the reads/writes appear in the code.
No, it absolutely does not. And that makes volatile almost useless for the purpose of MT safe code.
If it did, then volatile would be quite good for variables shared by multiple thread as ordering the events in the L1 cache is all you need to do in typical CPU (that is either multi-core or multi-CPU on motherboard) capable of cooperating in a way that makes a normal implementation of either C/C++ or Java multithreading possible with typical expected costs (that is, not a huge cost on most atomic or non-contented mutex operations).
But volatile does not provide any guaranteed ordering (or "memory visibility") in the cache either in theory or in practice.
(Note: the following is based on sound interpretation of the standard documents, the standard's intent, historical practice, and a deep understand of the expectations of compiler writers. This approach based on history, actual practices, and expectations and understanding of real persons in the real world, which is much stronger and more reliable than parsing the words of a document that is not known to be stellar specification writing and which has been revised many times.)
In practice, volatile does guarantees ptrace-ability that is the ability to use debug information for the running program, at any level of optimization, and the fact the debug information makes sense for these volatile objects:
you may use ptrace (a ptrace-like mechanism) to set meaningful break points at the sequence points after operations involving volatile objects: you can really break at exactly these points (note that this works only if you are willing to set many break points as any C/C++ statement may be compiled to many different assembly start and end points, as in a massively unrolled loop);
while a thread of execution of stopped, you may read the value of all volatile objects, as they have their canonical representation (following the ABI for their respective type); a non volatile local variable could have an atypical representation, f.ex. a shifted representation: a variable used for indexing an array might be multiplied by the size of individual objects, for easier indexing; or it might be replaced by a pointer to an array element (as long as all uses of the variable as similarly converted) (think changing dx to du in an integral);
you can also modify those objects (as long as the memory mappings allow that, as volatile object with static lifetime that are const qualified might be in a memory range mapped read only).
Volatile guarantee in practice a little more than the strict ptrace interpretation: it also guarantees that volatile automatic variables have an address on the stack, as they aren't allocated to a register, a register allocation which would make ptrace manipulations more delicate (compiler can output debug information to explain how variables are allocated to registers, but reading and changing register state is slightly more involved than accessing memory addresses).
Note that full program debug-ability, that is considering all variables volatile at least at sequence points, is provided by the "zero optimization" mode of the compiler, a mode which still performs trivial optimizations like arithmetic simplifications (there is usually no guaranteed no optimization at all mode). But volatile is stronger than non optimization: x-x can be simplified for a non volatile integer x but not of a volatile object.
So volatile means guaranteed to be compiled as is, like the translation from source to binary/assembly by the compiler of a system call isn't a reinterpretation, changed, or optimized in any way by a compiler. Note that library calls may or may not be system calls. Many official system functions are actually library function that offer a thin layer of interposition and generally defer to the kernel at the end. (In particular getpid doesn't need to go to the kernel and could well read a memory location provided by the OS containing the information.)
Volatile interactions are interactions with the outside world of the real machine, which must follow the "abstract machine". They aren't internal interactions of program parts with other program parts. The compiler can only reason about what it knows, that is the internal program parts.
The code generation for a volatile access should follow the most natural interaction with that memory location: it should be unsurprising. That means that some volatile accesses are expected to be atomic: if the natural way to read or write the representation of a long on the architecture is atomic, then it's expected that a read or write of a volatile long will be atomic, as the compiler should not generate silly inefficient code to access volatile objects byte by byte, for example.
You should be able to determine that by knowing the architecture. You don't have to know anything about the compiler, as volatile means that the compiler should be transparent.
But volatile does no more than force the emission of expected assembly for the least optimized for particular cases to do a memory operation: volatile semantics means general case semantic.
The general case is what the compiler does when it doesn't have any information about a construct: f.ex. calling a virtual function on an lvalue via dynamic dispatch is a general case, making a direct call to the overrider after determining at compile time the type of the object designated by the expression is a particular case. The compiler always have a general case handling of all constructs, and it follows the ABI.
Volatile does nothing special to synchronize threads or provide "memory visibility": volatile only provides guarantees at the abstract level seen from inside a thread executing or stopped, that is the inside of a CPU core:
volatile says nothing about which memory operations reach main RAM (you may set specific memory caching types with assembly instructions or system calls to obtain these guarantees);
volatile doesn't provide any guarantee about when memory operations will be committed to any level of cache (not even L1).
Only the second point means volatile is not useful in most inter threads communication problems; the first point is essentially irrelevant in any programming problem that doesn't involve communication with hardware components outside the CPU(s) but still on the memory bus.
The property of volatile providing guaranteed behavior from the point of the view of the core running the thread means that asynchronous signals delivered to that thread, which are run from the point of view of the execution ordering of that thread, see operations in source code order.
Unless you plan to send signals to your threads (an extremely useful approach to consolidation of information about currently running threads with no previously agreed point of stopping), volatile is not for you.
The ISO C standard, no, but in practice all machines that we run threads across have coherent shared memory, so volatile in practice works somewhat like _Atomic with memory_order_relaxed, at least for pure-load / pure-store operations on small-enough types. (But of course only _Atomic will give you atomic RMWs for stuff like n += 1;)
There's also the question of what exactly volatile means to a compiler. The standard allows wiggle room, but in real-world compilers, it means the load or store has to actually happen in the asm. No more, no less. (A compiler that didn't work this way couldn't correctly compile pre-C11 multi-threaded code that used hand-rolled volatile, so that de-facto standard is a requirement for compilers to be generally useful and for anyone to want to actually use them. ISO C leaves enough choice up to the implementation that a DeathStation 9000 could be ISO C compliant and almost totally unusable for real programs, and break most real code bases.)
The requirement that volatile accesses are guaranteed to happen in source order is normally interpreted as putting the asm in that order, leaving runtime reordering at the mercy of the target machine's memory model. volatile accesses aren't ordered wrt. anything else, so plain operations can still optimize away separately from them.
When to use volatile with multi threading? is a C++ version of the question. Answer: basically never, use stdatomic. My answer there explains why cache-coherency makes volatile useful in practice: there are no C or C++ implementations I'm aware of where shared_var.store(1, std::memory_order_relaxed) needs to explicitly flush anything to make the store visible to other cores. It compiles to just a normal asm store instruction, for variables narrow enough to be "naturally" atomic.
(Memory barriers just make this core wait, e.g. until the store commits from the store buffer to L1d cache and thus becomes globally visible, before doing later loads/stores. So they order this core's accesses to coherent shared memory.)
For example, the Linux kernel depends on this, using volatile for inter-thread visibility, and asm() for memory barriers to order those accesses, and for atomic-RMW operations. All multi-core systems that can run a single instance of Linux across those cores have coherent shared memory.
There are some rare systems with shared memory that isn't coherent, for example some clusters. But you don't run threads of the same process across different coherency domains. (Or run a single instance of the OS on it). Instead, the shared memory has to get mapped differently from normal write-back cacheable, or you have to do explicit flushing.

Race condition when accessing adjacent members in a shared struct, according to CERT coding rule POS49-C?

According to CERT coding rule POS49-C it is possible that different threads accessing different fields of the same structure may conflict.
Instead of bit-field, I use regular unsigned int.
struct multi_threaded_flags {
unsigned int flag1;
unsigned int flag2;
};
struct multi_threaded_flags flags;
void thread1(void) {
flags.flag1 = 1;
}
void thread2(void) {
flags.flag2 = 2;
}
I can see that even unsigned int, there can still be racing condition IF compiler decides to use load/store 8 bytes instead of 4 bytes.
I think compiler will never do that and racing condition will never happen here, but that's completely just my guess.
Is there any well-defined assembly/compiler documentation regarding this case ? I hope locking, which is costly, is the last resort when this situation happens to be undefined.
FYI, I use gcc.
The C11 memory model guarantees that accesses to distinct structure members (which aren't part of a bit-field) are independent, so you'll run into no problems modifying the two flags from different threads (i.e., the "load 8 bytes, modify 4, and write back 8" scenario is not allowed).
This guarantee does not extend in general to bitfields, so you have to be careful there.
Of course, if you are concurrently modifying the same flag from more than one thread, you'll likely trigger the prohibition against data races, so don't do that.
Before C11, ISO C had nothing to say about threads, and writing multi-threaded code relied on other standards (e.g. POSIX which defines a memory model for pthreads), and multi-threaded code essentially depended on the way real compilers worked.
Note that this CERT coding-standard rule is in the POSIX section, and appears to be about pthreads without C11. (There's a CON32-C. Prevent data races when accessing bit-fields from multiple threads rule for C11, where they solve the bit-field concurrency problem by simply promoting the bit-fields to unsigned char, which C11 defines as "separate memory locations". This rule appears to be an insufficiently-edited copy of that, because many of its suggestions suck.)
But unfortunately POSIX pthreads doesn't clearly define what a "memory location is", and this is all they have to say on the subject:
Single UNIX® Specification, Version 4, 2016 Edition (online HTML copy, requires free registration)
4.12 Memory Synchronization
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads.
This is why C11 defines it more clearly, where only bitfields are dangerous to write from different threads (barring compiler bugs).
However, I think everyone (including all compilers) agreed that separate int variables / struct members / array elements were separate "memory locations". Most real-world software doesn't take any special precautions for int or char variables that may be written by separate threads (especially outside of structs).
A compiler that gets int wrong will cause problems all over the place unless the bug is limited to very specific circumstances.
Most bugs like this are very hard to detect with testing, because usually the other data that's non-atomically loaded and stored back isn't written by another thread very often / ever. But if a compiler always did that for every int, problems would show up in some software pretty quickly.
Normally, separate char members would also be considered separate "memory locations", but some pre-C11 implementations might have exceptions to that rule. (Especially on early Alpha AXP which famously has no byte store instruction (so a C11 implementation would have to use 32-bit char), but optimizations that invent writes when updating multiple members can happen anywhere, either by accident or because the compiler developers define "memory location" as a 32 or 64-bit word.)
There's also the issue of compiler bugs. This can affect even compilers that intend to conform to C11. For example gcc bug 52080 which affected some non-x86 architectures. (Discovered in gcc4.7 in 2012, fixed in gcc4.8 a couple months later). Using a bitfield "tricked" the compiler into doing a non-atomic read-modify-write of the containing 64-bit word, even though that included a non-bitfield member. (Bitfields are bait for compiler bugs. Any defensive / safe-coding standard should recommend avoiding them in structs where different members can be modified from different threads. And definitely don't put them next to the actual lock.)
Herb Sutter's talk atomic<> Weapons: The C++ Memory Model and Modern Hardware part 2 goes into some detail about the kinds of compiler bugs that have affected multi-threaded code. Most of these should be shaken out by now (2017) if you're using a modern compiler. Most things like inventing writes (or non-atomic read and write-back of the same value) were usually still considered bugs before C11; C11 mostly just firmed up the rules compilers were already trying to follow. It also made it easier to report such bugs, because you could say unequivocally that it violates the standard instead of just "it breaks my code".
That coding-rule article is poorly written. Its examples with adjacent bit-fields are unsafe, but it claims that all variables are at risk. This is not true in general, especially not with C11. Many users of pthreads can or already do compile with C11 compilers.
(The phrasing I'm referring to is "Bit-fields are especially prone to this behavior", which incorrectly implies that this is allowed to happen with ordinary members of structs, or variables that happen to be adjacent outside of structs)
It's part of a defensive coding standard, but it should definitely make the distinction between what the standards require and what is just belt-and-suspenders defense against compiler bugs.
Also, putting variables that will usually be accessed by different threads into one struct is generally terrible. False sharing of a cache line (typically 64 bytes) is really bad for performance, causing cache misses and (on out-of-order x86 CPUs) memory-ordering mis-speculation (like a branch mispredict requiring a roll-back.) Putting separately-used shared variables into the same byte with bit-fields is even worse, because it prevents efficient stores (any store has to be a RMW of the containing byte).
Solving the bit-field problem by promoting the two bit-fields to unsigned char makes much more sense than using a mutex if they need to be independently writeable from separate threads. Or even unsigned long if you're paranoid.
If the two members are often used together, it makes sense to put them nearby. But if you're going to pad to a whole long to contain both members (like that article does), you might as well make them at least unsigned char or bool instead of 1-byte bitfields.
Although honestly, having two threads modify separate members of a struct at the same time seems like poor design unless one of the members is the lock, and the modification is part of an attempt to take the lock. Using a bit-field as a lock is a bad idea unless you're writing for a specific ISA building and your own lock primitive using something like x86's lock bts instruction to atomically test-and-set a bit. Even then it's a bad idea unless you need to pack it with other bitfields for space saving; the Linux code that exposed the gcc bug with an int lock:1 member was a horrible idea.
In addition, the flags are declared volatile to ensure that the compiler will not attempt to move operations on them outside the mutex.
If your compiler needs this, your compiler is seriously broken, and will create broken code for most multi-threaded programs. (Unless the compiler bug only happens with bit-fields, because shared bit-fields are rare).
Most code doesn't make shared variables volatile, and relies on the guarantee that mutex lock/unlock stops operations from reordering at compile or run time out of the critical section.
Back in 2012, and possibly still today, gcc -pthread might affect code-gen choices in C89/C99 mode (-std=gnu99). In discussion on an LWN article about that gcc bug, this user claimed that -pthread would prohibit the compiler from doing a 64-bit load/store when modifying a 32-bit variable, but that without -pthread it could do so (although on most architectures, IDK why it would). But it turns out that gcc bug manifested even with -pthread, so it was really a bug rather than an aggressive optimization choice.
ISO C11 standard:
N1570, section 3.14 definitions:
memory location: either an object of scalar type, or a maximal sequence of adjacent bit-fields all having nonzero width
NOTE 1 Two threads of execution can update and access separate memory locations without interfering with each other.
A bit-field and an adjacent non-bit-field member are in separate memory locations. ... It is not safe to concurrently update two non-atomic bit-fields in the same structure if all members declared between them are also (non-zero-length) bit-fields, no matter what the sizes of those intervening bit-fields happen to be.
(...gives an example of a struct with bit-fields...)
So in C11, you can't assume anything about the compiler munging other bit-fields when writing one bit-field, but otherwise you're safe. Unless you use a separator :0 field to force the compiler pad enough (or use atomic bit-ops) so that it can update your bit-field without concurrency problems for other fields. But if you want to be safe, it's probably not a good idea to use bit-fields at all in structs that are written by multiple threads at once.
See also other Notes in the C11 standard, e.g. the one linked by #Fuz in Is it well-defined behavior to modify one element of an array while another thread modifies another element of the same array? that explicitly says that compiler transformations that would make this dangerous are disallowed.

Threading and Thread Safety in C

When there is a common set of global data that needs to be shared among several threaded processes, I typically have used a thread token to protect the shared resource:
Edit - 7/22/15 (to incorporate atomics as a viable option, per Jens comments)
My [First] question is, in C, if I write my routines in such a way as to guarantee each thread accesses one, and only one element of an array:
Is there any reason to think that asynchronous and simultaneous access to different indices of the same unprotected array (as shown in diagram) would be a problem?
Second question: Given that an object that can be accessed as
an atomic entity, even in the presence of asynchronous interrupts ( C99 - 7.14 Signal handling ) would using atomics be an effective method for thread protection for an otherwise unprotected variable?
Edit (Clarifications to address questions in comments to this point):
- Specifics for this application:
- Target OS: Windows 7/8/10
- Compiler : C99 compliant (cannot use C11, which include the _Atomic() type specifier )
- H/W : Intel i7 family
This (which looks like a C standard of some sort)
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf sayeth:
NOTE 1 Two threads of execution can update and access separate memory
locations without interfering with each other
NOTE 13 Compiler transformations that introduce assignments to a
potentially shared memory location that would not be modified by the
abstract machine are generally precluded by this standard, since such
an assignment might overwrite another assignment by a different thread
in cases in which an abstract machine execution would not have
encountered a data race. This includes implementations of data member
assignment that overwrite adjacent members in separate memory
locations. We also generally preclude reordering of atomic loads in
cases in which the atomics in question may alias, since this may
violate the "visible sequence" rules.
The way I understand it, this would preclude quamrana's concerns and guarantee you that unprotected writes to separate memory locations should never result in undefined behavior if there is no data race.
In C it will depend on your platform, that is your combination of compiler, processor architecture and operating system.
Your compiler can choose how to use the internal registers and instructions of the cpu to make the executable seem to perform the intent of the program. And C may know nothing about threads. It is usually the job of the operating system to provide a threading library.
There may be processors which might perform the write to an element of your array by reading a much larger patch of memory than just one element, then overwrite just the right bits that forms one element within internal registers and then writing the whole patch back. A single threaded program would work just fine, but two or more threads which interrupt each other could cause chaos in the array.
On the other hand it may work out just fine.
And as has been said, read-only access is always just fine.
Also, google is your friend. It found this stackoverflow question.
If each thread is accessing a different array element, and only the element it is "assigned", this shouldn't be a problem. Both scenarios above are essentially equivalent, since each array element has its own address.

multithreaded C/C++ variable no cache (Linux)

I use 2 pthreads, where one thread "notifies" the other one of an event, and for that there is a variable ( normal integer ), which is set by the second thread.
This works, but my question is, is it possible that the update is not seen immediately by the first (reading) thread, meaning the cache is not updated directly? And if so, is there a way to prevent this behaviour, e.g. like the volatile keyword in java?
(the frequency which the event occurs is approximately in microsecond range, so more or less immediate update needs to be enforced).
/edit: 2nd question: is it possible to enforce that the variable is hold in the cache of the core where thread 1 is, since this one is reading it all the time. ?
It sounds to me as though you should be using a pthread condition variable as your signaling mechanism. This takes care of all the issues you describe.
It may not be immediately visible by the other processors but not because of cache coherence. The biggest problems of visibility will be due to your processor's out-of-order execution schemes or due to your compiler re-ordering instructions while optimizing.
In order to avoid both these problems, you have to use memory barriers. I believe that most pthread primitives are natural memory barriers which means that you shouldn't expect loads or stores to be moved beyond the boundaries formed by the lock and unlock calls. The volatile keyword can also be useful to disable a certain class of compiler optimizations that can be useful when doing lock-free algorithms but it's not a substitute for memory barriers.
That being said, I recommend you don't do this manually and there are quite a few pitfalls associated with lock-free algorithms. Leaving these headaches to library writters should make you a happier camper (unless you're like me and you love headaches :) ). So my final recomendation is to ignore everything I said and use what vromanov or David Heffman suggested.
The most appropriate way to pass a signal from one thread to another should be to use the runtime library's signalling mechanisms, such as mutexes, condition variables, semaphores, and so forth.
If these have too high an overhead, my first thought would be that there was something wrong with the structure of the program. If it turned out that this really was the bottleneck, and restructuring the program was inappropriate, then I would use atomic operations provided by the compiler or a suitable library.
Using plain int variables, or even volatile-qualified ones is error prone, unless the compiler guarantees they have the appropriate semantics. e.g. MSVC makes particular guarantees about the atomicity and ordering constraints of plain loads and stores to volatile variables, but gcc does not.
Better way to use atomic variables. For sample you can use libatomic. volatile keyword not enough.

Resources