Is volatile really not useful in concurrent programming? - c

According to Linux documentation (https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt) volatile is never useful in concurrent programming aspects.
I agree it's true if you stick to C11 standard. The standard literally forbids modyfing a memory location in point where the abstract machine is not touching it, however it is not the case in C89 and C99. A compiler conforming to C89/C99 is free to optimize the code by adding some "invented" stores to a shared variable in a point where nothing is being done with the variable in the C code (so it's outside a critical section). So the only way to correctly synchronize inter-thread data is to mark all shared data as volatile and in addition apply a lock or mutex. Why Linux developers are ignoring this point?
Note from C11, which is not present in C89 and C99:
NOTE 13 Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard, since such an assignment might overwrite another assignment by a different thread in cases in which an abstract machine execution would not have encountered a data race. This includes implementations of data member assignment that overwrite adjacent members in separate memory locations. We also generally preclude reordering of atomic loads in cases in which the atomics in question may alias, since this may violate the "visible sequence" rules.

Related

Is `int` an atomic type?

Quoting gnu:
In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
How is this possible? All examples I've seen related to locks are made with an int counter, like this https://www.delftstack.com/howto/c/mutex-in-c/.
This text in the glibc manual is only relevant for things like volatile sig_atomic_t being safe between a thread and its signal handler. They're guaranteeing that volatile int will work that way, too, in GNU systems. Note the context you omitted:
To avoid uncertainty about interrupting access to a variable, you can use a particular data type for which access is always atomic: sig_atomic_t. Reading and writing this data type is guaranteed to happen in a single instruction, so there’s no way for a handler to run “in the middle” of an access.
This says nothing about atomicity wrt. other threads or concurrency. An interrupt handler takes over the CPU and runs instead of your code. Asynchronously at any point, but the main thread and its signal handler aren't both running at once (especially not on different CPU cores).
For threading, atomicity isn't sufficient, you also need a guarantee of visibility between threads (which you can get from _Atomic int with memory_order_relaxed), and sometimes ordering which you can get with stronger memory orders.
See the early parts of Why is integer assignment on a naturally aligned variable atomic on x86? for more discussion of why a C type having a width that is naturally atomic on the target you're compiling for is not sufficient for much of anything.
You sometimes also need RMW atomicity, like the ability to do an atomic_fetch_add such that if 1000 such +=1 operations happen across multiple threads, the total result will be like +=1000. For that you absolutely need compiler support (or inline asm), like C11 _Atomic int. Can num++ be atomic for 'int num'?
The guarantee that int is atomic means atomic_int should always be lock-free, and cheap, but it absolutely does not mean plain int is remotely safe for data shared between threads. That's data-race UB, and even if you use memory barriers like GNU C asm("" ::: "memory") to try to get the compiler to not keep the value in a register, your code can break in lots of interesting ways that are more subtle than some of the obvious breakage mechanisms. See Who's afraid of a big bad optimizing compiler? on LWN for some caveats about doing that in the Linux kernel, where they use volatile for atomicity.
(Fun fact: GNU C at least de-facto gives pure-load and pure-store atomicity with volatile int64_t on 64-bit machines, unlike with plain int64_t. ISO C doesn't guarantee even that for volatile, which is why the Linux kernel depends on being compiled with GCC or maybe clang.)

Do I need to write explicit memory barrier for multithreaded C code?

I'm writing some code on Linux using pthread multithreading library and I'm currently wondering if following code is safe when compiled with -Ofast -lto -pthread.
// shared global
long shared_event_count = 0;
// ...
pthread_mutex_lock(mutex);
while (shared_event_count <= *last_seen_event_count)
pthread_cond_wait(cond, mutex);
*last_seen_event_count = shared_event_count;
pthread_mutex_unlock(mutex);
Are the calls to pthread_* functions enough or should I also include memory barrier to make sure that the change to global variable shared_event_count is actually updated during the loop? Without memory barrier the compiler would be freely to optimize the variable as register integer only, right? Of course, I could declare the shared integer as volatile which would prevent keeping the contents of that variable in register only during the loop but if I used the variable multiple times within the loop, it could make sense to only check the fresh status for the loop conditition only because that could allow for more compiler optimizations.
From testing the above code as-is, it appears that the generated code actually sees the changes made by another thread. However, is there any spec or documentation that actually guarantees this?
The common solution seems to be "don't optimize multithreaded code too aggressively" but that seems like a poor man's workaround instead of really fixing the issue. I'd rather write correct code and let the compiler optimize as much as possible within the specs (any code that gets broken by optimizations is in reality using e.g. undefined behavior of C standard as assumed stable behavior, except for some rare cases where compiler actually outputs invalid code but that seems to be very very rare these days).
I'd much prefer writing the code that works with any optimizing compiler – as such, it should only use features specified in the C standard and the pthread library documentation.
I found an interesting article at https://www.alibabacloud.com/blog/597460 which contains a trick like this:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
This was actually used first in Linux kernel and it triggered a compiler bug in old GCC versions: https://lwn.net/Articles/624126/
As such, let's assume that the compiler is actually following the spec and doesn't contain a bug but implements every possible optimization known to man allowed by the specs. Is the above code safe with that assumption?
Also, does pthread_mutex_lock() include memory barrier by the spec or could compiler re-order the statements around it?
The compiler will not reorder memory accesses across pthread_mutex_lock() (this is an oversimplification and not strictly true, see below).
First I’ll justify this by talking about how compilers work, then I’ll justify this by looking at the spec, and then I’ll justify this by talking about convention.
I don’t think I can give you a perfect justification from the spec. In general, I would not expect a spec to give you a perfect justification—it’s turtles all the way down (do you have a spec for how to interpret the spec?), and the spec is designed to be read and understood by actual humans who understand the relevant background concepts.
How This Works
How this works—the compiler, by default, assumes that a function it doesn’t know can access any global variable. So it must emit the store to shared_event_count before the call to pthread_mutex_lock()—as far as the compiler knows, pthread_mutex_lock() reads the value of shared_event_count.
Inside pthread_mutex_lock is a memory fence for the CPU, if necessary.
Justification
From n1548:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
Yes, there’s LTO. LTO can do some very surprising things. However, the fact is that writing to shared_event_count does have side effects and those side effects do affect the behavior of pthread_mutex_lock() and pthread_mutex_unlock().
The POSIX spec states that pthread_mutex_lock() provides synchronization. I could not find an explanation in the POSIX spec of what synchronization is, so this may have to suffice.
POSIX 4.12
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads:
Yes, in theory the store to shared_event_count could be moved or eliminated—but the compiler would have to somehow prove that this transformation is legal. There are various ways you could imagine this happening. For example, the compiler might be configured to do “whole program optimization”, and it may observe that shared_event_count is never read by your program—at which point, it’s a dead store and can be eliminated by the compiler.
Convention
This is how pthread_mutex_lock() has been used since the dawn of time. If compilers did this optimization, pretty much everyone’s code would break.
Volatile
I would generally not use this macro in ordinary code:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
This is a weird thing to do, useful in weird situations. Ordinary multithreaded code is not a sufficiently weird situation to use tricks like this. Generally, you want to either use locks or atomics to read or write shared values in a multithreaded program. ACCESS_ONCE does not use a lock and it does not use atomics—so, what purpose would you use it for? Why wouldn’t you use atomic_store() or atomic_load()?
In other words, it is unclear what you would be trying to do with volatile. The volatile keyword is easily the most abused keyword in C. It is rarely useful except when writing to memory-mapped IO registers.
Conclusion
The code is fine. Don’t use volatile.

Does C standard mandate that platforms must not define behaviors beyond those given in standard

The C standard makes clear that a compiler/library combination is allowed to do whatever it likes with the following code:
int doubleFree(char *p)
{
int temp = *p;
free(p);
free(p);
return temp;
}
In the event that a compiler does not require use of a particular bundled library, however, is there anything in the C standard which would forbid a library from defining a meaningful behavior? As a simple example, suppose code were written for a platform which had reference-counted pointers, such that following p = malloc(1234); __addref(p); __addref(p); the first two calls to free(p) would decrement the counter but not free the memory. Any code written for use with such a library would naturally work only with such a library (and the __addref() calls would likely fail on most others), but such a feature could be helpful in many cases when e.g. it is necessary to pass the a string repeatedly to a method which expects to be given a string produced with strdup and consequently calls free on it.
In the event that a library would define a useful behavior for some action like double-freeing a pointer, is there anything in the C standard which would authorize a compiler to unilaterally break it?
There is really two question here, your formally stated one and your broader one outlined in your comments to questions raised by others.
Your formal question is answers by the definition of undefined behavior and section 4 on conformance. The definition says (emphasis mine):
behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements
With emphasis on nonportable and imposes no requirements. This really says it all, the compiler is free to optimize in unpleasant manners or can also chose to make the behavior documented and well defined, this of course mean the program is no longer strictly conforming, which brings us to section 4:
A strictly conforming program shall use only those features of the language and library
specified in this International Standard.2) It shall not produce output dependent on any
unspecified, undefined, or implementation-defined behavior, and shall not exceed any
minimum implementation limit.
but a conforming implementation is allowed extensions as long as they don't break a conforming program:
A conforming implementation may have extensions (including additional
library functions), provided they do not alter the behavior of any strictly conforming
program.3)
As the C FAQ says:
There are very few realistic, useful, strictly conforming programs. On the other hand, a merely conforming program can make use of any compiler-specific extension it wants to.
Your informal question deals with compilers taking more aggressive optimization opportunies with undefined behavior and in the long run the fear this will make real world systems programming impossible. While I do understand how this relatively new aggressive stance seems very programmer unfriendly to many in the end a compiler won't last very long if people can not build useful programs with it. A related blog post by John Regehr: Proposal for a Friendly Dialect of C.
One could argue the opposite, that compilers have made a lot of effort to build extensions to support varying needs not supported by the standard. I think the article GCC hacks in the Linux kernel demonstrates this well. It goes into the many gcc extensions that the Linux kernel relies on and clang has in general attempted to support as many gcc extensions as possible.
Whether compilers have removed useful handling of undefined behavior which hampers effective systems programming is not clear to me. I think specific questions on alternatives for individual cases of undefined behavior that has been exploited in systems programming and no longer work would be useful and interesting to the community.
Does C standard mandate that platforms must not define behaviors beyond those given in standard
Quite simply, no, it does not. The standard says:
An implementation shall be accompanied by a document that defines all implementation-
defined and locale-specific characteristics and all extensions.
There is no restriction anywhere in the standard that prohibits implementations from providing any other documentation they like. If you like, you can read N1570, the latest freely available draft of the ISO C standard, and confirm the lack of any such prohibition.
In the event that a library would define a useful behavior for some action like double-freeing a pointer, is there anything in the C standard which would authorize a compiler to unilaterally break it?
A C implementation includes both the compiler and the standard library. free() is part of the standard library. The standard does not define the behavior of passing the same pointer value to free() twice, but an implementation is free to define the behavior. Any such documentation is not required, and is outside the scope of the C standard.
If a C implementation documented, for example, that calling free() a second time on the same pointer value has no effect, but then doing so actually causes the program to crash, that would violate the implementation's own documentation, but it would not violate the C standard. There is no specific requirement in the C standard that says an implementation must conform to its own documentation, beyond the documentation that's required by the standard. An implementation's conformance to its own documentation is enforce by the market and by common sense, not by the C standard.
In the event that a library would define a useful behavior for some action like double-freeing a pointer, is there anything in the C standard which would authorize a compiler to unilaterally break it?
The compiler and the standard library (i.e. the one in which free is defined) are both part of the implementation - it isn't really coherent to talk about one of them doing something "unilaterally".
If a compiler "does not require use of a particular bundled library", then (other than perhaps as a freestanding implementation) it alone is not an implementation, so the standard doesn't apply to it at all. The behavior of a combination of a library and a compiler are the responsibility of whoever chooses to combine them (which may be the author of either component, or someone else entirely) and label this combination as an implementation. It would, of course, be wise not to document extensions implemented by the library as features of this implementation without confirming that the compiler does not break them. For that matter, you would also need to make sure that the compiler doesn't break anything used internally by the library.
In answer to your main question: no, it does not. If the end result of combining a library and a compiler (and kernel, dynamic loader, etc) is a conforming hosted environment, it is a conforming implementation even if some extensions that the library's author would like to have provided are not supported by the final result of combining them, but it does not require them to work, either. Conversely, if the result does not conform - for example if the compiler breaks the internals of the library and thereby causes some library function not to conform - then it is not a conforming implementation. Any program which calls free twice on the same pointer, or uses any reserved identifier starting with two underscores, causes undefined behavior and therefore is not a strictly conforming program.

Restrictions of the OpenACC aware CAPS compiler

I'm currently writing a report on the state of automatic parallelisation techniques on compiler level. Concerning the OpenACC standard, several compilers are available, such as the PGI compiler, CAPS, or the CRAY compiler. However, I was wondering if there are specific restrictions to the CAPS compiler, which are not documented within the OpenACC standard? I'm aware, that there are probably restrictions for 2.0a, as this standard is not yet completely implemented but are there any pitfalls I should take care of?
The most common problem with OpenACC-2.0 when people rely on automatic parallelization, is that the scalars are implicitely copy (in kernels) or firstprivate (in parallels sections).
This means that unless the compiler is able to privatize these scalars, automatic parallelization of loops that contains such scalars, if they are written to, will likely fail (that is, not "promote" a loop to parallel execution).
At the present time, CAPS Compilers does not aggressively privatize scalars, so automatic parallelization may not work as well as you'd expect. Does that answer your question?

Using __thread in c99

I would like to define a few variables as thread-specific using the __thread storage class. But three questions make me hesitate:
Is it really standard in c99? Or more to the point, how good is the compiler support?
Will the variables be initialised in every thread?
Do non-multi threaded programs treat them as plain-old-globals?
To answer your specific questions:
No, it is not part of C99. You will not find it mentioned anywhere in the n1256.pdf (C99+TC1/2/3) or the original C99 standard.
Yes, __thread variables start out with their initialized value in every new thread.
From a standpoint of program behavior, thread-local storage class variables behave pretty much the same as plain globals in non-multi-threaded programs. However, they do incur a bit more runtime cost (memory and startup time), and there can be issues with limits on the size and number of thread-local variables. All this is rather complicated and varies depending on whether your program is static- or dynamic-linked and whether the variables reside in the main program or a shared library...
Outside of implementing C/POSIX (e.g. errno, etc.), thread-local storage class is actually not very useful, in my opinion. It's pretty much a crutch for avoiding cleanly passing around the necessary state in the form of a context pointer or similar. You might think it could be useful for getting around broken interfaces like qsort that don't take a context pointer, but unfortunately there is no guarantee that qsort will call the comparison function in the same thread that called qsort. It might break the job down and run it in multiple threads. Same goes for most other interfaces where this sort of workaround would be possible.
You probably want to read this:
http://www.akkadia.org/drepper/tls.pdf
1) MSVC doesn't support C99. GCC does and other compilers attempt GCC compatibility.
edit A breakdown of compiler support for __thread is available here:
http://chtekk.longitekk.com/index.php?/archives/2011/02/C8.html
2) Only C++ supports an initializer and it must be constant.
3) Non-multi-threaded applications are single-threaded applications.

Resources