Questions regarding (non-)volatile and optimizing compilers

Questions regarding (non-)volatile and optimizing compilers - c

I have the following C code:
/* the memory entry points to can be changed from another thread but
* it is not declared volatile */
struct myentry *entry;
bool isready(void)
{
return entry->status == 1;
}
bool isready2(int idx)
{
struct myentry *x = entry + idx;
return x->status == 1;
}
int main(void) {
/* busy loop */
while (!isready())
;
while (!isready2(5))
;
}
As I note in the comment, entry is not declared as volatile even though the array it points to can be changed from another thread (or actually even directly from kernel space).
Is the above code incorrect / unsafe? My thinking is that no optimization could be performed in the bodies of isready, isready2 and since I repeatedly perform function calls from within main the appropriate memory location should be read on every call.
On the other hand, the compiler could inline these functions. Is it possible that it does it in a way that results in a single read happening (hence causing an infinite loop) instead of multiple reads (even if these reads come from a load/store buffer)?
And a second question. Is it possible to prevent the compiler from doing optimizations by casting to volatile only in certain places like that?
void func(void)
{
entry->status = 1;
while (((volatile struct myentry *) entry)->status != 2)
;
}
Thanks.

If the memory entry points to can be modified by another thread, then the program has a data race and therefore the behaviour is undefined . This is still true even if volatile is used.
To have a variable accessed concurrently by multiple threads, in ISO C11, it must either be an atomic type, or protected by correct synchronization.
If using an older standard revision then there are no guarantees provided by the Standard about multithreading so you are at the mercy of any idiosyncratic behaviour of your compiler.
If using POSIX threads, there are no portable atomic operations, but it does define synchronization primitives.
See also:
Why is volatile not considered useful in multithreaded C or C++ programming?
The second question is a bugbear, I would suggest not doing it because different compilers may interpret the meaning differently, and the behaviour is still formally undefined either way.

Related

Reason for volatile nonstatic local variable in C

Refer to the following code:
void calledFunction(volatile uint8_t **inPtr);
volatile uint8_t buffer[] = {0,0,0,0,0,0};
volatile uint8_t *headPtr = buffer;
void foo(void)
{
volatile uint8_t *tmpPtr = NULL;
tmpPtr = headPtr;
//This function modifies tmpPtr
calledFunction(&tmpPtr);
headPtr = tmpPtr;
return;
}
This is a simplified version of code that I am attempting to make interrupt-safe, and I am not sure why this local is defined as a volatile. I know that there is no performance reason (i.e. to guarantee at least O(n) for this function) because this function should run as efficiently as possible.
This function can be called in both main execution and inside interrupts, but since tmpPtr is a nonstatic local variable, it should not be able to be modified by any other instance of foo().
I can't see any access pattern that would require the volatile keyword in this context.
In short, What is the purpose of the volatile keyword for tmpPtr in function foo()?
EDIT:Forgot a & in function argument
EDIT2: I have inherited this code and need to modify it.
My main question is whether the volatile keyword has any special effective reason for being in this context.
EDIT3: Added the prototype for calledFunction()
EDIT4: Added important clarification in original code that headPtr and buffer both have volatile

The reason tmpPtr has volatile is due to tmpPtr needing to reference a volatile uint8_t, not because tmpPtr itself is volatile (it isn't).
As initially pointed out by #Eugene Sh., this question came up due to a misunderstanding in syntax when defining volatile pointers and variables. This question has a great explanation of syntax for pointers to volatile vs volatile pointers.

Volatile restricts the compiler from optimizations when accessing (reading or writing) the data this pointer points to.
This comes up in embedded or interrupts often because memory-mapped peripherals can't have what would normally be "extraneous" reads or writes optimized out.
E.g.,
int32_t variable = 0;
variable = 1;
variable = 2;
variable = 3;
An optimizing compiler would skip setting the value of variable to 0, 1, and 2 and just set it to 3. That's fine generally, but if instead of writing to a normal variable we are writing to a memory-mapped port, we actually want each of those writes to happen.
This can happen even outside the world of hardware interfaces. If a separate thread is in a loop looking for variable to be set to 2, an optimizing compiler would preclude this from happening.
The fact that it's local is not material. It's just that most use cases for volatile happen be implemented with translation-unit (or cross-translation-unit using extern) scope.
Two examples are memory-mapped register definitions (the struct is typically global, often in a header file, and commonly the instance of the pointer to the struct is global though it doesn't have to be), and flags like our thread example.
I don't advocate for either design, but you will come across it frequently in embedded development.

Thread-safely initializing a pointer just once

I'm writing a library function, say, count_char(const char *str, int len, char ch) that detects the supported SIMD extensions of the CPU it's running on and dispatches the call to, say, an AVX2- or SSE4.2-optimized version. Since I'd like to avoid the penalty of doing a couple of cpuid instructions per each call, I'm trying to do this just once the first time the function is called (which might be called by different threads simultaneously).
In C++ land I'd just do something like
int count_char(const char *str, int len, char ch) {
static const auto fun_ptr = select_simd_function();
return (*fun_ptr)(str, len, ch);
}
and rely on C++ semantics of static to guarantee that it's called exactly once without any race conditions. But what's the best way to do this in pure C?
This is what I've come up with:
Using atomic variables (that are also present in C) — rather error-prone and a bit harder to maintain.
Using pthread_once — not sure about what overhead it has, plus it might give headache on Windows.
Forcing the library user to call another library function to initialize the pointer — in short, it won't work in my case since this is actually C bits of a library for another language.
Aligning the pointer by 8 bytes and relying on x86 word-sized accesses being atomic — unportable to other architectures (shall I later implement some PowerPC or ARM-specific SIMD versions, say), technically UB (at least in C++).
Using thread-local storage and marking fun_ptr as thread_local and then doing something like
static thread_local fun_ptr_t fun_ptr = NULL;
if (!fun_ptr) {
fun_ptr = select_simd_function();
}
return (*fun_ptr)(str, len, ch);
The upside is that the code is very clear and apparently correct, but I'm not sure about the performance implications of TLS, plus every thread will have to call select_simd_function() once (but that's probably not a big deal).
For me personally, (5) is the winner so far, followed closely by (1) (I'd probably even go with (1) if it weren't somebody else's very foundational library and I didn't want to embarrass myself with a likely faulty implementation).
So, what'd be the best option? Did I miss anything else?

If you can use C11, this would work (assuming your implementation supports threads - it's an optional feature):
#include <threads.h>
static fun_ptr_t fun_ptr = NULL;
static void init_fun_ptr( void )
{
fun_ptr = select_simd_function();
}
fun_ptr_t get_simd_function( void )
{
static once_flag flag = ONCE_FLAG_INIT;
call_once( &flag, init_fun_ptr);
return ( fun_ptr );
}
Of course, you mentioned Windows. I doubt MSVC supports this.

volatile keyword with mutex and semaphores

The question is simple. Does/Should a variable used with multi-threads be volatile even accessed in critical section(i.e. mutex, semaphore) in C? Why / Why not?
#include <pthread.h>
volatile int account_balance;
pthread_mutex_t flag = PTHREAD_MUTEX_INITIALIZER;
void debit(int amount) {
pthread_mutex_lock(&flag);
account_balance -= amount;//Inside critical section
pthread_mutex_unlock(&flag);
}
What about the example or equivalently thinking for semaphore?

Does/Should a variable used with multi-threads be volatile even accessed in critical section(i.e. mutex, semaphore) in C? Why / Why not?
No.
volatile is logically irrelevant for concurency, because it's not sufficient.
Actually, that's not really true - volatile is not irrelevant because it can hide concurrency problems in your code, so it works "most of the time".
All volatile does is tell the compiler "this variable can change outside the current thread of execution". Volatile in no way enforces any ordering, atomicity, or - critically - visibility. Just because thread 2 on CPU A changes int x, that doesn't mean thread 1 on CPU D can even see the change at any specific time - it has it's own cached value, and volatile means almost nothing with respect to memory coherence because it doesn't guarantee ordering.
The last comment at the bottom of the Intel article Volatile: Almost Useless for Multi-Threaded Programming says it best:
If you are simply adding 'volatile' to variables that are shared
between threads thinking that fixes your shared-data problem without
bothering to understand why it may not, you will eventually reap the
reward you deserve.
Yes, lock-free code can make use of volatile. Such code is written by people who can likely write tutorials on the use of volatile, multithreaded code, and other extremely detailed subjects regarding compilers.

No, volatile should not be used on shared variables which are accessed under the protection of pthreads synchronisation functions like pthread_mutex_lock().
The reason is that the synchronisation functions themselves are guaranteed by POSIX to provide all the necessary compiler barriers and synchronisation to ensure consistency (as long as you follow the POSIX rules on concurrent access - ie. that you have used pthreads synchronisation functions to ensure that no thread can be writing to a shared variable whilst another thread is writing to or reading from it).

I have no idea why there's so much misinformation about volatile everywhere on the internet. The answer to your question is yes, you should make variables you use within a critical section volatile.
I'll give a contrived example. Let's say you want to run this function on multiple threads:
int a;
void inc_a(void) {
for (int i = 0; i < 5; ++i) {
a += 5;
}
}
Everybody, as it would seem, on this site will tell you that it's enough to put a += 5 in a critical section like so:
int a;
void inc_a(void) {
for (int i = 0; i < 5; ++i) {
enter_critical_section();
a += 5;
exit_critical_section();
}
}
As i said, it's contrived, but people will tell you this is correct, and it absolutely is not! If the compiler wasn't given prior knowledge as to what the critical section functions are, and what their semantic meaning is, there's nothing stopping the compiler from outputting this code:
int a;
void inc_a(void) {
register eax = a;
for (int i = 0; i < 5; ++i) {
enter_critical_section();
eax += 5;
exit_critical_section();
}
a = eax;
}
This code produces the same output in a single threaded context, so the compiler is allowed to do that. But in a multithreaded context, this can output anything between 25 and 25 times the thread count. One way to solve this issue is to use an atomic construct, but that has performance implications, instead what you should do is make the variable volatile. That is, unless you want to be like the rest of this community and blindly put your faith in your C compiler.

SPSC thread safe with fences

I just want my code as simple as possible and thread safe.
With C11 atomics
Regarding part "7.17.4 Fences" of the ISO/IEC 9899/201X draft
X and Y , both operating on some atomic object M, such that A is
sequenced before X, X modifies M, Y is sequenced before B, and Y reads
the value written by X or a value written by any side effect in the
hypothetical release sequence X would head if it were a release
operation.
Is this code thread safe (with "w_i" as "object M") ?
Are "w_i" and "r_i" need both to be declared as _Atomic ?
If only w_i is _Atomic, can the main thread keep an old value of r_i in cache and consider the queue as not full (while it's full) and write data ?
What's going on if I read an atomic without atomic_load ?
I have made some tests but all of my attempts seems to give the right results.
However, I know that my tests are not really correct regarding multithread : I run my program several times and look at the result.
Even if neither w_i not r_i are declared as _Atomic, my program work, but only fences are not sufficient regarding C11 standard, right ?
typedef int rbuff_data_t;
struct rbuf {
rbuff_data_t * buf;
unsigned int bufmask;
_Atomic unsigned int w_i;
_Atomic unsigned int r_i;
};
typedef struct rbuf rbuf_t;
static inline int
thrd_tryenq(struct rbuf * queue, rbuff_data_t val) {
size_t next_w_i;
next_w_i = (queue->w_i + 1) & queue->bufmask;
/* if ring full */
if (atomic_load(&queue->r_i) == next_w_i) {
return 1;
}
queue->buf[queue->w_i] = val;
atomic_thread_fence(memory_order_release);
atomic_store(&queue->w_i, next_w_i);
return 0;
}
static inline int
thrd_trydeq(struct rbuf * queue, rbuff_data_t * val) {
size_t next_r_i;
/*if ring empty*/
if (queue->r_i == atomic_load(&queue->w_i)) {
return 1;
}
next_r_i = (queue->r_i + 1) & queue->bufmask;
atomic_thread_fence(memory_order_acquire);
*val = queue->buf[queue->r_i];
atomic_store(&queue->r_i, next_r_i);
return 0;
}
I call theses functions as follow :
Main thread enqueue some data :
while (thrd_tryenq(thrd_get_queue(&tinfo[tnum]), i)) {
usleep(10);
continue;
}
Others threads dequeue data :
static void *
thrd_work(void *arg) {
struct thrd_info *tinfo = arg;
int elt;
atomic_init(&tinfo->alive, true);
/* busy waiting when queue empty */
while (atomic_load(&tinfo->alive)) {
if (thrd_trydeq(&tinfo->queue, &elt)) {
sched_yield();
continue;
}
printf("Thread %zu deq %d\n",
tinfo->thrd_num, elt);
}
pthread_exit(NULL);
}
With asm fences
Regarding a specific platform x86 with lfence and sfence,
If I remove all C11 code and just replace fences by
asm volatile ("sfence" ::: "memory");
and
asm volatile ("lfence" ::: "memory");
(My understanding of these macro is : compiler fence to prevent memory access to be reoganized/optimized + hardware fence)
do my variables need to be declared as volatile for instance ?
I have already seen this ring buffer code above with only these asm fences but with no atomic types and I was really surprised, I want to know if this code was correct.

I just reply regarding C11 atomics, platform specifics are too complicated and should be phased out.
Synchronization between threads in C11 is only guaranteed through some system calls (e.g for mtx_t) and atomics. Don't even try to do it without.
That said, sychronization works via atomics, that is visibility of side effects is guaranteed to propagate via the visibility of effects on atomics. E.g for the simplest consistency model, sequential, whenever thread T2 sees a modification thread T1 has effected on an atomic variable A, all side effects before that modication in thread T1 are visible to T2.
So not all your shared variables need to be atomic, you only must ensure that your state is properly propagated via an atomic. In that sense fences buy you nothing when you use sequential or acquire-release consistency, they only complicate the picture.
Some more general remarks:
Since you seem to use the sequential consistency model, which is the
default, the functional writing of atomic operations (e.g
atomic_load) is superfluous. Just evaluating the atomic variable is
exactly the same.
I have the impression that you are attempting optimization much too
early in your development. I think you should do an implementation
for which you can prove correctness, first. Then, if and only if
you notice a performance problem, you should start to think about
optimization. It is very unlikely that such an atomic data structure
is a real bottleneck for your applcation. You'd have to have a very
large number of threads that all simultaneously hammer on your poor
little atomic variable, to see a measurable bottleneck here.

Why pthread_self is marked with attribute(const)?

In Glibc's pthread.h the pthread_self function is declared with the const attribute:
extern pthread_t pthread_self (void) __THROW __attribute__ ((__const__));
In GCC that attribute means:
Many functions do not examine any values except their arguments, and have no effects except the return value. Basically this is just slightly more strict class than the pure attribute below, since function is not allowed to read global memory.
I wonder how that's supposed to be? Since it does not take any argument, pthread_self is therefore allowed only to always return the same value, which is obviously not the case. That is, I would have expected pthread_self to read global memory, and therefore eventually be marked as pure instead:
Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be. These functions should be declared with the attribute pure.
The implementation on x86-64 seems to be actually reading global memory:
# define THREAD_SELF \
({ struct pthread *__self; \
asm ("mov %%fs:%c1,%0" : "=r" (__self) \
: "i" (offsetof (struct pthread, header.self))); \
__self;})
pthread_t
__pthread_self (void)
{
return (pthread_t) THREAD_SELF;
}
strong_alias (__pthread_self, pthread_self)
Is this a bug or am I not seeing something?

The attribute was most likely added in the assumption that GCC would only use it locally (within a function), and would never be able to use it for inter-procedural optimizations. Today, some of Glibc developers are questioning the correctness of the attribute exactly because powerful inter-procedural optimization could, potentially, lead to miscompilation; quoting post by Torvald Riegel to Glibc developers' mailing list,
The const attribute is specified as asserting that the function does not
examine any data except the arguments. __errno_location has no
arguments, so it would have to return the same values every time.
This works in a single-threaded program, but not in a multi-threaded
one. Thus, I think that strictly speaking, it should not be const.
We could argue that this magically is meant to always be in the context
of a specific thread. Ignoring that GCC doesn't define threads itself
(especially in something like NPTL which is about creating a notion of
threads), we could still assume that this works because in practice, the
compiler and its passes can't leak knowledge across a function used in
one thread and other one used in another thread.
(__errno_location() and pthread_self() both are marked with __attribute__((const)) and receive no arguments).
Here's a small example that could plausibly be miscompiled with powerful interprocedural analysis:
#include <pthread.h>
#include <errno.h>
#include <stdlib.h>
static void *errno_pointer;
static void *thr(void *unused)
{
if (!errno_pointer || errno_pointer == &errno)
abort();
return 0;
}
int main()
{
errno_pointer = &errno;
pthread_t t;
pthread_create(&t, 0, thr, 0);
pthread_join(t, 0);
}
(the compiler can observe that errno_pointer is static, it does not escape the translation unit, and the only store into it assigns the same "const" value, given by __errno_location(), that is tested in thr()). I've used this example in my email asking to improve documentation of pure/const attributes, but unfortunately it didn't get much traction.

I wonder how that's supposed to be?
This attribute is telling the compiler that in a given context pthread_self will always return the same value. In other words, the two loops below are exactly equivalent, and the compiler is allowed to optimize out the second (and all subsequent) calls to pthread_self:
// loop A
std::map<pthread_t, int> m;
for (int j = 0; j < 1000; ++j)
m[pthread_self()] += 1;
// loop B
std::map<pthread_t, int> m;
const pthread_t self = pthread_self();
for (int j = 0; j < 1000; ++j)
m[self] += 1;
The implementation on x86-64 seems to be actually reading global memory
No, it does not. It reads thread-local memory.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight