The question is simple. Does/Should a variable used with multi-threads be volatile even accessed in critical section(i.e. mutex, semaphore) in C? Why / Why not?
#include <pthread.h>
volatile int account_balance;
pthread_mutex_t flag = PTHREAD_MUTEX_INITIALIZER;
void debit(int amount) {
pthread_mutex_lock(&flag);
account_balance -= amount;//Inside critical section
pthread_mutex_unlock(&flag);
}
What about the example or equivalently thinking for semaphore?
Does/Should a variable used with multi-threads be volatile even accessed in critical section(i.e. mutex, semaphore) in C? Why / Why not?
No.
volatile is logically irrelevant for concurency, because it's not sufficient.
Actually, that's not really true - volatile is not irrelevant because it can hide concurrency problems in your code, so it works "most of the time".
All volatile does is tell the compiler "this variable can change outside the current thread of execution". Volatile in no way enforces any ordering, atomicity, or - critically - visibility. Just because thread 2 on CPU A changes int x, that doesn't mean thread 1 on CPU D can even see the change at any specific time - it has it's own cached value, and volatile means almost nothing with respect to memory coherence because it doesn't guarantee ordering.
The last comment at the bottom of the Intel article Volatile: Almost Useless for Multi-Threaded Programming says it best:
If you are simply adding 'volatile' to variables that are shared
between threads thinking that fixes your shared-data problem without
bothering to understand why it may not, you will eventually reap the
reward you deserve.
Yes, lock-free code can make use of volatile. Such code is written by people who can likely write tutorials on the use of volatile, multithreaded code, and other extremely detailed subjects regarding compilers.
No, volatile should not be used on shared variables which are accessed under the protection of pthreads synchronisation functions like pthread_mutex_lock().
The reason is that the synchronisation functions themselves are guaranteed by POSIX to provide all the necessary compiler barriers and synchronisation to ensure consistency (as long as you follow the POSIX rules on concurrent access - ie. that you have used pthreads synchronisation functions to ensure that no thread can be writing to a shared variable whilst another thread is writing to or reading from it).
I have no idea why there's so much misinformation about volatile everywhere on the internet. The answer to your question is yes, you should make variables you use within a critical section volatile.
I'll give a contrived example. Let's say you want to run this function on multiple threads:
int a;
void inc_a(void) {
for (int i = 0; i < 5; ++i) {
a += 5;
}
}
Everybody, as it would seem, on this site will tell you that it's enough to put a += 5 in a critical section like so:
int a;
void inc_a(void) {
for (int i = 0; i < 5; ++i) {
enter_critical_section();
a += 5;
exit_critical_section();
}
}
As i said, it's contrived, but people will tell you this is correct, and it absolutely is not! If the compiler wasn't given prior knowledge as to what the critical section functions are, and what their semantic meaning is, there's nothing stopping the compiler from outputting this code:
int a;
void inc_a(void) {
register eax = a;
for (int i = 0; i < 5; ++i) {
enter_critical_section();
eax += 5;
exit_critical_section();
}
a = eax;
}
This code produces the same output in a single threaded context, so the compiler is allowed to do that. But in a multithreaded context, this can output anything between 25 and 25 times the thread count. One way to solve this issue is to use an atomic construct, but that has performance implications, instead what you should do is make the variable volatile. That is, unless you want to be like the rest of this community and blindly put your faith in your C compiler.
Related
I have the following C code:
/* the memory entry points to can be changed from another thread but
* it is not declared volatile */
struct myentry *entry;
bool isready(void)
{
return entry->status == 1;
}
bool isready2(int idx)
{
struct myentry *x = entry + idx;
return x->status == 1;
}
int main(void) {
/* busy loop */
while (!isready())
;
while (!isready2(5))
;
}
As I note in the comment, entry is not declared as volatile even though the array it points to can be changed from another thread (or actually even directly from kernel space).
Is the above code incorrect / unsafe? My thinking is that no optimization could be performed in the bodies of isready, isready2 and since I repeatedly perform function calls from within main the appropriate memory location should be read on every call.
On the other hand, the compiler could inline these functions. Is it possible that it does it in a way that results in a single read happening (hence causing an infinite loop) instead of multiple reads (even if these reads come from a load/store buffer)?
And a second question. Is it possible to prevent the compiler from doing optimizations by casting to volatile only in certain places like that?
void func(void)
{
entry->status = 1;
while (((volatile struct myentry *) entry)->status != 2)
;
}
Thanks.
If the memory entry points to can be modified by another thread, then the program has a data race and therefore the behaviour is undefined . This is still true even if volatile is used.
To have a variable accessed concurrently by multiple threads, in ISO C11, it must either be an atomic type, or protected by correct synchronization.
If using an older standard revision then there are no guarantees provided by the Standard about multithreading so you are at the mercy of any idiosyncratic behaviour of your compiler.
If using POSIX threads, there are no portable atomic operations, but it does define synchronization primitives.
See also:
Why is volatile not considered useful in multithreaded C or C++ programming?
The second question is a bugbear, I would suggest not doing it because different compilers may interpret the meaning differently, and the behaviour is still formally undefined either way.
Let's suppose I want to set atomic instructions into a function.
I declared
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
as a global variable.
Instead of:
int main() {
myFoo();
...
}
void myFoo() {
pthread_mutex_lock(&mutex);
myGlobal++;
pthread_mutex_unlock(&mutex);
}
can I do:
int main() {
pthread_mutex_lock(&mutex);
myFoo();
pthread_mutex_unlock(&mutex);
...
}
void myFoo() {
myGlobal++;
}
So that every instructions in myFoo become atomic?
In first example, you are protecting myGlobal and in 2nd you are protecting myFoo. Your code works as you expect (if you call it everywhere between lock/unlock), but you need to use terms correctly or its meaning will be wrong.
No it will not be atomic, but access to myFoo will be synchronized, meaning no other thread can access that part code when a another thread is using it.
Atomic operation term normally is used showing that an instruction is run without any interruption (sometimes considered lock-free). For example, C11's atomic_flag provides such functionality. On the other hand, mutex is for creating mutual exclusion. You can protect a part of your code from simultaneous access from different threads. These 2 terms are not similar.
Side note:
Only atomic_ type that is guaranteed to be really atomic and lock-free is atomic_flag is both C and C++. Other ones such as atomic_int may be implemented using synchronization method and is not lock-free.
Your use of the term atomic is not really correct but I guess the question is more about whether the two code snippets will behave the same.
If myFoo is only called between lock/unlock, the answer is yes, they are the same.
However, in the second case you have lost protection of myFoo. Another thread could call myFoo without calling lock first which would cause problems.
So the second example is bad as it opens up for more mistakes. Stick to the first one, i.e. keep the lock/unlock inside the function.
Also notice:
Since myGlobal is a global variable, you can't make sure that the threads do not access it directly. There are several ways to avoid that. The example below shows a single function with a static variable. The function can be used to receive the static variable and do an increment if desired.
int myFoo(int doIncrement)
{
static int myStatic = 0;
int result;
pthread_mutex_lock(&mutex);
if (doIncrement) myStatic++;
result = myStatic;
pthread_mutex_unlock(&mutex);
return result;
}
Now the variable myStatic is hidden from all the threads and can only be accessed through myFoo.
int x = myFoo(1); // Increment and read
int y = myFoo(0); // Read only
I just want my code as simple as possible and thread safe.
With C11 atomics
Regarding part "7.17.4 Fences" of the ISO/IEC 9899/201X draft
X and Y , both operating on some atomic object M, such that A is
sequenced before X, X modifies M, Y is sequenced before B, and Y reads
the value written by X or a value written by any side effect in the
hypothetical release sequence X would head if it were a release
operation.
Is this code thread safe (with "w_i" as "object M") ?
Are "w_i" and "r_i" need both to be declared as _Atomic ?
If only w_i is _Atomic, can the main thread keep an old value of r_i in cache and consider the queue as not full (while it's full) and write data ?
What's going on if I read an atomic without atomic_load ?
I have made some tests but all of my attempts seems to give the right results.
However, I know that my tests are not really correct regarding multithread : I run my program several times and look at the result.
Even if neither w_i not r_i are declared as _Atomic, my program work, but only fences are not sufficient regarding C11 standard, right ?
typedef int rbuff_data_t;
struct rbuf {
rbuff_data_t * buf;
unsigned int bufmask;
_Atomic unsigned int w_i;
_Atomic unsigned int r_i;
};
typedef struct rbuf rbuf_t;
static inline int
thrd_tryenq(struct rbuf * queue, rbuff_data_t val) {
size_t next_w_i;
next_w_i = (queue->w_i + 1) & queue->bufmask;
/* if ring full */
if (atomic_load(&queue->r_i) == next_w_i) {
return 1;
}
queue->buf[queue->w_i] = val;
atomic_thread_fence(memory_order_release);
atomic_store(&queue->w_i, next_w_i);
return 0;
}
static inline int
thrd_trydeq(struct rbuf * queue, rbuff_data_t * val) {
size_t next_r_i;
/*if ring empty*/
if (queue->r_i == atomic_load(&queue->w_i)) {
return 1;
}
next_r_i = (queue->r_i + 1) & queue->bufmask;
atomic_thread_fence(memory_order_acquire);
*val = queue->buf[queue->r_i];
atomic_store(&queue->r_i, next_r_i);
return 0;
}
I call theses functions as follow :
Main thread enqueue some data :
while (thrd_tryenq(thrd_get_queue(&tinfo[tnum]), i)) {
usleep(10);
continue;
}
Others threads dequeue data :
static void *
thrd_work(void *arg) {
struct thrd_info *tinfo = arg;
int elt;
atomic_init(&tinfo->alive, true);
/* busy waiting when queue empty */
while (atomic_load(&tinfo->alive)) {
if (thrd_trydeq(&tinfo->queue, &elt)) {
sched_yield();
continue;
}
printf("Thread %zu deq %d\n",
tinfo->thrd_num, elt);
}
pthread_exit(NULL);
}
With asm fences
Regarding a specific platform x86 with lfence and sfence,
If I remove all C11 code and just replace fences by
asm volatile ("sfence" ::: "memory");
and
asm volatile ("lfence" ::: "memory");
(My understanding of these macro is : compiler fence to prevent memory access to be reoganized/optimized + hardware fence)
do my variables need to be declared as volatile for instance ?
I have already seen this ring buffer code above with only these asm fences but with no atomic types and I was really surprised, I want to know if this code was correct.
I just reply regarding C11 atomics, platform specifics are too complicated and should be phased out.
Synchronization between threads in C11 is only guaranteed through some system calls (e.g for mtx_t) and atomics. Don't even try to do it without.
That said, sychronization works via atomics, that is visibility of side effects is guaranteed to propagate via the visibility of effects on atomics. E.g for the simplest consistency model, sequential, whenever thread T2 sees a modification thread T1 has effected on an atomic variable A, all side effects before that modication in thread T1 are visible to T2.
So not all your shared variables need to be atomic, you only must ensure that your state is properly propagated via an atomic. In that sense fences buy you nothing when you use sequential or acquire-release consistency, they only complicate the picture.
Some more general remarks:
Since you seem to use the sequential consistency model, which is the
default, the functional writing of atomic operations (e.g
atomic_load) is superfluous. Just evaluating the atomic variable is
exactly the same.
I have the impression that you are attempting optimization much too
early in your development. I think you should do an implementation
for which you can prove correctness, first. Then, if and only if
you notice a performance problem, you should start to think about
optimization. It is very unlikely that such an atomic data structure
is a real bottleneck for your applcation. You'd have to have a very
large number of threads that all simultaneously hammer on your poor
little atomic variable, to see a measurable bottleneck here.
In Glibc's pthread.h the pthread_self function is declared with the const attribute:
extern pthread_t pthread_self (void) __THROW __attribute__ ((__const__));
In GCC that attribute means:
Many functions do not examine any values except their arguments, and have no effects except the return value. Basically this is just slightly more strict class than the pure attribute below, since function is not allowed to read global memory.
I wonder how that's supposed to be? Since it does not take any argument, pthread_self is therefore allowed only to always return the same value, which is obviously not the case. That is, I would have expected pthread_self to read global memory, and therefore eventually be marked as pure instead:
Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be. These functions should be declared with the attribute pure.
The implementation on x86-64 seems to be actually reading global memory:
# define THREAD_SELF \
({ struct pthread *__self; \
asm ("mov %%fs:%c1,%0" : "=r" (__self) \
: "i" (offsetof (struct pthread, header.self))); \
__self;})
pthread_t
__pthread_self (void)
{
return (pthread_t) THREAD_SELF;
}
strong_alias (__pthread_self, pthread_self)
Is this a bug or am I not seeing something?
The attribute was most likely added in the assumption that GCC would only use it locally (within a function), and would never be able to use it for inter-procedural optimizations. Today, some of Glibc developers are questioning the correctness of the attribute exactly because powerful inter-procedural optimization could, potentially, lead to miscompilation; quoting post by Torvald Riegel to Glibc developers' mailing list,
The const attribute is specified as asserting that the function does not
examine any data except the arguments. __errno_location has no
arguments, so it would have to return the same values every time.
This works in a single-threaded program, but not in a multi-threaded
one. Thus, I think that strictly speaking, it should not be const.
We could argue that this magically is meant to always be in the context
of a specific thread. Ignoring that GCC doesn't define threads itself
(especially in something like NPTL which is about creating a notion of
threads), we could still assume that this works because in practice, the
compiler and its passes can't leak knowledge across a function used in
one thread and other one used in another thread.
(__errno_location() and pthread_self() both are marked with __attribute__((const)) and receive no arguments).
Here's a small example that could plausibly be miscompiled with powerful interprocedural analysis:
#include <pthread.h>
#include <errno.h>
#include <stdlib.h>
static void *errno_pointer;
static void *thr(void *unused)
{
if (!errno_pointer || errno_pointer == &errno)
abort();
return 0;
}
int main()
{
errno_pointer = &errno;
pthread_t t;
pthread_create(&t, 0, thr, 0);
pthread_join(t, 0);
}
(the compiler can observe that errno_pointer is static, it does not escape the translation unit, and the only store into it assigns the same "const" value, given by __errno_location(), that is tested in thr()). I've used this example in my email asking to improve documentation of pure/const attributes, but unfortunately it didn't get much traction.
I wonder how that's supposed to be?
This attribute is telling the compiler that in a given context pthread_self will always return the same value. In other words, the two loops below are exactly equivalent, and the compiler is allowed to optimize out the second (and all subsequent) calls to pthread_self:
// loop A
std::map<pthread_t, int> m;
for (int j = 0; j < 1000; ++j)
m[pthread_self()] += 1;
// loop B
std::map<pthread_t, int> m;
const pthread_t self = pthread_self();
for (int j = 0; j < 1000; ++j)
m[self] += 1;
The implementation on x86-64 seems to be actually reading global memory
No, it does not. It reads thread-local memory.
How do you declare a particular member of a struct as volatile?
Exactly the same as non-struct fields:
#include <stdio.h>
int main (int c, char *v[]) {
struct _a {
int a1;
volatile int a2;
int a3;
} a;
a.a1 = 1;
a.a2 = 2;
a.a3 = 3;
return 0;
}
You can mark the entire struct as volatile by using "volatile struct _a {...}" but the method above is for individual fields.
Should be pretty straight forward according to this article:
Finally, if you apply volatile to a
struct or union, the entire contents
of the struct/union are volatile. If
you don't want this behavior, you can
apply the volatile qualifier to the
individual members of the
struct/union.
I need to clarify volatile for C/C++ because there was a wrong answer here. I've been programming microcontroleurs since 1994 where this keyword is very useful and needed often.
volatile will never break your code, it is never risky to use it. The keyword will basically make sure the variable is not optimized by the compiler. The worst that shold happen if you overuse this keyword is that your program will be a bit bigger and slower.
Here is when you NEED this keyword for a variable :
- You have a variable that is written to inside an interrupt function.
AND
- This same variable is read or written to outside interrupt functions.
OR
If you have 2 interrupt functions of different priority that use the variable, then you should also use 'volatile'.
Otherwise, the keyword is not needed.
As for hardware registers, they should be treated as volatile even without the keyword if you don't do weird stuff in your program.
I just finished a data structure in which it was obvious where the volatile qualifier was required, but for a different reason than the ones stated above: It is simply because the struct requires a forceful locking mechanism because of (i) direct access and (ii) equivalent invocation.
Direct access deals with sustained RAM reading and writing.
Equivalent invocation deals with interchangeable method flows.
I haven't had much luck with this keyword unless the compiler knows exactly what to do about it. And that's my own personal experience. But I am interested in studying how it directly impacts a cross-platform compilation such as between a low-level system call and a back-end database.