SPSC thread safe with fences - c

I just want my code as simple as possible and thread safe.
With C11 atomics
Regarding part "7.17.4 Fences" of the ISO/IEC 9899/201X draft
X and Y , both operating on some atomic object M, such that A is
sequenced before X, X modifies M, Y is sequenced before B, and Y reads
the value written by X or a value written by any side effect in the
hypothetical release sequence X would head if it were a release
operation.
Is this code thread safe (with "w_i" as "object M") ?
Are "w_i" and "r_i" need both to be declared as _Atomic ?
If only w_i is _Atomic, can the main thread keep an old value of r_i in cache and consider the queue as not full (while it's full) and write data ?
What's going on if I read an atomic without atomic_load ?
I have made some tests but all of my attempts seems to give the right results.
However, I know that my tests are not really correct regarding multithread : I run my program several times and look at the result.
Even if neither w_i not r_i are declared as _Atomic, my program work, but only fences are not sufficient regarding C11 standard, right ?
typedef int rbuff_data_t;
struct rbuf {
rbuff_data_t * buf;
unsigned int bufmask;
_Atomic unsigned int w_i;
_Atomic unsigned int r_i;
};
typedef struct rbuf rbuf_t;
static inline int
thrd_tryenq(struct rbuf * queue, rbuff_data_t val) {
size_t next_w_i;
next_w_i = (queue->w_i + 1) & queue->bufmask;
/* if ring full */
if (atomic_load(&queue->r_i) == next_w_i) {
return 1;
}
queue->buf[queue->w_i] = val;
atomic_thread_fence(memory_order_release);
atomic_store(&queue->w_i, next_w_i);
return 0;
}
static inline int
thrd_trydeq(struct rbuf * queue, rbuff_data_t * val) {
size_t next_r_i;
/*if ring empty*/
if (queue->r_i == atomic_load(&queue->w_i)) {
return 1;
}
next_r_i = (queue->r_i + 1) & queue->bufmask;
atomic_thread_fence(memory_order_acquire);
*val = queue->buf[queue->r_i];
atomic_store(&queue->r_i, next_r_i);
return 0;
}
I call theses functions as follow :
Main thread enqueue some data :
while (thrd_tryenq(thrd_get_queue(&tinfo[tnum]), i)) {
usleep(10);
continue;
}
Others threads dequeue data :
static void *
thrd_work(void *arg) {
struct thrd_info *tinfo = arg;
int elt;
atomic_init(&tinfo->alive, true);
/* busy waiting when queue empty */
while (atomic_load(&tinfo->alive)) {
if (thrd_trydeq(&tinfo->queue, &elt)) {
sched_yield();
continue;
}
printf("Thread %zu deq %d\n",
tinfo->thrd_num, elt);
}
pthread_exit(NULL);
}
With asm fences
Regarding a specific platform x86 with lfence and sfence,
If I remove all C11 code and just replace fences by
asm volatile ("sfence" ::: "memory");
and
asm volatile ("lfence" ::: "memory");
(My understanding of these macro is : compiler fence to prevent memory access to be reoganized/optimized + hardware fence)
do my variables need to be declared as volatile for instance ?
I have already seen this ring buffer code above with only these asm fences but with no atomic types and I was really surprised, I want to know if this code was correct.

I just reply regarding C11 atomics, platform specifics are too complicated and should be phased out.
Synchronization between threads in C11 is only guaranteed through some system calls (e.g for mtx_t) and atomics. Don't even try to do it without.
That said, sychronization works via atomics, that is visibility of side effects is guaranteed to propagate via the visibility of effects on atomics. E.g for the simplest consistency model, sequential, whenever thread T2 sees a modification thread T1 has effected on an atomic variable A, all side effects before that modication in thread T1 are visible to T2.
So not all your shared variables need to be atomic, you only must ensure that your state is properly propagated via an atomic. In that sense fences buy you nothing when you use sequential or acquire-release consistency, they only complicate the picture.
Some more general remarks:
Since you seem to use the sequential consistency model, which is the
default, the functional writing of atomic operations (e.g
atomic_load) is superfluous. Just evaluating the atomic variable is
exactly the same.
I have the impression that you are attempting optimization much too
early in your development. I think you should do an implementation
for which you can prove correctness, first. Then, if and only if
you notice a performance problem, you should start to think about
optimization. It is very unlikely that such an atomic data structure
is a real bottleneck for your applcation. You'd have to have a very
large number of threads that all simultaneously hammer on your poor
little atomic variable, to see a measurable bottleneck here.

Related

Questions regarding (non-)volatile and optimizing compilers

I have the following C code:
/* the memory entry points to can be changed from another thread but
* it is not declared volatile */
struct myentry *entry;
bool isready(void)
{
return entry->status == 1;
}
bool isready2(int idx)
{
struct myentry *x = entry + idx;
return x->status == 1;
}
int main(void) {
/* busy loop */
while (!isready())
;
while (!isready2(5))
;
}
As I note in the comment, entry is not declared as volatile even though the array it points to can be changed from another thread (or actually even directly from kernel space).
Is the above code incorrect / unsafe? My thinking is that no optimization could be performed in the bodies of isready, isready2 and since I repeatedly perform function calls from within main the appropriate memory location should be read on every call.
On the other hand, the compiler could inline these functions. Is it possible that it does it in a way that results in a single read happening (hence causing an infinite loop) instead of multiple reads (even if these reads come from a load/store buffer)?
And a second question. Is it possible to prevent the compiler from doing optimizations by casting to volatile only in certain places like that?
void func(void)
{
entry->status = 1;
while (((volatile struct myentry *) entry)->status != 2)
;
}
Thanks.
If the memory entry points to can be modified by another thread, then the program has a data race and therefore the behaviour is undefined . This is still true even if volatile is used.
To have a variable accessed concurrently by multiple threads, in ISO C11, it must either be an atomic type, or protected by correct synchronization.
If using an older standard revision then there are no guarantees provided by the Standard about multithreading so you are at the mercy of any idiosyncratic behaviour of your compiler.
If using POSIX threads, there are no portable atomic operations, but it does define synchronization primitives.
See also:
Why is volatile not considered useful in multithreaded C or C++ programming?
The second question is a bugbear, I would suggest not doing it because different compilers may interpret the meaning differently, and the behaviour is still formally undefined either way.

Can I insert a function inside a pthread_mutex_lock and unlock statements?

Let's suppose I want to set atomic instructions into a function.
I declared
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
as a global variable.
Instead of:
int main() {
myFoo();
...
}
void myFoo() {
pthread_mutex_lock(&mutex);
myGlobal++;
pthread_mutex_unlock(&mutex);
}
can I do:
int main() {
pthread_mutex_lock(&mutex);
myFoo();
pthread_mutex_unlock(&mutex);
...
}
void myFoo() {
myGlobal++;
}
So that every instructions in myFoo become atomic?
In first example, you are protecting myGlobal and in 2nd you are protecting myFoo. Your code works as you expect (if you call it everywhere between lock/unlock), but you need to use terms correctly or its meaning will be wrong.
No it will not be atomic, but access to myFoo will be synchronized, meaning no other thread can access that part code when a another thread is using it.
Atomic operation term normally is used showing that an instruction is run without any interruption (sometimes considered lock-free). For example, C11's atomic_flag provides such functionality. On the other hand, mutex is for creating mutual exclusion. You can protect a part of your code from simultaneous access from different threads. These 2 terms are not similar.
Side note:
Only atomic_ type that is guaranteed to be really atomic and lock-free is atomic_flag is both C and C++. Other ones such as atomic_int may be implemented using synchronization method and is not lock-free.
Your use of the term atomic is not really correct but I guess the question is more about whether the two code snippets will behave the same.
If myFoo is only called between lock/unlock, the answer is yes, they are the same.
However, in the second case you have lost protection of myFoo. Another thread could call myFoo without calling lock first which would cause problems.
So the second example is bad as it opens up for more mistakes. Stick to the first one, i.e. keep the lock/unlock inside the function.
Also notice:
Since myGlobal is a global variable, you can't make sure that the threads do not access it directly. There are several ways to avoid that. The example below shows a single function with a static variable. The function can be used to receive the static variable and do an increment if desired.
int myFoo(int doIncrement)
{
static int myStatic = 0;
int result;
pthread_mutex_lock(&mutex);
if (doIncrement) myStatic++;
result = myStatic;
pthread_mutex_unlock(&mutex);
return result;
}
Now the variable myStatic is hidden from all the threads and can only be accessed through myFoo.
int x = myFoo(1); // Increment and read
int y = myFoo(0); // Read only

How to implement TAS ("test and set")?

Did anyone know the implementation of TAS?
This code should later be able to protect the code between acquiring and release from multithread based loss of states.
typedef volatile long lock_t;
void acquire(lock_t* lock){
while(TAS(lock))
;
}
void release(lock_t* lock){
*lock = 0;
}
lock_t REQ_lock = 0;
int main(){
acquire(&REQ_lock);
//not Atomar Code
release(&REQ_lock);
}
The C11 standard brings atomic types into the language.
Among them is the atomic_flag type, which has an associated function atomic_flag_test_and_set
Here it is from the standard:
C11 working draft section 7.17.8.1:
Synopsis
#include <stdatomic.h>
_Bool atomic_flag_test_and_set(
volatile atomic_flag *object);
_Bool atomic_flag_test_and_set_explicit(
volatile atomic_flag *object,memory_order order);
Description
Atomically sets the value pointed to by
object
to true. Memory is affected according
to the value of
order
. These operations are atomic read-modify-write operations
Returns
Atomically, the value of the object immediately before the effects.
And along with it, you'll want its sister operation, atomic_flag_clear
Section 7.17.8.2:
Synopsis
#include <stdatomic.h>
void atomic_flag_clear(volatile atomic_flag *object);
void atomic_flag_clear_explicit(
volatile atomic_flag *object, memory_order order);
Description
The order argument shall not be memory_order_acquire nor memory_order_acq_rel. Atomically sets the value pointed to by object to false.
Memory is affected according to the value of order.
Returns
The atomic_flag_clear functions return no value
Yeah, no. One of the virtues of C is to bring you as close to the machine-language boundary as possible, but no further. This is further.
You could go for atomic support, which is provided in some C augmentation-standard, but is a nest of problems.
Write TAS in assembly. Restrict the visibility of the variables used for TAS, for containment. For any given architecture, it should be no more than a handful of lines of assembly. Your program will have contained its dependency, and the lack of containment is amongst the critical flaw in the last decade of C augmentations.

getter and setters best practice with mutex

I'm feeling a bit overwhelmed when using multiple threads in embedded programming since each and every shared resource ends up with a getter/setter protected by a mutex.
I would really like to understand if a getter of the following sort
static float static_raw;
float get_raw() {
os_mutex_get(mutex, OS_WAIT_FOREVER);
float local_raw = static_raw;
os_mutex_put(mutex);
return local_raw ;
}
makes sense or if float assignement can be considered atomic e.g. for ARM (differently from e.g 64bit variables) making this superfluous.
I can understand something like this:
raw = raw > VALUE ? raw + compensation() : raw;
where the value is handled multiple times, but what about when reading or returning it?
Can you make my mind clear?
EDIT 1:
Regarding the second question below.
let's assume we have an "heavy" function in terms of time execution let's call it
void foo(int a, int b, int c)
where a,b,c are potentially values from shared resources.
When the foo function is called should it be enveloped by a mutex, locking it for plenty of time even if it just needs a copy of the value? e.g.
os_mutex_get(mutex, OS_WAIT_FOREVER);
foo(a,b,c);
os_mutex_put(mutex);
does it make any sense to do
os_mutex_get(mutex, OS_WAIT_FOREVER);
int la = a;
int lb = b;
int lc = c;
os_mutex_put(mutex);
foo(la,lb,lc);
locking only the copy of the variable instead of the full execution?
EDIT2:
Given there could exist getter and setter for "a", "b" and "c".
Is it problematic in terms of performance/or anything else in doing something like this?
int static_a;
int get_a(int* la){
os_mutex_get(mutex, OS_WAIT_FOREVER);
*la = static_a;
os_mutex_put(mutex);
}
or
int static_b;
int get_b(){
os_mutex_get(mutex, OS_WAIT_FOREVER);
int lb = static_b;
os_mutex_put(mutex);
return lb;
}
using them as
void main(){
int la = 0;
get_a(&la);
foo(la,get_b());
}
I'm asking this because im locking and relocking on the same mutex sequentially for potential no reason.
if float assignement can be considered atomic
Nothing can be considered atomic in C unless you use C11 _Atomic or inline assembler. The underlying hardware is irrelevant, because even if a word of a certain size can be read in a single instruction on the given hardware, there is never a guarantee that a certain C instruction will only result in a single instruction.
does it make any sense to do
os_mutex_get(mutex, OS_WAIT_FOREVER);
int la = a;
int lb = b;
int lc = c;
os_mutex_put(mutex);
foo(a,b,c);
Assuming you mean foo(la,lb,lc);, then yes it makes lots of sense. This is how you should ideally use mutex: minimize the code between mutex locks so that it is just raw variable copying and nothing else.
The C standard does not dictate anything about the atomicity of the assignment operator. You cannot consider an assignment atomic, as it is completely implementation dependent.
However, in C11 the _Atomic type qualifier (C11 ยง6.7.3, page 121 here) can be used (if supported by your compiler) to declare variables to be atomically read and written, so you could for example do the following:
static _Atomic float static_raw;
float get_raw(void) {
return static_raw;
}
Don't forget to compile with -std=c11 if you do so.
Addressing your first edit:
When the foo function is called should it be enveloped by a mutex, locking it for plenty of time even if it just needs a copy of the value?
While it would be correct it certainly would not be the best solution. If the funcion only needs a copy of the variables then your second snippet is without a doubt much better, and should be the ideal solution:
os_mutex_get(mutex, OS_WAIT_FOREVER);
int la = a;
int lb = b;
int lc = c;
os_mutex_put(mutex);
foo(la,lb,lc);
If you lock the whole function, you'll block any other thread trying to acquire the lock for much longer than needed, slowing down everything. Locking before calling the function and passing copies of the values will instead only lock for the needed amount of time leaving much more free time to other threads.
Addressing your second edit:
Given there could exist getter and setter for "a", "b" and "c". Is it problematic in terms of performance/or anything else in doing something like this?
That code is correct. In terms of performance it would certainly be much better to have one mutex per variable, if you can. With only one mutex, any thread holding the mutex will "block" any other thread that is trying to lock it, even if they are trying to access a different variable.
If you cannot use multiple mutexes then it's a matter of chosing between these two options:
Lock inside the getters:
void get_a(int* la){
os_mutex_get(mutex, OS_WAIT_FOREVER);
*la = static_a;
os_mutex_put(mutex);
}
void get_b(int* lb){
os_mutex_get(mutex, OS_WAIT_FOREVER);
*lb = static_b;
os_mutex_put(mutex);
}
/* ... */
int var1, var2;
get_a(&var1);
get_b(&var2);
Lock outside the getters (leave the duty to the caller):
int get_a(void){
return static_a;
}
int get_b(void){
return static_b;
}
/* ... */
os_mutex_get(mutex, OS_WAIT_FOREVER);
int var1 = get_a();
int var2 = get_b();
os_mutex_put(mutex);
At this point you wouldn't even need to have getters and you could just do:
os_mutex_get(mutex, OS_WAIT_FOREVER);
int var1 = a;
int var2 = b;
os_mutex_put(mutex);
If your code frequently requests multiple values, then locking/unlocking outside the getters is better since it will cause less overhead. As an alternative, you could also keep the locking inside, but create different functions to retrieve multiple variables so that the mutex is only locked and released once.
On the other hand, if your code is only rarely requesting multiple values, then it's ok to keep the locking inside each getter.
It isn't possible to say what's the best solution ahead of time, you should run different tests and see what's best for your scenario.

volatile keyword with mutex and semaphores

The question is simple. Does/Should a variable used with multi-threads be volatile even accessed in critical section(i.e. mutex, semaphore) in C? Why / Why not?
#include <pthread.h>
volatile int account_balance;
pthread_mutex_t flag = PTHREAD_MUTEX_INITIALIZER;
void debit(int amount) {
pthread_mutex_lock(&flag);
account_balance -= amount;//Inside critical section
pthread_mutex_unlock(&flag);
}
What about the example or equivalently thinking for semaphore?
Does/Should a variable used with multi-threads be volatile even accessed in critical section(i.e. mutex, semaphore) in C? Why / Why not?
No.
volatile is logically irrelevant for concurency, because it's not sufficient.
Actually, that's not really true - volatile is not irrelevant because it can hide concurrency problems in your code, so it works "most of the time".
All volatile does is tell the compiler "this variable can change outside the current thread of execution". Volatile in no way enforces any ordering, atomicity, or - critically - visibility. Just because thread 2 on CPU A changes int x, that doesn't mean thread 1 on CPU D can even see the change at any specific time - it has it's own cached value, and volatile means almost nothing with respect to memory coherence because it doesn't guarantee ordering.
The last comment at the bottom of the Intel article Volatile: Almost Useless for Multi-Threaded Programming says it best:
If you are simply adding 'volatile' to variables that are shared
between threads thinking that fixes your shared-data problem without
bothering to understand why it may not, you will eventually reap the
reward you deserve.
Yes, lock-free code can make use of volatile. Such code is written by people who can likely write tutorials on the use of volatile, multithreaded code, and other extremely detailed subjects regarding compilers.
No, volatile should not be used on shared variables which are accessed under the protection of pthreads synchronisation functions like pthread_mutex_lock().
The reason is that the synchronisation functions themselves are guaranteed by POSIX to provide all the necessary compiler barriers and synchronisation to ensure consistency (as long as you follow the POSIX rules on concurrent access - ie. that you have used pthreads synchronisation functions to ensure that no thread can be writing to a shared variable whilst another thread is writing to or reading from it).
I have no idea why there's so much misinformation about volatile everywhere on the internet. The answer to your question is yes, you should make variables you use within a critical section volatile.
I'll give a contrived example. Let's say you want to run this function on multiple threads:
int a;
void inc_a(void) {
for (int i = 0; i < 5; ++i) {
a += 5;
}
}
Everybody, as it would seem, on this site will tell you that it's enough to put a += 5 in a critical section like so:
int a;
void inc_a(void) {
for (int i = 0; i < 5; ++i) {
enter_critical_section();
a += 5;
exit_critical_section();
}
}
As i said, it's contrived, but people will tell you this is correct, and it absolutely is not! If the compiler wasn't given prior knowledge as to what the critical section functions are, and what their semantic meaning is, there's nothing stopping the compiler from outputting this code:
int a;
void inc_a(void) {
register eax = a;
for (int i = 0; i < 5; ++i) {
enter_critical_section();
eax += 5;
exit_critical_section();
}
a = eax;
}
This code produces the same output in a single threaded context, so the compiler is allowed to do that. But in a multithreaded context, this can output anything between 25 and 25 times the thread count. One way to solve this issue is to use an atomic construct, but that has performance implications, instead what you should do is make the variable volatile. That is, unless you want to be like the rest of this community and blindly put your faith in your C compiler.

Resources