Thread Synchronization at C

Thread Synchronization at C - c

I have to use two threads; one to do various operations on matrices, and the other to monitor virtual memory at various points in the matrix operation process. This method is required to use a global state variable 'flag'.
So far I have the following (leaving some out for brevity):
int flag = 0;
int allocate_matrices(int dimension)
{
while (flag == 0) {} //busy wait while main prints memory state
int *matrix = (int *) malloc(sizeof(int)*dimension*dimension);
int *matrix2 = (int *) malloc(sizeof(int)*dimension*dimension);
flag = 0;
while (flag == 0) {} //busy wait while main prints memory state
// more similar actions...
}
int memory_stats()
{
while (flag == 0)
{ system("top"); flag = 1; }
}
int main()
{ //threads are created and joined for these two functions }
As you might expect, the system("top") call happens once, the the matrices are allocated, then the program falls into an infinite loop. It seems apparent to me that this is because the thread assigned to the memory_stats function has already completed its duty, so flag will never be updated again.
Is there an elegant way around this? I know I have to print memory stats four times, so it occurs to me that I could write four while loops in the memory_stats function with busy waiting contingent on the global flag in between each of them, but that seems clunky to me. Any help or pointers would be appreciated.

One of the possible reasons for the hang is that flag is a regular variable and the compiler sees that it's never set to a non-zero value between flag = 0; and while (flag == 0) {} or in this while inside allocate_matrices(). And so it "thinks" the variable stays 0 and the loop becomes infinite. The compiler is entirely oblivious to your threads.
You could define flag as volatile to prevent the above from happening, but you'll likely run into other issues after adding volatile. For one thing, volatile does not guarantee atomicity of variable modifications.
Another issue is that if the compiler sees an infinite loop that has no side effects, it may be considered undefined behavior and anything could happen, or, at least, not what you're thinking should, also this.
You need to use proper synchronization primitives like mutexes.

You can lock it with mutex. I assume you use pthread.
pthread_mutex_t mutex;
pthread_mutex_lock(&mutex);
flag=1;
pthread_mutex_unlock (&mutex);
Here is a very good tutorial about pthreads, mutexes and other stuff: https://computing.llnl.gov/tutorials/pthreads/

Your problem could be solved with a C compiler that follows the latest C standard, C11. C11 has threads and a data type called atomic_flag, that can basically used for a spin lock as you have it in your question.

First of all, the variable flag needs to be declared volatile or else the compiler has license to omit reads to it after the first one.
With that out of the way, a sequencer/event_counter can be used: one thread may increment the variable when it's odd, the other when it's even. Since one thread always "owns" the variable, and transfers the ownership with the increment, there is no race condition.

Related

Testing lockless buffer copy in C using memory barriers

I have a few questions regarding memory barriers.
Say I have the following C code (it will be run both from C++ and C code, so atomics are not possible) that writes an array into another one. Multiple threads may call thread_func(), and I want to make sure that my_str is returned only after it was initialized fully. In this case, it is a given that the last byte of the buffer can't be 0. As such, checking for the last byte as not 0, should suffice.
Due to reordering by compiler/CPU, this can be a problem as the last byte might get written before previous bytes, causing my_str to be returned with a partially copied buffer. So to get around this, I want to use a memory barrier. A mutex will work of course, but would be too heavy for my uses.
Keep in mind that all threads will call thread_func() with the same input, so even if multiple threads call init() a couple of times, it's OK as long as in the end, thread_func() returns a valid my_str, and that all subsequent calls after initialization return my_str directly.
Please tell me if all the following different code approaches work, or if there could be issues in some scenarios as aside from getting the solution to the problem, I'd like to get some more information regarding memory barriers.
__sync_bool_compare_and_swap on last byte. If I understand correctly, any memory store/load would not be reordered, not just the one for the particular variable that is sent to the command. Is that correct? if so, I would expect this to work as all previous writes of the previous bytes should be made before the barrier moves on.
#define STR_LEN 100
static uint8_t my_str[STR_LEN] = {0};
static void init(uint8_t input_buf[STR_LEN])
{
for (int i = 0; i < STR_LEN - 1; ++i) {
my_str[i] = input_buf[i];
}
__sync_bool_compare_and_swap(my_str, 0, input_buf[STR_LEN - 1]);
}
const char * thread_func(char input_buf[STR_LEN])
{
if (my_str[STR_LEN - 1] == 0) {
init(input_buf);
}
return my_str;
}
__sync_bool_compare_and_swap on each write. I would expect this to work as well, but to be slower than the first one.
static void init(char input_buf[STR_LEN])
{
for (int i = 0; i < STR_LEN; ++i) {
__sync_bool_compare_and_swap(my_str + i, 0, input_buf[i]);
}
}
__sync_synchronize before each byte copy. I would expect this to work as well, but is this slower or faster than (2)? __sync_bool_compare_and_swap is supposed to be a full barrier as well, so which would be preferable?
static void init(char input_buf[STR_LEN])
{
for (int i = 0; i < STR_LEN; ++i) {
__sync_synchronize();
my_str[i] = input_buf[i];
}
}
__sync_synchronize by condition. As I understand it, __sync_synchronize is both a HW and SW memory barrier. As such, since the compiler can't tell the value of use_sync it shouldn't reorder. And the HW reordering will be done only if use_sync is true. is that correct?
static void init(char input_buf[STR_LEN], bool use_sync)
{
for (int i = 0; i < STR_LEN; ++i) {
if (use_sync) {
__sync_synchronize();
}
my_str[i] = input_buf[i];
}
}

GNU C legacy __sync builtins are not recommended for new code, as the manual says.
Use the __atomic builtins which can take a memory-order parameter like C11 stdatomic. But they're still builtins and still work on plain types not declared _Atomic, so using them is like C++20 std::atomic_ref. In C++20, use std::atomic_ref<unsigned char>(my_str[STR_LEN - 1]), but C doesn't provide an equivalent so you'd have to use compiler builtins to hand-roll it.
Just do the last store separately with a release store in the writer, not an RMW, and definitely not a full memory barrier (__sync_synchronize()) between every byte!!! That's way slower than necessary, and defeats any optimization to use memcpy. Also, you need the store of the final byte to be at least RELEASE, not a plain store, so readers can synchronize with it. See also Who's afraid of a big bad optimizing compiler? re: how exactly compilers can break your code if you try to hand-roll lockless code with just barriers, not atomic loads or stores. (It's written for Linux kernel code, where a macro would use *(volatile char*) to hand-roll something close to __atomic_store_n with __ATOMIC_RELAXED`)
So something like
__atomic_store_n(&my_str[STR_LEN - 1], input_buf[STR_LEN - 1], __ATOMIC_RELEASE);
The if (my_str[STR_LEN - 1] == 0) load in thread_func is of course data-race UB when there are concurrent writers.
For safety it needs to be an acquire load, like __atomic_load_n(&my_str[STR_LEN - 1], __ATOMIC_ACQUIRE) == 0, since you need a thread that loads a non-0 value to also see all other stores by another thread that ran init(). (Which did a release-store to that location, creating acquire/release synchronization and guaranteeing a happens-before relationship between these threads.)
See https://preshing.com/20120913/acquire-and-release-semantics/
Writing the same value non-atomically is also UB in ISO C and ISO C++. See Race Condition with writing same value in C++? and others.
But in practice it should be fine except with clang -fsanitize=thread. In theory a DeathStation9000 could implement non-atomic stores by storing value+1 and then subtracting 1, so temporarily there's be a different value in memory. But AFAIK there aren't real compilers that do that. I'd have a look at the generated asm on any new compiler / ISA combination you're trying, just to make sure.
It would be hard to test; the init stuff can only race once per program invocation. But there's no fully safe way to do it that doesn't totally suck for performance, AFAIK. Perhaps doing the init with a cast to _Atomic unsigned char* or typedef _Atomic unsigned long __attribute__((may_alias)) aliasing_atomic_ulong; as a building block for a manual copy loop?
Bonus question: if(use_sync) __sync_synchronize() inside the loop.
Since the compiler can't tell the value of use_sync it shouldn't reorder.
Optimization is possible to asm that works something like if(use_sync) { slow barrier loop } else { no-barrier loop }. This is called "loop unswitching": making two loops and branching once to decide which to run, instead of every iteration. GCC has been able to do that optimization (in some cases) since 3.4. So that defeats your attempt to take advantage of how the compiler would compile to trick it into doing more ordering than the source actually requires.
And the HW reordering will be done only if use_sync is true.
Yes, that part is correct.
Also, inlining and constant-propagation of use_sync could easily defeat this, unless use_sync was a volatile global or something. At that point you might as well just make a separate _Atomic unsigned char array_init_done flag / guard variable.
And you can use it for mutual exclusion by having threads try to set it to 1 with int old = guard.exchange(1), with the winner of the race being the one to run init while they spin-wait (or C++20 .wait(1)) for the guard variable to become 2 or -1 or something, which the winner of the race will set after finishing init.
Have a look at the asm GCC makes for non-constant-initialized static local vars; they check a guard variable with an acquire load, only doing locking to have one thread do the run_once init stuff and the others wait for that result. IIRC there's a Q&A about doing that yourself with atomics.

Do all threads have the same global variable?

I have a general question that occured to me while trying to implement a thread sychronization problem with sempaphores. I do not want to get into much (unrelated) detail so I am going to give the code that I think is important to clarify my question.
sem_t *mysema;
violatile int counter;
struct my_info{
pthread_t t;
int id;
};
void *barrier (void *arg){
struct my_info *a = arg;
arg->id = thrid;
while(counter >0){
do_work(&mysem[thrid])
sem_wait(&mysema[third])
display_my_work(arg);
counter--;
sem_post(&mysema[thrid+1])
}
return NULL;
}
int main(int argc, char *argv[]){
int N = atoi(argv[1]);
mysema = mallon(N*(*mysema));
counter = 50;
/*semaphore intialisations */
for(i=0; i<M; i++){
sem_init(&mysema[i],0,0);
}
for(i=0; i<M; i++){
mysema[i].id = i;
}
for(i=0; i<M; i++){
pthread_create(&mysema.t[i], NULL, barrier, &tinfo[i])
}
/*init wake up the first sempahore */
sem_post(&mysema[0]);
.
.
.
We have an array made of M semaphores intialised in 0 , where M is defined in command line by the user.
I know I am done when all M threads in total have done some necessary computations 50 times.
Each thread blocks itself, until the previous thread "sem_post's" it. The very first thread will be waken up by init.
My question is whether the threads will stop when '''counter = 0 '''. Do they all see the same variable - counter? (It is a global one, initialised in the main).
If thread zero , makes the very first time ```counter = 49''' do all the other threads( thread 1, 2, ...M-1) see that ?

These are different questions:
Do [the threads] all see the same variable - counter? (It is a global one, initialised in the main).
If thread zero , makes the very first time ```counter = 49''' do all the other threads( thread 1, 2, ...M-1) see that ?
The first is fairly simple: yes. An object declared at file scope and without storage class specifier _Thread_local is a single object whose storage duration is the entire run of the program. Wherever that object's identifier is in-scope and visible, it identifies the same object regardless of which thread is accessing it.
The answer to the second question is more complicated. In a multi-threaded program there is the potential for data races, and the behavior of a program containing a data race is undefined. The volatile qualifier does not protect against these; instead, you need proper synchronization for all accesses to each shared variable, both reads and writes. This can be provided by a semaphore or more often a mutex, among other possibilities.
Your code's decrement of counter may be adequately protected, but I suspect not, on account of the threads using different semaphores. If this allows for multiple different threads to execute the ...
display_my_work(arg);
counter--;
... lines at the same time then you have a data race. Even if your protection is adequate there, however, the read of counter in the while condition clearly is not properly synchronized, and you definitely have a data race there.
One of the common manifestations of the undefined behavior brought on by data races is that threads do not see each others' updates, so not only does your program's undefined behavior generally mean that threads 1 ... M-1 may not see thread 0's update of counter, it also specifically makes such a failure comparatively probable.

Per-thread state vs. shared state in threads

I'm trying to understand the details in the TCB (thread control block and the differences between per-thread states and shared states. My book has its own implementation of pthread, so it gives an example with this mini C program (I've not typed the whole thing out)
#include "thread.h"
static void go(int n);
static thread_t threads[NTHREADS];
#define NTHREADS 10
int main(int argh, char **argv) {
int i;
long exitValue;
for (i = 0; i < NTHREADS; i++) {
thread_create(&threads[i]), &go, i);
}
for (i = 0; i < NTHREADS; i++) {
exitValue = thread_join(threads[i]);
}
printf("Main thread done".\n);
return 0;
}
void go(int n) {
printf("Hello from thread %d\n", n);
thread_exit(100 + n);
}
What would the variables i and exitValue (in the main() function) be examples of? They're not shared state since they're not global variables, but I'm not sure if they're per-thread state either. The i is used as the parameter for the go function when each thread is being created, so I'm a bit confused about it. The exitValue's scope is limited only to main() so that seems like it would just be stored on the process' stack. The int n as the parameter for the void go() would be a per-thread variable because its value is independent for each thread. I don't think I fully understand these concepts so any help would be appreciated! Thanks!

Short Answer
All of the variables in your example program are automatic variables. Each time one of them comes into scope storage for it is allocated, and when it leaves its scope it is no longer valid. This concept is independent of whether the variables is shared or not.
Longer Answer
The scope of a variable refers to its lifetime in the program (and also the rules for how it can be accessed). In your program the variables i and exitValue are scoped to the main function. Typically a compiler will allocate space on the stack which is used to store the values for these variables.
The variable n in function go is a parameter to the function and so it also acts as a local variable in the function go. So each time go is executed the compiler will allocate space on the stack frame for the variables n (although the compiler may be able to perform optimization to keep the local variables in registers rather than actually allocating stack space). However, as a parameter n will be initialized with whatever value it was called with (its actual parameter).
To make this more concrete, here is what the values of the variales in the program would be after the first loop has completed 2 iterations (assuming that the spawned threads haven't finished executing).
Main thread: i = 2, exitValue = 0
Thread 0: n = 0
Thread 1: n = 1
The thing to note is that there are multiple independent copies of the variable n. And that n gets a copy of the value in i when thread_create is executed, but that the values of i and n are independent after that.
Finally I'm not certain what is supposed to happen with the statement exitValue = thread_join(threads[i]); since this is a variation of pthreads. But what probably happens is that it makes the value available when another thread calls thread_join. So in that way you do get some data sharing between threads, but the sharing is synchronized by the thread_join command.

They're objects with automatic storage, casually known as "local variables" although the latter is ambiguous since C and C++ both allow objects with local scope but that only have one global instance via the static keyword.

Comparing a volatile array to a non-volatile array

Recently I needed to compare two uint arrays (one volatile and other nonvolatile) and results were confusing, there got to be something I misunderstood about volatile arrays.
I need to read an array from an input device and write it to a local variable before comparing this array to a global volatile array. And if there is any difference i need to copy new one onto global one and publish new array to other platforms. Code is something as blow:
#define ARRAYLENGTH 30
volatile uint8 myArray[ARRAYLENGTH];
void myFunc(void){
uint8 shadow_array[ARRAYLENGTH],change=0;
readInput(shadow_array);
for(int i=0;i<ARRAYLENGTH;i++){
if(myArray[i] != shadow_array[i]){
change = 1;
myArray[i] = shadow_array[i];
}
}
if(change){
char arrayStr[ARRAYLENGTH*4];
array2String(arrayStr,myArray);
publish(arrayStr);
}
}
However, this didn't work and everytime myFunc runs, it comes out that a new message is published, mostly identical to the earlier message.
So I inserted a log line into code:
for(int i=0;i<ARRAYLENGTH;i++){
if(myArray[i] != shadow_array[i]){
change = 1;
log("old:%d,new:%d\r\n",myArray[i],shadow_array[i]);
myArray[i] = shadow_array[i];
}
}
Logs I got was as below:
old:0,new:0
old:8,new:8
old:87,new:87
...
Since solving bug was time critical I solved the issue as below:
char arrayStr[ARRAYLENGTH*4];
char arrayStr1[ARRAYLENGTH*4];
array2String(arrayStr,myArray);
array2String(arrayStr1,shadow_array);
if(strCompare(arrayStr,arrayStr1)){
publish(arrayStr1);
}
}
But, this approach is far from being efficient. If anyone have a reasonable explanation, i would like to hear.
Thank you.
[updated from comments:]
For the volatile part, global array has to be volatile, since other threads are accessing it.

If the global array is volatile, your tracing code could be inaccurate:
for(int i=0;i<ARRAYLENGTH;i++){
if(myArray[i] != shadow_array[i]){
change = 1;
log("old:%d,new:%d\r\n",myArray[i],shadow_array[i]);
myArray[i] = shadow_array[i];
}
}
The trouble is that the comparison line reads myArray[i] once, but the logging message reads it again, and since it is volatile, there's no guarantee that the two reads will give the same value. An accurate logging technique would be:
for (int i = 0; i < ARRAYLENGTH; i++)
{
uintu_t value;
if ((value = myArray[i]) != shadow_array[i])
{
change = 1;
log("old:%d,new:%d\r\n", value, shadow_array[i]);
myArray[i] = shadow_array[i];
}
}
This copies the value used in the comparison and reports that. My gut feel is it is not going to show a difference, but in theory it could.

global array has to be volatile, since other threads are accessing it
As you "nicely" observe declaring an array volatile is not the way to protect it against concurrent read/write access by different threads.
Use a mutex for this. For example by wrapping access to the "global array" into a function which locks and unlocks this mutex. Then only use this function to access the "global array".
References:
Why is volatile not considered useful in multithreaded C or C++ programming?
https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
Also for printf()ing unsigned integers use the conversion specifier u not d.

A variable (or Array) should be declared volatile when it may Change outside the current program execution flow. This may happen by concurrent threads or an ISR.
If there is, however, only one who actually writes to it and all others are jsut Readers, then the actual writing code may treat it as being not volatile (even though there is no way to tell teh Compiler to do so).
So if the comparison function is the only Point in the Project where teh gloal Array is actually changed (updated) then there is no Problem with multiple reads. The code can be designed with the (external) knowledge that there will be no Change by an external source, despite of the volatile declaration.
The 'readers', however, do know that the variable (or the array content) may change and won't buffer it reads (e.g by storing the read vlaue in a register for further use), but still the array content may change while they are reading it and the whole information might be inconsistent.
So the suggested use of a mutex is a good idea.
It does not, however, help with the original Problem that the comparison Loop fails, even though nobody is messing with the array from outside.
Also, I wonder why myArray is declared volatile if it is only locally used and the publishing is done by sending out a pointer to ArrayStr (which is a pointer to a non-volatile char (array).
There is no reason why myArray should be volatile. Actually, there is no reason for its existence at all:
Just read in the data, create a temporary tring, and if it differes form the original one, replace the old string and publish it. Well, it's maybe less efficient to always build the string, but it makes the code much shorter and apparently works.
static char arrayStr[ARRAYLENGTH*4]={};
char tempStr[ARRAYLENGTH*4];
array2String(tempStr,shadow_array);
if(strCompare(arrayStr,tempStr)){
strCopy(arrayStr, tempStr);
publish(arrayStr);
}
}

Lock-free buffer

In my code I have a buffer, and my code to add data to it is:
bool push_string(file_buffer *cb, const char* message, const unsigned short msglen)
{
unsigned int size = msglen;
if(cb->head >= (cb->size - size))
{
size = cb->size - cb->head - 1;
}
if(size < 1) return false;
char* dest = cb->head += size;
memcpy(dest, message, size);
return (size == msglen);
}
Since I add data from multiple interrupts (which can exempt eachother), I was wondering if this code is thread-safe? I marked 'cb->head' as volatile, but if another interrupt exempts exactly between the increase of 'head' and the asignment to 'dest', things could go wrong.
How can I improve this code to make it safer?
EDIT: Maybe I shouldn't have used the term 'thread-safe' because there are no threads running in parallel, just the possibility of interrupts.

C99 has no concept of threads and thus none for thread-savety either. Only C11 has.
In C99 the only data type that is interrupt safe is sig_atomic_t, but evidently this says nothing about threads either.
Generally you are completely mistaken in attempting to access data structures concurrently, volatile is no guarantee at all that you receive sensible data. There is no guarantee as such of atomicity of any of the operations, even in C11, so you could e.g be in a situation where the lower half of a pointer value is already written but not the upper half. This could give you a completely bogus result. Since such thing would perhaps just happen once in a million or under special circumstances (heavy load e.g) this could lead to bugs that are very difficult to trace.
Don't do that.
C11 gives you new tools to handle such things, in particular atomic operations. It is not completely implemented but many compilers already have extensions that could help you. I have wrapped some of these in the P99 macro package, so with certain compilers you could start to use these features as of today.

Think about signals interrupting signals... if you really need that:
You could block all relevant signals while in push_string().
Another, application dependant possibility might be moving the signal handler code into the main 'thread' (signal handler code just generate 'events' that wake up the main thread of execution). I have not enough information about your app, to say if it is a good choice or not.