Reading an update across threads - c

In my application, I have a block of shared memory, which one thread periodically writes to, and another thread periodically gets (and then sets to 0)
Thread 1:
#onevent:
__atomic_store(addr, val, __ATOMIC_SEQ_CST);
Thread 2:
while((val = __atomic_exchange_n(addr, 0, __ATOMIC_SEQ_CST)) == 0);
... work on val
I find that occasionally thread 2 spins forever.
In addition, placing any kind of debugging statements, say a print statement of addr after the atomic store or after each atomic exchange, and everything works fine (so some kind of race condition).
I'm really stuck (since I tried this in a separate isolated program, and it seems to work fine). Any help would be much appreciated. For reference I am running on a high-core count dual-socket node.

Related

Are condition signals better performance-wise than semaphores in multi-release cases?

I currently have some (working) code implemented, doing a histogram through the usage of a semaphore.
Here's a rough pseudocode outline:
// init multi and one to 0, initial lock state
void* helper(void*)
{
sem_wait(multi); // wait for one to finish its work before starting
cnt++
...
cnt--;
if (cnt == 0) sem_post(one); // release one when the work is finished
}
compute_histogram(void*)
{
... initialize globals that multi will be using ...
for (all threads) {sem_post(multi)); // releases every waiting thread at multi
sem_wait(one); // forces one to wait to finish return until helper's work has finished
return;
}
Performance increases, compared to the single-threaded version, are there, about 6-8x, though I had a less threaded, less complicated version that was doing about the same; still, I can't help but think that I could be doing more. I'm extremely new to multi-threading (learned it this week) and looked at the man pages for pthread_cond, and I saw that the broadcast() command allows all threads to be released, which seems much quicker than the for loop on sem_post, as that must call into the OS every time in order to release exactly one thread.
My questions are:
Would broadcasting/condition codes be more suited to this case/substantially faster? I understand that semaphores include the features for waits and mutexes all in one and I could instead break it down into conds and muteness.
How would the initialization of such an implementation look? I believe I understand how the signal and wait commands work but I struggle to see their relation to the mutex reference that must be passed in, alongside any initialization of the cond variable itself? I would be interested in pseudocode and explanations of what it's doing.

Do any mechanisms, similar to pthread_barrier, exist that allow thread communication across functions?

I'm trying to find out if there exists some sort of barrier-like mechanism to prevent threads in one function from proceeding until some condition has been met.
I attempted to implement this by the following:
function1 (multiple threads are here):
{...
while(suspended){} /* runs forever while suspended = 1, could also do while(!__sync_bool_compare_and_swap(&suspended, 1, 0)) {} */
...}
function2(one thread is here):
{...updates some resource the function1 threads need
suspended = 0; // takes out the suspended factor, could also do atomically
while(!donewithwork){} // will be released upon end of work of func1
return;
}
This implementation typically doesn't work and causes a full program crash. Looking in gdb wasn't too enlightening, the global variable "suspended" is decremented to 0, but in both cases, all threads are stuck in the while loops.
Online, I found that there exists a "pthreads_cond" section of functions devoted to this, but after talking to some people, I found that the multi-lock and release ends up having the same cost (in terms of system calls) as the following implementation:
// init multi and one to 0, initial lock state
void* helper(void*)
{
sem_wait(multi); // wait for one to finish its work before starting
cnt++
...
cnt--;
if (cnt == 0) sem_post(one); // release one when the work is finished
}
compute_histogram(void*)
{
... initialize globals that multi will be using ...
for (all threads) {sem_post(multi)); // releases every waiting thread at multi
sem_wait(one); // forces one to wait to finish return until helper's work has finished
return;
}
This has significant performance increases, 6-8x, over the single-threaded version, but the system calls required by the semaphore have significant overhead.
I'm wondering if there exists a type of "pthread_barrier" type mechanism that would release the threads upon a condition being met (by some other thread).

How to use sched_yield() properly?

For an assignment, I need to use sched_yield() to synchronize threads. I understand a mutex lock/conditional variables would be much more effective, but I am not allowed to use those.
The only functions we are allowed to use are sched_yield(), pthread_create(), and pthread_join(). We cannot use mutexes, locks, semaphores, or any type of shared variable.
I know sched_yield() is supposed to relinquish access to the thread so another thread can run. So it should move the thread it executes on to the back of the running queue.
The code below is supposed to print 'abc' in order and then the newline after all three threads have executed. I looped sched_yield() in functions b() and c() because it wasn't working as I expected, but I'm pretty sure all that is doing is delaying the printing because a function is running so many times, not because sched_yield() is working.
The server it needs to run on has 16 CPUs. I saw somewhere that sched_yield() may immediately assign the thread to a new CPU.
Essentially I'm unsure of how, using only sched_yield(), to synchronize these threads given everything I could find and troubleshoot with online.
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <sched.h>
void* a(void*);
void* b(void*);
void* c(void*);
int main( void ){
pthread_t a_id, b_id, c_id;
pthread_create(&a_id, NULL, a, NULL);
pthread_create(&b_id, NULL, b, NULL);
pthread_create(&c_id, NULL, c, NULL);
pthread_join(a_id, NULL);
pthread_join(b_id, NULL);
pthread_join(c_id, NULL);
printf("\n");
return 0;
}
void* a(void* ret){
printf("a");
return ret;
}
void* b(void* ret){
for(int i = 0; i < 10; i++){
sched_yield();
}
printf("b");
return ret;
}
void* c(void* ret){
for(int i = 0; i < 100; i++){
sched_yield();
}
printf("c");
return ret;
}
There's 4 cases:
a) the scheduler doesn't use multiplexing (e.g. doesn't use "round robin" but uses "highest priority thread that can run does run", or "earliest deadline first", or ...) and sched_yield() does nothing.
b) the scheduler does use multiplexing in theory, but you have more CPUs than threads so the multiplexing doesn't actually happen, and sched_yield() does nothing. Note: With 16 CPUs and 2 threads, this is likely what you'd get for "default scheduling policy" on an OS like Linux - the sched_yield() just does a "Hrm, no other thread I could use this CPU for, so I guess the calling thread can keep using the same CPU!").
c) the scheduler does use multiplexing and there's more threads than CPUs, but to improve performance (avoid task switches) the scheduler designer decided that sched_yield() does nothing.
d) sched_yield() does cause a task switch (yielding the CPU to some other task), but that is not enough to do any kind of synchronization on its own (e.g. you'd need an atomic variable or something for the actual synchronization - maybe like "while( atomic_variable_not_set_by_other_thread ) { sched_yield(); }). Note that with an atomic variable (introduced in C11) it'd work without sched_yield() - the sched_yield() (if it does anything) merely makes busy waiting less awful/wasteful.
Essentially I'm unsure of how, using only sched_yield(), to
synchronize these threads given everything I could find and
troubleshoot with online.
That would be because sched_yield() is not well suited to the task. As I wrote in comments, sched_yield() is about scheduling, not synchronization. There is a relationship between the two, in the sense that synchronization events affect which threads are eligible to run, but that goes in the wrong direction for your needs.
You are probably looking at the problem from the wrong end. You need each of your threads to wait to execute until it is their turn, and for them to do that, they need some mechanism to convey information among them about whose turn it is. There are several alternatives for that, but if "only sched_yield()" is taken to mean that no library functions other than sched_yield() may be used for that purpose then a shared variable seems the expected choice. The starting point should therefore be how you could use a shared variable to make the threads take turns in the appropriate order.
Flawed starting point
Here is a naive approach that might spring immediately to mind:
/* FLAWED */
void *b(void *data){
char *whose_turn = data;
while (*whose_turn != 'b') {
// nothing?
}
printf("b");
*whose_turn = 'c';
return NULL;
}
That is, the thread executes a busy loop, monitoring the shared variable to await it taking a value signifying that the thread should proceed. When it has done its work, the thread modifies the variable to indicate that the next thread may proceed. But there are several problems with that, among them:
Supposing that at least one other thread writes to the object designated by *whose_turn, the program contains a data race, and therefore its behavior is undefined. As a practical matter, a thread that once entered the body of the loop in that function might loop infinitely, notwithstanding any action by other threads.
Without making additional assumptions about thread scheduling, such as a fairness policy, it is not safe to assume that the thread that will make the needed modification to the shared variable will be scheduled in bounded time.
While a thread is executing the loop in that function, it prevents any other thread from executing on the same core, yet it cannot make progress until some other thread takes action. To the extent that we can assume preemptive thread scheduling, this is an efficiency issue and contributory to (2). However, if we assume neither preemptive thread scheduling nor the threads being scheduled each on a separate core then this is an invitation to deadlock.
Possible improvements
The conventional and most appropriate way to do that in a pthreads program is with the use of a mutex and condition variable. Properly implemented, that resolves the data race (issue 1) and it ensures that other threads get a chance to run (issue 3). If that leaves no other threads eligible to run besides the one that will modify the shared variable then it also addresses issue 2, to the extent that the scheduler is assumed to grant any CPU to the process at all.
But you are forbidden to do that, so what else is available? Well, you could make the shared variable _Atomic. That would resolve the data race, and in practice it would likely be sufficient for the wanted thread ordering. In principle, however, it does not resolve issue 3, and as a practical matter, it does not use sched_yield(). Also, all that busy-looping is wasteful.
But wait! You have a clue in that you are told to use sched_yield(). What could that do for you? Suppose you insert a call to sched_yield() in the body of the busy loop:
/* (A bit) better */
void* b(void *data){
char *whose_turn = data;
while (*whose_turn != 'b') {
sched_yield();
}
printf("b");
*whose_turn = 'c';
return NULL;
}
That resolves issues 2 and 3, explicitly affording the possibility for other threads to run and putting the calling thread at the tail of the scheduler's thread list. Formally, it does not resolve issue 1 because sched_yield() has no documented effect on memory ordering, but in practice, I don't think it can be implemented without a (full) memory barrier. If you are allowed to use atomic objects then combining an atomic shared variable with sched_yield() would tick all three boxes. Even then, however, there would still be a bunch of wasteful busy-looping.
Final remarks
Note well that pthread_join() is a synchronization function, thus, as I understand the task, you may not use it to ensure that the main thread's output is printed last.
Note also that I have not spoken to how the main() function would need to be modified to support the approach I have suggested. Changes would be needed for that, and they are left as an exercise.

Odd behavior of Windows thread

I am combining GTK+, WinAPI and Winsock to create a graphical client-server interface, a waiting room. nUsers is a variable that determines the number of clients successfully connected.
Within a Windows thread, created by:
CreateThread(NULL, 0, action_to_users, NULL, 0, NULL);
I use the do-nothing while loop so that it freezes until a user connects.
while(!nUsers);
However, it never passes through the loop as if nUsers never becomes > 0.nUsers counts the number of clients connected properly, as I constantly monitor it and use it in a variety of different functions.
To prove my point, something even stranger happens.
If I make the loop
while(!nUsers) { printf("(%i)\n", nUsers); }
To spam the console with text printed out (doesn't matter what text as long as it is not an empty string) it works as intended.
What could be possibly going on here...
Regarding the original problem: compiler is free to cache the value of nUsers since the variable is not modified within this loop. Marking the variable volatile prevents this optimization as described here.
Regarding what you're trying to achieve - it looks like a producer-consumer pattern, where the thread(s) handling the sockets are producers and your GUI thread is a consumer. You can slow down your consumer loop to only loop when new data is available using:
semaphores as showcased here - producer thread increments count on semaphore while consumer decrements it upon dequeing work item.
Events like here - producer thread signals an event while consumer thread waits for it to become signalled. You can queue the work in some queue to allow more than one item being processed.
Condition variables (XP+) - here a variable you're waiting for gets signalled upon meeting certain criteria.

Unpredictable results with two threads calling a function

I hoping someone can help me solve some unpredictable behaviour in a C program I need to fix:
I have two Xenomai real-time tasks (threads) that wait until they receive an incoming message from one of two CAN buses.
Each task calls a function checkMessageNumber() however I'm getting unpredictable results.
Please note that I am using a priority based, single-threaded system. One thread has priority over the other, however one thread could be part-way through executing when the other thread takes priority.
It the future it is possible that the hardware could be upgraded to a multi-threading system, however this part of the program would still be confined to a single thread (one CPU core).
It is my understanding that each thread would invoke it's own instance of this function so I don't know what's happening.
int getMessageIndex(unsigned int msg_number)
{
unsigned int i = 0;
while(i < global_number_message_boxes)
{
if (global_message_box[i].id == msg_number}
return i; // matched the msg number, so return the index number
i++;
}
return -1; // no match found
}
Originally this function was highly unpredictable, and as messages streamed in and were processed by the two tasks (depending on which hardware bus the message came from), this function would sometimes return -1 even though the incoming 'msg_number' did match an 'id' in the 'global_message_box' struct.
I was able to make it work better by setting 'global_number_message_boxes' to an integer:
eg. while(i < 50)
however the function still sometimes returns -1 even though there should be a match.
I am only reading global variables, so why are they getting corrupted? what do I need to learn about this?
My idea is to simplify things so the incoming 'msg_number' simply just is the 'id' in the 'global_message_box'.
Each thread will then write to the struct directly without having to check which 'id' to write too.
How important is it to use a mutex? due to system design, each thread will never write to the same part of the struct, so I am unsure if it's important?
Thanks.
This likely comes down to lack of thread synchronisation around the global struct: you say this function is just reading. Sure, but what if another thread calls another function that writes global_number_message_boxes or global_message_box? In a system where you have globals and multiple threads accessing them the safes rule is: put a lock around every access. Maybe the platform you use even supports read/write locks, so multiple threads can read at the same time as long as none is writing.
Lock and Semaphores will be your friends here. Writing data using two threads is going to cause any number of problems.
When the thread enters the function, you will have to BLOCK the other threads and UNBLOCK those threads at exit. This will ensure thread-safe operations and produce consistent results.

Resources