How to synchronize threads without blocking?

How to synchronize threads without blocking? - c

Now as far as I know mutex used for syncing all the thread which are sharing same data by following a principle that when one thread is using that data all other thread should be blocked while using that common resource until it is unlocked...now recently in a blogpost I have seen a code explaining this concept and some people wrote that blocking all the threads while one thread is accessing the resources is a very bad idea and it goes against the concept of threading which is true somehow.. Then my question is how to synchronize threads without blocking?
Here is the link of that blogpost
http://www.thegeekstuff.com/2012/05/c-mutex-examples/

You cannot synchronize threads without blocking by the very definition of synchronization. However, good synchronization technique will limit the scope of where things are blocked to the absolute minimum. To illustrate, and point out exactly why the article is wrong consider the following:
From the article:
pthread_t tid[2];
int counter;
pthread_mutex_t lock;
void* doSomeThing(void *arg)
{
pthread_mutex_lock(&lock);
unsigned long i = 0;
counter += 1;
printf("\n Job %d started\n", counter);
for(i=0; i<(0xFFFFFFFF);i++);
printf("\n Job %d finished\n", counter);
pthread_mutex_unlock(&lock);
return NULL;
}
What it should be:
pthread_t tid[2];
int counter;
pthread_mutex_t lock;
void* doSomeThing(void *arg)
{
unsigned long i = 0;
pthread_mutex_lock(&lock);
counter += 1;
int myJobNumber = counter;
pthread_mutex_unlock(&lock);
printf("\n Job %d started\n", myJobNumber);
for(i=0; i<(0xFFFFFFFF);i++);
printf("\n Job %d finished\n", myJobNumber);
return NULL;
}
Notice that in the article, the work being done (the pointless for loop) is done while holding the lock. This is complete nonsense, since the work is supposed to be done concurrently. The reason the lock is needed is only to protect the counter variable. Thus the threads only need to hold the lock when changing that variable as in the second example.
Mutex locks protect the critical section of code, which are those areas of code which only 1 thread at a time should touch - and all the other threads must block if trying to access the critical section at the same time. However, if thread 1 is in the critical section, and thread 2 is not, then it's perfectly fine for both to run concurrently.

The term you are looking for is lock free data structures.
General idea is that the state shared between threads is contorted into one of those.
Implementations of those vary and often are compiler or platform specific. For example MSVC has a set of _Interlocked* functions to perform simple atomic operations.

blocking all the threads while one thread is accessing the resources is a very bad idea and it goes against the concept of threading which is true somehow
This is a fallacy. Locks block only contending threads, allowing all non-contending threads to run concurrently. Running the work that's the most efficient to run at any particular time rather than forcing any particular ordering is not against the concept of threading at all.
Now if so many of your threads contend so badly that blocking contending threads is harming performance, there are two possibilities:
Most likely you have a very poor design and you should fix it. Don't blame the locks for a high-contention design.
You are in the rare case where other synchronization mechanisms are more appropriate (such as lock-free collections). But this requires significant expertise and analysis of the specific use case to find the best solution.
Generally, if your use case is a perfect fit for atomics, use them. Otherwise, mutexes (possibly in combination with condition variables) should be your first thought. That will cover 99% of the cases a typical multi-threaded C programmer will face.

You can use pthread_mutex_trylock() to attempt a lock. If that fails then you know you would have blocked. You can't do what you want to do, but your thread is not blocked, so it can attempt to do something else. I think most of the comments on that blog are about avoiding contention between threads though, i.e. that maximising multi-threaded performance is about avoiding threads working on the same resource at the same time. If you avoid that by design then by design you don't need locks as you never have contention.

There are a number of tricks that can be used to avoid concurrent bottle necks.
Immutable Data Structures. The idea here is that concurrent reads are okay, but writes are not. To implement something like this you basically need to think of business units as factories to these immutable data structures which are used by other business units.
Asynchronous-Callbacks. This is the essence of event-driven development. If you have concurrent tasks, use the observer pattern to execute some logic when a resource becomes available. Basically we execute some code up until a shared resource is needed then add a listener for when the resource becomes available. This typically results in less readable code and heaver strain on the stack, but you never block a thread waiting on a resource. If you have the tasks ready to keep the CPUs running hot, this pattern will do it for you.
Even with these tools, you'll never completely remove the need for some synchronization (counters come to mind).

Related

Are condition signals better performance-wise than semaphores in multi-release cases?

I currently have some (working) code implemented, doing a histogram through the usage of a semaphore.
Here's a rough pseudocode outline:
// init multi and one to 0, initial lock state
void* helper(void*)
{
sem_wait(multi); // wait for one to finish its work before starting
cnt++
...
cnt--;
if (cnt == 0) sem_post(one); // release one when the work is finished
}
compute_histogram(void*)
{
... initialize globals that multi will be using ...
for (all threads) {sem_post(multi)); // releases every waiting thread at multi
sem_wait(one); // forces one to wait to finish return until helper's work has finished
return;
}
Performance increases, compared to the single-threaded version, are there, about 6-8x, though I had a less threaded, less complicated version that was doing about the same; still, I can't help but think that I could be doing more. I'm extremely new to multi-threading (learned it this week) and looked at the man pages for pthread_cond, and I saw that the broadcast() command allows all threads to be released, which seems much quicker than the for loop on sem_post, as that must call into the OS every time in order to release exactly one thread.
My questions are:
Would broadcasting/condition codes be more suited to this case/substantially faster? I understand that semaphores include the features for waits and mutexes all in one and I could instead break it down into conds and muteness.
How would the initialization of such an implementation look? I believe I understand how the signal and wait commands work but I struggle to see their relation to the mutex reference that must be passed in, alongside any initialization of the cond variable itself? I would be interested in pseudocode and explanations of what it's doing.

How to use sched_yield() properly?

For an assignment, I need to use sched_yield() to synchronize threads. I understand a mutex lock/conditional variables would be much more effective, but I am not allowed to use those.
The only functions we are allowed to use are sched_yield(), pthread_create(), and pthread_join(). We cannot use mutexes, locks, semaphores, or any type of shared variable.
I know sched_yield() is supposed to relinquish access to the thread so another thread can run. So it should move the thread it executes on to the back of the running queue.
The code below is supposed to print 'abc' in order and then the newline after all three threads have executed. I looped sched_yield() in functions b() and c() because it wasn't working as I expected, but I'm pretty sure all that is doing is delaying the printing because a function is running so many times, not because sched_yield() is working.
The server it needs to run on has 16 CPUs. I saw somewhere that sched_yield() may immediately assign the thread to a new CPU.
Essentially I'm unsure of how, using only sched_yield(), to synchronize these threads given everything I could find and troubleshoot with online.
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <sched.h>
void* a(void*);
void* b(void*);
void* c(void*);
int main( void ){
pthread_t a_id, b_id, c_id;
pthread_create(&a_id, NULL, a, NULL);
pthread_create(&b_id, NULL, b, NULL);
pthread_create(&c_id, NULL, c, NULL);
pthread_join(a_id, NULL);
pthread_join(b_id, NULL);
pthread_join(c_id, NULL);
printf("\n");
return 0;
}
void* a(void* ret){
printf("a");
return ret;
}
void* b(void* ret){
for(int i = 0; i < 10; i++){
sched_yield();
}
printf("b");
return ret;
}
void* c(void* ret){
for(int i = 0; i < 100; i++){
sched_yield();
}
printf("c");
return ret;
}

There's 4 cases:
a) the scheduler doesn't use multiplexing (e.g. doesn't use "round robin" but uses "highest priority thread that can run does run", or "earliest deadline first", or ...) and sched_yield() does nothing.
b) the scheduler does use multiplexing in theory, but you have more CPUs than threads so the multiplexing doesn't actually happen, and sched_yield() does nothing. Note: With 16 CPUs and 2 threads, this is likely what you'd get for "default scheduling policy" on an OS like Linux - the sched_yield() just does a "Hrm, no other thread I could use this CPU for, so I guess the calling thread can keep using the same CPU!").
c) the scheduler does use multiplexing and there's more threads than CPUs, but to improve performance (avoid task switches) the scheduler designer decided that sched_yield() does nothing.
d) sched_yield() does cause a task switch (yielding the CPU to some other task), but that is not enough to do any kind of synchronization on its own (e.g. you'd need an atomic variable or something for the actual synchronization - maybe like "while( atomic_variable_not_set_by_other_thread ) { sched_yield(); }). Note that with an atomic variable (introduced in C11) it'd work without sched_yield() - the sched_yield() (if it does anything) merely makes busy waiting less awful/wasteful.

Essentially I'm unsure of how, using only sched_yield(), to
synchronize these threads given everything I could find and
troubleshoot with online.
That would be because sched_yield() is not well suited to the task. As I wrote in comments, sched_yield() is about scheduling, not synchronization. There is a relationship between the two, in the sense that synchronization events affect which threads are eligible to run, but that goes in the wrong direction for your needs.
You are probably looking at the problem from the wrong end. You need each of your threads to wait to execute until it is their turn, and for them to do that, they need some mechanism to convey information among them about whose turn it is. There are several alternatives for that, but if "only sched_yield()" is taken to mean that no library functions other than sched_yield() may be used for that purpose then a shared variable seems the expected choice. The starting point should therefore be how you could use a shared variable to make the threads take turns in the appropriate order.
Flawed starting point
Here is a naive approach that might spring immediately to mind:
/* FLAWED */
void *b(void *data){
char *whose_turn = data;
while (*whose_turn != 'b') {
// nothing?
}
printf("b");
*whose_turn = 'c';
return NULL;
}
That is, the thread executes a busy loop, monitoring the shared variable to await it taking a value signifying that the thread should proceed. When it has done its work, the thread modifies the variable to indicate that the next thread may proceed. But there are several problems with that, among them:
Supposing that at least one other thread writes to the object designated by *whose_turn, the program contains a data race, and therefore its behavior is undefined. As a practical matter, a thread that once entered the body of the loop in that function might loop infinitely, notwithstanding any action by other threads.
Without making additional assumptions about thread scheduling, such as a fairness policy, it is not safe to assume that the thread that will make the needed modification to the shared variable will be scheduled in bounded time.
While a thread is executing the loop in that function, it prevents any other thread from executing on the same core, yet it cannot make progress until some other thread takes action. To the extent that we can assume preemptive thread scheduling, this is an efficiency issue and contributory to (2). However, if we assume neither preemptive thread scheduling nor the threads being scheduled each on a separate core then this is an invitation to deadlock.
Possible improvements
The conventional and most appropriate way to do that in a pthreads program is with the use of a mutex and condition variable. Properly implemented, that resolves the data race (issue 1) and it ensures that other threads get a chance to run (issue 3). If that leaves no other threads eligible to run besides the one that will modify the shared variable then it also addresses issue 2, to the extent that the scheduler is assumed to grant any CPU to the process at all.
But you are forbidden to do that, so what else is available? Well, you could make the shared variable _Atomic. That would resolve the data race, and in practice it would likely be sufficient for the wanted thread ordering. In principle, however, it does not resolve issue 3, and as a practical matter, it does not use sched_yield(). Also, all that busy-looping is wasteful.
But wait! You have a clue in that you are told to use sched_yield(). What could that do for you? Suppose you insert a call to sched_yield() in the body of the busy loop:
/* (A bit) better */
void* b(void *data){
char *whose_turn = data;
while (*whose_turn != 'b') {
sched_yield();
}
printf("b");
*whose_turn = 'c';
return NULL;
}
That resolves issues 2 and 3, explicitly affording the possibility for other threads to run and putting the calling thread at the tail of the scheduler's thread list. Formally, it does not resolve issue 1 because sched_yield() has no documented effect on memory ordering, but in practice, I don't think it can be implemented without a (full) memory barrier. If you are allowed to use atomic objects then combining an atomic shared variable with sched_yield() would tick all three boxes. Even then, however, there would still be a bunch of wasteful busy-looping.
Final remarks
Note well that pthread_join() is a synchronization function, thus, as I understand the task, you may not use it to ensure that the main thread's output is printed last.
Note also that I have not spoken to how the main() function would need to be modified to support the approach I have suggested. Changes would be needed for that, and they are left as an exercise.

Lock that handles a high-contention, high-frequency situation

I am looking for a lock implementation that degrades gracefully in the situation where you have two threads that constantly try to release and re-acquire the same lock, at a very high frequency.
Of course it is clear that in this case the two threads won't significantly progress in parallel. Theoretically, the best result would be achieved by running the whole thread 1, and then the whole thread 2, without any switching---because switching just creates massive overhead here. So I am looking for a lock implementation that would handle this situation gracefully by keeping the same thread running for a while before switching, instead of constantly switching.
Long version of the question
As I would myself be tempted to answer this question by "your program is broken, don't do that", here is some justification about why we end up in this kind of situation.
The lock is a "single global lock", i.e. a very coarse lock. (It is the Global Interpreter Lock (GIL) inside PyPy, but the question is about how to do it in general, say if you have a C program.)
We have the following situation:
There is constantly contention. That's expected in this case: the lock is a global lock that needs to be acquired for most threads to progress. So we expect that a large fraction of them are waiting for the lock. Only one of these threads can progress.
The thread that holds the lock might do sometimes bursts of short releases. A typical example would be if this thread does repeated calls to "something external", e.g. many short writes to a file. Each of these writes is usually completed very quickly. The lock still has to be released just in case this external thing turns out to take longer than expected (e.g. if the write actually needs to wait for disk I/O), so that another thread can acquire the lock in this case.
If we use some standard mutex for the lock, then the lock will often switch to another thread as soon as the owner releases the lock. But the problem is what if the program runs several threads that each wants to do a long burst of short releases. The program ends up spending most of its time switching the lock between CPUs.
It is much faster to run the same thread for a while before switching, at least as long as the lock is released for very short periods of time. (E.g. on Linux/pthread a release immediately followed by an acquire will sometimes re-acquire the lock instantly even if there are other waiting threads; but we'd like this result in a large majority of cases, not just sometimes.)
Of course, as soon as the lock is released for a longer period of time, then it becomes a good idea to transfer ownership of the lock to a different thread.
So I'm looking for general ideas about how to do that. I guess it should exist already somewhere---in a paper, or in some multithreading library?
For reference, PyPy tries to implement something like this by polling: the lock is just a global variable, with synchronized compare-and-swap but no OS calls; one of the waiting threads is given the role of "stealer"; that "stealer" thread wakes up every 100 microseconds to check the variable. This is not horribly bad (it costs maybe 1-2% of CPU time in addition to the 100% consumed by the running thread). This actually implements what I'm asking for here, but the problem is that this is a hack that doesn't cleanly support more traditional cases of locks: for example, if thread 1 tries to send a message to thread 2 and wait for the answer, the two thread switches will take in average 100 microseconds each---which is far too much if the message is processed quickly.

For reference, let me describe how we finally implemented it. I was unsure about it as it still feels like a hack, but it seems to work for PyPy's use case in practice.
We did it as described in the last paragraph of the question, with one addition: the "stealer" thread, which checks some global variable every 100 microseconds, does this by calling pthread_cond_timedwait or WaitForSingleObject with a regular, system-provided mutex, with a timeout of 100 microseconds. This gives a "composite lock" with both the global variable and the regular mutex. The "stealer" will succeed in stealing the "lock" if either it notices a value 0 is the global variable (every 100 microseconds), or immediately if the regular mutex is released by another thread.
It's then a matter of choosing how to release the composite lock in a case-by-case basis. Most external functions (writes to files, etc.) are expected to generally complete quickly, and so we release and re-acquire the composite lock by writing to the global variable. Only in a few specific function cases---like sleep() or lock_acquire()---we expect the calling thread to often block; around these functions, we release the composite lock by actually releasing the mutex instead.

If I understand the problem statement, you are asking the kernel scheduler to do an educated guess on whether your userspace application "hot" thread will try to reacquire the lock in the very near future, to avoid implicitly preempting it by allowing a "not-so-hot" thread to acquire the mutex.
I wouldn't know how the kernel could do that. The only two things that come to my mind:
Do not release mutex unless hot thread is actually transitioning to idle (application specific condition). In Linux you can use MONOTONIC_COARSE to try to reduce the overhead of checking the wall clock to implement some sort of timer.
Increase hot thread prio. This is more of mitigation strategy, in an attempt to reduce the amount of preemption of the hot thread. If the "hot" thread can be identified, you could do something like:
pthread_t thread = pthread_self();
//Set max prio, FIFO
struct sched_param params;
params.sched_priority = sched_get_priority_max(SCHED_FIFO);
int rv = pthread_setschedparam(thread, SCHED_FIFO, &params);
if(rv != 0){
//Print error
//...
}

Spinlock might work better in your case. They avoid context switching and are highly efficient if the threads are likely to hold the lock only for short duration of time.
For this very reason, they are widely used in OS kernels.

Sync thread by changing schedule priority

I found another way to sync thread from the source code of strongswan. It sync the thread by changing thread's schedule policy(SCHED_FIFO). Does it have any advantage over the mutex way?
The code:
int oldpolicy;
struct sched_param oldparams, params;
pthread_getschedparam(thread_id, &oldpolicy, &oldparams);
params.__sched_priority = sched_get_priority_max(SCHED_FIFO);
pthread_setschedparam(thread_id, SCHED_FIFO, &params);
...
critical section
...
pthread_setschedparam(thread_id, oldpolicy, &oldparams);
PS: strongswan use malloc hook to detect memory leak. To support multi-thread, it use this way to sync the threads.
PPS: It seems that they have modified the code. That piece of code from the version Strongswan 4.5.0.

That does not synchronize anything!
What this does is prevents the thread to be scheduled off the CPU while the critical section is running. Since we now have multiple CPUs and since different thread can run on another CPU, it does not exclude anything at all. And it does not even completely prevent preemption; the thread can still sleep if waiting on page fault or other IO.
The reason for it is to avoid starving other threads when something very important is being calculated without which the other threads can't continue. It does help that cause, but it's a very specialized case (search for "priority inversion").

It's broken if you have more than one core, unless you lock all threads that might conflict to the same core. And, even then, it's still broken if you block on I/O. (For example, a page fault.) Yuck.

Implementing mutex in a user level thread library

I am developing a user level thread library as part of a project. I came up with an approach to implement mutex. I would like to see ur views before going on with it. Basically, i need to implement just 3 functions in my library
mutex_init, mutex_lock and mutex_unlock
I thought my mutex_t structure would look something like
typedef struct
{
int available; //indicates whether the mutex is locked or unlocked
queue listofwaitingthreads;
gtthread_t owningthread;
}mutex_t;
In my mutex_lock function, i will first check if the mutex is available in a while loop. If it is not, i will yield the processor for the next thread to execute.
In my mutex_unlock function, i will check if the owner thread is the current thread. If it is, i will set available to 0.
Is this the way to go about it ? Also, what about deadlock? Should i take care of those conditions in my user level library or should i leave the application programmers to write code properly ?

This won't work, because you have a race condition. If 2 threads try to catch the lock at the same time, both will see available == 0, and both will think they succeeded with taking the mutex.
If you want to do this properly, and without using an already-existing lock, You must access hardware operations like TAS, CAS, etc.
There are algorithms that give you mutual exclusion without such hardware support, but they make some assumptions that are many times false. For more details about this, I highly recommend reading Herlihy and Shavit's The art of multiprocessor programming, chapter 7.
You shouldn't worry about deadlocks in this level - mutex locks should be simple enough, and there is some assumption that the programmer using them should use care not to cause deadlocks (advanced mutexes can check for self-deadlock, meaning a thread that calls lock twice without calling unlock in the middle).

Not only that you have to do atomic operations to read and modify the flag (as Eran pointed out) you also have to watch that your queue is capable to have concurrent accesses. This is not completely trivial, sort of hen and egg problem.
But if you'd really implement this by spinning, you wouldn't even need to have such a queue. The access order to the lock then would be mainly random, though.
Probably just yielding would also not be enough, this can be quite costly if you have threads holding the lock for more than some processor cycles. Consider using nanosleep with a low time value for the wait.

In general, a mutex implementation should look like:
Lock:
while (trylock()==failed) {
atomic_inc(waiter_cnt);
atomic_sleep_if_locked();
atomic_dec(waiter_cnt);
}
Trylock:
return atomic_swap(&lock, 1);
Unlock:
atomic_store(&lock, 0);
if (waiter_cnt) wakeup_sleepers();
Things get more complex if you want recursive mutexes, mutexes that can synchronize their own destruction (i.e. freeing the mutex is safe as soon as you get the lock), etc.
Note that atomic_sleep_if_locked and wakeup_sleepers correspond to FUTEX_WAIT and FUTEX_WAKE ops on Linux. The other atomics are probably CPU instructions, but could be system calls or kernel-assisted userspace function code, as in the case of Linux/ARM and the 0xffff0fc0 atomic compare-and-swap call.

You do not need atomic instructions for a user level thread library, because all the threads are going to be user level threads of the same process. So actually when your process is given the time slice to execute, you are running multiple threads during that time slice but on the same processor. So, no two threads are going to be in the library function at the same time. Considering that the functions for mutex are already in the library, mutual exclusion is guaranteed.