Sync thread by changing schedule priority - c

I found another way to sync thread from the source code of strongswan. It sync the thread by changing thread's schedule policy(SCHED_FIFO). Does it have any advantage over the mutex way?
The code:
int oldpolicy;
struct sched_param oldparams, params;
pthread_getschedparam(thread_id, &oldpolicy, &oldparams);
params.__sched_priority = sched_get_priority_max(SCHED_FIFO);
pthread_setschedparam(thread_id, SCHED_FIFO, &params);
...
critical section
...
pthread_setschedparam(thread_id, oldpolicy, &oldparams);
PS: strongswan use malloc hook to detect memory leak. To support multi-thread, it use this way to sync the threads.
PPS: It seems that they have modified the code. That piece of code from the version Strongswan 4.5.0.

That does not synchronize anything!
What this does is prevents the thread to be scheduled off the CPU while the critical section is running. Since we now have multiple CPUs and since different thread can run on another CPU, it does not exclude anything at all. And it does not even completely prevent preemption; the thread can still sleep if waiting on page fault or other IO.
The reason for it is to avoid starving other threads when something very important is being calculated without which the other threads can't continue. It does help that cause, but it's a very specialized case (search for "priority inversion").

It's broken if you have more than one core, unless you lock all threads that might conflict to the same core. And, even then, it's still broken if you block on I/O. (For example, a page fault.) Yuck.

Related

Lock that handles a high-contention, high-frequency situation

I am looking for a lock implementation that degrades gracefully in the situation where you have two threads that constantly try to release and re-acquire the same lock, at a very high frequency.
Of course it is clear that in this case the two threads won't significantly progress in parallel. Theoretically, the best result would be achieved by running the whole thread 1, and then the whole thread 2, without any switching---because switching just creates massive overhead here. So I am looking for a lock implementation that would handle this situation gracefully by keeping the same thread running for a while before switching, instead of constantly switching.
Long version of the question
As I would myself be tempted to answer this question by "your program is broken, don't do that", here is some justification about why we end up in this kind of situation.
The lock is a "single global lock", i.e. a very coarse lock. (It is the Global Interpreter Lock (GIL) inside PyPy, but the question is about how to do it in general, say if you have a C program.)
We have the following situation:
There is constantly contention. That's expected in this case: the lock is a global lock that needs to be acquired for most threads to progress. So we expect that a large fraction of them are waiting for the lock. Only one of these threads can progress.
The thread that holds the lock might do sometimes bursts of short releases. A typical example would be if this thread does repeated calls to "something external", e.g. many short writes to a file. Each of these writes is usually completed very quickly. The lock still has to be released just in case this external thing turns out to take longer than expected (e.g. if the write actually needs to wait for disk I/O), so that another thread can acquire the lock in this case.
If we use some standard mutex for the lock, then the lock will often switch to another thread as soon as the owner releases the lock. But the problem is what if the program runs several threads that each wants to do a long burst of short releases. The program ends up spending most of its time switching the lock between CPUs.
It is much faster to run the same thread for a while before switching, at least as long as the lock is released for very short periods of time. (E.g. on Linux/pthread a release immediately followed by an acquire will sometimes re-acquire the lock instantly even if there are other waiting threads; but we'd like this result in a large majority of cases, not just sometimes.)
Of course, as soon as the lock is released for a longer period of time, then it becomes a good idea to transfer ownership of the lock to a different thread.
So I'm looking for general ideas about how to do that. I guess it should exist already somewhere---in a paper, or in some multithreading library?
For reference, PyPy tries to implement something like this by polling: the lock is just a global variable, with synchronized compare-and-swap but no OS calls; one of the waiting threads is given the role of "stealer"; that "stealer" thread wakes up every 100 microseconds to check the variable. This is not horribly bad (it costs maybe 1-2% of CPU time in addition to the 100% consumed by the running thread). This actually implements what I'm asking for here, but the problem is that this is a hack that doesn't cleanly support more traditional cases of locks: for example, if thread 1 tries to send a message to thread 2 and wait for the answer, the two thread switches will take in average 100 microseconds each---which is far too much if the message is processed quickly.
For reference, let me describe how we finally implemented it. I was unsure about it as it still feels like a hack, but it seems to work for PyPy's use case in practice.
We did it as described in the last paragraph of the question, with one addition: the "stealer" thread, which checks some global variable every 100 microseconds, does this by calling pthread_cond_timedwait or WaitForSingleObject with a regular, system-provided mutex, with a timeout of 100 microseconds. This gives a "composite lock" with both the global variable and the regular mutex. The "stealer" will succeed in stealing the "lock" if either it notices a value 0 is the global variable (every 100 microseconds), or immediately if the regular mutex is released by another thread.
It's then a matter of choosing how to release the composite lock in a case-by-case basis. Most external functions (writes to files, etc.) are expected to generally complete quickly, and so we release and re-acquire the composite lock by writing to the global variable. Only in a few specific function cases---like sleep() or lock_acquire()---we expect the calling thread to often block; around these functions, we release the composite lock by actually releasing the mutex instead.
If I understand the problem statement, you are asking the kernel scheduler to do an educated guess on whether your userspace application "hot" thread will try to reacquire the lock in the very near future, to avoid implicitly preempting it by allowing a "not-so-hot" thread to acquire the mutex.
I wouldn't know how the kernel could do that. The only two things that come to my mind:
Do not release mutex unless hot thread is actually transitioning to idle (application specific condition). In Linux you can use MONOTONIC_COARSE to try to reduce the overhead of checking the wall clock to implement some sort of timer.
Increase hot thread prio. This is more of mitigation strategy, in an attempt to reduce the amount of preemption of the hot thread. If the "hot" thread can be identified, you could do something like:
pthread_t thread = pthread_self();
//Set max prio, FIFO
struct sched_param params;
params.sched_priority = sched_get_priority_max(SCHED_FIFO);
int rv = pthread_setschedparam(thread, SCHED_FIFO, &params);
if(rv != 0){
//Print error
//...
}
Spinlock might work better in your case. They avoid context switching and are highly efficient if the threads are likely to hold the lock only for short duration of time.
For this very reason, they are widely used in OS kernels.

Scheduling from thread in kernel space

I wrote a netfilter module, in which I had to write an extra thread to clean up a data structure in regular basis. I used schedule() function after each round of clean up was done from the thread. Before using semaphore locks it was working just fine except "General Protection Fault" was occurring. So I used semaphores to lock and unlock the data structure during both insert and delete operation. Now it is showing a bug telling "BUG: Scheduling while atomic". After searching in Google I got that it is showing because of explicit call of schedule() function where it should not be called.
What are the ways to resolve it? Is there any other way in which the thread will yield the CPU to other threads?
So a basic summary (can't give a detailed answer for the reasons specified in the other comments):
A semaphore or mutex will attempt to grab a lock, and if it is unattainable, it will allow something else to run while it is waiting (this is done through the schedule() call). A spinlock, on the other hand, is a locking mechanism that constantly tries to grab a lock until and does not relinquish control until it is succesful (note: there are also spinlock_irq_save, etc, which prevent you from being interrupted by an isr as well. You can look these up).
As far as atomic goes, that means that the kernel is unable to allow another process to run in the background. If you are in an interrupt for example, the kernel cannot allow someone else to run due to stack management. If you are within a spinlock, the kernel cannot allow someone else to run, because if it it schedules another process waiting on the spinlock, you could get in a deadlock situation. You can tell if you're atomic by using the in_atomic() macro.
For your case, you mention an insert and delete function which are obviously called (at least some of the time) from atomic context. If these are only being called from an atomic context, then you can use a spinlock and be done with it. If these are being called from both, you can use spinlock_irq_save (which prevents interrupts from interrupting the spinlocked-area). Again, be careful about disabling interrupts for to long a time.
It is also possible to defer the freeing until later if there is something in the freeing, which you don't want to do from atomic context. In this case, create a spin-lock-protected linked list which is poulated by your delete. Then in your thread, you pop items off the list, and free them there. For insertion, do the malloc before locking the list.
Hope this helps.
John

Linux Scheduling: OS vs "virtual"

How does one implement a multithreaded single process model in linux fedora under c where a single scheduler is used on a "main" core reading i/o availability (ex. tcp/ip, udp) then having a single-thread-per-core (started at init), the "execution thread", parse the data then update a small amount of info update to shared memory space (it is my understanding pthreads share data under a single process).
I beleive my options are:
Pthreads or the linux OS scheduler
I have a naive model in mind consisting of starting a certain number of these execution threads a single scheduler thread.
What is the best solution one could think when I know that I can use this sort of model.
Completing Benoit's answer, in order to communicate between your master and your worker threads, you could use conditional variable. The workers do something like:
while (true)
{
pthread_mutex_lock(workQueueMutex);
while (workQueue.empty())
pthread_cond_wait(workQueueCond, workQueueMutex);
/* if we get were then (a) we have work (b) we hold workQueueMutex */
work = pop(workQueue);
pthread_mutex_unlock(workQueueMutex);
/* do work */
}
and the master:
/* I/O received */
pthread_mutex_lock(workQueueMutex);
push(workQueue, work);
pthread_cond_signal(workQueueCond);
pthread_mutex_unlock(workQueueMutex);
This would wake up one idle work to immediately process the request. If no worker is available, the work will be dequeued and processed later.
Modifying the Linux scheduler is quite a tough work. I would just forget about it. Pthread is usually prefered. If I understand well, you want to have one core dedicated to the control plan, and a pool of other cores dedicated to the data plan processing? Then create a pool of threads from your master thread and setup core affinity for these slave threads with pthread_setaffinity_np(...).
Indeed threads of a process share the same address-space, and global variables are accessible by any threads of that process.
It looks to me that you have a version of the producer-consumer problem with a single consumer aggregating the results of n producers. This is a pretty standard problem, so I definitely think that pthread is more than enough for you. You don't need to go and mess around with the scheduler.
As one of the answer's states, a thread safe queue like the one described here works nicely for this sort of issue. Your original idea of spawning a bunch of threads is a good idea. You seem to be worried that the ability of the threads to share global state will cause you problems. I don't think that this is an issue if you keep shared state to a minimum and use sane locking discipline. Sharing state is fine as long as you do so responsibly.
Finally, unless you really know what you're doing, I would advise against manually messing with thread affinity. Just spawn the threads and let the scheduler handle when and on what core a thread runs. The thing to optimize is the number of threads you use. One for each core may not actually be the fastest approach if other threads are running.
Generally speaking, this is more or less exactly what the posix select and linux specific epoll functions are for.

suspend pthread?

I want to implement a mutex lock.
From my understanding, mutex.lock() should work like
1) check lock owner
2) if lock is owned, put thread in waiting queue
3) suspend this thread until another thread send a wait up signal
However, there is nothing like pthread_suspend(), then how do I do suspend?
I found someone saying use pthread_con_wait(), but seems if I want to use that function, I have to set up a pthread_mutex lock first, which it doesn't make sense to use pthread_mutex inside my mutex.
Well, if my understanding of mutex is wrong, please correct me.
Thanks.
Mutexes, locks, and wait conditions are all different, distinct things. You need a mutex variable in order to implement both a lock and a wait condition.
A lock is a simple mechanism that prevents more than one thread from executing the same code at once by making all by one thread wait for the lock to become unlocked.
A wait condition is a slightly more complex structure that allows a thread to monitor a condition (usually a boolean flag) and only wake up when the flag has changed favourably.
In both cases, when a thread blocks (i.e. sleeps), the operating system's scheduling primitives automatically take care of descheduling the thread and using the available computing time elsewhere. Thread and task scheduling is not something you would normally have to worry about manually.
You can only make things that are at least as complex as the simplest pieces you have. If the simplest pieces you have are mutexes, then you can't make mutexes from the pieces you have. You can only make things at least as complex as a mutex or more so. If you have any pieces simpler than a mutex, tell us what they are, and we can tell you how to make a mutex out of them.
I suppose, if you want, you can make your own mutex out of pthread mutexes and condition variables. I'm not sure what the point is, but it's trivial to do. As you noted, you can use pthread_cond_wait to wait on your own kind of mutex.
The reason the pthreads standard gives you a mutex is because it's about the most flexible of the possible synchronization primitives.
mutex.lock() should work like:
1) check lock owner
2) if lock is owned, put thread in waiting queue
3) suspend this thread until THE THREAD THAT OWNS THE LOCK sends a wake up signal. No other thread can release the lock.
These steps should be performed as an atomic operation so that the correct behaviour is followed for all threads acquiring/releasing the mutex, no matter how such calls may be interrupted and reentered from other threads.
'However, there is nothing like pthread_suspend(), then how do I do suspend?' - usually, you don't. The OS kernel provides synchronization primitives that can block threads that should not run on. To implement a 'suspend' in user-space, you can only spin-wait - something that is a good strategy in a few cases, (underloaded multi-core box where the lock is only held for a very short time), but certainly not all, (and can lead to spectacularly disastrous livelocks across whole clusters of machines).
If you want a mutex, use an OS mutex - that's what any cross-platform lib. will do.

Implementing mutex in a user level thread library

I am developing a user level thread library as part of a project. I came up with an approach to implement mutex. I would like to see ur views before going on with it. Basically, i need to implement just 3 functions in my library
mutex_init, mutex_lock and mutex_unlock
I thought my mutex_t structure would look something like
typedef struct
{
int available; //indicates whether the mutex is locked or unlocked
queue listofwaitingthreads;
gtthread_t owningthread;
}mutex_t;
In my mutex_lock function, i will first check if the mutex is available in a while loop. If it is not, i will yield the processor for the next thread to execute.
In my mutex_unlock function, i will check if the owner thread is the current thread. If it is, i will set available to 0.
Is this the way to go about it ? Also, what about deadlock? Should i take care of those conditions in my user level library or should i leave the application programmers to write code properly ?
This won't work, because you have a race condition. If 2 threads try to catch the lock at the same time, both will see available == 0, and both will think they succeeded with taking the mutex.
If you want to do this properly, and without using an already-existing lock, You must access hardware operations like TAS, CAS, etc.
There are algorithms that give you mutual exclusion without such hardware support, but they make some assumptions that are many times false. For more details about this, I highly recommend reading Herlihy and Shavit's The art of multiprocessor programming, chapter 7.
You shouldn't worry about deadlocks in this level - mutex locks should be simple enough, and there is some assumption that the programmer using them should use care not to cause deadlocks (advanced mutexes can check for self-deadlock, meaning a thread that calls lock twice without calling unlock in the middle).
Not only that you have to do atomic operations to read and modify the flag (as Eran pointed out) you also have to watch that your queue is capable to have concurrent accesses. This is not completely trivial, sort of hen and egg problem.
But if you'd really implement this by spinning, you wouldn't even need to have such a queue. The access order to the lock then would be mainly random, though.
Probably just yielding would also not be enough, this can be quite costly if you have threads holding the lock for more than some processor cycles. Consider using nanosleep with a low time value for the wait.
In general, a mutex implementation should look like:
Lock:
while (trylock()==failed) {
atomic_inc(waiter_cnt);
atomic_sleep_if_locked();
atomic_dec(waiter_cnt);
}
Trylock:
return atomic_swap(&lock, 1);
Unlock:
atomic_store(&lock, 0);
if (waiter_cnt) wakeup_sleepers();
Things get more complex if you want recursive mutexes, mutexes that can synchronize their own destruction (i.e. freeing the mutex is safe as soon as you get the lock), etc.
Note that atomic_sleep_if_locked and wakeup_sleepers correspond to FUTEX_WAIT and FUTEX_WAKE ops on Linux. The other atomics are probably CPU instructions, but could be system calls or kernel-assisted userspace function code, as in the case of Linux/ARM and the 0xffff0fc0 atomic compare-and-swap call.
You do not need atomic instructions for a user level thread library, because all the threads are going to be user level threads of the same process. So actually when your process is given the time slice to execute, you are running multiple threads during that time slice but on the same processor. So, no two threads are going to be in the library function at the same time. Considering that the functions for mutex are already in the library, mutual exclusion is guaranteed.

Resources