Scheduling from thread in kernel space

Scheduling from thread in kernel space - c

I wrote a netfilter module, in which I had to write an extra thread to clean up a data structure in regular basis. I used schedule() function after each round of clean up was done from the thread. Before using semaphore locks it was working just fine except "General Protection Fault" was occurring. So I used semaphores to lock and unlock the data structure during both insert and delete operation. Now it is showing a bug telling "BUG: Scheduling while atomic". After searching in Google I got that it is showing because of explicit call of schedule() function where it should not be called.
What are the ways to resolve it? Is there any other way in which the thread will yield the CPU to other threads?

So a basic summary (can't give a detailed answer for the reasons specified in the other comments):
A semaphore or mutex will attempt to grab a lock, and if it is unattainable, it will allow something else to run while it is waiting (this is done through the schedule() call). A spinlock, on the other hand, is a locking mechanism that constantly tries to grab a lock until and does not relinquish control until it is succesful (note: there are also spinlock_irq_save, etc, which prevent you from being interrupted by an isr as well. You can look these up).
As far as atomic goes, that means that the kernel is unable to allow another process to run in the background. If you are in an interrupt for example, the kernel cannot allow someone else to run due to stack management. If you are within a spinlock, the kernel cannot allow someone else to run, because if it it schedules another process waiting on the spinlock, you could get in a deadlock situation. You can tell if you're atomic by using the in_atomic() macro.
For your case, you mention an insert and delete function which are obviously called (at least some of the time) from atomic context. If these are only being called from an atomic context, then you can use a spinlock and be done with it. If these are being called from both, you can use spinlock_irq_save (which prevents interrupts from interrupting the spinlocked-area). Again, be careful about disabling interrupts for to long a time.
It is also possible to defer the freeing until later if there is something in the freeing, which you don't want to do from atomic context. In this case, create a spin-lock-protected linked list which is poulated by your delete. Then in your thread, you pop items off the list, and free them there. For insertion, do the malloc before locking the list.
Hope this helps.
John

Related

Lock that handles a high-contention, high-frequency situation

I am looking for a lock implementation that degrades gracefully in the situation where you have two threads that constantly try to release and re-acquire the same lock, at a very high frequency.
Of course it is clear that in this case the two threads won't significantly progress in parallel. Theoretically, the best result would be achieved by running the whole thread 1, and then the whole thread 2, without any switching---because switching just creates massive overhead here. So I am looking for a lock implementation that would handle this situation gracefully by keeping the same thread running for a while before switching, instead of constantly switching.
Long version of the question
As I would myself be tempted to answer this question by "your program is broken, don't do that", here is some justification about why we end up in this kind of situation.
The lock is a "single global lock", i.e. a very coarse lock. (It is the Global Interpreter Lock (GIL) inside PyPy, but the question is about how to do it in general, say if you have a C program.)
We have the following situation:
There is constantly contention. That's expected in this case: the lock is a global lock that needs to be acquired for most threads to progress. So we expect that a large fraction of them are waiting for the lock. Only one of these threads can progress.
The thread that holds the lock might do sometimes bursts of short releases. A typical example would be if this thread does repeated calls to "something external", e.g. many short writes to a file. Each of these writes is usually completed very quickly. The lock still has to be released just in case this external thing turns out to take longer than expected (e.g. if the write actually needs to wait for disk I/O), so that another thread can acquire the lock in this case.
If we use some standard mutex for the lock, then the lock will often switch to another thread as soon as the owner releases the lock. But the problem is what if the program runs several threads that each wants to do a long burst of short releases. The program ends up spending most of its time switching the lock between CPUs.
It is much faster to run the same thread for a while before switching, at least as long as the lock is released for very short periods of time. (E.g. on Linux/pthread a release immediately followed by an acquire will sometimes re-acquire the lock instantly even if there are other waiting threads; but we'd like this result in a large majority of cases, not just sometimes.)
Of course, as soon as the lock is released for a longer period of time, then it becomes a good idea to transfer ownership of the lock to a different thread.
So I'm looking for general ideas about how to do that. I guess it should exist already somewhere---in a paper, or in some multithreading library?
For reference, PyPy tries to implement something like this by polling: the lock is just a global variable, with synchronized compare-and-swap but no OS calls; one of the waiting threads is given the role of "stealer"; that "stealer" thread wakes up every 100 microseconds to check the variable. This is not horribly bad (it costs maybe 1-2% of CPU time in addition to the 100% consumed by the running thread). This actually implements what I'm asking for here, but the problem is that this is a hack that doesn't cleanly support more traditional cases of locks: for example, if thread 1 tries to send a message to thread 2 and wait for the answer, the two thread switches will take in average 100 microseconds each---which is far too much if the message is processed quickly.

For reference, let me describe how we finally implemented it. I was unsure about it as it still feels like a hack, but it seems to work for PyPy's use case in practice.
We did it as described in the last paragraph of the question, with one addition: the "stealer" thread, which checks some global variable every 100 microseconds, does this by calling pthread_cond_timedwait or WaitForSingleObject with a regular, system-provided mutex, with a timeout of 100 microseconds. This gives a "composite lock" with both the global variable and the regular mutex. The "stealer" will succeed in stealing the "lock" if either it notices a value 0 is the global variable (every 100 microseconds), or immediately if the regular mutex is released by another thread.
It's then a matter of choosing how to release the composite lock in a case-by-case basis. Most external functions (writes to files, etc.) are expected to generally complete quickly, and so we release and re-acquire the composite lock by writing to the global variable. Only in a few specific function cases---like sleep() or lock_acquire()---we expect the calling thread to often block; around these functions, we release the composite lock by actually releasing the mutex instead.

If I understand the problem statement, you are asking the kernel scheduler to do an educated guess on whether your userspace application "hot" thread will try to reacquire the lock in the very near future, to avoid implicitly preempting it by allowing a "not-so-hot" thread to acquire the mutex.
I wouldn't know how the kernel could do that. The only two things that come to my mind:
Do not release mutex unless hot thread is actually transitioning to idle (application specific condition). In Linux you can use MONOTONIC_COARSE to try to reduce the overhead of checking the wall clock to implement some sort of timer.
Increase hot thread prio. This is more of mitigation strategy, in an attempt to reduce the amount of preemption of the hot thread. If the "hot" thread can be identified, you could do something like:
pthread_t thread = pthread_self();
//Set max prio, FIFO
struct sched_param params;
params.sched_priority = sched_get_priority_max(SCHED_FIFO);
int rv = pthread_setschedparam(thread, SCHED_FIFO, &params);
if(rv != 0){
//Print error
//...
}

Spinlock might work better in your case. They avoid context switching and are highly efficient if the threads are likely to hold the lock only for short duration of time.
For this very reason, they are widely used in OS kernels.

Can a process call "down" on two semaphores at once?

Let's say two semaphores are protecting a critical piece of code, and you only want a critical piece of code to execute if both of them are available. Is there a pattern for writing this?
In other words, is there a statement that reads, "If semaphore a and b are available, then run... otherwise sleep"?

The simplest way to implement this is to use a single pthread_mutex_t to protect some state, and a single pthread_cond_t to notify all threads when the state has changed. If you always broadcast on the condvar, then you will always wake all waiting threads. The threads can then perform arbitrarily complex tests and updates to the shared state.
Of course, this is not the most efficient solution since it potentially wakes threads when the state does not satisfy the condition they are waiting for (and they have to go back to sleep). It could also lead to starvation since a thread may always find itself at the back of the queue whenever it waits on the condvar, and never find an acceptable state when it awakens.
Without knowing more details of the problem you are trying to solve, it is hard to give an air tight answer.
pthreads does not allow you to acquire multiple locks/semaphores atomically; however, as pointed out by #Greg, you can avoid deadlock by assigning an order to the locks/semaphores, and having the threads always acquire them in that order. Of course, you have to know which locks you intend to acquire before you start to acquire any of them. It will not work if you cannot determine the next lock to acquire until you have acquired the current one, since you may be required to take a lock out of order. If you release all of the locks and start over, you may find the state has changed, requiring you to acquire a different set of locks, which could lead to livelock.

Sync thread by changing schedule priority

I found another way to sync thread from the source code of strongswan. It sync the thread by changing thread's schedule policy(SCHED_FIFO). Does it have any advantage over the mutex way?
The code:
int oldpolicy;
struct sched_param oldparams, params;
pthread_getschedparam(thread_id, &oldpolicy, &oldparams);
params.__sched_priority = sched_get_priority_max(SCHED_FIFO);
pthread_setschedparam(thread_id, SCHED_FIFO, &params);
...
critical section
...
pthread_setschedparam(thread_id, oldpolicy, &oldparams);
PS: strongswan use malloc hook to detect memory leak. To support multi-thread, it use this way to sync the threads.
PPS: It seems that they have modified the code. That piece of code from the version Strongswan 4.5.0.

That does not synchronize anything!
What this does is prevents the thread to be scheduled off the CPU while the critical section is running. Since we now have multiple CPUs and since different thread can run on another CPU, it does not exclude anything at all. And it does not even completely prevent preemption; the thread can still sleep if waiting on page fault or other IO.
The reason for it is to avoid starving other threads when something very important is being calculated without which the other threads can't continue. It does help that cause, but it's a very specialized case (search for "priority inversion").

It's broken if you have more than one core, unless you lock all threads that might conflict to the same core. And, even then, it's still broken if you block on I/O. (For example, a page fault.) Yuck.

Implementing mutex in a user level thread library

I am developing a user level thread library as part of a project. I came up with an approach to implement mutex. I would like to see ur views before going on with it. Basically, i need to implement just 3 functions in my library
mutex_init, mutex_lock and mutex_unlock
I thought my mutex_t structure would look something like
typedef struct
{
int available; //indicates whether the mutex is locked or unlocked
queue listofwaitingthreads;
gtthread_t owningthread;
}mutex_t;
In my mutex_lock function, i will first check if the mutex is available in a while loop. If it is not, i will yield the processor for the next thread to execute.
In my mutex_unlock function, i will check if the owner thread is the current thread. If it is, i will set available to 0.
Is this the way to go about it ? Also, what about deadlock? Should i take care of those conditions in my user level library or should i leave the application programmers to write code properly ?

This won't work, because you have a race condition. If 2 threads try to catch the lock at the same time, both will see available == 0, and both will think they succeeded with taking the mutex.
If you want to do this properly, and without using an already-existing lock, You must access hardware operations like TAS, CAS, etc.
There are algorithms that give you mutual exclusion without such hardware support, but they make some assumptions that are many times false. For more details about this, I highly recommend reading Herlihy and Shavit's The art of multiprocessor programming, chapter 7.
You shouldn't worry about deadlocks in this level - mutex locks should be simple enough, and there is some assumption that the programmer using them should use care not to cause deadlocks (advanced mutexes can check for self-deadlock, meaning a thread that calls lock twice without calling unlock in the middle).

Not only that you have to do atomic operations to read and modify the flag (as Eran pointed out) you also have to watch that your queue is capable to have concurrent accesses. This is not completely trivial, sort of hen and egg problem.
But if you'd really implement this by spinning, you wouldn't even need to have such a queue. The access order to the lock then would be mainly random, though.
Probably just yielding would also not be enough, this can be quite costly if you have threads holding the lock for more than some processor cycles. Consider using nanosleep with a low time value for the wait.

In general, a mutex implementation should look like:
Lock:
while (trylock()==failed) {
atomic_inc(waiter_cnt);
atomic_sleep_if_locked();
atomic_dec(waiter_cnt);
}
Trylock:
return atomic_swap(&lock, 1);
Unlock:
atomic_store(&lock, 0);
if (waiter_cnt) wakeup_sleepers();
Things get more complex if you want recursive mutexes, mutexes that can synchronize their own destruction (i.e. freeing the mutex is safe as soon as you get the lock), etc.
Note that atomic_sleep_if_locked and wakeup_sleepers correspond to FUTEX_WAIT and FUTEX_WAKE ops on Linux. The other atomics are probably CPU instructions, but could be system calls or kernel-assisted userspace function code, as in the case of Linux/ARM and the 0xffff0fc0 atomic compare-and-swap call.

You do not need atomic instructions for a user level thread library, because all the threads are going to be user level threads of the same process. So actually when your process is given the time slice to execute, you are running multiple threads during that time slice but on the same processor. So, no two threads are going to be in the library function at the same time. Considering that the functions for mutex are already in the library, mutual exclusion is guaranteed.

Restarting threads from saved contexts

I am trying to implement a checkpointing scheme for multithreaded applications by using fork. I will take the checkpoint at a safe location such as a barrier. One thread will call fork to replicate the address space and signals will be sent to all other threads so that they can save their contexts and write it to a file.
The forked process will not run initially. Only when restart from checkpoint is required, a signal would be sent to it so it can start running. At that point, the threads who were not forked but whose contexts were saved, will be recreated from the saved contexts.
My first question is if it is enough to recreate threads from saved contexts and run them from there, if i assume there was no lock held, no signal pending during checkpoint etc... . Lastly, how a thread can be created to run from a known context.

What you want is not possible without major integration with the pthreads implementation. Internal thread structures will likely contain their own kernel-space thread ids, which will be different in the restored contexts.
It sounds to me like what you really want is forkall, which is non-trivial to implement. I don't think barriers are useful at all for what you're trying to accomplish. Asynchronous interruption and checkpointing is just as good as synchronized.
If you want to try hacking forkall into glibc, you should start out by looking at the setxid code NPTL uses for synchronizing setuid() calls between threads using signals. The same principle is what's needed to implement forkall, but you'd basically call setjmp instead of setuid in the signal handlers, and then longjmp back into them after making new threads in the child. After that you'd have to patch up the thread structures to have the right pid/tid values, free the excess new stacks that were created, etc.
Edit: Since the setxid code in glibc/NPTL is rather dense reading for someone not familiar with the codebase, you might instead look at the corresponding code I have in musl, called __synccall:
http://git.etalabs.net/cgi-bin/gitweb.cgi?p=musl;a=blob;f=src/thread/synccall.c;h=91ac5eb77322da7393f778da29d35fb3c2def15d;hb=HEAD
It uses a signal to synchronize all threads, then runs a callback sequentially in each thread one-by-one. To implement forkall, you'd want to do something like this prior to the fork, but instead of a callback, simply save jump buffers for each thread except the calling thread (you can't use a callback for this because the return would invalidate the jump buffer you just saved), then perform the fork from the calling thread. After that, you would make N new threads, and have them jump back to the old threads' saved jump buffers, and destroy their new (unneeded) stacks. You'd also need to make the right syscall to update their thread register (e.g. %gs on x86) and tid address.
Then you need to take these ideas and integrate them with glibc's thread allocation and thread stack cache framework. :-)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight