Lock that handles a high-contention, high-frequency situation

Lock that handles a high-contention, high-frequency situation - c

I am looking for a lock implementation that degrades gracefully in the situation where you have two threads that constantly try to release and re-acquire the same lock, at a very high frequency.
Of course it is clear that in this case the two threads won't significantly progress in parallel. Theoretically, the best result would be achieved by running the whole thread 1, and then the whole thread 2, without any switching---because switching just creates massive overhead here. So I am looking for a lock implementation that would handle this situation gracefully by keeping the same thread running for a while before switching, instead of constantly switching.
Long version of the question
As I would myself be tempted to answer this question by "your program is broken, don't do that", here is some justification about why we end up in this kind of situation.
The lock is a "single global lock", i.e. a very coarse lock. (It is the Global Interpreter Lock (GIL) inside PyPy, but the question is about how to do it in general, say if you have a C program.)
We have the following situation:
There is constantly contention. That's expected in this case: the lock is a global lock that needs to be acquired for most threads to progress. So we expect that a large fraction of them are waiting for the lock. Only one of these threads can progress.
The thread that holds the lock might do sometimes bursts of short releases. A typical example would be if this thread does repeated calls to "something external", e.g. many short writes to a file. Each of these writes is usually completed very quickly. The lock still has to be released just in case this external thing turns out to take longer than expected (e.g. if the write actually needs to wait for disk I/O), so that another thread can acquire the lock in this case.
If we use some standard mutex for the lock, then the lock will often switch to another thread as soon as the owner releases the lock. But the problem is what if the program runs several threads that each wants to do a long burst of short releases. The program ends up spending most of its time switching the lock between CPUs.
It is much faster to run the same thread for a while before switching, at least as long as the lock is released for very short periods of time. (E.g. on Linux/pthread a release immediately followed by an acquire will sometimes re-acquire the lock instantly even if there are other waiting threads; but we'd like this result in a large majority of cases, not just sometimes.)
Of course, as soon as the lock is released for a longer period of time, then it becomes a good idea to transfer ownership of the lock to a different thread.
So I'm looking for general ideas about how to do that. I guess it should exist already somewhere---in a paper, or in some multithreading library?
For reference, PyPy tries to implement something like this by polling: the lock is just a global variable, with synchronized compare-and-swap but no OS calls; one of the waiting threads is given the role of "stealer"; that "stealer" thread wakes up every 100 microseconds to check the variable. This is not horribly bad (it costs maybe 1-2% of CPU time in addition to the 100% consumed by the running thread). This actually implements what I'm asking for here, but the problem is that this is a hack that doesn't cleanly support more traditional cases of locks: for example, if thread 1 tries to send a message to thread 2 and wait for the answer, the two thread switches will take in average 100 microseconds each---which is far too much if the message is processed quickly.

For reference, let me describe how we finally implemented it. I was unsure about it as it still feels like a hack, but it seems to work for PyPy's use case in practice.
We did it as described in the last paragraph of the question, with one addition: the "stealer" thread, which checks some global variable every 100 microseconds, does this by calling pthread_cond_timedwait or WaitForSingleObject with a regular, system-provided mutex, with a timeout of 100 microseconds. This gives a "composite lock" with both the global variable and the regular mutex. The "stealer" will succeed in stealing the "lock" if either it notices a value 0 is the global variable (every 100 microseconds), or immediately if the regular mutex is released by another thread.
It's then a matter of choosing how to release the composite lock in a case-by-case basis. Most external functions (writes to files, etc.) are expected to generally complete quickly, and so we release and re-acquire the composite lock by writing to the global variable. Only in a few specific function cases---like sleep() or lock_acquire()---we expect the calling thread to often block; around these functions, we release the composite lock by actually releasing the mutex instead.

If I understand the problem statement, you are asking the kernel scheduler to do an educated guess on whether your userspace application "hot" thread will try to reacquire the lock in the very near future, to avoid implicitly preempting it by allowing a "not-so-hot" thread to acquire the mutex.
I wouldn't know how the kernel could do that. The only two things that come to my mind:
Do not release mutex unless hot thread is actually transitioning to idle (application specific condition). In Linux you can use MONOTONIC_COARSE to try to reduce the overhead of checking the wall clock to implement some sort of timer.
Increase hot thread prio. This is more of mitigation strategy, in an attempt to reduce the amount of preemption of the hot thread. If the "hot" thread can be identified, you could do something like:
pthread_t thread = pthread_self();
//Set max prio, FIFO
struct sched_param params;
params.sched_priority = sched_get_priority_max(SCHED_FIFO);
int rv = pthread_setschedparam(thread, SCHED_FIFO, &params);
if(rv != 0){
//Print error
//...
}

Spinlock might work better in your case. They avoid context switching and are highly efficient if the threads are likely to hold the lock only for short duration of time.
For this very reason, they are widely used in OS kernels.

Related

What is disadvantage of calling sleep() inside mutex lock?

For example:
pthread_mutex_lock();
//Do something
sleep(1); //causes issues waiting while holding lock
pthread_mutex_unlock();
what is the solution if we don't want to use sleep inside mutex lock

As a rule of thumb, you usually (but not always) don't want to hold a mutex for a long period of time (otherwise, other threads locking the same mutex would wait too long), and a full second is a long period for a processor doing billions of elementary operations each second.
You might want to use condition variables (since pthread_cond_wait is atomically releasing the mutex), or do the sleep (or some poll(2)...) outside of the locked region. You might even -on Linux- use pipe(7)-s -or the cheaper but Linux-specific eventfd(2)- to communicate between threads running event loops.
The coverity static source analyzer is heuristic and might give false alarms.
Take time to read a good Pthread tutorial.

thread overhead performance

When programming in C using threads, in a Linux shell, I am trying to reduce the thread overhead, basically lower CPU time (and making it more efficient).
Now in the program lots of threads are being created and need to do a job before it terminates. Only one thread can do the job at the same time because of mutual exclusion.
I know how long a thread will take to complete a job before it starts
Other threads have to wait while there is a thread doing that job. The way they check if they can do the job is if a condition variable is met.
For waiting threads, if they wait using that condition variable, using this specific code to wait (the a, b, c, and d is just arbitrary stuff, this is just an example):
while (a == b || c != d){
pthread_cond_wait(&open, &mylock);
}
How efficient is this? Whats happening in the pthread_cond_wait code? Is it a while loop (behind the scenes) that constantly checks the condition variable?
Also since I know how long a job a thread will take, is it more efficient that I enforce a scheduling policy about shortest jobs first? Or does that not matter since, in any combination of threads doing the job, the program will take the same amount of time to finish. In other words, does using shortest job first lower CPU overhead for other threads doing the waiting? Since the shortest job first seems to lower waiting times.

Solve your problem with a single thread, and then ask us for help identifying the best place for exposing parallelisation if you can't already see an avenue where the least locking is required. The optimal number of threads to use will depend upon the computer you use. It doesn't make much sense to use more than n+1 threads, where n is the number of processors/cores available to your program. To reduce thread creation overhead, it's a good idea to give each thread multiple jobs.
The following is in response to your clarification edit:
Now in the program lots of threads are being created and need to do a
job before it terminates. Only one thread can do the job at the same
time because of mutual exclusion.
No. At most n+1 threads should be created, as described above. What is it you mean by mutual exclusion? I consider mutual exclusion to be "Only one thread includes task x in it's work queue". This means that no other threads require locking on task x.
Other threads have to wait while there is a thread doing that job. The
way they check if they can do the job is if a condition variable is
met.
Give each thread an independent list of tasks to complete. If job x is a prerequisite to job y, then job x and job y would ideally be in the same list so that the thread doesn't have to deal with thread mutex objects on either job. Have you explored this avenue?
while (a == b || c != d){
pthread_cond_wait(&open, &mylock);
}
How efficient is this? Whats happening in the pthread_cond_wait code?
Is it a while loop (behind the scenes) that constantly checks the
condition variable?
In order to avoid undefined behaviour, mylock must be locked by the current thread before calling pthread_cond_wait, so I presume your code calls pthread_mutex_lock to acquire the mylock lock before this loop is entered.
pthread_mutex_lock blocks the thread until it acquires the lock, which means that one thread at a time can execute the code between the pthread_mutex_lock and pthread_cond_wait (the pre-pthread_cond_wait code).
pthread_cond_wait releases the lock, allowing some other thread to run the code between the pthread_mutex_lock and the pthread_cond_wait. Before pthread_cond_wait returns, it waits until it can acquire the lock again. This step is repeated adhoc while (a == b || c != d).
pthread_mutex_unlock is later called when the task is complete. Until then, only one thread at a time can execute the code between the pthread_cond_wait and the pthread_mutex_unlock (the post-pthread_cond_wait code). In addition, if one thread is running pre-pthread_cond_wait code then no other thread can be running post-pthread_cond_wait code, and visa-versa.
Hence, you might as well be running single-threaded code that stores jobs in a priority queue. At least you wouldn't have the unnecessary and excessive context switches. As I said earlier, "Solve your problem with a single thread". You can't make meaningful statements about how much time an optimisation saves until you have something to measure it against.
Also since I know how long a job a thread will take, is it more
efficient that I enforce a scheduling policy about shortest jobs
first? Or does that not matter since, in any combination of threads
doing the job, the program will take the same amount of time to
finish. In other words, does using shortest job first lower CPU
overhead for other threads doing the waiting? Since the shortest job
first seems to lower waiting times.
If you're going to enforce a scheduling policy, then do it in a single-threaded project. If you believe that concurrency will help you solve your problem quickly, then expose your completed single-threaded project to concurrency and derive tests to verify your beliefs. I suggest exposing concurrency in ways that threads don't have to share work.

Pthread primitives are generally fairly efficient; things that block usually consume no or negligible CPU time while blocking. If you are having performance problems, look elsewhere first.
Don't worry about the scheduling policy. If your application is designed such that only one thread can run at a time, you are losing most of the benefits of being threaded in the first place while imposing all of the costs. (And if you're not imposing all the costs, like locking shared variables because only one thread is running at a time, you're asking for trouble down the road.)

Why is "sleeping" not allowed while holding a spinlock? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Why can't you sleep while holding spinlock?
As far as I know, spinlocks should be used in short duration, and are only choices in code such as interrupt handler where sleeping (preemption) is not allowed.
However, I do not know why there is such a "rule" that there SHOULD BE no sleeping at all while holding a spinlock. I know that it is not a recommended practice (since it is detrimental in performance), but I see no reason why sleeps SHOULD NOT be allowed in spinlocks.
You cannot hold a spin lock while you acquire a semaphore, because you might have to sleep while waiting for the semaphore, and you cannot sleep while holding a spin lock (from "Linux Kernel Development" by Robert Love).
The only reason I can see is for portability reasons, because in uniprocessors, spinlocks are implemented as disabling interrupts, and by disabling interrupts, sleeping is of course not allowed (but sleeping will not break code in SMP systems).
But I am wondering if my reasoning is correct or if there are any other reasons.

There are several reasons why, at least in Linux, sleeping in spinlocks is not allowed:
If thread A sleeps in a spinlock, and thread B then tries to acquire the same spinlock, a uniprocessor system will deadlock. Thread B will never go to sleep (because spinlocks don't have the waitlist necessary to awaken B when A is done), and thread A will never get a chance to wake up.
Spinlocks are used over semaphores precisely because they're more efficient - provided you do not contend for long. Allowing sleeping means that you will have long contention periods, erasing all the benefit of using a spinlock. Your system would be faster just using a semaphore in this case.
Spinlocks are often used to synchronize with interrupt handlers, by additionally disabling interrupts. This use case is not possible if you sleep (once you enter the interrupt handler, you cannot switch back to the thread to let it wake up and finish its spinlock critical section).
Use the right tool for the right job - if you need to sleep, semaphores and mutexes are your friends.

Actually, you can sleep with interrupts disabled or some other sort of exclusion active. If you don't, the condition for which you are sleeping could change state due to an interrupt and then you would never wake up. The sleep code would normally never be entered without an elevated priority or some other critical section that encloses the execution path between the decision to sleep and the context switch.
But for spinlocks, sleep is a disaster, as the lock stays set. Other threads will spin when they hit it, and they won't stop spinning until you wake up from the sleep. That could be an eternity compared to the handful of spins expected in the worst case at a spinlock, because spinlocks exist just to synchronize access to memory locations, they aren't supposed to interact with the context-switching mechanism. (For that matter, every other thread might eventually hit the spinlock and then you would have wedged every thread of every core of the entire system.)

You cannot when you use a spin lock as it is meant to be used. Spin locks are used where really necessary to protect critical regions and shared data structures. If you acquire one while also holding a semaphore, you lock access to whichever critical region (say) your lock is attached to (it is typically a member of a specific larger data structure), while allowing this process to possibly be put to sleep. If, say, an IRQ is raised while this process sleeps, and the IRQ handler needs access to the critical region still locked away, it's blocked, which can never happen with IRQs. Obviously, you could make up examples where your spin lock isn't used the way it should be (a hypothetical spin lock attached to a nop loop, say); but that's simply not a real spin lock found in Linux kernels.

Implementing mutex in a user level thread library

I am developing a user level thread library as part of a project. I came up with an approach to implement mutex. I would like to see ur views before going on with it. Basically, i need to implement just 3 functions in my library
mutex_init, mutex_lock and mutex_unlock
I thought my mutex_t structure would look something like
typedef struct
{
int available; //indicates whether the mutex is locked or unlocked
queue listofwaitingthreads;
gtthread_t owningthread;
}mutex_t;
In my mutex_lock function, i will first check if the mutex is available in a while loop. If it is not, i will yield the processor for the next thread to execute.
In my mutex_unlock function, i will check if the owner thread is the current thread. If it is, i will set available to 0.
Is this the way to go about it ? Also, what about deadlock? Should i take care of those conditions in my user level library or should i leave the application programmers to write code properly ?

This won't work, because you have a race condition. If 2 threads try to catch the lock at the same time, both will see available == 0, and both will think they succeeded with taking the mutex.
If you want to do this properly, and without using an already-existing lock, You must access hardware operations like TAS, CAS, etc.
There are algorithms that give you mutual exclusion without such hardware support, but they make some assumptions that are many times false. For more details about this, I highly recommend reading Herlihy and Shavit's The art of multiprocessor programming, chapter 7.
You shouldn't worry about deadlocks in this level - mutex locks should be simple enough, and there is some assumption that the programmer using them should use care not to cause deadlocks (advanced mutexes can check for self-deadlock, meaning a thread that calls lock twice without calling unlock in the middle).

Not only that you have to do atomic operations to read and modify the flag (as Eran pointed out) you also have to watch that your queue is capable to have concurrent accesses. This is not completely trivial, sort of hen and egg problem.
But if you'd really implement this by spinning, you wouldn't even need to have such a queue. The access order to the lock then would be mainly random, though.
Probably just yielding would also not be enough, this can be quite costly if you have threads holding the lock for more than some processor cycles. Consider using nanosleep with a low time value for the wait.

In general, a mutex implementation should look like:
Lock:
while (trylock()==failed) {
atomic_inc(waiter_cnt);
atomic_sleep_if_locked();
atomic_dec(waiter_cnt);
}
Trylock:
return atomic_swap(&lock, 1);
Unlock:
atomic_store(&lock, 0);
if (waiter_cnt) wakeup_sleepers();
Things get more complex if you want recursive mutexes, mutexes that can synchronize their own destruction (i.e. freeing the mutex is safe as soon as you get the lock), etc.
Note that atomic_sleep_if_locked and wakeup_sleepers correspond to FUTEX_WAIT and FUTEX_WAKE ops on Linux. The other atomics are probably CPU instructions, but could be system calls or kernel-assisted userspace function code, as in the case of Linux/ARM and the 0xffff0fc0 atomic compare-and-swap call.

You do not need atomic instructions for a user level thread library, because all the threads are going to be user level threads of the same process. So actually when your process is given the time slice to execute, you are running multiple threads during that time slice but on the same processor. So, no two threads are going to be in the library function at the same time. Considering that the functions for mutex are already in the library, mutual exclusion is guaranteed.

When is pthread_spin_lock the right thing to use (over e.g. a pthread mutex)?

Given that pthread_spin_lock is available, when would I use it, and when should one not use them ?
i.e. how would I decide to protect some shared data structure with either a pthread mutex or a pthread spinlock ?

The short answer is that a spinlock can be better when you plan to hold the lock for an extremely short interval (for example to do nothing but increment a counter), and contention is expected to be rare, but the operation is occurring often enough to be a potential performance bottleneck. The advantages of a spinlock over a mutex are:
On unlock, there is no need to check if other threads may be waiting for the lock and waking them up. Unlocking is simply a single atomic write instruction.
Failure to immediately obtain the lock does not put your thread to sleep, so it may be able to obtain the lock with much lower latency as soon a it does become available.
There is no risk of cache pollution from entering kernelspace to sleep or wake other threads.
Point 1 will always stand, but point 2 and 3 are of somewhat diminished usefulness if you consider that good mutex implementations will probably spin a decent number of times before asking the kernel for help waiting.
Now, the long answer:
What you need to ask yourself before using spinlocks is whether these potential advantages outweigh one rare but very real disadvantage: what happens when the thread that holds the lock gets interrupted by the scheduler before it can release the lock. This is of course rare, but it can happen even if the lock is just held for a single variable-increment operation or something else equally trivial. In this case, any other threads attempting to obtain the lock will keep spinning until the thread the holds the lock gets scheduled and has a chance to release the lock. This may never happen if the threads trying to obtain the lock have higher priorities than the thread that holds the lock. That may be an extreme case, but even without different priorities in play, there can be very long delays before the lock owner gets scheduled again, and worst of all, once this situation begins, it can quickly escalate as many threads, all hoping to get the lock, begin spinning on it, tying up more processor time, and further delaying the scheduling of the thread that could release the lock.
As such, I would be careful with spinlocks... :-)

The spinlock is a "busy waiting" lock. It's main advantage is that it keeps the thread active and won't cause a context switch, so if you know that you will only be waiting for a very short time (because your critical operation is very quick), then this may give better performance than a mutex. Conversely a mutex will cause less demand on the system if the critical section takes a long time and a context switch is desirable.
TL;DR: It depends.

The safest method with a performance boost is a hybrid of the two: an adaptive mutex.
When your system has multiple cores you spin for a few thousand cycles to capture the best case of low or no contention, then defer to a full mutex to yield to other threads for long contended locks.
Both POSIX (PTHREAD_MUTEX_ADAPTIVE_NP) and Win32 (SetCriticalSectionSpinCount) have adaptive mutexes, many platforms don't have a POSIX spinlock API.

Spinlock has only interest in MP context. It is used to execute pseudo-atomical tasks. In monoprocessor system the principle is the following :
Lock the scheduler (if the task deals with interrupts, lock interrupts instead)
Do my atomic tack
Unlock the scheduler
But in MP systems we have no guaranties that an other core will not execute an other thread that could enter our code section. To prevent this the spin lock has been created, its purpose is to postpone the other cores execution preventing concurrency issue. The critical section becomes :
Lock the scheduler
SpinLock (prevent entering of other cores)
My task
SpinUnlock
Task Unlock
If the task lock is omitted, during a scheduling, an other thread could try to enter the section an will loop at 100% CPU waiting next scheduling. If this task is an high-priority one, it will produce a deadlock.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight