thread overhead performance - c

When programming in C using threads, in a Linux shell, I am trying to reduce the thread overhead, basically lower CPU time (and making it more efficient).
Now in the program lots of threads are being created and need to do a job before it terminates. Only one thread can do the job at the same time because of mutual exclusion.
I know how long a thread will take to complete a job before it starts
Other threads have to wait while there is a thread doing that job. The way they check if they can do the job is if a condition variable is met.
For waiting threads, if they wait using that condition variable, using this specific code to wait (the a, b, c, and d is just arbitrary stuff, this is just an example):
while (a == b || c != d){
pthread_cond_wait(&open, &mylock);
}
How efficient is this? Whats happening in the pthread_cond_wait code? Is it a while loop (behind the scenes) that constantly checks the condition variable?
Also since I know how long a job a thread will take, is it more efficient that I enforce a scheduling policy about shortest jobs first? Or does that not matter since, in any combination of threads doing the job, the program will take the same amount of time to finish. In other words, does using shortest job first lower CPU overhead for other threads doing the waiting? Since the shortest job first seems to lower waiting times.

Solve your problem with a single thread, and then ask us for help identifying the best place for exposing parallelisation if you can't already see an avenue where the least locking is required. The optimal number of threads to use will depend upon the computer you use. It doesn't make much sense to use more than n+1 threads, where n is the number of processors/cores available to your program. To reduce thread creation overhead, it's a good idea to give each thread multiple jobs.
The following is in response to your clarification edit:
Now in the program lots of threads are being created and need to do a
job before it terminates. Only one thread can do the job at the same
time because of mutual exclusion.
No. At most n+1 threads should be created, as described above. What is it you mean by mutual exclusion? I consider mutual exclusion to be "Only one thread includes task x in it's work queue". This means that no other threads require locking on task x.
Other threads have to wait while there is a thread doing that job. The
way they check if they can do the job is if a condition variable is
met.
Give each thread an independent list of tasks to complete. If job x is a prerequisite to job y, then job x and job y would ideally be in the same list so that the thread doesn't have to deal with thread mutex objects on either job. Have you explored this avenue?
while (a == b || c != d){
pthread_cond_wait(&open, &mylock);
}
How efficient is this? Whats happening in the pthread_cond_wait code?
Is it a while loop (behind the scenes) that constantly checks the
condition variable?
In order to avoid undefined behaviour, mylock must be locked by the current thread before calling pthread_cond_wait, so I presume your code calls pthread_mutex_lock to acquire the mylock lock before this loop is entered.
pthread_mutex_lock blocks the thread until it acquires the lock, which means that one thread at a time can execute the code between the pthread_mutex_lock and pthread_cond_wait (the pre-pthread_cond_wait code).
pthread_cond_wait releases the lock, allowing some other thread to run the code between the pthread_mutex_lock and the pthread_cond_wait. Before pthread_cond_wait returns, it waits until it can acquire the lock again. This step is repeated adhoc while (a == b || c != d).
pthread_mutex_unlock is later called when the task is complete. Until then, only one thread at a time can execute the code between the pthread_cond_wait and the pthread_mutex_unlock (the post-pthread_cond_wait code). In addition, if one thread is running pre-pthread_cond_wait code then no other thread can be running post-pthread_cond_wait code, and visa-versa.
Hence, you might as well be running single-threaded code that stores jobs in a priority queue. At least you wouldn't have the unnecessary and excessive context switches. As I said earlier, "Solve your problem with a single thread". You can't make meaningful statements about how much time an optimisation saves until you have something to measure it against.
Also since I know how long a job a thread will take, is it more
efficient that I enforce a scheduling policy about shortest jobs
first? Or does that not matter since, in any combination of threads
doing the job, the program will take the same amount of time to
finish. In other words, does using shortest job first lower CPU
overhead for other threads doing the waiting? Since the shortest job
first seems to lower waiting times.
If you're going to enforce a scheduling policy, then do it in a single-threaded project. If you believe that concurrency will help you solve your problem quickly, then expose your completed single-threaded project to concurrency and derive tests to verify your beliefs. I suggest exposing concurrency in ways that threads don't have to share work.

Pthread primitives are generally fairly efficient; things that block usually consume no or negligible CPU time while blocking. If you are having performance problems, look elsewhere first.
Don't worry about the scheduling policy. If your application is designed such that only one thread can run at a time, you are losing most of the benefits of being threaded in the first place while imposing all of the costs. (And if you're not imposing all the costs, like locking shared variables because only one thread is running at a time, you're asking for trouble down the road.)

Related

Priority based multithreading?

I have written code for two threads where is one is assigned priority 20 (lower) and another on 10 (higher). Upon executing my code, 70% of the time I get expected results i.e high_prio (With priority 10) thread executes first and then low_prio (With priority 20).
Why is my code not able to get 100 % correct result in all the executions? Is there any conceptual mistake that I am doing?
void *low_prio(){
Something here;
}
void *high_prio(){
Something here;
}
int main(){
Thread with priority 10 calls high_prio;
Thread with priority 20 calls low_prio;
return 0;
}
Is there any conceptual mistake that I am doing?
Yes — you have an incorrect expectation regarding what thread priorities do. Thread priorities are not meant to force one thread to execute before another thread.
In fact, in a scenario where there is no CPU contention (i.e. where there are always at least as many CPU cores available as there are threads that currently want to execute), thread priorities will have no effect at all -- because there would be no benefit to forcing a low-priority thread not to run when there is a CPU core available for it to run on. In this no-contention scenario, all of the threads will get to run simultaneously and continuously for as long as they want to.
The only time thread priorities may make a difference is when there is CPU contention -- i.e. there are more threads that want to run than there are CPU cores available to run them. At that point, the OS's thread-scheduler has to make a decision about which thread will get to run and which thread will have to wait for a while. In this instance, thread priorities can be used to indicate to the scheduler which thread it should prefer allow to run.
Note that it's even more complicated than that, however -- for example, in your posted program, both of your threads are calling printf() rather a lot, and printf() invokes I/O, which means that the thread may be temporarily put to sleep while the I/O (e.g. to your Terminal window, or to a file if you have redirected stdout to file) completes. And while that thread is sleeping, the thread-scheduler can take advantage of the now-available CPU core to let another thread run, even if that other thread is of lower priority. Later, when the I/O operation completes, your high-priority thread will be re-awoken and re-assigned to a CPU core (possibly "bumping" a low-priority thread off of that core in order to get it).
Note that inconsistent results are normal for multithreaded programs -- threads are inherently non-deterministic, since their execution patterns are determined by the thread-scheduler's decisions, which in turn are determined by lots of factors (e.g. what other programs are running on the computer at the time, the system clock's granularity, etc).

Lock that handles a high-contention, high-frequency situation

I am looking for a lock implementation that degrades gracefully in the situation where you have two threads that constantly try to release and re-acquire the same lock, at a very high frequency.
Of course it is clear that in this case the two threads won't significantly progress in parallel. Theoretically, the best result would be achieved by running the whole thread 1, and then the whole thread 2, without any switching---because switching just creates massive overhead here. So I am looking for a lock implementation that would handle this situation gracefully by keeping the same thread running for a while before switching, instead of constantly switching.
Long version of the question
As I would myself be tempted to answer this question by "your program is broken, don't do that", here is some justification about why we end up in this kind of situation.
The lock is a "single global lock", i.e. a very coarse lock. (It is the Global Interpreter Lock (GIL) inside PyPy, but the question is about how to do it in general, say if you have a C program.)
We have the following situation:
There is constantly contention. That's expected in this case: the lock is a global lock that needs to be acquired for most threads to progress. So we expect that a large fraction of them are waiting for the lock. Only one of these threads can progress.
The thread that holds the lock might do sometimes bursts of short releases. A typical example would be if this thread does repeated calls to "something external", e.g. many short writes to a file. Each of these writes is usually completed very quickly. The lock still has to be released just in case this external thing turns out to take longer than expected (e.g. if the write actually needs to wait for disk I/O), so that another thread can acquire the lock in this case.
If we use some standard mutex for the lock, then the lock will often switch to another thread as soon as the owner releases the lock. But the problem is what if the program runs several threads that each wants to do a long burst of short releases. The program ends up spending most of its time switching the lock between CPUs.
It is much faster to run the same thread for a while before switching, at least as long as the lock is released for very short periods of time. (E.g. on Linux/pthread a release immediately followed by an acquire will sometimes re-acquire the lock instantly even if there are other waiting threads; but we'd like this result in a large majority of cases, not just sometimes.)
Of course, as soon as the lock is released for a longer period of time, then it becomes a good idea to transfer ownership of the lock to a different thread.
So I'm looking for general ideas about how to do that. I guess it should exist already somewhere---in a paper, or in some multithreading library?
For reference, PyPy tries to implement something like this by polling: the lock is just a global variable, with synchronized compare-and-swap but no OS calls; one of the waiting threads is given the role of "stealer"; that "stealer" thread wakes up every 100 microseconds to check the variable. This is not horribly bad (it costs maybe 1-2% of CPU time in addition to the 100% consumed by the running thread). This actually implements what I'm asking for here, but the problem is that this is a hack that doesn't cleanly support more traditional cases of locks: for example, if thread 1 tries to send a message to thread 2 and wait for the answer, the two thread switches will take in average 100 microseconds each---which is far too much if the message is processed quickly.
For reference, let me describe how we finally implemented it. I was unsure about it as it still feels like a hack, but it seems to work for PyPy's use case in practice.
We did it as described in the last paragraph of the question, with one addition: the "stealer" thread, which checks some global variable every 100 microseconds, does this by calling pthread_cond_timedwait or WaitForSingleObject with a regular, system-provided mutex, with a timeout of 100 microseconds. This gives a "composite lock" with both the global variable and the regular mutex. The "stealer" will succeed in stealing the "lock" if either it notices a value 0 is the global variable (every 100 microseconds), or immediately if the regular mutex is released by another thread.
It's then a matter of choosing how to release the composite lock in a case-by-case basis. Most external functions (writes to files, etc.) are expected to generally complete quickly, and so we release and re-acquire the composite lock by writing to the global variable. Only in a few specific function cases---like sleep() or lock_acquire()---we expect the calling thread to often block; around these functions, we release the composite lock by actually releasing the mutex instead.
If I understand the problem statement, you are asking the kernel scheduler to do an educated guess on whether your userspace application "hot" thread will try to reacquire the lock in the very near future, to avoid implicitly preempting it by allowing a "not-so-hot" thread to acquire the mutex.
I wouldn't know how the kernel could do that. The only two things that come to my mind:
Do not release mutex unless hot thread is actually transitioning to idle (application specific condition). In Linux you can use MONOTONIC_COARSE to try to reduce the overhead of checking the wall clock to implement some sort of timer.
Increase hot thread prio. This is more of mitigation strategy, in an attempt to reduce the amount of preemption of the hot thread. If the "hot" thread can be identified, you could do something like:
pthread_t thread = pthread_self();
//Set max prio, FIFO
struct sched_param params;
params.sched_priority = sched_get_priority_max(SCHED_FIFO);
int rv = pthread_setschedparam(thread, SCHED_FIFO, &params);
if(rv != 0){
//Print error
//...
}
Spinlock might work better in your case. They avoid context switching and are highly efficient if the threads are likely to hold the lock only for short duration of time.
For this very reason, they are widely used in OS kernels.

Implementing mutex in a user level thread library

I am developing a user level thread library as part of a project. I came up with an approach to implement mutex. I would like to see ur views before going on with it. Basically, i need to implement just 3 functions in my library
mutex_init, mutex_lock and mutex_unlock
I thought my mutex_t structure would look something like
typedef struct
{
int available; //indicates whether the mutex is locked or unlocked
queue listofwaitingthreads;
gtthread_t owningthread;
}mutex_t;
In my mutex_lock function, i will first check if the mutex is available in a while loop. If it is not, i will yield the processor for the next thread to execute.
In my mutex_unlock function, i will check if the owner thread is the current thread. If it is, i will set available to 0.
Is this the way to go about it ? Also, what about deadlock? Should i take care of those conditions in my user level library or should i leave the application programmers to write code properly ?
This won't work, because you have a race condition. If 2 threads try to catch the lock at the same time, both will see available == 0, and both will think they succeeded with taking the mutex.
If you want to do this properly, and without using an already-existing lock, You must access hardware operations like TAS, CAS, etc.
There are algorithms that give you mutual exclusion without such hardware support, but they make some assumptions that are many times false. For more details about this, I highly recommend reading Herlihy and Shavit's The art of multiprocessor programming, chapter 7.
You shouldn't worry about deadlocks in this level - mutex locks should be simple enough, and there is some assumption that the programmer using them should use care not to cause deadlocks (advanced mutexes can check for self-deadlock, meaning a thread that calls lock twice without calling unlock in the middle).
Not only that you have to do atomic operations to read and modify the flag (as Eran pointed out) you also have to watch that your queue is capable to have concurrent accesses. This is not completely trivial, sort of hen and egg problem.
But if you'd really implement this by spinning, you wouldn't even need to have such a queue. The access order to the lock then would be mainly random, though.
Probably just yielding would also not be enough, this can be quite costly if you have threads holding the lock for more than some processor cycles. Consider using nanosleep with a low time value for the wait.
In general, a mutex implementation should look like:
Lock:
while (trylock()==failed) {
atomic_inc(waiter_cnt);
atomic_sleep_if_locked();
atomic_dec(waiter_cnt);
}
Trylock:
return atomic_swap(&lock, 1);
Unlock:
atomic_store(&lock, 0);
if (waiter_cnt) wakeup_sleepers();
Things get more complex if you want recursive mutexes, mutexes that can synchronize their own destruction (i.e. freeing the mutex is safe as soon as you get the lock), etc.
Note that atomic_sleep_if_locked and wakeup_sleepers correspond to FUTEX_WAIT and FUTEX_WAKE ops on Linux. The other atomics are probably CPU instructions, but could be system calls or kernel-assisted userspace function code, as in the case of Linux/ARM and the 0xffff0fc0 atomic compare-and-swap call.
You do not need atomic instructions for a user level thread library, because all the threads are going to be user level threads of the same process. So actually when your process is given the time slice to execute, you are running multiple threads during that time slice but on the same processor. So, no two threads are going to be in the library function at the same time. Considering that the functions for mutex are already in the library, mutual exclusion is guaranteed.

How to setup and manage persistent multiple threads?

I have POSIX in mind for implementation, though this question is more about architecture.
I am starting from an update loop that has several main jobs to do. I can group those jobs into four or five main tasks that have common memory access requirements. It's my idea to break off those jobs into their own threads and have them complete one cycle of "update" and sleep until the next frame.
But how to synchronize? If I detach four or five threads at the start of each cycle, have them run once, die, and then detach another 4-5 threads on each pass? That sounds expensive.
It sounds more reasonable to create these threads once, and have them go to sleep until a synchronized call wakes it up.
Is this a wise approach? I'm open to accepting responses from just ideas to implementations of any kind.
EDIT: based on the answers so far, I'd like to add:
concurrency is desired
these worker threads are intended to run at very short durations <250ms
the work done by each thread will always be the same
i'm considering 4-5 threads, 20 being a hard limit.
That depends on the granularity of the tasks that the threads are performing. If they're doing long tasks (e.g. a second or longer), then the cost of creating and destroying threads is negligible compared to the work the threads are doing, so I'd recommend keeping things simple and creating the threads on demand.
Conversely, if you have very short tasks (e.g. less than 10-100 ms or so), you will definitely start to notice the cost of creating and destroying lots of threads. In that case, yes, you should create the threads only once and have them sleep until work arrives for them. You'll want to use some sort of condition variable (e.g. pthread_cond_t) for this: the thread waits on the condition variable, and when work arrives, you signal the condition variable.
If you always have the same work to do every cycle, and you need to wait for all the work to finish before the next cycle starts, then you're thinking about the right solution.
You'll need some synchronization objects: a "start of frame semaphore", an "end of frame semaphore", and an "end of frame event". If you have n independent tasks each frame, start n threads, with loops that look like this (pseudocode):
while true:
wait on "start of frame semaphore"
<do work>
enter lock
decrement "worker count"
if "worker count" = 0 then set "end of frame event"
release lock
wait on "end of frame semaphore"
You can then have a controller thread run:
while true:
set "worker count" to n
increment "start of frame semaphore" by n
wait on "end of frame event"
increment "end of frame semaphore" by n
This will work well for small n. If the number of tasks you need to complete each cycle becomes large, then you will probably want to use a thread pool coupled with a task queue, so that you don't overwhelm the system with threads. But there's more complexity with that solution, and with threading complexity is the enemy.
The best is probably to use a task queue.
Task queues can be seen as threads waiting for a job to be submitted to them. If there are many sent at once, they are executed in FIFO order.
That way, you maintain 4-5 threads, and each of them executes the job you feed them, without needing to detach a new thread for each job.
The only problem is that I don't know many implementations of task queues in C. Apple has Grand Central Dispatch that does just that; FreeBSD has an implementation of it too. Except those, I don't know any other. (I didn't look very hard, though.)
Your idea is known as a thread pool. They are found in WinAPI, Intel TBB and the Visual Studio ConcRT, I don't know much about POSIX and therefore cannot help you, but they are an excellent structure with many desirable properties, such as excellent scaling, if the work being posted can be split up.
However, I wouldn't trivialize the time the work takes. If you have five tasks, and you have a performance issue so desperate that multiple threads are the key, then creating the threads is almost certainly a negligible problem.

Implementing a Priority queue with a Condition Variable in C

My current understanding of condition variables is that all blocked (waiting) threads are inserted into a basic FIFO queue, the first item of which is awakened when signal() is called.
Is there any way to modify this queue (or create a new structure) to perform as a priority queue instead? I've been thinking about it for a while, but most solutions I have end up being hampered by the existing queue structure inherent to C.V.'s and mutexes.
Thanks!
I think you should rethink what you're trying to do. If you're trying to optimize your performance, you're probably barking up the wrong tree.
pthread_cond_signal() isn't even guaranteed to unblock exactly one thread -- it's guaranteed to unblock at least one thread, so your code better be able to handle the situation where multiple threads are unblocked simultaneously. The typical way to do this is for each thread to re-check the condition after becoming unblocked, and, if false, return to waiting again.
You could implement some sort of scheme where you kept your own priority queue of threads waiting, and each thread added itself to that queue immediately before it was to begin waiting, and then it would check the queue when unblocking, but this would add a lot of complexity and a lot of potential for serious problems (race conditions, deadlocks, etc.). It was also add a non-trivial amount of overhead.
Also, what happens if a higher-priority thread starts waiting on a condition variable at the same moment that condition variable is being signalled? Who gets unblocked, the newly arrived high-priority thread or the former highest priority thread?
The order that threads get unblocked in is entirely dependent on the kernel's thread scheduler, so you are at its mercy. I wouldn't even assume FIFO ordering, either.
Since condition variables are basically just a barrier and you have no control over the queue of waiting threads there's no real way to apply priorities. It's invalid to assume waiting threads will act in a FIFO manner.
With a combination of atomics, additional condition variables, and pre-knowledge of the threads/priorities involved you could construct a solution where a signaled thread will re-signal the master CV and then re-block on a priority CV but it certainly wouldn't be a generic solution. That's also off the top of my head so might also have some other flaw.
It's the scheduler that determines which thread will run. You can look at pthread_setschedparam and pthread_getschedparam and fiddle with the policies (SCHED_OTHER, SCHED_FIFO, or SCHED_RR) and the priorities. But it probably won't get you to where I suspect you want to go.
It sounds as if you want to make something predictable from the inherently non-deterministic. As Andrew notes you might hack something but my guess is that this will lead to heartache or a lot code you will hate yourself for writing in six months (or both).

Resources