How to setup and manage persistent multiple threads? - c

I have POSIX in mind for implementation, though this question is more about architecture.
I am starting from an update loop that has several main jobs to do. I can group those jobs into four or five main tasks that have common memory access requirements. It's my idea to break off those jobs into their own threads and have them complete one cycle of "update" and sleep until the next frame.
But how to synchronize? If I detach four or five threads at the start of each cycle, have them run once, die, and then detach another 4-5 threads on each pass? That sounds expensive.
It sounds more reasonable to create these threads once, and have them go to sleep until a synchronized call wakes it up.
Is this a wise approach? I'm open to accepting responses from just ideas to implementations of any kind.
EDIT: based on the answers so far, I'd like to add:
concurrency is desired
these worker threads are intended to run at very short durations <250ms
the work done by each thread will always be the same
i'm considering 4-5 threads, 20 being a hard limit.

That depends on the granularity of the tasks that the threads are performing. If they're doing long tasks (e.g. a second or longer), then the cost of creating and destroying threads is negligible compared to the work the threads are doing, so I'd recommend keeping things simple and creating the threads on demand.
Conversely, if you have very short tasks (e.g. less than 10-100 ms or so), you will definitely start to notice the cost of creating and destroying lots of threads. In that case, yes, you should create the threads only once and have them sleep until work arrives for them. You'll want to use some sort of condition variable (e.g. pthread_cond_t) for this: the thread waits on the condition variable, and when work arrives, you signal the condition variable.

If you always have the same work to do every cycle, and you need to wait for all the work to finish before the next cycle starts, then you're thinking about the right solution.
You'll need some synchronization objects: a "start of frame semaphore", an "end of frame semaphore", and an "end of frame event". If you have n independent tasks each frame, start n threads, with loops that look like this (pseudocode):
while true:
wait on "start of frame semaphore"
<do work>
enter lock
decrement "worker count"
if "worker count" = 0 then set "end of frame event"
release lock
wait on "end of frame semaphore"
You can then have a controller thread run:
while true:
set "worker count" to n
increment "start of frame semaphore" by n
wait on "end of frame event"
increment "end of frame semaphore" by n
This will work well for small n. If the number of tasks you need to complete each cycle becomes large, then you will probably want to use a thread pool coupled with a task queue, so that you don't overwhelm the system with threads. But there's more complexity with that solution, and with threading complexity is the enemy.

The best is probably to use a task queue.
Task queues can be seen as threads waiting for a job to be submitted to them. If there are many sent at once, they are executed in FIFO order.
That way, you maintain 4-5 threads, and each of them executes the job you feed them, without needing to detach a new thread for each job.
The only problem is that I don't know many implementations of task queues in C. Apple has Grand Central Dispatch that does just that; FreeBSD has an implementation of it too. Except those, I don't know any other. (I didn't look very hard, though.)

Your idea is known as a thread pool. They are found in WinAPI, Intel TBB and the Visual Studio ConcRT, I don't know much about POSIX and therefore cannot help you, but they are an excellent structure with many desirable properties, such as excellent scaling, if the work being posted can be split up.
However, I wouldn't trivialize the time the work takes. If you have five tasks, and you have a performance issue so desperate that multiple threads are the key, then creating the threads is almost certainly a negligible problem.

Related

Can multithreading be coupled with event programming?

I have a single thread C program implemented using event driven programming - a callback triggers every time the event happens.
The callback takes way too long to execute (do a bunch of calculations) and this processing time is important. Currently is 500 microseconds and need it to be less than 100.
Most of the calculations are independent, can be done in parallel.
I have a machine with many cores and was thinking if getting multiple threads to make the calculations in parallel could be possible / of help.
I think that the approach in which at the beginning of the callback I generate multiple threads, and then send the different calculations to the multiple threads will not work well because generating the threads takes time.
Is it possible to have a few threads up, waiting to be used, and that every time the callback is triggered I can send the calculations there without having to generate the threads in each callback?
You can use a thread pool for this (often called a worker pool). The basic idea is create some number of threads in advance and have them all sleep, waiting on a semaphore whenever there is no work to do.
Your code will be simpler if you can get away with one thread for each processing task, but you can also implement it (carefully) with a queue, where each worker tries to handle the next job in the queue and then sleep when the job queue is empty.
Either way, a single round of processing will look something like this:
assign or queue tasks to your worker pool
notify worker pool to wake up and begin processing tasks
wait for worker pool to signal all tasks complete (*)
(*) remember, "all tasks complete" is not the same as "task queue empty"
Now your main timing bottlenecks will depend on the mutex/semaphore implementation and your OS thread scheduler. It may be appropriate to set a high priority on all your worker threads.
If you have events at regular intervals, a common improvement to the above is to also double-buffer (i.e. output the result for the previous event, and assign the workers to begin processing input for the current event). To achieve that, you would move step 3 to happen before step 1.
This may or may not be suitable for your purposes. But it can provide some extra leeway with timing, if you're still having trouble processing fast enough. Try something simple first. Problems like this can get hairy very quickly when you start introducing extra requirements.

thread overhead performance

When programming in C using threads, in a Linux shell, I am trying to reduce the thread overhead, basically lower CPU time (and making it more efficient).
Now in the program lots of threads are being created and need to do a job before it terminates. Only one thread can do the job at the same time because of mutual exclusion.
I know how long a thread will take to complete a job before it starts
Other threads have to wait while there is a thread doing that job. The way they check if they can do the job is if a condition variable is met.
For waiting threads, if they wait using that condition variable, using this specific code to wait (the a, b, c, and d is just arbitrary stuff, this is just an example):
while (a == b || c != d){
pthread_cond_wait(&open, &mylock);
}
How efficient is this? Whats happening in the pthread_cond_wait code? Is it a while loop (behind the scenes) that constantly checks the condition variable?
Also since I know how long a job a thread will take, is it more efficient that I enforce a scheduling policy about shortest jobs first? Or does that not matter since, in any combination of threads doing the job, the program will take the same amount of time to finish. In other words, does using shortest job first lower CPU overhead for other threads doing the waiting? Since the shortest job first seems to lower waiting times.
Solve your problem with a single thread, and then ask us for help identifying the best place for exposing parallelisation if you can't already see an avenue where the least locking is required. The optimal number of threads to use will depend upon the computer you use. It doesn't make much sense to use more than n+1 threads, where n is the number of processors/cores available to your program. To reduce thread creation overhead, it's a good idea to give each thread multiple jobs.
The following is in response to your clarification edit:
Now in the program lots of threads are being created and need to do a
job before it terminates. Only one thread can do the job at the same
time because of mutual exclusion.
No. At most n+1 threads should be created, as described above. What is it you mean by mutual exclusion? I consider mutual exclusion to be "Only one thread includes task x in it's work queue". This means that no other threads require locking on task x.
Other threads have to wait while there is a thread doing that job. The
way they check if they can do the job is if a condition variable is
met.
Give each thread an independent list of tasks to complete. If job x is a prerequisite to job y, then job x and job y would ideally be in the same list so that the thread doesn't have to deal with thread mutex objects on either job. Have you explored this avenue?
while (a == b || c != d){
pthread_cond_wait(&open, &mylock);
}
How efficient is this? Whats happening in the pthread_cond_wait code?
Is it a while loop (behind the scenes) that constantly checks the
condition variable?
In order to avoid undefined behaviour, mylock must be locked by the current thread before calling pthread_cond_wait, so I presume your code calls pthread_mutex_lock to acquire the mylock lock before this loop is entered.
pthread_mutex_lock blocks the thread until it acquires the lock, which means that one thread at a time can execute the code between the pthread_mutex_lock and pthread_cond_wait (the pre-pthread_cond_wait code).
pthread_cond_wait releases the lock, allowing some other thread to run the code between the pthread_mutex_lock and the pthread_cond_wait. Before pthread_cond_wait returns, it waits until it can acquire the lock again. This step is repeated adhoc while (a == b || c != d).
pthread_mutex_unlock is later called when the task is complete. Until then, only one thread at a time can execute the code between the pthread_cond_wait and the pthread_mutex_unlock (the post-pthread_cond_wait code). In addition, if one thread is running pre-pthread_cond_wait code then no other thread can be running post-pthread_cond_wait code, and visa-versa.
Hence, you might as well be running single-threaded code that stores jobs in a priority queue. At least you wouldn't have the unnecessary and excessive context switches. As I said earlier, "Solve your problem with a single thread". You can't make meaningful statements about how much time an optimisation saves until you have something to measure it against.
Also since I know how long a job a thread will take, is it more
efficient that I enforce a scheduling policy about shortest jobs
first? Or does that not matter since, in any combination of threads
doing the job, the program will take the same amount of time to
finish. In other words, does using shortest job first lower CPU
overhead for other threads doing the waiting? Since the shortest job
first seems to lower waiting times.
If you're going to enforce a scheduling policy, then do it in a single-threaded project. If you believe that concurrency will help you solve your problem quickly, then expose your completed single-threaded project to concurrency and derive tests to verify your beliefs. I suggest exposing concurrency in ways that threads don't have to share work.
Pthread primitives are generally fairly efficient; things that block usually consume no or negligible CPU time while blocking. If you are having performance problems, look elsewhere first.
Don't worry about the scheduling policy. If your application is designed such that only one thread can run at a time, you are losing most of the benefits of being threaded in the first place while imposing all of the costs. (And if you're not imposing all the costs, like locking shared variables because only one thread is running at a time, you're asking for trouble down the road.)

How to resuse threads - pthreads c

I am programming using pthreads in C.
I have a parent thread which needs to create 4 child threads with id 0, 1, 2, 3.
When the parent thread gets data, it will set split the data and assign it to 4 seperate context variables - one for each sub-thread.
The sub-threads have to process this data and in the mean time the parent thread should wait on these threads.
Once these sub-threads have done executing, they will set the output in their corresponding context variables and wait(for reuse).
Once the parent thread knows that all these sub-threads have completed this round, it computes the global output and prints it out.
Now it waits for new data(the sub-threads are not killed yet, they are just waiting).
If the parent thread gets more data the above process is repeated - albeit with the already created 4 threads.
If the parent thread receives a kill command (assume a specific kind of data), it indicates to all the sub-threads and they terminate themselves. Now the parent thread can terminate.
I am a Masters research student and I am encountering the need for the above scenario. I know that this can be done using pthread_cond_wait, pthread_Cond_signal. I have written the code but it is just running indefinitely and I cannot figure out why.
My guess is that, the way I have coded it, I have over-complicated the scenario. It will be very helpful to know how this can be implemented. If there is a need, I can post a simplified version of my code to show what I am trying to do(even though I think that my approach is flawed!)...
Can you please give me any insights into how this scenario can be implemented using pthreads?
As far what can be seen from your description, there seems to be nothing wrong with the principle.
What you are trying to implement is a worker pool, I guess, there should be a lot of implementations out there. If the work that your threads are doing is a substantial computation (say at least a CPU second or so) such a scheme is a complete overkill. Mondern implementations of POSIX threads are efficient enough that they support the creation of a lot of threads, really a lot, and the overhead is not prohibitive.
The only thing that would be important if you have your workers communicate through shared variables, mutexes etc (and not via the return value of the thread) is that you start your threads detached, by using the attribute parameter to pthread_create.
Once you have such an implementation for your task, measure. Only then, if your profiler tells you that you spend a substantial amount of time in the pthread routines, start thinking of implementing (or using) a worker pool to recycle your threads.
One producer-consumer thread with 4 threads hanging off it. The thread that wants to queue the four tasks assembles the four context structs containing, as well as all the other data stuff, a function pointer to an 'OnComplete' func. Then it submits all four contexts to the queue, atomically incrementing a a taskCount up to 4 as it does so, and waits on an event/condvar/semaphore.
The four threads get a context from the P-C queue and work away.
When done, the threads call the 'OnComplete' function pointer.
In OnComplete, the threads atomically count down taskCount. If a thread decrements it to zero, is signals the the event/condvar/semaphore and the originating thread runs on, knowing that all the tasks are done.
It's not that difficult to arrange it so that the assembly of the contexts and the synchro waiting is done in a task as well, so allowing the pool to process multiple 'ForkAndWait' operations at once for multiple requesting threads.
I have to add that operations like this are a huge pile easier in an OO language. The latest Java, for example, has a 'ForkAndWait' threadpool class that should do exactly this kind of stuff, but C++, (or even C#, if you're into serfdom), is better than plain C.

The disadvantages of using sleep()

For c programming, if i want to coordinate two concurrently executing processes, I can use sleep(). However, i heard that sleep() is not a good idea to implement the orders of events between processes? Are there any reasons?
sleep() is not a coordination function. It never has been. sleep() makes your process do just that - go to sleep, not running at all for a certain period of time.
You have been misinformed. Perhaps your source was referring to what is known as a backoff after an acquisition of a lock fails, in which case a randomized sleep may be appropriate.
The way one generally establishes a relative event ordering between processes (ie, creates a happens-before edge) is to use a concurrency-control structure such as a condition variable which is only raised at a certain point, or a more-obtuse barrier which causes each thread hitting it to wait until all others have also reached that point in the program.
Using sleep() will impact the latency and CPU load. Let's say you sleep for 1ms and check some atomic shared variable. The average latency will be (at least) 0.5ms. You will be consuming CPU cycles in this non-active thread to poll the shared atomic variable. There are also often no guarantees about the sleep time.
The OS provides services to communicate/synchronize between threads/processes. Those have low latency, consume less CPU cycles, and often have other guarantees - those are the ones you should use... (E.g. condition variables, events, semaphores etc.). When you use those the thread/process does not need to "poll". The kernel wakes up the waiting threads/processes when needed (the thread/process "blocks").
There are some rare situations where polling is the best solution for thread/process synchronization, e.g. a spinlock, usually when the overhead of going through the kernel is larger than the time spent polling.
Sleep would not be a very robust way to handle event ordering between processes as there are so many things that can go wrong.
What if your sleep() is interrupted?
You need to be a bit more specific about what you mean by "implement the order of events between processes".
In my case, I was using this function in celery. I was doing time.sleep(10). And it was working fine if the celery_task was called once or twice per minute. But it created chaos in one case.
If the celery_task is called 1000 times
I had 4 celery workers, so the above 1000 celery calls were queued for execution.
The first 4 calls were executed by the 4 workers and the remaining 996 were still in the queue.
the workers were busy in the 4 tasks for 10 seconds and after 10 secs it took the next 4 tasks. Going this way it may take around 1000\4*10=2500 seconds.
Eventually, we had to remove time.sleep as it was blocking the worker for 10 seconds in my case.

What could produce this bizzare behavior with two threads sleeping at the same time?

There are two threads. One is an events thread, and another does rendering. The rendering thread uses variables from the events thread. There are mutex locks but they are irrelevant since I noticed the behavior is same even if I remove them completely (for testing).
If I do a sleep() in the rendering thread alone, for 10 milliseconds, the FPS is normally 100.
If I do no sleep at all in the rendering thread and a sleep in the events thread, the rendering thread does not slow down at all.
But, if I do a sleep of 10 milliseconds in the rendering thread and 10 in the events thread, the FPS is not 100, but lower, about 84! (notice it's the same even if mutex locks are removed completely)
(If none of them has sleeps it normally goes high.)
What could produce this behavior?
--
The sleep command used is Sleep() of windows or SDL_Delay() (which probably ends up to Sleep() on windows).
I believe I have found an answer (own answer).
Sleeping is not guaranteed to wait for a period, but it will wait at least a certain time, due to OS scheduling.
A better approach would be to calculate actual time passed explicitly (and allow execution via that, only if certain time has passed).
The threads run asynchronously unless you synchronise them, and will be scheduled according to the OS's scheduling policy. I would suggest that the behaviour will at best be non-deterministic (unless you were running on an RTOS perhaps).
You might do better to have one thread trigger another by some synchronisation mechanism such as a semaphore, then only have one thread Sleep, and the other wait on the semaphore.
I do not know what your "Events" thread does but given its name, perhaps it would be better to wait on the events themselves rather than simply sleep and then poll for events (if that is what it does). Making the rendering periodic probably makes sense, but waiting on events would be better doing exactly that.
The behavior will vary depending on many factors such as the OS version (e.g. Win7 vs. Win XP) and number of cores. If you have two cores and two threads with no synchronization objects they should run concurrently and Sleep() on one thread should not impact the other (for the most part).
It sounds like you have some other synchronization between the threads because otherwise when you have no sleep at all in your rendering thread you should be running at >100FPS, no?
In case that there is absolutely no synchronization then depending on how much processing happens in the two threads having them both Sleep() may increase the probability of contention for a single core system. That is if only one thread calls Sleep() it is generally likely to be given the next quanta once it wakes up and assuming it does very little processing, i.e. yields right away, that behavior will continue. If two threads are calling Sleep() there is some probability they will wake up in the same quanta and if at least one of them needs to do any amount of processing the other will be delayed and the observed frequency will be lower. This should only apply if there's a single core available to run the two threads on.
If you want to maintain a 100FPS update rate you should keep track of the next scheduled update time and only Sleep for the remaining time. This will ensure that even if your thread gets bumped by some other thread for a CPU quanta you will be able to keep the rate (assuming there is enough CPU time for all processing). Something like:
DWORD next_frame_time = GetTickCount(); // Milli-seconds. Note the resolution of GetTickCount()
while(1)
{
next_frame_time += 10; // Time of next frame update in ms
DWORD wait_for = next_frame_time - GetTickCount(); // How much time remains to next update
if( wait_for < 11 ) // A simplistic test for the case where we're already too late
{
Sleep(wait_for);
}
// Do periodic processing here
}
Depending on the target OS and your accuracy requirements you may want to use a higher resolution time function such as QueryPerformanceCounter(). The code above will not work well on Windows XP where the resolution of GetTickCount() is ~16ms but should work in Win7 - it's mostly to illustrate my point rather than meant to be copied literally in all situations.

Resources