I have a single thread C program implemented using event driven programming - a callback triggers every time the event happens.
The callback takes way too long to execute (do a bunch of calculations) and this processing time is important. Currently is 500 microseconds and need it to be less than 100.
Most of the calculations are independent, can be done in parallel.
I have a machine with many cores and was thinking if getting multiple threads to make the calculations in parallel could be possible / of help.
I think that the approach in which at the beginning of the callback I generate multiple threads, and then send the different calculations to the multiple threads will not work well because generating the threads takes time.
Is it possible to have a few threads up, waiting to be used, and that every time the callback is triggered I can send the calculations there without having to generate the threads in each callback?
You can use a thread pool for this (often called a worker pool). The basic idea is create some number of threads in advance and have them all sleep, waiting on a semaphore whenever there is no work to do.
Your code will be simpler if you can get away with one thread for each processing task, but you can also implement it (carefully) with a queue, where each worker tries to handle the next job in the queue and then sleep when the job queue is empty.
Either way, a single round of processing will look something like this:
assign or queue tasks to your worker pool
notify worker pool to wake up and begin processing tasks
wait for worker pool to signal all tasks complete (*)
(*) remember, "all tasks complete" is not the same as "task queue empty"
Now your main timing bottlenecks will depend on the mutex/semaphore implementation and your OS thread scheduler. It may be appropriate to set a high priority on all your worker threads.
If you have events at regular intervals, a common improvement to the above is to also double-buffer (i.e. output the result for the previous event, and assign the workers to begin processing input for the current event). To achieve that, you would move step 3 to happen before step 1.
This may or may not be suitable for your purposes. But it can provide some extra leeway with timing, if you're still having trouble processing fast enough. Try something simple first. Problems like this can get hairy very quickly when you start introducing extra requirements.
Related
I have a multi threaded program in which I sleep in one thread(Thread A) unconditionally for infinite time. When an event happens in another thread (Thread B), it wake up Thread-A by signaling. Now I know there are multiple ways to do it.
When my program runs in windows environment, I use WaitForSingleObject in Thread-A and SetEvent in the Thread-B. It is working without any issues.
I can also use file descriptor based model where I do poll, select. There are more than one way to do it.
However, I am trying to find which is the most efficient way. I want to wake up the Thread-A asap whenever Thread-B signals. What do you think is the best option.
I am ok to explore a driver based option.
Thanks
As said, triggering an SetEvent in thread B and a WaitForSingleObject in thread A is fast.
However some conditions have to be taken into account:
Single core/processor: As Martin says, the waiting thread will preempt the signalling thread. With such a scheme you should take care that the signalling thread (B) is going idle right after the SetEvent. This can be done by a sleep(0) for example.
Multi core/processor: One might think there is an advantage to put the two threads onto different cores/processors but this is not really such a good idea. If both threads are on the same core/processor, the time-span between calling SetEventand the return of WaitForSingleObject is much shorter shorter.
Handling both threads on one core (SetThreadAffinityMask) also allows to handle the behavior of them by means of their priority setting (SetThreadPriority). You may run the waiting thread at a higher priorty or you have to ensure that the signalling thread is really not doing anything after it has set the event.
You have to deal with some other synchronization matter: When is the next event going to happen? Will thread A have completed its task? Most effective a second event can be used to solve this matter: When thread A is done, it sets an event to indicate that thread B is allowed to set its event again. Thread B will effectively first set the event and then wait for the feedback event, it meets the requirment to go idle immedeately.
If you want to allow thread B to set the event even when thread A is not finished and not yet in a wait state, you should consider using semaphores instead of events. This way the number of "calls/events" from thread B is kept and the wait function in thread A can follow up, because it is returning for the number of times the semaphore has been released. Semaphore objects are about as fast as events.
Summary:
Have both threads on the same core/cpu by means of SetThreadAffinityMask.
Extend the SetEvent/WaitForSingleObject by another event to establish a Handshake.
Depending on the details of the processing you may also consider semaphore objects.
For c programming, if i want to coordinate two concurrently executing processes, I can use sleep(). However, i heard that sleep() is not a good idea to implement the orders of events between processes? Are there any reasons?
sleep() is not a coordination function. It never has been. sleep() makes your process do just that - go to sleep, not running at all for a certain period of time.
You have been misinformed. Perhaps your source was referring to what is known as a backoff after an acquisition of a lock fails, in which case a randomized sleep may be appropriate.
The way one generally establishes a relative event ordering between processes (ie, creates a happens-before edge) is to use a concurrency-control structure such as a condition variable which is only raised at a certain point, or a more-obtuse barrier which causes each thread hitting it to wait until all others have also reached that point in the program.
Using sleep() will impact the latency and CPU load. Let's say you sleep for 1ms and check some atomic shared variable. The average latency will be (at least) 0.5ms. You will be consuming CPU cycles in this non-active thread to poll the shared atomic variable. There are also often no guarantees about the sleep time.
The OS provides services to communicate/synchronize between threads/processes. Those have low latency, consume less CPU cycles, and often have other guarantees - those are the ones you should use... (E.g. condition variables, events, semaphores etc.). When you use those the thread/process does not need to "poll". The kernel wakes up the waiting threads/processes when needed (the thread/process "blocks").
There are some rare situations where polling is the best solution for thread/process synchronization, e.g. a spinlock, usually when the overhead of going through the kernel is larger than the time spent polling.
Sleep would not be a very robust way to handle event ordering between processes as there are so many things that can go wrong.
What if your sleep() is interrupted?
You need to be a bit more specific about what you mean by "implement the order of events between processes".
In my case, I was using this function in celery. I was doing time.sleep(10). And it was working fine if the celery_task was called once or twice per minute. But it created chaos in one case.
If the celery_task is called 1000 times
I had 4 celery workers, so the above 1000 celery calls were queued for execution.
The first 4 calls were executed by the 4 workers and the remaining 996 were still in the queue.
the workers were busy in the 4 tasks for 10 seconds and after 10 secs it took the next 4 tasks. Going this way it may take around 1000\4*10=2500 seconds.
Eventually, we had to remove time.sleep as it was blocking the worker for 10 seconds in my case.
I do understand what an APC is, how it works, and how Windows uses it, but I don't understand when I (as a programmer) should use QueueUserAPC instead of, say, a fiber, or thread pool thread.
When should I choose to use QueueUserAPC, and why?
QueueUserAPC is a neat tool that can often be a shortcut for some tasks that are otherwise handled with synchronization objects. It allows you to tell a particular thread to do something whenever it is convenient for that thread (i.e. when it finishes its current work and starts waiting on something).
Let's say you have a main thread and a worker thread. The worker thread opens a socket to a file server and starts downloading a 10GB file by calling recv() in a loop. The main thread wants to have the worker thread do something else in its downtime while it is waiting for net packets; it can queue a function to be run on the worker while it would otherwise be waiting and doing nothing.
You have to be careful with APCs, because as in the scenario I mentioned you would not want to make another blocking WinSock call (which would result in undefined behavior). You really have to be watching in order to find any good uses of this functionality because you can do the same thing in other ways. For example, by having the other thread check an event every time it is about to go to sleep, rather than giving it a function to run while it is waiting. Obviously the APC would be simpler in this scenario.
It is like when you have a call desk employee sitting and waiting for phone calls, and you give that person little tasks to do during their downtime. "Here, solve this Rubik's cube while you're waiting." Although, when a phone call comes in, the person would not put down the Rubik's cube to answer the phone (the APC has to return before the thread can go back to waiting).
QueueUserAPC is also useful if there is a single thread (Thread A) that is in charge of some data structure, and you want to perform some operation on the data structure from another thread (Thread B), but you don't want to have the synchronization overhead / complexity of trying to share that data between two threads. By having Thread B queue the operation to run on Thread A, which solely maintains that structure, you are executing any arbitrary function you want on that data without having to worry about synchronization.
It is just another tool like a thread pool. However with a thread pool you cannot send a task to a particular thread. You have no control over where the work is done. When you queue up a task that may end up creating a whole new thread. You may queue two tasks and they get done simultaneously on two different threads. With QueueUserAPC, you can be guaranteed that the tasks would get done in order and on the thread you designate.
I have POSIX in mind for implementation, though this question is more about architecture.
I am starting from an update loop that has several main jobs to do. I can group those jobs into four or five main tasks that have common memory access requirements. It's my idea to break off those jobs into their own threads and have them complete one cycle of "update" and sleep until the next frame.
But how to synchronize? If I detach four or five threads at the start of each cycle, have them run once, die, and then detach another 4-5 threads on each pass? That sounds expensive.
It sounds more reasonable to create these threads once, and have them go to sleep until a synchronized call wakes it up.
Is this a wise approach? I'm open to accepting responses from just ideas to implementations of any kind.
EDIT: based on the answers so far, I'd like to add:
concurrency is desired
these worker threads are intended to run at very short durations <250ms
the work done by each thread will always be the same
i'm considering 4-5 threads, 20 being a hard limit.
That depends on the granularity of the tasks that the threads are performing. If they're doing long tasks (e.g. a second or longer), then the cost of creating and destroying threads is negligible compared to the work the threads are doing, so I'd recommend keeping things simple and creating the threads on demand.
Conversely, if you have very short tasks (e.g. less than 10-100 ms or so), you will definitely start to notice the cost of creating and destroying lots of threads. In that case, yes, you should create the threads only once and have them sleep until work arrives for them. You'll want to use some sort of condition variable (e.g. pthread_cond_t) for this: the thread waits on the condition variable, and when work arrives, you signal the condition variable.
If you always have the same work to do every cycle, and you need to wait for all the work to finish before the next cycle starts, then you're thinking about the right solution.
You'll need some synchronization objects: a "start of frame semaphore", an "end of frame semaphore", and an "end of frame event". If you have n independent tasks each frame, start n threads, with loops that look like this (pseudocode):
while true:
wait on "start of frame semaphore"
<do work>
enter lock
decrement "worker count"
if "worker count" = 0 then set "end of frame event"
release lock
wait on "end of frame semaphore"
You can then have a controller thread run:
while true:
set "worker count" to n
increment "start of frame semaphore" by n
wait on "end of frame event"
increment "end of frame semaphore" by n
This will work well for small n. If the number of tasks you need to complete each cycle becomes large, then you will probably want to use a thread pool coupled with a task queue, so that you don't overwhelm the system with threads. But there's more complexity with that solution, and with threading complexity is the enemy.
The best is probably to use a task queue.
Task queues can be seen as threads waiting for a job to be submitted to them. If there are many sent at once, they are executed in FIFO order.
That way, you maintain 4-5 threads, and each of them executes the job you feed them, without needing to detach a new thread for each job.
The only problem is that I don't know many implementations of task queues in C. Apple has Grand Central Dispatch that does just that; FreeBSD has an implementation of it too. Except those, I don't know any other. (I didn't look very hard, though.)
Your idea is known as a thread pool. They are found in WinAPI, Intel TBB and the Visual Studio ConcRT, I don't know much about POSIX and therefore cannot help you, but they are an excellent structure with many desirable properties, such as excellent scaling, if the work being posted can be split up.
However, I wouldn't trivialize the time the work takes. If you have five tasks, and you have a performance issue so desperate that multiple threads are the key, then creating the threads is almost certainly a negligible problem.
I am an embedded programmer attempting to simulate a real time preemptive scheduler in a Win32 environment using Visual Studio 2010 and MingW (as two separate build environments). I am very green on the Win32 scheduling environment and have hit a brick wall with what I am trying to do. I am not trying to achieve real time behaviour - just to get the simulated tasks to run in the same order and sequence as they would on the real target hardware.
The real time scheduler being simulated has a simple objective - always execute the highest priority task (thread) that is able to run. As soon a task becomes able to run - it must preempt the currently running task if it has a priority higher than the currently running task. A task can become able to run due to an external event it was waiting for, or a time out/block time/sleep time expiring - with a tick interrupt generating the time base.
In addition to this preemptive behaviour, a task can yield or volunteer to give up its time slice because is is executing a sleep or wait type function.
I am simulating this by creating a low priority Win32 thread for each task that is created by the real time scheduler being simulated (the thread effectively does the context switching the scheduler would do on a real embedded target), a medium priority Win32 thread as a pseudo interrupt handler (handles simulated tick interrupts and yield requests that are signalled to it using a Win32 event object), and a higher priority Win32 thread to simulate the peripheral that generates the tick interrupts.
When the pseudo interrupt handler establishes that a task switch should occur it suspends the currently executing thread using SuspendThread() and resumes the thread that executes the newly selected task using ResumeThread(). Of the many tasks and their associated Win32 threads that may be created, only one thread that manages the task will ever be out of the suspended state at any one time.
It is important that a suspended thread suspends immediately that SuspendThread() is called, and that the pseudo interrupt handling thread executes as soon as the event telling it that an interrupt is pending is signalled - but this is not the behaviour I am seeing.
As an example problem that I already have a work around for: When a task/thread yields the yield event is latched in a variable and the interrupt handling thread is signalled as there is a pseudo interrupt (the yield) that needs processing. Now in a real time system as I am used to programming I would expect the interrupt handling thread to execute immediately that it is signalled because it has a higher priority than the thread that signals it. What I am seeing in the Win32 environment is that the thread that signals the higher priority thread continues for some time before being suspended - either because it takes some time before the signalled higher priority thread starts to execute or because it takes some time for the suspended task to actually stop running - I'm not sure which. In any case this can easily be correct by making the signally Win32 thread block on a semaphore after signalling the Win32 interrupt handling thread, and have the interrupt handling Win32 thread unblock the thread when it has finished its function (handshake). Effectively using thread synchronisation to force the scheduling pattern to what I need. I am using SignalObjectAndWait() for this purpose.
Using this technique the simulation works perfectly when the real time scheduler being simulated is functioning in co-operative mode - but not (as is needed) in preemptive mode.
The problem with preemptive task switching is I guess the same, the task continues to execute for some time after it has been told to suspend before it actually stops running so the system cannot be guaranteed to be left in a consistent state when the thread that runs the task suspends. In the preemptive case though, because the task does not know when it is going to happen, the same technique of using a semaphore to prevent the Win32 thead continuing until it is next resumed cannot be used.
Has anybody made it this far down this post - sorry for its length!
My questions then are:
How I can force Win32 (XP) scheduling to start and stop tasks immediately that the suspend and resume thread functions are called - or - how can I force a higher priority Win32 thread to start executing immediately that it is able to do so (the object it is blocked on is signalled). Effectively forcing Win32 to reschedule its running processes.
Is there some way of asynchronously stopping a task to wait for an event when its not in the task/threads sequential execution path.
The simulator works well in a Linux environment where POSIX signals are used to effectively interrupt threads - is there an equivalent in Win32?
Thanks to anybody who has taken the time to read this long post, and especially thanks in advance to anybody that can hold my 'real time engineers' hand through this Win32 maze.
If you need to do your own scheduling, then you might consider using fibers instead of threads. Fibers are like threads, in that they are separate blocks of executable code, however fibers can be scheduled in user code whereas threads are scheduled by the OS only. A single thread can host and manage scheduling of multiple fibers, and fibers can even schedule each other.
Firstly, what priority values are you using for your threads?
If you set the high priority thread to THREAD_PRIORITY_TIME_CRITICAL it should run pretty much immediately --- only those threads associated with a real-time process will have higher priority.
Secondly, how do you know that the suspend and resume aren't happening immediately? Are you sure this is the problem?
You cannot force a thread to wait on something from outside without suspending the thread to inject the wait code; if SuspendThread isn't working for you then this isn't going to help.
The closest to a signal is probably QueueUserAPC, which will schedule a callback to run the next time the thread enters an "alertable wait state", e.g. by calling SleepEx or WaitForSingleObjectEx or similar.
#Anthony W - thanks for the advice. I was running the Win32 threads that simulated the real time tasks at THREAD_PRIORITY_ABOVE_NORMAL, and the threads that ran the pseudo interrupt handler and the tick interrupt generator at THREAD_PRIORITY_HIGHEST. The threads that were suspended I was changing to THREAD_PRIORITY_IDLE in case that made any difference. I just tried your suggestion of using THREAD_PRIORITY_TIME_CRITICAL but unfortunately it didn't make any difference.
With regards to your question am I sure that the suspend and resume not happening immediately is the problem - well no I'm not. It is my best guess in an environment I am unfamiliar with. My thinking regarding the failure of suspend and resume to work immediately stems from my observation when a task yields. If I make the call to yield (signal [using a Win32 event] a higher priority Win32 thread to switch to the next real time task) I can place a break point after the yield and that gets hit before a break point in the higher priority thread. It is unclear whether a delay in signalling the event and the higher priority task running, or a delay in suspending the thread and the thread actually stopping running was causing this - but the behaviour was definitely observed. This was fixed using a semaphore handshake, but that cannot be done for preemptions caused by tick interrupts.
I know the simulation is not running as I expect because a set of tests that check the sequence of scheduling of real time tasks is failing. It is always possible the scheduler has a problem, or the test has a problem, but the test will run for weeks without failing on a real real time target so I'm inclined to think the test and the scheduler are ok. A big difference is on the real time target the tick frequency is 1 ms, whereas on the Win32 simulated target it is 15ms with quite a lot of variation even then.
#Remy - I have done quite a bit of reading about fibers today, and my conclusion is that for simulating the scheduler in cooperative mode they would be perfect. However, as far as I can see they can only be scheduled by the fibers themselves calling the SwitchToFiber() function. Can a thread be made to block on a timer or sleep so it runs periodically, effectively preempting the fiber that was running at the time? From what I have read the answer is no because blocking one fiber will block all fibers running in the thread. If it could be made to work, could the periodically executing fiber then call the SwitchToFiber() function to select the next fiber to run before again sleeping for a fixed period? Again I think the answer is no because once it switches to another fiber it will no longer be executing and so will not actually call the Sleep() function until the next time the executing fiber switches back to it. Please correct my logic here if I have got the wrong idea of how fibers work.
I think it could work if the periodic functionality could remain in its own thread, separate from the thread that executed the fibers - but (again from what I have read) I don't think a one thread can influence the execution of fibers running in a different thread. Again I would be grateful if you could correct my conclusions here if they are wrong.
[EDIT] - simpler than the hack below - it seems just ensuring all the threads run on the same CPU core also fixes the problem :o) After all that. The only problem then is the CPU runs at nearly 100% and I'm not sure if the heat is damaging to it.
[/EDIT]
Ahaa! I think I have a work around for this - but its ugly. The uglyness is kept in the port layer though.
What I do now is store the thread ID each time a thread is created to run a task (a Win32 thread is created for each real time task that is created). I then added the function below - which is called using trace macros. The trace macros can be defined to do whatever you want, and have proven very useful in this case. The comments in the code below explain. The simulation is not perfect, and all this does is correct the thread scheduling when it has already deviated from the real time scheduling whereas I would prefer it not to go wrong in the first place, but the positioning of the trace macros makes the code containing this solution pass all the tests:
void vPortCheckCorrectThreadIsRunning( void )
{
xThreadState *pxThreadState;
/* When switching threads, Windows does not always seem to run the selected
thread immediately. This function can be called to check if the thread
that is currently running is the thread that is responsible for executing
the task selected by the real time scheduler. The demo project for the Win32
port calls this function from the trace macros which are seeded throughout
the real time kernel code at points where something significant occurs.
Adding this functionality allows all the standard tests to pass, but users
should still be aware that extra calls to this function could be required
if their application requires absolute fixes and predictable sequencing (as
the port tests do). This is still a simulation - not the real thing! */
if( xTaskGetSchedulerState() != taskSCHEDULER_NOT_STARTED )
{
/* Obtain the real time task to Win32 mapping state information. */
pxThreadState = ( xThreadState * ) *( ( unsigned long * ) pxCurrentTCB );
if( GetCurrentThreadId() != pxThreadState->ulThreadId )
{
SwitchToThread();
}
}
}