Correctly shutting down a work stealing thread pool

Correctly shutting down a work stealing thread pool - c

Given a work stealing thread pool system where each work item can generate new tasks in a threads local work queue - that can spill out to a global queue if full.
How would you safely and efficiently coordinate the shutdown of such a system? Assuming you have basic atomic operations and critical section locks available only.
To clarify some more and simplify. Say each thread grabs tasks from it's local work queue only (no stealing between other threads queues to simplify). If it's local work queue is exhausted it will take a lock on the global work queue and steal work to add to its local work queue. The local work queues require no locks as they are specific to each worker thread.
Using a simple flag or atomic count of 'active' worker threads won't work due to cases where other workers may spill new work onto the global queue where from another worker threads view it may have thought it was the only worker left with work.
All workers should exit only when there is no work left.

The biggest requirement would be to have some way of saving the definition of each task so the state of pending tasks can be saved to persistent storage. Then implement a "stop" flag (with a mutex on it). The method to get a task from the pool for execution checks that flag and, if it's set, returns a "terminate work thread" indication (distinct from a "no tasks available" result that makes the thread wait and try again). Threads terminate when they get that indication, and the overall pool management thread waits until all work threads have terminated and then terminates the pool. The main program has to wait until the pool is terminated and the pool management thread exits, once that happens it's safe to terminate the program. If the program needs to continue to run and restart the pool later, that's also the condition that must be met before it can do anything that would affect the pool configuration or restart the pool.

Related

pthreads lock recovery

I am working on a multi-threaded network server application. At the moment, I am having issues with lock recovery. If a thread dies unexpectedly while it is holding a lock, say a mutex, rwlock, spinlock, etc..., is it possible to recover the lock from a different thread without having to go into the lock struct itself and manually disassociate the owner from the lock. I would like to not have to go to this extreme to clear it as this will make the code non-portable. I have attempted to force a lock owner change by doing a pthread_kill on the offending thread and looking at the return code. But even using a mutex type attribute of PTHREAD_MUTEX_ERRORCHECK, I still cannot gain control of the mutex from another thread if the locking thread has quit. This can be a problem if some internal table is being updated when the thread bails out as it will eventually cause the entire server application to halt.
I have used Google extensively and I'm getting conflicting information, even on here. Any suggestions or ideas that I can explore?
This is on FreeBSD 9.3 using clang-llvm compiler.

For mutexes which are shared between processes (PTHREAD_PROCESS_SHARED) you can set them PTHREAD_MUTEX_ROBUST... but you are stuck with the problem that the state protected by the mutex may be invalid -- depending on the application.
For mutexes which are not shared between processes, there is no standard notion of "robustness", because a thread cannot spontaneously die on its own -- a thread will run until either it is cancelled, it exits or the process exits or dies.
You can use:
void pthread_cleanup_push(void (*routine)(void*), void *arg);
void pthread_cleanup_pop(int execute);
to arrange for a mutex to be released if the thread is cancelled or exits while holding the mutex -- something like:
pthread_mutex_lock(&foo) ; // as now
pthread_cleanup_push(pthread_mutex_unlock, &foo) ; // extra step
....
pthread_cleanup_pop(true) ; // replacing the pthread_mutex_unlock()
HOWEVER: you still need to think very carefully about what state the data protected by the mutex is in when the thread is cancelled or exits !!
You may be much better off examining why the thread needs this, and perhaps sort out any error/exception handling to pass the error/exception up and out of the critical section (leaving the critical section cleanly).

One thread controlling many others

I have an application that waits for clients to connect. Each time a client connects, a new frame gets created (with the new socket file descriptor). I know how many clients will connect, after I reach that number I just run pthread_join in a for loop.
My problem is that I would like the main thread to control all the other threads. My goal is to have each thread send the same message back to the client, at the same time, and only once. There are multiple messages a thread can send.
My current thinking is to define a list of command, as follows:
char *commands[] = {
(char*) "TERMINATE\0",
.... };
And then specify a command number that represents which command to use in that char* array. All threads will do something like
write(sockfd, buffer[commandNumber], length[commandNumber]);
I thought about waiting on a condition variable, but I see two problems:
1) I want to make sure that each thread, although synchronized, execute the command only once.
2) The main thread that initiates the command has to know when all those threads is done executing the command.
Only way I see to execute 2) is to keep track of a counter (with mutexes), and when each thread executes the command, it can increase that counter. I am not sure I will be able to avoid a thread from running the command twice.
What is the best possible way please to coordinate multiple threads to execute a single action at once; and also be able to know when that action has finished executing for every thread please?

You might use a barrier to gate the operation.
Synchronizing the send
The main thread initializes a barrier named "Ready" to N+1. Then it begins accept()ing N client connections, spawning a worker thread for each. The new worker threads immediately wait on barrier "Ready".
After spawning the Nth (and last) worker, the main thread sets the desired command (perhaps using a global commandNumber). Then the main thread waits on barrier "Ready". As soon as all workers and the main thread have arrived (reaching the barrier's limit of N+1), all threads are released, knowing that they are ready to issue their command immediately.
(A common alternate approach is to use a predicate and condition variable rather than a barrier. For example, the main thread might spawn the Nth worker and then cond_broadcast() that it has set a flag ready = 1. This approach is flawed. The main thread cannot know that the Nth worker — or, indeed, any of the workers — are yet waiting on that condition. The barrier solves this problem.)
Indicating completion
Another N+1 barrier, "AllDone", could be used to indicate that the workers are all done. A semaphore initialized to -N and posted by workers would do the same. Having the workers close() their connections and the main thread select()ing or poll()ing connections would convey the same information, too.

When to use QueueUserAPC()?

I do understand what an APC is, how it works, and how Windows uses it, but I don't understand when I (as a programmer) should use QueueUserAPC instead of, say, a fiber, or thread pool thread.
When should I choose to use QueueUserAPC, and why?

QueueUserAPC is a neat tool that can often be a shortcut for some tasks that are otherwise handled with synchronization objects. It allows you to tell a particular thread to do something whenever it is convenient for that thread (i.e. when it finishes its current work and starts waiting on something).
Let's say you have a main thread and a worker thread. The worker thread opens a socket to a file server and starts downloading a 10GB file by calling recv() in a loop. The main thread wants to have the worker thread do something else in its downtime while it is waiting for net packets; it can queue a function to be run on the worker while it would otherwise be waiting and doing nothing.
You have to be careful with APCs, because as in the scenario I mentioned you would not want to make another blocking WinSock call (which would result in undefined behavior). You really have to be watching in order to find any good uses of this functionality because you can do the same thing in other ways. For example, by having the other thread check an event every time it is about to go to sleep, rather than giving it a function to run while it is waiting. Obviously the APC would be simpler in this scenario.
It is like when you have a call desk employee sitting and waiting for phone calls, and you give that person little tasks to do during their downtime. "Here, solve this Rubik's cube while you're waiting." Although, when a phone call comes in, the person would not put down the Rubik's cube to answer the phone (the APC has to return before the thread can go back to waiting).
QueueUserAPC is also useful if there is a single thread (Thread A) that is in charge of some data structure, and you want to perform some operation on the data structure from another thread (Thread B), but you don't want to have the synchronization overhead / complexity of trying to share that data between two threads. By having Thread B queue the operation to run on Thread A, which solely maintains that structure, you are executing any arbitrary function you want on that data without having to worry about synchronization.
It is just another tool like a thread pool. However with a thread pool you cannot send a task to a particular thread. You have no control over where the work is done. When you queue up a task that may end up creating a whole new thread. You may queue two tasks and they get done simultaneously on two different threads. With QueueUserAPC, you can be guaranteed that the tasks would get done in order and on the thread you designate.

Manipulating thread's nice value

I wrote a simple program that implements master/worker scheme where the master is the main thread, and workers are created by it.
The main thread writes something to a shared buffer, and the worker threads read this shared buffer, writing and reading to shared buffer are organized by read/write lock.
Unfortunately, this scheme definitely leads to starvation of main thread, since a single write has to wait on several reads to complete. One possible solution is increasing the priority of the master thread, so if it wants to write something, it will get immediate access to the shared buffer.
According to a great post to a similar issue, I discovered that probably manipulating the priority of a thread under SCHED_OTHER policy is not allowed, what can be changed is the nice value only.
I wrote a procedure to give worker threads lower priority than master thread, but it seems not to work correctly.
void assignWorkerThreadPriority(pthread_t* worker)
{
struct sched_param* worker_sched_param = (struct sched_param*)malloc(sizeof(struct sched_param));
worker_sched_param->sched_priority =0; //any value other than 0 gives error?
int policy = SCHED_OTHER;
pthread_setschedparam(*worker, policy, worker_sched_param);
printf("Result of changing priority is: %d - %s\n", errno, strerror(errno));
}
I have a two-fold question:
How can I set the nice value of a worker threads to avoid main thread starvation.
If not possible, then how can I change the scheduling policy to a one that allows changing the priority.
Edit: I managed to run the program using other policies, such as SCHED_FIFO, all I had to do was running the program as a super user

You cannot avoid problems using a read/write lock when the read and write usage is so even. You need a different method. You need a lock-free message queue or independent work queues or one of many other techniques.
Here is another way to do the job, the way I would do it. The worker can take the buffer away and work on it rather than keeping it shared:
Write thread:
Create work item.
Lock the mutex or CriticalSection protecting the current queue and pointer to queue.
Add work item to queue.
Release the lock.
Optionally signal a condition variable or Event. Another option is for worker threads to check for work on a timer.
Worker thread:
Create a new queue.
Wait for a condition variable or event or other signal, or wait on a timer.
Lock the mutex or CriticalSection protecting the current queue and pointer to queue.
Set the current queue pointer to the new queue.
Release the lock.
Proceed to work on the now private queue.
Delete the queue when all work items complete.
Now write thread creates more work items. When all the worker threads have their own copies of a queue to work on it will be able to write many items in peace.
You can modify this. For example, a worker thread may lock the queue and move a limited number of work items off into its own internal queue instead of taking the whole thing.

Semaphore queues

I'm extending the functionality of a semaphore. I ran into a roadblock when I realized I don't know the implementation of an actual semaphore and to make sure my code ran correctly, I needed to know this.
I know a semaphore works by blocking threads that are waiting on it when they call sem_wait() and another thread currently has it locked. The thread is then blocked and then put into a wait list for that semaphore.
My question relates to what happens on a sem_post(). Is the next thread pulled off the waiting list, set as the locking thread, and allowed to be unblocked? Or is the scheme for posting completely different?
Thanks!

The next thread to unblock on it's sem_wait() will be whatever thread the OS decides is the next one to context switch into. Nobody makes any guarantee of ordering; it depends on your OS's scheduling strategy. It might be the thread that has been off the CPU for the longest, or the one that has been assigned the highest "priority", or the one that has historically had certain resource-usage statistics, or whatever.
Most likely, your current thread (the one that called sem_post()) will continue running for a while, until it either starts waiting for user input, blocks on another semaphore, or runs out of its os-allotted time slice. Then, the OS will switch in some totally unrelated process to run for a fraction of a second (probably Firefox or something), then go off and handle some network traffic, get itself a cup of tea, and, finally, when it gets around to it, pick whichever of your other threads it feels like, based on something like whether it feels based on past history that the particular thread is more CPU or I/O-bound.
In many OSes, priority is given to I/O-bound processes that haven't been around for very long. The theory is that new processes might be short-lived (if it's been around for five hours already, odds are it won't be finishing up in the next 1ms) so we might as well get them over with. I/O-bound processes are likely to continue to be I/O-bound, which means that chances are they are going to switch off the CPU shortly while waiting for other resources. Basically, the OS wants to find the process that it's going to be able to be done with ASAP, so it can get back to sipping its tea and running your malware.

Semaphores have two operations:
P() To acquire the semaphore (you seem to call this sem_wait)
V() To release the semaphore (you seem to call this sem_post)
Semaphores also have an integer associated to them, which is the number of concurrent threads allowed to pass P() without blocking. Other calls to P() will block until V() is called to free up spots.
That is the classic definition of a semaphore.
Edit: Semaphores do not make any guarantee of order. They don't have to actually use a queue or other FIFO structure. When only one thread is allowed at a time, when it calls V(), another (possibly random) thread will then return from its P() call and continue.

According to the IEEE standards, the behavior of POSIX semaphores:
If the semaphore value resulting from this operation is positive, then no threads were blocked waiting for the semaphore to become unlocked; the semaphore value is simply incremented.
If the value of the semaphore resulting from this operation is zero, then one of the threads blocked waiting for the semaphore shall be allowed to return successfully from its call to sem_wait(). If the Process Scheduling option is supported, the thread to be unblocked shall be chosen in a manner appropriate to the scheduling policies and parameters in effect for the blocked threads. In the case of the schedulers SCHED_FIFO and SCHED_RR, the highest priority waiting thread shall be unblocked, and if there is more than one highest priority thread blocked waiting for the semaphore, then the highest priority thread that has been waiting the longest shall be unblocked. If the Process Scheduling option is not defined, the choice of a thread to unblock is unspecified.
If the Process Sporadic Server option is supported, and the scheduling policy is SCHED_SPORADIC, the semantics are as per SCHED_FIFO above."

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight