I am trying to implement a checkpointing scheme for multithreaded applications by using fork. I will take the checkpoint at a safe location such as a barrier. One thread will call fork to replicate the address space and signals will be sent to all other threads so that they can save their contexts and write it to a file.
The forked process will not run initially. Only when restart from checkpoint is required, a signal would be sent to it so it can start running. At that point, the threads who were not forked but whose contexts were saved, will be recreated from the saved contexts.
My first question is if it is enough to recreate threads from saved contexts and run them from there, if i assume there was no lock held, no signal pending during checkpoint etc... . Lastly, how a thread can be created to run from a known context.
What you want is not possible without major integration with the pthreads implementation. Internal thread structures will likely contain their own kernel-space thread ids, which will be different in the restored contexts.
It sounds to me like what you really want is forkall, which is non-trivial to implement. I don't think barriers are useful at all for what you're trying to accomplish. Asynchronous interruption and checkpointing is just as good as synchronized.
If you want to try hacking forkall into glibc, you should start out by looking at the setxid code NPTL uses for synchronizing setuid() calls between threads using signals. The same principle is what's needed to implement forkall, but you'd basically call setjmp instead of setuid in the signal handlers, and then longjmp back into them after making new threads in the child. After that you'd have to patch up the thread structures to have the right pid/tid values, free the excess new stacks that were created, etc.
Edit: Since the setxid code in glibc/NPTL is rather dense reading for someone not familiar with the codebase, you might instead look at the corresponding code I have in musl, called __synccall:
http://git.etalabs.net/cgi-bin/gitweb.cgi?p=musl;a=blob;f=src/thread/synccall.c;h=91ac5eb77322da7393f778da29d35fb3c2def15d;hb=HEAD
It uses a signal to synchronize all threads, then runs a callback sequentially in each thread one-by-one. To implement forkall, you'd want to do something like this prior to the fork, but instead of a callback, simply save jump buffers for each thread except the calling thread (you can't use a callback for this because the return would invalidate the jump buffer you just saved), then perform the fork from the calling thread. After that, you would make N new threads, and have them jump back to the old threads' saved jump buffers, and destroy their new (unneeded) stacks. You'd also need to make the right syscall to update their thread register (e.g. %gs on x86) and tid address.
Then you need to take these ideas and integrate them with glibc's thread allocation and thread stack cache framework. :-)
Related
I'm writing a piece of software that does a single very long task. To allow interruption, we have added a check-pointing function that periodically (on the order of minutes) dumps an image of the program state to disk. This takes some time, however, so I would like to switch to a model where the checkpoints are written on a separate thread rather than blocking the primary worker. (Yes, I know I need to keep it thread-safe.)
As I see it, there are two primary methods of accomplishing this task:
For each checkpoint, I pthread_create() a thread which will execute the checkpointing function once and then terminate.
For each checkpoint, I pthread_cond_signal() a single waiting thread that executes the checkpointing function and then returns to waiting.
Both methods require making an atomic copy of my working state and passing it to the checkpoint thread, as well as ensuring that the checkpoint complete successfully before I try another.
My question is if there is a compelling reason to use one method over the other.
I would argue that pthreads are a bad fit for your requirements:Regardless of whether you spawn a new thread for each backup or use a threadpool, you need to make a deep copy of your working-set, which is expensive. Also, you may need extensive synchronization if you go with the thread-pool. Instead, there's a much easier way to do it:fork().The child process inherits the entire memory-space of the parent, but on modern OSs, the copy is lazy (copy on write). Also, you don't need to worry about cleaning up the thread you started, because the fork()ed child releases its resources when it terminates. If your original program is already multithreaded, you may wish to make sure to only use async-safe functions in the child, but thankfully write() is async-safe (as is open() and unlink()). To avoid your child turning into a zombie, you need to call waitid(P_ALL, 0, siginfo_t *infop, WEXITED | WNOHANG) in a loop until it returns nonzero or the siginfo_t * indicates that the child has not yet exited. This avoids stalling the parent in case the child is not done with the backup before the next backup-point is reached.
Don't go with continually creating/terminating/destroying/joining threads if you can possibly avoid it. It's expensive in terms of latency and cycles, has the risk of unwanted multiple threads doing overlapping work and is difficult to debug.
Just create one thread once, at app startup, and don't terminate it. Loop it round some synchro object and sSignal it when you need to, or run a timer or sleep loop to perform your image dumps.
(Working in Win32 api , in C environment with VS2010)
I have a two thread app. The first thread forks the second and waits for a given interval - 'TIMEOUT', and then calls TerminateThread() on it.
Meanwhile, second thread calls NetServerEnum().
It appears that when timeout is reached , whether NetServerEnum returned successfully or not, the first thread get deadlocked.
I've already noticed that NetServerEnum creates worker threads of it's own.
I ultimately end up with one of those threads in deadlock, typically on ntdll.dll!RtlInitializeExceptionChain, unable to exit my process gracefully.
As this to too long for a comment:
Verbatim from MSDN, allow me to use te answer form (emphasis by me):
TerminateThread is a dangerous function that should only be used in the most extreme cases. You should call TerminateThread only if you know exactly what the target thread is doing, and you control all of the code that the target thread could possibly be running at the time of the termination. For example, TerminateThread can result in the following problems:
If the target thread owns a critical section, the critical section will not be released.
If the target thread is allocating memory from the heap, the heap lock will not be released.
*If the target thread is executing certain kernel32 calls when it is terminated, the kernel32 state for the thread's process could be inconsistent.
If the target thread is manipulating the global state of a shared DLL, the state of the DLL could be destroyed, affecting other users of the DLL.
From reading this it is easy to understanf why it is a bad idea to cancel (terminate) a thread stucking in a system call.
A possible alternative approach to the OP's design might be to spawn off a thread calling NetServerEnum() and simply let it run until the system call returned.
In the mean while the main thread could do other things like for example informing the user that scanning the net takes longer as expected.
I am programming using pthreads in C.
I have a parent thread which needs to create 4 child threads with id 0, 1, 2, 3.
When the parent thread gets data, it will set split the data and assign it to 4 seperate context variables - one for each sub-thread.
The sub-threads have to process this data and in the mean time the parent thread should wait on these threads.
Once these sub-threads have done executing, they will set the output in their corresponding context variables and wait(for reuse).
Once the parent thread knows that all these sub-threads have completed this round, it computes the global output and prints it out.
Now it waits for new data(the sub-threads are not killed yet, they are just waiting).
If the parent thread gets more data the above process is repeated - albeit with the already created 4 threads.
If the parent thread receives a kill command (assume a specific kind of data), it indicates to all the sub-threads and they terminate themselves. Now the parent thread can terminate.
I am a Masters research student and I am encountering the need for the above scenario. I know that this can be done using pthread_cond_wait, pthread_Cond_signal. I have written the code but it is just running indefinitely and I cannot figure out why.
My guess is that, the way I have coded it, I have over-complicated the scenario. It will be very helpful to know how this can be implemented. If there is a need, I can post a simplified version of my code to show what I am trying to do(even though I think that my approach is flawed!)...
Can you please give me any insights into how this scenario can be implemented using pthreads?
As far what can be seen from your description, there seems to be nothing wrong with the principle.
What you are trying to implement is a worker pool, I guess, there should be a lot of implementations out there. If the work that your threads are doing is a substantial computation (say at least a CPU second or so) such a scheme is a complete overkill. Mondern implementations of POSIX threads are efficient enough that they support the creation of a lot of threads, really a lot, and the overhead is not prohibitive.
The only thing that would be important if you have your workers communicate through shared variables, mutexes etc (and not via the return value of the thread) is that you start your threads detached, by using the attribute parameter to pthread_create.
Once you have such an implementation for your task, measure. Only then, if your profiler tells you that you spend a substantial amount of time in the pthread routines, start thinking of implementing (or using) a worker pool to recycle your threads.
One producer-consumer thread with 4 threads hanging off it. The thread that wants to queue the four tasks assembles the four context structs containing, as well as all the other data stuff, a function pointer to an 'OnComplete' func. Then it submits all four contexts to the queue, atomically incrementing a a taskCount up to 4 as it does so, and waits on an event/condvar/semaphore.
The four threads get a context from the P-C queue and work away.
When done, the threads call the 'OnComplete' function pointer.
In OnComplete, the threads atomically count down taskCount. If a thread decrements it to zero, is signals the the event/condvar/semaphore and the originating thread runs on, knowing that all the tasks are done.
It's not that difficult to arrange it so that the assembly of the contexts and the synchro waiting is done in a task as well, so allowing the pool to process multiple 'ForkAndWait' operations at once for multiple requesting threads.
I have to add that operations like this are a huge pile easier in an OO language. The latest Java, for example, has a 'ForkAndWait' threadpool class that should do exactly this kind of stuff, but C++, (or even C#, if you're into serfdom), is better than plain C.
As you might know, all threads in the application die in a forked process, other than the thread doing the fork. However, I plan to ressurrect those threads in the forked process by calling pthread_create and using pthread_attr_setstack, so as to assign the newly created threads the same stack as the dead threads. Something like as follows.
// stackAddr and stacksize taken from the dead thread
pthread_attr_setstack(&attr, stackAddr, stacksize);
rc = pthread_create(&thread, &attr, threadRoutine, NULL);
However, I would still need to get the CPU register values, such as stack pointer, base pointer, instruction pointer etc, to restart threads from the same point. How can I do that? And what else do I need to do to successfully achieve my goal?
Also note that I'm using a 64-bit architecture. What additional difficulties would it have as compared to 32-bit one?
I see two possible ways to shoot yourself in the foot and lose hair^W^W^W^W^W^W^W^Wtry to do this:
Try to force each thread into calling getcontext() before the fork(), and then restore the context of each thread via setcontext(). Probably won't work, but you can try for fun.
Save ptrace(PTRACE_GETREGS), ptrace(PTRACE_GETFPREGS), and restore with ptrace(PTRACE_SETREGS), ptrace(PTRACE_SETFPREGS).
The other threads in the current process aren't killed by a fork -- they're still there and running in the parent. The problem you seem to have is that fork only forks a SINGLE thread in the current procces, creating a new process running one thread with a copy of all non-thread resources in the parent.
What you apparently want is a way of duplicating an entire multithreaded task, forking all the threads in it and creating a new process/task with the same number of threads.
In order to do THAT, you would need to find and pause all the other threads in the process, dump their current state (including all locks they hold), fork a new process, and then (re)create each of those other threads in the child, rewiring the lock state to refer to the new child threads where needed.
Unfortunately, the POSIX pthread interface is hopelessly underspecified, and provides no way of doing that. In particular, it lacks any sort of reflective interface allowing you to figure out what threads are actually running.
If you want to try to do this anyway, I can see two ways of trying to approach this:
poke around in /proc/self/task to figure out what threads are running in your process, effectively getting that reflective interface in a highly non-portable way. You'll likely end up having to ptrace(2) the other threads to get their internal state. This will be very difficult.
wrap the pthreads library -- instead of using library directly, intercept every call and keep track of all the threads/mutexes/locks that get created, so that you have that information available when you want to fork. This will work fine as long as you don't want to use any third-party libraries that use pthreads
The second option is much easier (and somewhat portable), but only works well if you have access to all the source code of your entire application, and can modify it to use your wrappers properly.
Just googling around I found that solaris has a forkall() call that does exactly what you want, see the documentation here:
http://download.oracle.com/docs/cd/E19963-01/html/821-1601/gen-1.html
I assume you're running on linux, but it is possible to run solaris on x86 hardware. So maybe that is an option for you.
I am trying to checkpoint a multithreaded application. For single threaded applications, forking a process as a checkpoint is an efficient technique. However, there is no such thing as a mulithreaded fork. Any idea of how to implement your own mulithreaded fork? Any reference to such work will be greatly appreciated.
There is no portable way to implement a variant of fork that preserves all threads using the interfaces provided by POSIX. On some systems such as Linux, you could implement a highly non-portable, highly fragile version of this either:
using ptrace to trace all threads (to stop them), then making new kernel threads in the child process to duplicate each thread in the parent and assigning them the original stack addresses, instruction pointers, register values, etc. You'd also need to patch up the thread descriptors to know their new kernelspace thread ids, and you'd need to avoid race conditions in this if the thread was in the middle of querying its thread id.
using vfork followed by SIGSTOP to halt the parent process and give yourself a chance to recreate its thread state without things changing under you. This seems possible but sufficiently difficult I'd get a headache trying to go into detail, I think...
(newly added) catch each thread in signal handlers before forking, and save the ucontext_t argument to the signal handler. Then fork and make new kernel threads (using clone), have them signal themselves, then overwrite the ucontext_t the signal handler gets to have the signal handler return back into the context of the original thread you're trying to duplicate. Of course this would all require very clever synchronization...
Alternatively, you could look for a kernel-based "process hibernation" approach to checkpointing that would not be so hackish...
What do you mean by "multithreaded fork"? A function that makes a copy of a multithreaded process, so that the forked process has just as many threads as the old one? A function that makes a new thread which copies the state of the old one?
The latter isn't possible, since the address space is shared. A copy of the current thread's state would be using the current thread's stack, and the new thread and the old thread would fight over the stack.
See also:
Multithreaded fork
fork and existing threads?