Implementing posix_spawn on Linux - c

I am curious to see if it would be possible to implement posix_spawn in Linux using a combination of vfork+exec. In a very simplified way (leaving out most optional arguments) this could look more or less like this:
int my_posix_spawn(pid_t *ppid, char **argv, char **env)
{
pid_t pid;
pid = vfork();
if (pid == -1)
return errno;
if (pid == 0)
{
/* Child */
execve(argv[0], argv, env);
/* If we got here, execve failed. How to communicate this to
* the parent? */
_exit(-1);
}
/* Parent */
if (ppid != NULL)
*ppid = pid;
return 0;
}
However I am wondering how to cope with the case where vfork succeeds (so the child process is created) but the exec call fails. There seems to be no way to communicate this to the parent, which would only see that it could apparently create a child process successfully (as it would get a valid pid back)
Any ideas?

As others have noted in the comments, posix_spawn is permitted to create a child process that immediately dies to due to exec failure or other post-fork failures; the calling application needs to be prepared for this. But of course it's preferable not to do so.
The general procedure for communicating exec failure to the parent is described in an answer I wrote on this question: What can cause exec to fail? What happens next?.
Unfortunately, however, some of the operations you need to perform are not legal after vfork due to its nasty returns-twice semantics. I've covered this topic in the past in an article on ewontfix.com. The solution for making a posix_spawn that avoids duplicating the VM seems to be using clone with CLONE_VM (and possibly CLONE_VFORK) to get a new process that shares memory but doesn't run on the same stack. However, this still requires a lot of care to avoid making any calls to libc functions that might modify memory used by the parent. My current implementation is here:
http://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c?id=v1.1.4
and as you can see it's rather complicated. Reading the git history may be informative regarding some of the design decisions that were made.

I don't think there's any good way to do this with the current set of system calls. You've correctly identified the biggest problem -- the absence of any reliable way to report failure after the vfork. Other problems include race conditions in setting child state, and Linux's lack of interest in picking up closefrom.
Several years ago I sketched a new system-level API that would solve this problem: the key addition is a system call, which I called egg(), that creates a process without giving it an address space, and inheriting no state from the parent. Obviously, an egg process can't execute code; but you can (with a whole bunch more new system calls) set all of its kernelside state, and then (with yet another system call, hatch()) load an executable into it and set it going. Crucially, all of the new system calls report failure in the parent. For instance, there's a dup_into(pid, to_fd, from_fd) call that copies parent file descriptor from_fd to egg-state process pid's file descriptor to_fd; if it fails, the parent gets the failure code.
I never had time to flesh all of that out into a coherent API specification and code it up (and I'm not a kernel hacker, anyway) but I still think the concept has legs and I would be happy to work with someone to get it done.

Related

Wait for child process without using wait()

When using fork(), is it possible to ensure that the child process executes before the parent without using wait() in the parent?
This is related to a homework problem in the Process API chapter of Operating Systems: Three Easy Pieces, a free online operating systems book.
The problem says:
Write another program using fork(). The child process should
print "hello"; the parent process should print "goodbye". You should
try to ensure that the child process always prints first; can you do
this without calling wait() in the parent?
Here's my solution using wait():
#include <stdio.h>
#include <stdlib.h> // exit
#include <sys/wait.h> // wait
#include <unistd.h> // fork
int main(void) {
int f = fork();
if (f < 0) { // fork failed
fprintf(stderr, "fork failed\n");
exit(1);
} else if (f == 0) { // child
printf("hello\n");
} else { // parent
wait(NULL);
printf("goodbye\n");
}
}
After thinking about it, I decided the answer to the last question was "no, you can't", but then a later question seems to imply that you can:
Now write a program that uses wait() to wait for the child process
to finish in the parent. What does wait() return? What happens if
you use wait() in the child?
Am I interpreting the second question wrong? If not, how do you do what the first question asks? How can I make the child print first without using wait() in the parent?
I hope this answer is not too late.
Minutes ago, I have emailed Remiz(this book's author), and got such a replay(extract some segments):
Without calling wait() is hard, and not really the main point.
What you did -- learning about signals on your own -- is a good sign,
showing you will seek out deeper knowledge. Good for you!
Later, you'll be able to use a shared memory segment, and
either condition variables or semaphores, to solve this problem.
Create a pipe in the parent. After fork, close the write half in the parent and the read half in the child.
Then, poll for readability. Since the child never writes to it, it will wait until the child (and all grandchildren, unless you take special care) no longer exists, at which time poll will give a "read with hangup" response. (Alternatively, you could actually communicate over the pipe).
You should read about O_CLOEXEC. As a general rule, that flag should always be set unless you have a good reason to clear it.
I can't see why second question would imply that answer is "yes" to the first.
Yes there is plenty of solutions to obtain what asked, but of course I suspect that all are not in the "spirit" of the problem/question where the focus in on fork/wait primitives. The point is always to remember that you can't assume anything after a fork regarding the way processes ran relatively to each other.
To ensure the child process print first you need a kind of synchronization in between both processes, and there is a lot of system primitives that have a semantic of "communication" between processes (for example locks, semaphores, signals, etc). I doubt one of these is to be used her, as they are generally introduced slightly later in such a course.
Any other attempt only that will only rely on time assumption (like using sleep or loops to "slow" down the parent, etc) can lead to failure, means that you will not be able to prove that it will always succeed. Even if testing would probably show you that it seems correct, most of the runs you would try will not have the bad characteristics that lead to failure. Remember that except in realtime OSes, scheduling is almost an approximation of fair concurrency.
NOTE:
As Jonathan Leffler commented, I also suppose that using other wait-like primitives is forbidden (aka wait4, waitpid, etc) -- "spirit" argument.
I'm not sure whether this is against the spirit of the question, but I think that calling the pause system call in the parent process branch will cause the scheduler to immediately run the child process (if it didn't already run).

Need clarification regarding vfork use

I want to run the child process earlier than the parent process. I just want to use execv call from the child process, So i am using vfork instead of fork.
But suppose execv fails and returns, i want to return non-zero value from the function in which i am calling vfork.
something like this,
int start_test()
{
int err = 0;
pid_t pid;
pid = vfork();
if(pid == 0) {
execv(APP, ARGS);
err = -1;
_exit(1);
} else if(pid > 0) {
//Do something else
}
return err;
}
Is above code is proper, Or i should use some other mechanism to run child earlier than parent?
On some links, I have read that, We should not modify any parent resource in child process created through vfork(I am modifying err variable).
Is above code is proper?
No. The err variable is already in the child's memory, thus parent will not see its modification.
We should not modify any parent resource in child process created through vfork (I am modifying err variable).
The resource is outdated. Some *nix systems in the past tried to play tricks with the fork()/exec() pair, but generally those optimizations have largely backfired, resulting in unexpected, hard to reproduce problems. Which is why the vfork() was removed from the recent versions of POSIX.
Generally you can make this assumptions about vfork() on modern systems which support it: If the child does something unexpected by the OS, the child would be upgraded from vfork()ed to normal fork()ed one.
For example on Linux the only difference between fork() and vfork() is that in later case, more of the child's data are made into lazy COW data. The effect is that vfork() is slightly faster than fork(), but child is likely to sustain some extra performance penalty if it tries to access the yet-not-copied data, since they are yet to be duplicated from the parent. (Notice the subtle problem: parent can also modify the data. That would also trigger the COW and duplication of the data between parent and child processes.)
I just want to use execv call from the child process, So i am using vfork instead of fork.
The handling should be equivalent regardless of the vfork() vs fork(): child should return with a special exit code, and parent should do the normal waitpid() and check the exit status.
Otherwise, if you want to write a portable application, do not use the vfork(): it is not part of the POSIX standard.
P.S. This article might be also of interest. If you still want to use the vfork(), then read this scary official manpage.
Yes. You should not modify the variable err. Use waitpid in the parent process to check the exit code of the child process. Also check the return value from execv and the errno variable(see man execv), to determine why your execv fails.

vfork VS fork in MPI mutiple thread

My program is very very large. So, I can't list it here. My program uses openMPI & mutiple_thread.
The problem has been solved. (using vfork() instead of fork()) But I don't know why it works. So, could anyone give me an explaination about it?
The problem is caused by free().
There are some segments of code in my program. All these segments are in threads which is created by pthread_create. The logic of these segments are like:
{
*p = malloc();
fun(p);
free(p);
}
All errors are at free(). It report a segment fault error. I ran the program more than 100 times. I found that there is always a fork() being called before each corruption at free.
The logic of fork segment is like(in thread):
{
MPI_program_code...
if(!fork())
{
execv(exe_file,arg);
}
MPI_program_code...
}
(Note that, in exe_file no MPI_function is used.)
When I use vfork() instead of fork(), there is no problem at all. But I don't know why it works.
So, could anyone explain why it works?
You might find the Open MPI FAQ topic on forking child processes very useful. Also an explanation on why using fork() with InfiniBand is dangerous can be found here.
vfork(2) differs from fork(2) in that it is specifically designed to be as lightweight as possible and is only meant to be used together with an immediately following execve(2) (or any of its wrappers from the C library) or _exit(2) call. The reason for that is that vfork(2) creates a child process that shares all memory with the parent instead of having it copy-on-write mapped, i.e. the new child is more like a thread than like a full-blown process. Since the child also uses the stack of the original thread, the parent is blocked until the child has either execve'd another executable or exited.
Open MPI registers a fork() handler using pthread_atfork(). The handler is not called when vfork() is used on modern Linux systems, therefore no actions are taken by the parent process upon forking.

What is the use of fork() - ing before exec()?

In *nix systems, processes are created by using fork() system call. Consider for example, init process creates another process.. First it forks itself and creates the a process which has the context like init. Only on calling exec(), this child process turns out to be a new process. So why is the intermediate step ( of creating a child with same context as parent ) needed? Isn't that a waste of time and resource, because we are creating a context ( consumes time and wastes memory ) and then over writing it?
Why is this not implemented as allocating a vacant memory area and then calling exec()? This would save time and resources right?
The intermediate step enables you to set up shared resources in the child process without the external program being aware of it. The canonical example is constructing a pipe:
// read output of "ls"
// (error checking omitted for brevity)
int pipe_fd[2];
pipe(&pipe_fd);
if (fork() == 0) { // child:
close(pipe_fd[0]); // we don't want to read from the pipe
dup2(pipe_fd[1], 1); // redirect stdout to the write end of the pipe
execlp("ls", "ls", (char *) NULL);
_exit(127); // in case exec fails
}
// parent:
close(pipe_fd[1]);
fp = fdopen(pipe_fd[0], "r");
while (!feof(fp)) {
char line[256];
fgets(line, sizeof line, fp);
...
}
Note how the redirection of standard output to the pipe is done in the child, between fork and exec. Of course, for this simple case, there could be a spawning API that would simply do this automatically, given the proper parameters. But the fork() design enables arbitrary manipulation of per-process resources in the child — one can close unwanted file descriptors, modify per-process limits, drop privileges, manipulate signal masks, and so on. Without fork(), the API for spawning processes would end up either extremely fat or not very useful. And indeed, the process spawning calls of competing operating systems typically fall somewhere in between.
As for the waste of memory, it is avoided with the copy on write technique. fork() doesn't allocate new memory for the child process, but points the child to the parent's memory, with the instructions to make a copy of a page only if the page is ever written to. This makes fork() not only memory-efficient, but also fast, because it only needs to copy a "table of contents".
This is an old complaint. Many people have asked Why fork() first? and typically they suggest an operation that will both create a new process from scratch and run a program in it. This operation is called something like spawn().
And they always say, Won't that be faster?
And in fact, every system other than the Unix family does go the "spawn" way. Only Unix is based on fork() and exec().
But it's funny, Unix has always been much faster than other full-featured systems. It has always handled way more users and load.
And Unix has been made even faster over the years. Fork() no longer really duplicates the address space, it just shares it using a technique called copy-on-write. (A very old fork optimization called vfork() is also still around.)
Drink the Kool-Aid.
I don't know exactly how the init process works on a kernel in terms of forking but to answer you question of why you need to call fork then exec is simply because once you exec there is no turning back.
If you check out the documentation here, it essentially requires a new process to be spawned (the fork call) in order for the parent process to resume control and either wait for it to finish or sit as a daemon probably would.
Only on calling exec(), this child process turns out to be a new
process.
Not really. After a fork, you already have new process, even not that much different from its parent. There are some cases where no exec need to follow a fork.
So why is the intermediate step ( of creating a child with same
context as parent ) needed?
One reason would be because it is an efficient way to create the whole shebang. Cloning is usually less complex than creating from scratch.
Isn't that a waste of time and resource, because we are creating a
context ( consumes time and wastes memory ) and then over writing it?
It is not a waste of time and resource as most of this resource is virtual, due to the copy on write mechanism used. Moreover, it is incorrect to state the created context is overwritten. Nothing is rewritten given the fact nothing was actually written in the first place. That's the whole point of COW. "Only" the process address space (code, heap and stack) are substituted, not overwritten. A lot of the process context is partially or totally preserved, including environment, file descriptors, priority, ignored signals, current and root directory, limits, various masks, processor bindings, privileges and several other things foreign to the process address space.

How do I spawn a daemon in uClinux using vfork?

This would be easy with fork(), but I've got no MMU. I've heard that vfork() blocks the parent process until the child exits or executes exec(). How would I accomplish something like this?:
pid_t pid = vfork();
if (pid == -1) {
// fail
exit(-1);
}
if (pid == 0) {
// child
while(1) {
// Do my daemon stuff
}
// Let's pretend it exits sometime
exit();
}
// Continue execution in parent without blocking.....
It seems there is no way to do this exactly as you have it here. exec or _exit have to get called for the parent to continue execution. Either put the daemon code into another executable and exec it, or use the child to spawn the original task. The second approach is the sneaky way, and is described here.
daemon() function for uClinux systems without MMU and fork(), by Jamie Lokier, in patch format
You can't do daemon() with vfork(). To create something similar to a daemon on !MMU using vfork(), the parent process doesn't die (so there are extra processes), and you should call your daemon on the background (i.e. by appending & to the command line on the shell).
On the other hand, Linux provides clone(). Armed with that, knowledge and care, it's possible to implement daemon() for !MMU. Jamie Lokier has a function to do just that on ARM and i386, get it from here.
Edit: made the link to Jamie Lokier's daemon() for !MMU Linux more prominent.
I would have thought that this would be the type of problem that many others had run into before, but I've had a hard time finding anyone talking about the "kill the parent" problems.
I initially thought that you should be able to do this with a (not quite so, but sort of) simple call to clone, like this:
pid_t new_vfork(void) {
return clone(child_func, /* child function */
child_stack, /* child stack */
SIGCHLD | CLONE_VM, /* flags */
NULL, /* argument to child */
NULL, /* pid of the child */
NULL, /* thread local storage for child */
NULL); /* thread id of child in child's mem */
}
Except that determining the child_stack and the child_func to work the way that it does with vfork is pretty difficult since child_func would need to be the return address from the clone call and the child_stack would need to be the top of the stack at the point that the actual system call (sys_clone) is made.
You could probably try to call sys_clone directly with
pid_t new_vfork(void) {
return sys_clone( SIGCHLD | CLONE_VM, NULL);
}
Which I think might get what you want. Passing NULL as the second argument, which is the child_stack pointer, causes the kernel to do the same thing as it does in vfork and fork, which is to use the same stack as the parent.
I've never used sys_clone directly and haven't tested this, but I think it should work. I believe that:
sys_clone( SIGCHLD | CLONE_VM | CLONE_VFORK, NULL);
is equivalent to vfork.
If this doesn't work (and you can't figure out how to do something similar) then you may be able to use the regular clone call along with setjump and longjmp calls to emulate it, or you may be able to get around the need for the "return's twice" semantics of fork and vfork.

Resources