Clone-equivalent of fork? - c

I'd like to use the namespacing features of the clone function. Reading the manpage, it seems like clone has lots of intricate details I need to worry about.
Is there an equivalent clone invocation to good ol' fork()?
I'm already familiar with fork, and believe that if I have a starting point in clone, I can add flags and options from there.

I think that this will work, but I'm not entirely certain about some of the pointer arguments.
pid_t child = clone( child_f, child_stack,
/* int flags */ SIGCHLD,
/* argument to child_f */ NULL,
/* pid_t *pid */ NULL,
/* struct usr_desc * tls */ NULL,
/* pid_t *ctid */ NULL );
In the flags parameter the lower byte of it is used to specify which signal to send to notify the parent of the thread doing things like dying or stopping. I believe that all of the actual flags turn on switches which are different from fork. Looking at the kernel code suggests this is the case.
If you really want to get something close to fork you may want to call sys_clone which does not take function pointer and instead returns twice like fork.

You could fork a normal child process using fork(), then use unshare() to create a new namespace.
Namespaces are a bit weird, I can't see a lot of use-cases for them.

clone() is used to create a thread. The big difference between clone() and fork() is that clone() is meant to execute starting at a separate entry point - a function, whereas fork() just continues on down from the same point in the code from where was invoked. int (*fn)(void *) in the manpage definition is the function, which on exit returns an int, the exit status.
The closest call to clone is pthread_create() which is essentially a wrapper for clone().
This does not get you a way to get fork() behavior.

Related

Get system calls IDs and store them in a .txt file(LINUX)

So i've been struggling with this exercise. I must get al of the System Calls made by any given Linux command of my choice (I.E. ls or cd), list them in a .txt file, and have their unique IDs listed beside them.
So far here's what i got:
strace -o filename.txt ls
This when executed in the Linux shell gives me a "filename.txt" file containing all the system calls of the ls command. Now in my C script:
#include <stdio.h>
#include <stdlib.h>
int main(){
system("strace -o filename.txt ls");
return 0;
}
This should do the same as the previous code, but it's not returning me anything, although the code succesfully compiles. How would i go about fixing this, and then get the IDs? I'm using the "stdlib" library because in my research i found that it has some relation to system call IDs, but haven't found any indication on how to get them. Basically i must read that file i created and have it give each system call its ID.
The exercise is obviously designed to be solved by using the ptrace() facility, because the strace utility does not have an option to print the syscall number (as far as I know).
Technically, you can use something like
printf '#include <sys/syscall.h>\n' | gcc -dD -E - | awk '$1 == "#define" { m[$2] = $3 } END { for (name in m) if (name ~ /^SYS_/) { v = name; while (v in m) v = m[v]; sub(/^SYS_/, "", name); printf "%s %s\n", v, name } }'
to generate a number of syscall-number syscall-name lines, to be used for mapping syscall names back to syscall numbers, but this would be silly and error-prone. Silly, because being able to use ptrace() gives you much more control than using the strace utility, and using a "clever hack" like above just means you avoid learning how to do that, which in my opinion is by definition self-defeating and therefore utterly silly; and error-prone, because there is absolutely no guarantee that the installed headers match the running architecture. This is especially problematic on multiarch architectures, where you can use -m32 and -m64 compiler options to switch between 32-bit and 64-bit architectures. They typically have completely different syscall numbers.
Essentially, your program should:
fork() a child process.
In the child process:
Enable ptracing by calling prctl(PR_SET_DUMPABLE, 1L)
Make parent process the tracer by calling ptrace(PTRACE_TRACEME, (pid_t)0, (void *)0, (void *)0)
Optionally, set tracing options. For example, call ptrace(PTRACE_SETOPTIONS, getpid(), PTRACE_O_TRACECLONE | PTRACE_O_TRACEEXEC | PTRACE_O_TRACEEXIT | PTRACE_O_TRACEFORK) so that you catch at least clone(), fork(), and exec() family of syscalls.
If you do not set the PTRACE_O_TRACEEXEC option, you should stop the child process at this point using e.g. raise(SIGSTOP);, so that the parent process can start tracing this child.
Execute the command to be traced using e.g. execv(). In particular, if the first command line parameter is the command to run, optionally followed by its options, you can use execvp(argv[1], argv + 1);.
If you set the PTRACE_O_TRACEEXEC option above, then the kernel will auto-pause the child process just before executing the new binary.
If the exec fails, the child process should exit. I like to use exit(127);, to return exit status 127.
In the parent process, use waitpid(childpid, &status, WUNTRACED | WCONTINUED in a loop, to catch events in the child process.
The very first event should be the initial pause, i.e. WIFSTOPPED(status) being true. (If not, something else went wrong.)
There are three three different reasons why waitpid(childpid, &status, WUNTRACED | WCONTINUED) may return:
When the child exits (WIFEXITED(status) will be true).
This should obviously end the tracing, and have the parent tracer process exit, too.
When the child resumes execution (WIFCONTINUED(status) will be true).
You cannot assume that a PTRACE_SYSCALL, PTRACE_SYSEMU, PTRACE_CONT etc. commands have actually caused the child process to continue, until the parent gets this signal. In other words, you cannot just fire ptrace() commands to the child process, and expect them to take place in an orderly fashion! The ptrace() facility is asynchronous, and the call will return immediately; you need to waitpid() for the WIFCONTINUED(status) type of event to know that the child process heeded the command.
When the kernel stopped the child (with SIGTRAP) because the child process is about to execute a syscall. (In the parent, WIFSTOPPED(status) will be true.)
Whenever the child process gets stopped because it is about to execute a syscall, you need to use ptrace(PTRACE_GETREGS, childpid, (void *)0, &regs) to obtain the CPU register state in the child process at the point of syscall execution.
regs is of type struct user, defined in <sys/user.h>. For Intel/AMD architectures, regs.regs.eax (for 32-bit) or regs.regs.rax (for 64-bit) contains the syscall number (SYS_foo as defined in <sys/syscall.h>.
You then need to call ptrace(PTRACE_SYSCALL, childpid, (void *)0, (void *)0) to tell the kernel to execute that syscall, and waitpid() again to wait for the WIFCONTINUED(status) event notifying that it did.
The next WIFSTOPPED(status) type event from waitpid() will occur when the syscall is completed. If you want, you can use PTRACE_GETREGS again to examine regs.regs.eax or regs.regs.rax, which contains the syscall return value; on Intel/AMD, if an error occurred, it will be a negative errno value (i.e. -EACCES, -EINVAL, or similar.)
You need to call ptrace(PTRACE_SYSCALL, childpid, (void *)0, (void *)0) to tell the kernel to continue running the child, until the next syscall.
There are quite a few examples on-line showing some of the details above, although most that I have personally seen are pretty lax on error checking, and occasionally omit checking the WIFCONTINUED(status) waitpid() events. I've even written an answer detailing how to stop and continue individual threads on StackOverflow. Since the technique can be used as a very powerful custom debugging tool, I do recommend you try to learn the facility so you can leverage it in your work, rather than just copy-paste some existing code to get a passing grade on the exercise.

Implementing posix_spawn on Linux

I am curious to see if it would be possible to implement posix_spawn in Linux using a combination of vfork+exec. In a very simplified way (leaving out most optional arguments) this could look more or less like this:
int my_posix_spawn(pid_t *ppid, char **argv, char **env)
{
pid_t pid;
pid = vfork();
if (pid == -1)
return errno;
if (pid == 0)
{
/* Child */
execve(argv[0], argv, env);
/* If we got here, execve failed. How to communicate this to
* the parent? */
_exit(-1);
}
/* Parent */
if (ppid != NULL)
*ppid = pid;
return 0;
}
However I am wondering how to cope with the case where vfork succeeds (so the child process is created) but the exec call fails. There seems to be no way to communicate this to the parent, which would only see that it could apparently create a child process successfully (as it would get a valid pid back)
Any ideas?
As others have noted in the comments, posix_spawn is permitted to create a child process that immediately dies to due to exec failure or other post-fork failures; the calling application needs to be prepared for this. But of course it's preferable not to do so.
The general procedure for communicating exec failure to the parent is described in an answer I wrote on this question: What can cause exec to fail? What happens next?.
Unfortunately, however, some of the operations you need to perform are not legal after vfork due to its nasty returns-twice semantics. I've covered this topic in the past in an article on ewontfix.com. The solution for making a posix_spawn that avoids duplicating the VM seems to be using clone with CLONE_VM (and possibly CLONE_VFORK) to get a new process that shares memory but doesn't run on the same stack. However, this still requires a lot of care to avoid making any calls to libc functions that might modify memory used by the parent. My current implementation is here:
http://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c?id=v1.1.4
and as you can see it's rather complicated. Reading the git history may be informative regarding some of the design decisions that were made.
I don't think there's any good way to do this with the current set of system calls. You've correctly identified the biggest problem -- the absence of any reliable way to report failure after the vfork. Other problems include race conditions in setting child state, and Linux's lack of interest in picking up closefrom.
Several years ago I sketched a new system-level API that would solve this problem: the key addition is a system call, which I called egg(), that creates a process without giving it an address space, and inheriting no state from the parent. Obviously, an egg process can't execute code; but you can (with a whole bunch more new system calls) set all of its kernelside state, and then (with yet another system call, hatch()) load an executable into it and set it going. Crucially, all of the new system calls report failure in the parent. For instance, there's a dup_into(pid, to_fd, from_fd) call that copies parent file descriptor from_fd to egg-state process pid's file descriptor to_fd; if it fails, the parent gets the failure code.
I never had time to flesh all of that out into a coherent API specification and code it up (and I'm not a kernel hacker, anyway) but I still think the concept has legs and I would be happy to work with someone to get it done.

vfork VS fork in MPI mutiple thread

My program is very very large. So, I can't list it here. My program uses openMPI & mutiple_thread.
The problem has been solved. (using vfork() instead of fork()) But I don't know why it works. So, could anyone give me an explaination about it?
The problem is caused by free().
There are some segments of code in my program. All these segments are in threads which is created by pthread_create. The logic of these segments are like:
{
*p = malloc();
fun(p);
free(p);
}
All errors are at free(). It report a segment fault error. I ran the program more than 100 times. I found that there is always a fork() being called before each corruption at free.
The logic of fork segment is like(in thread):
{
MPI_program_code...
if(!fork())
{
execv(exe_file,arg);
}
MPI_program_code...
}
(Note that, in exe_file no MPI_function is used.)
When I use vfork() instead of fork(), there is no problem at all. But I don't know why it works.
So, could anyone explain why it works?
You might find the Open MPI FAQ topic on forking child processes very useful. Also an explanation on why using fork() with InfiniBand is dangerous can be found here.
vfork(2) differs from fork(2) in that it is specifically designed to be as lightweight as possible and is only meant to be used together with an immediately following execve(2) (or any of its wrappers from the C library) or _exit(2) call. The reason for that is that vfork(2) creates a child process that shares all memory with the parent instead of having it copy-on-write mapped, i.e. the new child is more like a thread than like a full-blown process. Since the child also uses the stack of the original thread, the parent is blocked until the child has either execve'd another executable or exited.
Open MPI registers a fork() handler using pthread_atfork(). The handler is not called when vfork() is used on modern Linux systems, therefore no actions are taken by the parent process upon forking.

setuid() before calling execv() in vfork() / clone()

I need to fork an exec from a server. Since my servers memory foot print is large, I intend to use vfork() / linux clone(). I also need to open pipes for stdin / stdout / stderr. Is this allowed with clone() / vfork()?
From the standard:
[..] the behaviour is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit() or one of the exec family of functions.
The problem with calling functions like setuid or pipe is that they could affect memory in the address space shared between the parent and child processes. If you need to do anything before exec, the best way is to write a small shim process that does whatever you need it to and then execs to the eventual child process (perhaps arguments supplied through argv).
shim.c
======
enum {
/* initial arguments */
ARGV_FILE = 5, ARGV_ARGS
};
int main(int argc, char *argv[]) {
/* consume instructions from argv */
/* setuid, pipe() etc. */
return execvp(argv[ARGV_FILE], argv + ARGV_ARGS);
}
I'd use clone() instead, using CLONE_VFORK|CLONE_VM flags; see man 2 clone for details.
Because CLONE_FILES is not set, the child process has its own file descriptors, and can close and open standard descriptors without affecting the parent at all.
Because the cloned process is a separate process, it has its own user and group ids, so setting them via setresgid() and setresuid() (perhaps calling setgroups() or initgroups() first to set the additional groups -- see man 2 setresuid, man 2 setgroups, and man 3 initgroups for details) will not affect the parent at all.
The CLONE_VFORK|CLONE_VM flags mean this clone() should behave like vfork(), with the child process running in the same memory space as the parent process up till the execve() call.
This approach avoids the latency when using an intermediate executable -- it is pretty significant --, but the approach completely Linux-specific.

How do I spawn a daemon in uClinux using vfork?

This would be easy with fork(), but I've got no MMU. I've heard that vfork() blocks the parent process until the child exits or executes exec(). How would I accomplish something like this?:
pid_t pid = vfork();
if (pid == -1) {
// fail
exit(-1);
}
if (pid == 0) {
// child
while(1) {
// Do my daemon stuff
}
// Let's pretend it exits sometime
exit();
}
// Continue execution in parent without blocking.....
It seems there is no way to do this exactly as you have it here. exec or _exit have to get called for the parent to continue execution. Either put the daemon code into another executable and exec it, or use the child to spawn the original task. The second approach is the sneaky way, and is described here.
daemon() function for uClinux systems without MMU and fork(), by Jamie Lokier, in patch format
You can't do daemon() with vfork(). To create something similar to a daemon on !MMU using vfork(), the parent process doesn't die (so there are extra processes), and you should call your daemon on the background (i.e. by appending & to the command line on the shell).
On the other hand, Linux provides clone(). Armed with that, knowledge and care, it's possible to implement daemon() for !MMU. Jamie Lokier has a function to do just that on ARM and i386, get it from here.
Edit: made the link to Jamie Lokier's daemon() for !MMU Linux more prominent.
I would have thought that this would be the type of problem that many others had run into before, but I've had a hard time finding anyone talking about the "kill the parent" problems.
I initially thought that you should be able to do this with a (not quite so, but sort of) simple call to clone, like this:
pid_t new_vfork(void) {
return clone(child_func, /* child function */
child_stack, /* child stack */
SIGCHLD | CLONE_VM, /* flags */
NULL, /* argument to child */
NULL, /* pid of the child */
NULL, /* thread local storage for child */
NULL); /* thread id of child in child's mem */
}
Except that determining the child_stack and the child_func to work the way that it does with vfork is pretty difficult since child_func would need to be the return address from the clone call and the child_stack would need to be the top of the stack at the point that the actual system call (sys_clone) is made.
You could probably try to call sys_clone directly with
pid_t new_vfork(void) {
return sys_clone( SIGCHLD | CLONE_VM, NULL);
}
Which I think might get what you want. Passing NULL as the second argument, which is the child_stack pointer, causes the kernel to do the same thing as it does in vfork and fork, which is to use the same stack as the parent.
I've never used sys_clone directly and haven't tested this, but I think it should work. I believe that:
sys_clone( SIGCHLD | CLONE_VM | CLONE_VFORK, NULL);
is equivalent to vfork.
If this doesn't work (and you can't figure out how to do something similar) then you may be able to use the regular clone call along with setjump and longjmp calls to emulate it, or you may be able to get around the need for the "return's twice" semantics of fork and vfork.

Resources