vfork+execve strange when using syscall - c

If you execute the code below you'll see execve returns a process id and parent never executes. I tried looking for documentation but I either can't find it or can't understand it. clone talks about vfork (CLONE_VFORK) and says the below but the parent never seems to execute. If you uncomment the non sys call vfork or use the syscall fork it'll work as expected
the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)).
#include <unistd.h>
#include <syscall.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
//int a = vfork();
//int a = syscall(__NR_fork);
int a = syscall(__NR_vfork);
if (a) {
write(2, "parent\n", 7);
} else {
char*args[] = {"/usr/bin/true", (char*)0};
int res = execve(args[0], args, &argv[2]);
char buf[256];
sprintf(buf, "child got %d\n", res);
write(2, buf, strlen(buf));
}
write(2, "Done\nChild\n", a?5:11);
}

There are multiple instances of undefined behavior in the code.
You are invoking undefined behavior by making calls such as sprintf() and write() after execve() fails. Per POSIX:
... the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit() or one of the exec family of functions.
Even simply returning from main() after vfork() invokes undefined behavior.
#Barmar summed it up best: "you should just not use vfork() at all"
This code also invokes undefined behavior:
char*args[] = {"/usr/bin/true", (char*)0};
int res = execve(args[0], args, &argv[2]);
argv[2] doesn't exist, so passing its address to execve() invokes undefined behavior. Note that taking the address of argv[2] does not in itself invoke undefined behavior - an address one past the actual end of an array does exist. But it can't be safely derferenced, which execve() will do.
execve() expects a pointer to an array of environment pointers as its third argument:
Using execve()
The following example passes arguments to the ls command in the cmd
array, and specifies the environment for the new process image using
the env argument.
#include <unistd.h>
int ret;
char *cmd[] = { "ls", "-l", (char *)0 };
char *env[] = { "HOME=/usr/home", "LOGNAME=home", (char *)0 };
...
ret = execve ("/bin/ls", cmd, env);

I was curious what exactly did happen. I used strace -f ./a.out to see output like this, showing that it's the parent making a write(2, "Done\nChild\n", 11) system call. (lower-numbered PID, and not the new PID strace reports attaching to after vfork)
...
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7f7e48c59000, 193483) = 0
vfork(strace: Process 515667 attached
<unfinished ...>
[pid 515667] execve("/usr/bin/true", ["/usr/bin/true"], 0x7ffc4447ce18 /* 60 vars */ <unfinished ...>
[pid 515666] <... vfork resumed>) = 515667
[pid 515666] write(2, "child got 515667\n", 17child got 515667
) = 17
[pid 515667] <... execve resumed>) = 0
[pid 515666] write(2, "Done\nChild\n", 11Done
Child
) = 11
[pid 515667] brk(NULL <unfinished ...>
[pid 515666] exit_group(0 <unfinished ...>
[pid 515667] <... brk resumed>) = 0x5603b644c000
[pid 515666] <... exit_group resumed>) = ?
[pid 515667] arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc878f2720) = -1 EINVAL (Invalid argument)
[pid 515666] +++ exited with 0 +++
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
... the parent has exited by now, leaving just the child running the dynamic linker for /usr/bin/true
This is terminal output mixed with strace output; I could have used strace -f -o vfork.trace ./a.out to capture the log separately, or ./a.out &>/dev/null.
The child overwrites the parent's return address, to the execve call site
The actual behaviour of this C code with undefined behaviour happened to be the same with gcc (-O0 by default), gcc -O3, and clang -O3. So for asm that was easier to single-step with GDB, I built it with gcc -O3 -fno-plt on my Arch GNU/Linux system (GCC12.2 in case it matters). -fno-plt means that dynamic linking isn't "lazy", so we can step into library functions.
It was also handy to look at the compiler's asm source with symbolic names (https://godbolt.org/z/j6ME6rWaa).
After vfork, GDB detaches the child and lets it run, so you're still single-stepping the parent.
The parent's return from the glibc syscall() wrapper function is not to the test eax,eax instruction after call syscall, it's to the instruction after a different call It seems that after the child returns from vfork, it ends up overwriting the return address on the stack before the parent has a chance to run. That makes sense; the compiler-generated asm for main doesn't adjust RSP after function entry, so any other call would push a return address to the same place, overwriting the return address in the other process.
The glibc wrapper for vfork avoids this by popping the return address around the syscall and pushing it right after, to make it work under the conditions where POSIX and the Linux man page says it should. (Which don't include the way you're using it, but even in a safe usage, call execve before the parent can ret from a wrapper function would be a problem.) The glibc wrapper's correctness also relies on the kernel semantics of not running the parent until after the child has exited or execve'd, see a later section below; if looking at just the user-space asm, you'd think there'd be a possible race condition and that it might only usually work.
The actual place it returned to was a RIP-relative LEA following a call, not a test eax,eax. That was the lightbulb moment, the clue that a return address would have been overwritten. That LEA is setting up args for sprintf; the preceding call was call execve.
That makes sense; execve is the last thing the child did since it only returns on error; on success it replaces the process with a fresh address space that's no longer shared with the parent.
After the child returned from syscall(__NR_vfork),it branched and called execve, pushing that return address, overwriting the parent's return address from call syscall because they share an address-space including the stack.
That leaves just the parent, executing from the return path of execve(), which in a non-buggy (or non-hacky) program would only be reachable on error.
So it does the sprintf. It prints child got 515667 because that PID was the value in EAX as the parent was returning from vfork (to this block of code which takes res from the EAX return value of this other call site.)
As for how it manages to pick 11 instead of 5 as the length for the write system call, the details probably differ in debug vs. optimized builds. In an optimized build, different branches of the if(a) leave a different number in a register which the call to write() uses.
In a debug build, only the child returned to the vfork call site and stored an a value to the stack.
Shenanigans like this are why nobody uses vfork anymore; a couple copy-on-write page-faults are cheap enough that it's not worth playing with fire.
It's also why the rules on how you're allowed to use vfork are very restrictive; you'd better have your args for execve already constructed before you call vfork, so the very next thing can be a call execve.
syscall(__NR_vfork) isn't safe; it needs special handling
Single-stepping into the glibc wrapper (stepi aka si in GDB, in layout asm TUI mode), we can see its asm.
│ 0x7ffff7e7d830 <vfork> endbr64
│ 0x7ffff7e7d834 <vfork+4> pop rdi
│ 0x7ffff7e7d835 <vfork+5> mov eax,0x3a
│ 0x7ffff7e7d83a <vfork+10> syscall
│ 0x7ffff7e7d83c <vfork+12> push rdi
│ > 0x7ffff7e7d83d <vfork+13> cmp eax,0xfffff001 # EAX >= -ERRNO_MAX
│ 0x7ffff7e7d842 <vfork+18> jae 0x7ffff7e7d858 <vfork+40>
# else no-error return path.
│ 0x7ffff7e7d844 <vfork+20> xor esi,esi
│ 0x7ffff7e7d846 <vfork+22> rdsspq rsi
│ 0x7ffff7e7d84b <vfork+27> test rsi,rsi # if shadow stack not in use
│ 0x7ffff7e7d84e <vfork+30> je 0x7ffff7e7d857 <vfork+39>
│ 0x7ffff7e7d850 <vfork+32> test eax,eax # in parent, normal return
│ 0x7ffff7e7d852 <vfork+34> jne 0x7ffff7e7d857 <vfork+39>
│ 0x7ffff7e7d854 <vfork+36> pop rdi # pop real return address
│ 0x7ffff7e7d855 <vfork+37> jmp rdi # and manually return to the correct address from the shadow stack?
# no shadow-stack path of execution, return normally.
│ 0x7ffff7e7d857 <vfork+39> ret
# error handling, set errno and return -1
│ 0x7ffff7e7d858 <vfork+40> mov rcx,QWORD PTR [rip+0x105509] # 0x7ffff7f82d68
│ 0x7ffff7e7d85f <vfork+47> neg eax
│ 0x7ffff7e7d861 <vfork+49> mov DWORD PTR fs:[rcx],eax
│ 0x7ffff7e7d864 <vfork+52> or rax,0xffffffffffffffff # code-size optimization for mov rax,-1 (really rarely executed for most system calls)
│ 0x7ffff7e7d868 <vfork+56> ret
rdsspq reads the "shadow stack" pointer, in case the caller was using CET, Control-flow Enforcement Technology. I'm not familiar with CET, so my comments on that part are guesswork based on what this function probably needs to do, and how it's using these instructions.
I should have just looked at the hand-written glibc source which has comments, glibc/sysdeps/unix/sysv/linux/x86_64/vfork.S; updated with some from there.
It seems like there could still be a race with the child, like if our push rdi runs before the child returns and calls execve. Under normal scheduling conditions, though, the child does run first.
But no, there's special logic to handle that:
https://man7.org/linux/man-pages/man2/vfork.2.html
vfork() differs from fork(2) in that the calling thread is
suspended until the child terminates (either normally, by calling
_exit(2), or abnormally, after delivery of a fatal signal), or it
makes a call to execve(2). Until that point, the child shares
all memory with its parent, including the stack. The child must
not return from the current function or call exit(3) (which would
have the effect of calling exit handlers established by the
parent process and flushing the parent's stdio(3) buffers), but
may call _exit(2).
As you mentioned in comments, if you wanted to use this for concurrency / threading, use pthread_create(3) to start threads, not vfork()! Or the same raw system call it uses, clone(CLONE_THREAD). (Note that the glibc wrapper for clone uses the new thread's stack memory to store a code pointer to be called; the kernel API/ABI doesn't have a code-pointer arg; see the C library / kernel differences part of the man page, and maybe the glibc ource code for clone().)
These days, vfork is implemented inside the kernel as clone( flags=CLONE_VM | CLONE_VFORK | SIGCHLD ).

Related

Why exit_group flushes output buffer?

From the manual page I know that:
exit() flushes output buffers while _exit,_Exit,exit_group don't.
In the code below, the content of test.log will be hello\nhello\n only if exit() was called in child process, which is the same as I tested.
But when there's no statements in child process, it will behave like it's calling exit(), the content of test.log will be hello\nhello\n too.
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
int main()
{
int pid;
FILE *fp = fopen("test.log", "w");
fprintf(fp, "hello\n");
pid = fork();
if (pid == 0)
{
// do something
;
}
else
{
wait(NULL);
exit(1);
}
}
Through ltrace and strace I can see both parent and child called exit_group.
Since I called exit in parent, so it can be judged by using ltrace that exit will call exit_group.
But in the opposite I can't judge whether child process called exit.
Does gcc called exit in child process implicitly? It may be harmful to call exit in child process as many people said. If not, then why the buffer was flushed?
Testing gcc and glibc version:
gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
GNU C Library (Debian GLIBC 2.24-11+deb9u4) stable release version
2.24
Promoting PSkocik's comment to an answer:
Returning from main is always equivalent to calling exit(). Either way is supposed to represent "normal" termination of a process, where you want for all the data you buffered to actually get written out, all atexit() cleanup functions called, and so on.
Indeed, the C library's startup code is often written simply as exit(main(argc, argv));
If you want the "harsher" effect of _exit() or _Exit() then you have to call them explicitly. And a forked child process is a common instance where this is what you do want.
But when there's no statements in child process, it will behave like it's calling exit(), the content of test.log will be hello\nhello\n too.
You have a statement in child process (implicit):
When child process returns from main(), the C runtime makes a call to exit(0); (or better, something equivalent to exit(main(arc, argv, envp));, indeed) that implies the flushing of buffers you are trying to avoid.
If you want to eliminate one of the hello\ns, just fflush(fp); before calling to fork() system call, as the call will duplicate fp in both processes, each with a partially full buffer the same way, and this is the reason you get the buffer contents duplicated. Or better flush all output buffers with fflush(NULL);.

Who sets the RIP register when you call the clone syscall?

I am trying to implement a minimal kernel and I am trying to implement the clone syscall. In the man pages you can see the clone syscall defined as such:
int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...
/* pid_t *parent_tid, void *tls, pid_t *child_tid */ );
As you can see, it receives a function pointer. If you read the man page more closely you can actually see that the actual syscall implementation in the kernel does not receive a function pointer:
long clone(unsigned long flags, void *stack,
int *parent_tid, int *child_tid,
unsigned long tls);
So, my question is, who modifies the RIP register after a thread is created? Is it the libc?
I found this code in glibc: https://elixir.bootlin.com/glibc/latest/source/sysdeps/unix/sysv/linux/x86_64/clone.S but I am not sure at what point the function is actually called.
Extra information:
When looking at the clone.S source code you can see that it jumps to a thread_start branch after the syscall. On the branch after the clone syscall (so only the child does this) it pops the function address and the arguments from the stack. Who actually pushed these arguments and the function address on the stack? I guess it has to happen somewhere in the kernel because at the point of the syscall instruction they were not there.
Here is some gdb output:
Right before the syscall:
[-------------------------------------code-------------------------------------]
0x7ffff7d8af22 <clone+34>: mov r8,r9
0x7ffff7d8af25 <clone+37>: mov r10,QWORD PTR [rsp+0x8]
0x7ffff7d8af2a <clone+42>: mov eax,0x38
=> 0x7ffff7d8af2f <clone+47>: syscall
0x7ffff7d8af31 <clone+49>: test rax,rax
0x7ffff7d8af34 <clone+52>: jl 0x7ffff7d8af49 <clone+73>
0x7ffff7d8af36 <clone+54>: je 0x7ffff7d8af39 <clone+57>
0x7ffff7d8af38 <clone+56>: ret
Guessed arguments:
arg[0]: 0x3d0f00
arg[1]: 0x7ffff8020b60 --> 0x7ffff7d3fb30 (<do_something>: push rbx)
arg[2]: 0x7fffffffda90 --> 0x0
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffda78 --> 0x7ffff7d3f52c (<main+172>: pop rsi)
0008| 0x7fffffffda80 --> 0x7fffffffda94 --> 0x73658b0000000000
0016| 0x7fffffffda88 --> 0x7fffffffda94 --> 0x73658b0000000000
0024| 0x7fffffffda90 --> 0x0
0032| 0x7fffffffda98 --> 0x492e085573658b00
0040| 0x7fffffffdaa0 --> 0x7ffff7d3f0d0 (<_init>: sub rsp,0x8)
0048| 0x7fffffffdaa8 --> 0x7ffff7d40830 (<__libc_csu_init>: push r15)
0056| 0x7fffffffdab0 --> 0x7ffff7d408d0 (<__libc_csu_fini>: push rbp)
[------------------------------------------------------------------------------]
After the syscall instruction on the child thread (check the top of the stack - this does not happen on the parent's thread):
[-------------------------------------code-------------------------------------]
0x7ffff7d8af25 <clone+37>: mov r10,QWORD PTR [rsp+0x8]
0x7ffff7d8af2a <clone+42>: mov eax,0x38
0x7ffff7d8af2f <clone+47>: syscall
=> 0x7ffff7d8af31 <clone+49>: test rax,rax
0x7ffff7d8af34 <clone+52>: jl 0x7ffff7d8af49 <clone+73>
0x7ffff7d8af36 <clone+54>: je 0x7ffff7d8af39 <clone+57>
0x7ffff7d8af38 <clone+56>: ret
0x7ffff7d8af39 <clone+57>: xor ebp,ebp
[------------------------------------stack-------------------------------------]
0000| 0x7ffff8020b60 --> 0x7ffff7d3fb30 (<do_something>: push rbx)
0008| 0x7ffff8020b68 --> 0x7ffff7dd5add --> 0x4c414d0074736574 ('test')
0016| 0x7ffff8020b70 --> 0x0
0024| 0x7ffff8020b78 --> 0x411
0032| 0x7ffff8020b80 ("Parameters: 0x7ffff7d3fb30 4001536 0x7ffff8020b70 0x7fffffffda90 0x7ffff8000b60 0x7fffffffda94\n")
0040| 0x7ffff8020b88 ("rs: 0x7ffff7d3fb30 4001536 0x7ffff8020b70 0x7fffffffda90 0x7ffff8000b60 0x7fffffffda94\n")
0048| 0x7ffff8020b90 ("fff7d3fb30 4001536 0x7ffff8020b70 0x7fffffffda90 0x7ffff8000b60 0x7fffffffda94\n")
0056| 0x7ffff8020b98 ("30 4001536 0x7ffff8020b70 0x7fffffffda90 0x7ffff8000b60 0x7fffffffda94\n")
[------------------------------------------------------------------------------]
Normally the way it works is that, when the computer boots, Linux sets up a MSR (Model Specific Register) to work with the assembly instruction syscall. The assembly instruction syscall will make the RIP register jump to the address specified in the MSR to enter kernel mode. As stated in 64-ia-32-architectures-software-developer-vol-2b-manual from Intel:
SYSCALL invokes an OS system-call handler at privilege level 0.
It does so by loading RIP from the IA32_LSTAR MSR
Once in kernel mode, the kernel will look at the arguments passed into conventional registers (RAX, RBX etc.) to determine what the syscall is asking. Then the kernel will invoke one of the sys_XXX functions whose prototypes are in linux/syscalls.h (https://elixir.bootlin.com/linux/latest/source/include/linux/syscalls.h#L217). The definition of sys_clone is in kernel/fork.c.
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
#endif
{
return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
}
The SYSCALLDEFINE5 macro takes the first argument and prefixes sys_ to it. This function is actually sys_clone and it calls _do_fork.
It means there really isn't a clone() function which is invoked by glibc to call into the kernel. The kernel is called with the syscall instruction, it jumps to an address specified in the MSR and then it invokes one of the syscalls in the sys_call_table.
The entry point to the kernel for x86 is here: https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S. If you scroll down you'll see the line: call *sys_call_table(, %rax, 8). Basically, call one of the functions of the sys_call_table. The implementation of the sys_call_table is here: https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscall_64.c#L20.
// SPDX-License-Identifier: GPL-2.0
/* System call table for x86-64. */
#include <linux/linkage.h>
#include <linux/sys.h>
#include <linux/cache.h>
#include <linux/syscalls.h>
#include <asm/unistd.h>
#include <asm/syscall.h>
#define __SYSCALL_X32(nr, sym)
#define __SYSCALL_COMMON(nr, sym) __SYSCALL_64(nr, sym)
#define __SYSCALL_64(nr, sym) extern long __x64_##sym(const struct pt_regs *);
#include <asm/syscalls_64.h>
#undef __SYSCALL_64
#define __SYSCALL_64(nr, sym) [nr] = __x64_##sym,
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
/*
* Smells like a compiler bug -- it doesn't work
* when the & below is removed.
*/
[0 ... __NR_syscall_max] = &__x64_sys_ni_syscall,
#include <asm/syscalls_64.h>
};
I recommend you read the following: https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html. On this website is stated that
As you can see, we include the asm/syscalls_64.h header at the end of the array. This header file is generated by the special script at arch/x86/entry/syscalls/syscalltbl.sh and generates our header file from the syscall table (https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl).
...
...
So, after this, our sys_call_table takes the following form:
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
[0 ... __NR_syscall_max] = &sys_ni_syscall,
[0] = sys_read,
[1] = sys_write,
[2] = sys_open,
...
...
...
};
Once you have the table generated, one of its entries is being jumped to when you use the syscall assembly instruction. For clone() it will call sys_clone() which itself calls _do_fork(). Which is defined as such:
long _do_fork(unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr,
unsigned long tls)
{
struct task_struct *p;
int trace = 0;
long nr;
/*
* Determine whether and which event to report to ptracer. When
* called from kernel_thread or CLONE_UNTRACED is explicitly
* requested, no event is reported; otherwise, report if the event
* for the type of forking is enabled.
*/
if (!(clone_flags & CLONE_UNTRACED)) {
if (clone_flags & CLONE_VFORK)
trace = PTRACE_EVENT_VFORK;
else if ((clone_flags & CSIGNAL) != SIGCHLD)
trace = PTRACE_EVENT_CLONE;
else
trace = PTRACE_EVENT_FORK;
if (likely(!ptrace_event_enabled(current, trace)))
trace = 0;
}
p = copy_process(clone_flags, stack_start, stack_size,
child_tidptr, NULL, trace, tls);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
*/
if (!IS_ERR(p)) {
struct completion vfork;
struct pid *pid;
trace_sched_process_fork(current, p);
pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid);
if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, parent_tidptr);
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
get_task_struct(p);
}
wake_up_new_task(p);
/* forking complete and child started to run, tell ptracer */
if (unlikely(trace))
ptrace_event_pid(trace, pid);
if (clone_flags & CLONE_VFORK) {
if (!wait_for_vfork_done(p, &vfork))
ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
}
put_pid(pid);
} else {
nr = PTR_ERR(p);
}
return nr;
}
It calls wake_up_new_task() which puts the task on the runqueue and wakes it. I'm surprised it even wakes the task immediatly. I would have guessed that the scheduler would have done it instead and that it would have been given a high priority to run as soon as possible. In itself, the kernel doesn't have to receive a function pointer because as stated on the manpage for clone():
The raw clone() system call corresponds more closely to fork(2)
in that execution in the child continues from the point of the
call. As such, the fn and arg arguments of the clone() wrapper
function are omitted.
The child continues execution where the syscall was made. I don't understand exactly the mechanism but in the end the child will continue execution in a new thread. The parent thread (which created the new child thread) returns and the child thread jumps to the specified function instead.
I think it works with the following lines (on the link you provided):
testq %rax,%rax
jl SYSCALL_ERROR_LABEL
jz L(thread_start) //Child jumps to thread_start
ret //Parent returns to where it was
Because rax is a 64 bits register, they use the 'q' version of the GNU syntax assembly instruction test. They test if rax is zero. If it is less than zero then there was an error. If it is zero then jump to thread_start. If it is not zero nor negative (in the case of the parent thread), continue execution and return. The new thread is created with rax as 0. It allows to diffenrentiate between the parent and the child thread.
EDIT
As stated on the link you provided,
The parameters are passed in register and on the stack from userland:
rdi: fn
rsi: child_stack
rdx: flags
rcx: arg
r8d: TID field in parent
r9d: thread pointer
So when your program executes the following lines:
/* Insert the argument onto the new stack. */
subq $16,%rsi
movq %rcx,8(%rsi)
/* Save the function pointer. It will be popped off in the
child in the ebx frobbing below. */
movq %rdi,0(%rsi)
it inserts the function pointer and arguments onto the new stack. Then it calls the kernel which itself doesn't have to push anything onto the stack. It just receives the new stack as an argument and then makes the child's thread RSP register point to it. I would guess this happens in the copy_process() function (called from fork()) along the lines of:
retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
if (retval)
goto bad_fork_cleanup_io;
It seems to be done in the copy_thread_tls() function which itself calls copy_thread(). copy_thread() has its prototype in include/linux/sched.h and it is defined based on the architecture. I'm not sure where it is defined for x86.
Yes, libc; the kernel interface is like fork: it returns twice to the same place, but with different return values. (0 in the child or a PID/TID in the parent). The man page documents the glibc wrapper vs. kernel differences, like for other system calls where there's a difference.
The libc wrapper stashes the function pointer and arg you pass in the new thread's stack space, where the new thread can load it. (The kernel starts it with its RSP set to the void *stack arg passed to clone(), so it doesn't have access to old locals in stack memory or registers, and using a global wouldn't be thread-safe if multiple threads are cloning themselves at the same time.)
Note that there's also a clone3 system call that takes a struct arg, and is also more like the raw kernel interface for clone. (Or at least there is no glibc wrapper for it.)

waitpid - WIFEXITED returning 0 although child exited normally

I have been writing a program that spawns a child process, and calls waitpid to wait for the termination of the child process. The code is below:
// fork & exec the child
pid_t pid = fork();
if (pid == -1)
// here is error handling code that is **not** triggered
if (!pid)
{
// binary_invocation is an array of the child process program and its arguments
execv(args.binary_invocation[0], (char * const*)args.binary_invocation);
// here is some error handling code that is **not** triggered
}
else
{
int status = 0;
pid_t res = waitpid(pid, &status, 0);
// here I see pid_t being a positive integer > 0
// and status being 11, which means WIFEXITED(status) is 0.
// this triggers a warning in my programs output.
}
The manpage of waitpid states for WIFEXITED:
WIFEXITED(status)
returns true if the child terminated normally, that is, by calling exit(3) or
_exit(2), or by returning from main().
Which I intepret to mean it should return an integer != 0 on success, which is not happening in the execution of my program, since I observe WIFEXITED(status) == 0
However, executing the same program from the command line results in $? == 0, and starting from gdb results in:
[Inferior 1 (process 31934) exited normally]
The program behaves normally, except for the triggered warning, which makes me think something else is going on here, that I am missing.
EDIT:
as suggested below in the comments, I checked if the child is terminated via segfault, and indeed, WIFSIGNALED(status) returns 1, and WTERMSIG(status) returns 11, which is SIGSEGV.
What I don't understand though, is why a call via execv would fail with a segfault while the same call via gdb, or a shell would succeed?
EDIT2:
The behaviour of my application heavily depends on the behaviour of the child process, in particular on a file the child writes in a function declared __attribute__ ((destructor)). After the waitpid call returns, this file exists and is generated correctly which means the segfault occurs somewhere in another destructor, or somewhere outside of my control.
On Unix and Linux systems, the status returned from wait or waitpid (or any of the other wait variants) has this structure:
bits meaning
0-6 signal number that caused child to exit,
or 0177 if child stopped / continued
or zero if child exited without a signal
7 1 if core dumped, else 0
8-15 low 8 bits of value passed to _exit/exit or returned by main,
or signal that caused child to stop/continue
(Note that Posix doesn't define the bits, just macros, but these are the bit definitions used by at least Linux, Mac OS X/iOS, and Solaris. Also note that waitpid only returns for stop events if you pass it the WUNTRACED flag and for continue events if you pass it the WCONTINUED flag.)
So a status of 11 means the child exited due to signal 11, which is SIGSEGV (again, not Posix but conventionally).
Either your program is passing invalid arguments to execv (which is a C library wrapper around execve or some other kernel-specific call), or the child runs differently when you execv it and when you run it from the shell or gdb.
If you are on a system that supports strace, run your (parent) program under strace -f to see whether execv is causing the signal.

stack smashing error detected

when i try the following snippet i am getting an error called stack smashing detected. what could be the reason for this potential bug? Can some one explain?
#include<stdio.h>
#include<unistd.h>
#include<sys/types.h>
int glob=88;
int main()
{
int loc=2;
pid_t pid=vfork();
if(pid==0)
{
printf("Inside child");
glob++;
loc++;
printf("%d %d" ,glob,loc);
}
else
{
printf("Inside parent");
glob++;
loc++;
printf("%d %d",glob,loc);
}
}
and the output when I run this code is like that
user018#sshell ~ $ gcc one.c
user018#sshell ~ $ ./a.out
Inside child89 3Inside parent90 945733057*** stack smashing detected ***: a.out
- terminated
a.out: stack smashing attack in function <unknown> - terminated
KILLED
From the Linux man page (and POSIX):
The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.
You're modifying data and returning from the function in which vfork was invoked - both of these lead to undefined behavior. vfork is not equivalent to fork, the number of things you can do in a vforkd child are very, very limited. It should only be used in very specific circumstances, essentially when the only thing you need to do in the child is exec something else.
See your operating system's man page for the full details.
vfork() is used to create new processes without copying the page tables of the parent process. So you can't modify the variables in the child process because they are not there anymore. Use fork() instead.
One more thing, it's better to add a \n to the end of printf() because stdout is line buffered by default.
1) I would definitely add a "return 0", since you declared "int main()".
2) If you wanted to disable the warning, use -fno-stack-protector in your compile line.
3) If you wanted to debug where the error is coming from, use "-g" in your compile line, and run the program from gdb (instead of running ./a.out).
My closest to "what's wrong" is this man page about vfork():
http://linux.die.net/man/3/vfork
The vfork() function shall be equivalent to fork(), except that the
behavior is undefined if the process created by vfork() either
modifies any data other than a variable of type pid_t used to store
the return value from vfork(), or returns from the function in which
vfork() was called, or calls any other function before successfully
calling _exit() or one of the exec family of functions.
For Linux, just use "fork()", and I think you'll be happy :)

vfork never ends

The following code never ends. Why is that?
#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
#define SIZE 5
int nums[SIZE] = {0, 1, 2, 3, 4};
int main()
{
int i;
pid_t pid;
pid = vfork();
if(pid == 0){ /* Child process */
for(i = 0; i < SIZE; i++){
nums[i] *= -i;
printf(”CHILD: %d “, nums[i]); /* LINE X */
}
}
else if (pid > 0){ /* Parent process */
wait(NULL);
for(i = 0; i < SIZE; i++)
printf(”PARENT: %d “, nums[i]); /* LINE Y */
}
return 0;
}
Update:
This code is just to illustrate some of the confusions I have regarding to vfork(). It seems like when I use vfork(), the child process doesn't copy the address space of the parent. Instead, it shares the address space. In that case, I would expect the nums array get updated by both of the processes, my question is in what order? How the OS synchronizes between the two?
As for why the code never ends, it is probably because I don't have any _exit() or exec() statement explicitly for exit. Am I right?
UPDATE2:
I just read: 56. Difference between the fork() and vfork() system call?
and I think this article helps me with my first confusion.
The child process from vfork() system call executes in the parent’s
address space (this can overwrite the parent’s data and stack ) which
suspends the parent process until the child process exits.
To quote from the vfork(2) man page:
The vfork() function has the same effect as fork(), except that the behaviour is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit() or one of the exec family of functions.
You're doing a whole bunch of those things, so you shouldn't expect it to work. I think the real question here is: why you're using vfork() rather than fork()?
Don't use vfork. That's the simplest advice you can get. The only thing that vfork gives you is suspending the parent until the child either calls exec* or _exit. The part about sharing the address space is incorrect, some operating systems do it, other choose not to because it's very unsafe and has caused serious bugs.
Last time I looked at how applications use vfork in reality the absolute majority did it wrong. It was so bad that I threw away the 6 character change that enabled address space sharing on the operating system I was working on at that time. Almost everyone who uses vfork at least leaks memory if not worse.
If you really want to use vfork, don't do anything other than immediately call _exit or execve after it returns in the child process. Anything else and you're entering undefined territory. And I really mean "anything". You start parsing your strings to make arguments for your exec call and you're pretty much guaranteed that something will touch something it's not supposed to touch. And I also mean execve, not some other function from the exec family. Many libc out there do things in execvp, execl, execle, etc. that are unsafe in a vfork context.
What is specifically happening in your example:
If your operating system shares address space the child returning from main means that your environment cleans things up (flush stdout since you called printf, free memory that was allocated by printf and such things). This means that there are other functions called that will overwrite the stack frame the parent was stuck in. vfork returning in the parent returns to a stack frame that has been overwritten and anything can happen, it might not even have a return address on the stack to return to anymore. You first entered undefined behavior country by calling printf, then the return from main brought you into undefined behavior continent and the cleanup run after the return from main made you travel to undefined behavior planet.
From the official specification:
the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(),
In your program you modify data other than the pid variable, meaning the behavior is undefined.
You also have to call _exit to end the process, or call one of the exec family of functions.
The child must _exit rather than returning from main. If the child returns from main, then the stack frame does not exist for the parent when it returns from vfork.
just call the _exit instead of calling return or insert _exit(0) to the last line in "child process". return 0 calls exit(0) while close the stdout, so when another printf follows, the program crashes.

Resources