understanding overlaying process image and execl call

understanding overlaying process image and execl call - c

I want to build my own debugger, from scratch, so I am trying to pick up some of the concepts behind it. First, I am starting easy, using the ptrace library. But even at this point I am having some issues, let me run through this code:
int main(int argc, char** argv)
{
pid_t child_pid;
if (argc < 2) {
fprintf(stderr, "Expected a program name as argument\n");
return -1;
}
child_pid = fork();
if (child_pid == 0)
run_target(argv[1]);
else if (child_pid > 0)
run_debugger(child_pid);
else {
perror("fork");
return -1;
}
return 0;
}
this is nothing really special, I am creating a child process using fork()
the next function is what really I cannot understand
void run_target(const char* programname)
{
procmsg("target started. will run '%s'\n", programname);
/* Allow tracing of this process */
if (ptrace(PTRACE_TRACEME, 0, 0, 0) < 0) {
perror("ptrace");
return;
}
/* Replace this process's image with the given program */
execl(programname, programname, 0);
}
The last call is the issue. this call represents the concept of overlaying process image.I am not fully getting what is happening. This is what the author says:
I've highlighted the part that interests us in this example. Note that the very next thing run_target does after ptrace is invoke the program given to it as an argument
with execl. This, as the highlighted part explains, causes the OS kernel to stop the process just before it begins executing the program in execl and send a signal to the parent.
To run the debugger basically the parent process must trace the child process, which acknowledges that it wants to be traced using PTRACEME. But I can’t figure out what that execl is doing. I can understand the purpose and the output but can’t figure out HOW. I consulted the man pages but could not wrap my head around this.
I would appreciate if someone could give me a clear explanation of what’s going on with this execl function.

I think you agree on this: the concept is that the debugger must debug program TARGET, that can only be debugged if it calls PTRACE_TRACEME.
Naturally, TARGET does not call ptrace with PTRACE_TRACEME argument in its source code.
So, the debugger must do it for it.
Initially, the debugger forks. At this time we have two processes:
Father: it calls run_debugger()
Child: it calls run_target()
Child is a process that the debugger has control on it, therefore it can call ptrace with argument PTRACE_TRACEME (in run_target()). But this process is not TARGET.
Thus, next step is associating to child the "image" of TARGET (namely the program we want to debug). A process image is an executable file required while executing the program, and it's composed of the 4 classical segments:
Code (text segment)
Data
Stack
Heap
execl and friends belong to exec family, namely functions which replace the current process image with a new process image. execl differs from its friend execv for the way the arguments are passed to the best of my knowledge, but the concept is the same.
So, what you need to know is that:
exec replaces the currently running program by another program (i..e., TARGET)
inside an EXISTING process.
The latter has called ptrace with argument PTRACE_TRACEME so the new program will not ignore the future ptrace calls made by the debugger.
If your question is the implementation details of exec systemcall, I have not a perfect knowledge for it, but I can give some suggestions:
Reading "The exec-like Functions" in the book "Understanding Linux Kernel"
Having a look to the source code (this question) directionates you to the source code of execve which is totally fine for you, same concept of execl.
If you want to create your own exec, this question can be useful as well.

Related

why a shell creates a new process for each command?

I am trying to program a shell in C , and I found that each command is executed in a new process, my question is why do we make a new process to execute the command? can't we just execute the command in the current process?

It's because of how the UNIX system was designed, where the exec family of calls replace the current process. Therefore you need to create a new process for the exec call if you want the shell to continue afterward.

When you execute a command, one of the following happens:
You're executing a builtin command
You're executing an executable program
An executable program needs many things to work: different memory sections (stack, heap, code, ...), it is executed with a specific set of privileges, and many more things are happening.
If you run this new executable program in your current process, you're going to replace the current program (your shell) with the new one. It works perfectly fine but when the new executable program is done, you cannot go back to your shell since it's not in memory anymore. This is why we create a new process and run the executable program in this new process. The shell waits for this new process to be done, then it collects its exit status and prompts you again for a new command to execute.

can't we just execute the command in the current process?
Sure we can, but that would then replace the shell program with the program of the command called. But that's probably not something you want in this particular application. There are in fact, many situations in which replacing the process program via execve is a the most straightforward way to implement something. But in the case of a shell, that's likely not what you want.
You should not think processes to be something to be avoided or "feared". As a matter of fact, segregating different things into different processes is the foundation of reliability and security features. Processes are (mostly) isolated from each other, so if a process gets terminated for whatever reason (bug, crash, etc.) this in the first degree affects only that particular process.
Here's something to try out:
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int segfault_crash()
{
fprintf(stderr, "I will SIGSEGV...\n");
fputs(NULL, stderr);
return 0;
}
int main(int argc, char *argv)
{
int status = -1;
pid_t const forked_pid = fork();
if( -1 == forked_pid ){
perror("fork: ");
return 1;
}
if( 0 == forked_pid ){
return segfault_crash();
}
waitpid(forked_pid, &status, 0);
if( WIFSIGNALED(status) ){
fprintf(stderr, "Child process %lld terminated by signal %d\n",
(long long)forked_pid,
(int)WTERMSIG(status) );
} else {
fprintf(stderr, "Child process %lld terminated normally\n",
(long long)forked_pid);
}
return 0;
}
This little program forks itself, then calls a function that deliberately performs undefined behavior, that on commonplace systems triggers some kind of memory protection fault (Access Violation on Windows, Segmentation Fault on *nix systems). But because this crash has been isolated into dedicated process, the parent process (and also siblings) are not crashing together with it.
Furthermore processes may drop their privileges, limit themselves to only a subset of system calls, and be moved into namespaces/containers, each of which prevents a bug in the process to damage the rest of the system. This is how modern browsers (for example) implement sandboxing, to improve security.

Why does the kretprobe of the _do_fork() only return once?

When I write a small script with fork, the syscall returns twice processes (once per process):
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
int pid = fork();
if (pid == 0) {
// child
} else if (pid > 0) {
// parent
}
}
If I instrument that with systemtap, I only find one return value:
// fork() in libc calls clone on Linux
probe syscall.clone.return {
printf("Return from clone\n")
}
(SystemTap installes probes on _do_fork instead of clone, but that shouldn't change anything.)
This confuses me. A couple of related questions:
Why does the syscall only return once?
If I understand the _do_fork code correctly, the process is cloned in the middle of the function. (copy_process and wake_up_new_task). Shouldn't the subsequent code run in both processes?
Does the kernel code after a syscall run in the same thread / process as the user code before the syscall?

creation of the child can fail, thus errors have to be detected and handled
the child has a different return value and this also has to be handled
it may be the parent has clean ups / additional actions to do
Thus the code would have to differentiate between executing as a parent and a child. But there are no checks of the sort, which is already a strong hint that the child does not execute this code in the first place. Thus one should look for a dedicated place new children return to.
Since the code is quite big and hairy, one can try to cheat and just look for 'fork' in arch-specific code, which quickly reveals ret_from_fork.
It is set a starting point by -> do_fork -> copy_process -> copy_thread_tls http://lxr.free-electrons.com/source/arch/x86/kernel/process_64.c#L158
Thus
Why does the syscall only return once?
It does not return once. There are 2 returning threads, except the other one uses a different code path. Since the probe is installed only on the first one, you don't see the other one. Also see below.
If I understand the _do_fork code correctly, the process is cloned in the middle of the function. (copy_process and wake_up_new_task). Shouldn't the subsequent code run in both processes?
I noted earlier this is false. The real question is what would be the benefit of making the child return in the same place as the parent. I don't see any and it would troublesome (extra special casing, as noted above). To re-state: making the child return elsehwere lets callers not have to handle the returning child. They only need to check for errors.
Does the kernel code after a syscall run in the same thread / process as the user code before the syscall?
What is 'kernel code after a syscall'? If you are thread X and enter the kernel, you are still the thread X.

Implementing posix_spawn on Linux

I am curious to see if it would be possible to implement posix_spawn in Linux using a combination of vfork+exec. In a very simplified way (leaving out most optional arguments) this could look more or less like this:
int my_posix_spawn(pid_t *ppid, char **argv, char **env)
{
pid_t pid;
pid = vfork();
if (pid == -1)
return errno;
if (pid == 0)
{
/* Child */
execve(argv[0], argv, env);
/* If we got here, execve failed. How to communicate this to
* the parent? */
_exit(-1);
}
/* Parent */
if (ppid != NULL)
*ppid = pid;
return 0;
}
However I am wondering how to cope with the case where vfork succeeds (so the child process is created) but the exec call fails. There seems to be no way to communicate this to the parent, which would only see that it could apparently create a child process successfully (as it would get a valid pid back)
Any ideas?

As others have noted in the comments, posix_spawn is permitted to create a child process that immediately dies to due to exec failure or other post-fork failures; the calling application needs to be prepared for this. But of course it's preferable not to do so.
The general procedure for communicating exec failure to the parent is described in an answer I wrote on this question: What can cause exec to fail? What happens next?.
Unfortunately, however, some of the operations you need to perform are not legal after vfork due to its nasty returns-twice semantics. I've covered this topic in the past in an article on ewontfix.com. The solution for making a posix_spawn that avoids duplicating the VM seems to be using clone with CLONE_VM (and possibly CLONE_VFORK) to get a new process that shares memory but doesn't run on the same stack. However, this still requires a lot of care to avoid making any calls to libc functions that might modify memory used by the parent. My current implementation is here:
http://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c?id=v1.1.4
and as you can see it's rather complicated. Reading the git history may be informative regarding some of the design decisions that were made.

I don't think there's any good way to do this with the current set of system calls. You've correctly identified the biggest problem -- the absence of any reliable way to report failure after the vfork. Other problems include race conditions in setting child state, and Linux's lack of interest in picking up closefrom.
Several years ago I sketched a new system-level API that would solve this problem: the key addition is a system call, which I called egg(), that creates a process without giving it an address space, and inheriting no state from the parent. Obviously, an egg process can't execute code; but you can (with a whole bunch more new system calls) set all of its kernelside state, and then (with yet another system call, hatch()) load an executable into it and set it going. Crucially, all of the new system calls report failure in the parent. For instance, there's a dup_into(pid, to_fd, from_fd) call that copies parent file descriptor from_fd to egg-state process pid's file descriptor to_fd; if it fails, the parent gets the failure code.
I never had time to flesh all of that out into a coherent API specification and code it up (and I'm not a kernel hacker, anyway) but I still think the concept has legs and I would be happy to work with someone to get it done.

Win XP, C program: Query regarding int main() for child process

I am creating a child process and passing some arguments to it.
Now, the child process starts execution from the next line of code, but will I have to write another int main () separately for the child process, as below, or would it just use the already written code for int main() of the parent process?
createProcess(All required arguments);
if (pid == child_process)
{
int main ()
{
......
}
}
ENV: WinXP, VS2005
NOTE: The above code just describes the flow and may have syntax errors.

Are you confusing Windows CreateProcess with UNIX fork()? The two operating systems are different in the way that processes are created. With Windows you have to execute an exe file from the beginning, you can't continue as the child process after CreateProcess as you can with fork on UNIX. Your statement "the child process starts execution from the next line of code" is mistaken for Windows.
Mind you, your code would be illegal on UNIX as well, you can't have two functions called main, and you can't have nested functions in C.

Please read the documentation of CreateProcess() again.
The function takes the filename of the program to run in the new process. The nested function you're showing is not valid C.

How do I spawn a daemon in uClinux using vfork?

This would be easy with fork(), but I've got no MMU. I've heard that vfork() blocks the parent process until the child exits or executes exec(). How would I accomplish something like this?:
pid_t pid = vfork();
if (pid == -1) {
// fail
exit(-1);
}
if (pid == 0) {
// child
while(1) {
// Do my daemon stuff
}
// Let's pretend it exits sometime
exit();
}
// Continue execution in parent without blocking.....

It seems there is no way to do this exactly as you have it here. exec or _exit have to get called for the parent to continue execution. Either put the daemon code into another executable and exec it, or use the child to spawn the original task. The second approach is the sneaky way, and is described here.

daemon() function for uClinux systems without MMU and fork(), by Jamie Lokier, in patch format
You can't do daemon() with vfork(). To create something similar to a daemon on !MMU using vfork(), the parent process doesn't die (so there are extra processes), and you should call your daemon on the background (i.e. by appending & to the command line on the shell).
On the other hand, Linux provides clone(). Armed with that, knowledge and care, it's possible to implement daemon() for !MMU. Jamie Lokier has a function to do just that on ARM and i386, get it from here.
Edit: made the link to Jamie Lokier's daemon() for !MMU Linux more prominent.

I would have thought that this would be the type of problem that many others had run into before, but I've had a hard time finding anyone talking about the "kill the parent" problems.
I initially thought that you should be able to do this with a (not quite so, but sort of) simple call to clone, like this:
pid_t new_vfork(void) {
return clone(child_func, /* child function */
child_stack, /* child stack */
SIGCHLD | CLONE_VM, /* flags */
NULL, /* argument to child */
NULL, /* pid of the child */
NULL, /* thread local storage for child */
NULL); /* thread id of child in child's mem */
}
Except that determining the child_stack and the child_func to work the way that it does with vfork is pretty difficult since child_func would need to be the return address from the clone call and the child_stack would need to be the top of the stack at the point that the actual system call (sys_clone) is made.
You could probably try to call sys_clone directly with
pid_t new_vfork(void) {
return sys_clone( SIGCHLD | CLONE_VM, NULL);
}
Which I think might get what you want. Passing NULL as the second argument, which is the child_stack pointer, causes the kernel to do the same thing as it does in vfork and fork, which is to use the same stack as the parent.
I've never used sys_clone directly and haven't tested this, but I think it should work. I believe that:
sys_clone( SIGCHLD | CLONE_VM | CLONE_VFORK, NULL);
is equivalent to vfork.
If this doesn't work (and you can't figure out how to do something similar) then you may be able to use the regular clone call along with setjump and longjmp calls to emulate it, or you may be able to get around the need for the "return's twice" semantics of fork and vfork.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight