Given the pid of a Linux process, I want to check, from a C program, if the process is still running.
Issue a kill(2) system call with 0 as the signal. If the call succeeds, it means that a process with this pid exists.
If the call fails and errno is set to ESRCH, a process with such a pid does not exist.
Quoting the POSIX standard:
If sig is 0 (the null signal), error checking is performed but no
signal is actually sent. The null signal can be used to check the
validity of pid.
Note that you are not safe from race conditions: it is possible that the target process has exited and another process with the same pid has been started in the meantime. Or the process may exit very quickly after you check it, and you could do a decision based on outdated information.
Only if the given pid is of a child process (fork'ed from the current one), you can use waitpid(2) with the WNOHANG option, or try to catch SIGCHLD signals. These are safe from race conditions, but are only relevant to child processes.
Use procfs.
#include <sys/stat.h>
[...]
struct stat sts;
if (stat("/proc/<pid>", &sts) == -1 && errno == ENOENT) {
// process doesn't exist
}
Easily portable to
Solaris
IRIX
Tru64 UNIX
BSD
Linux
IBM AIX
QNX
Plan 9 from Bell Labs
kill(pid, 0) is the typical approach, as #blagovest-buyukliev said. But if the process you are checking might be owned by a different user, and you don't want to take the extra steps to check whether errno == ESRCH, it turns out that
(getpgid(pid) >= 0)
is an effective one-step method for determining if any process has the given PID (since you are allowed to inspect the process group ID even for processes that don't belong to you).
You can issue a kill(2) system call with 0 as the signal.
There's nothing unsafe about kill -0. The program
must be aware that the result can become obsolete at any time
(including that the pid can get reused before kill is called),
that's all. And using procfs instead does use the pid too,
and doing so in a more cumbersome and nonstandard way.
As an addendum to the /proc filesystem method, you can check the /proc/<pid>/cmdline (assuming it was started from the command line) to see if it is the process you want.
ps -p $PID > /dev/null 2>&1; echo $?
This command return 0 if process with $PID is still running. Otherwise it returns 1.
One can use this command in OSX terminal too.
Related
So i've been struggling with this exercise. I must get al of the System Calls made by any given Linux command of my choice (I.E. ls or cd), list them in a .txt file, and have their unique IDs listed beside them.
So far here's what i got:
strace -o filename.txt ls
This when executed in the Linux shell gives me a "filename.txt" file containing all the system calls of the ls command. Now in my C script:
#include <stdio.h>
#include <stdlib.h>
int main(){
system("strace -o filename.txt ls");
return 0;
}
This should do the same as the previous code, but it's not returning me anything, although the code succesfully compiles. How would i go about fixing this, and then get the IDs? I'm using the "stdlib" library because in my research i found that it has some relation to system call IDs, but haven't found any indication on how to get them. Basically i must read that file i created and have it give each system call its ID.
The exercise is obviously designed to be solved by using the ptrace() facility, because the strace utility does not have an option to print the syscall number (as far as I know).
Technically, you can use something like
printf '#include <sys/syscall.h>\n' | gcc -dD -E - | awk '$1 == "#define" { m[$2] = $3 } END { for (name in m) if (name ~ /^SYS_/) { v = name; while (v in m) v = m[v]; sub(/^SYS_/, "", name); printf "%s %s\n", v, name } }'
to generate a number of syscall-number syscall-name lines, to be used for mapping syscall names back to syscall numbers, but this would be silly and error-prone. Silly, because being able to use ptrace() gives you much more control than using the strace utility, and using a "clever hack" like above just means you avoid learning how to do that, which in my opinion is by definition self-defeating and therefore utterly silly; and error-prone, because there is absolutely no guarantee that the installed headers match the running architecture. This is especially problematic on multiarch architectures, where you can use -m32 and -m64 compiler options to switch between 32-bit and 64-bit architectures. They typically have completely different syscall numbers.
Essentially, your program should:
fork() a child process.
In the child process:
Enable ptracing by calling prctl(PR_SET_DUMPABLE, 1L)
Make parent process the tracer by calling ptrace(PTRACE_TRACEME, (pid_t)0, (void *)0, (void *)0)
Optionally, set tracing options. For example, call ptrace(PTRACE_SETOPTIONS, getpid(), PTRACE_O_TRACECLONE | PTRACE_O_TRACEEXEC | PTRACE_O_TRACEEXIT | PTRACE_O_TRACEFORK) so that you catch at least clone(), fork(), and exec() family of syscalls.
If you do not set the PTRACE_O_TRACEEXEC option, you should stop the child process at this point using e.g. raise(SIGSTOP);, so that the parent process can start tracing this child.
Execute the command to be traced using e.g. execv(). In particular, if the first command line parameter is the command to run, optionally followed by its options, you can use execvp(argv[1], argv + 1);.
If you set the PTRACE_O_TRACEEXEC option above, then the kernel will auto-pause the child process just before executing the new binary.
If the exec fails, the child process should exit. I like to use exit(127);, to return exit status 127.
In the parent process, use waitpid(childpid, &status, WUNTRACED | WCONTINUED in a loop, to catch events in the child process.
The very first event should be the initial pause, i.e. WIFSTOPPED(status) being true. (If not, something else went wrong.)
There are three three different reasons why waitpid(childpid, &status, WUNTRACED | WCONTINUED) may return:
When the child exits (WIFEXITED(status) will be true).
This should obviously end the tracing, and have the parent tracer process exit, too.
When the child resumes execution (WIFCONTINUED(status) will be true).
You cannot assume that a PTRACE_SYSCALL, PTRACE_SYSEMU, PTRACE_CONT etc. commands have actually caused the child process to continue, until the parent gets this signal. In other words, you cannot just fire ptrace() commands to the child process, and expect them to take place in an orderly fashion! The ptrace() facility is asynchronous, and the call will return immediately; you need to waitpid() for the WIFCONTINUED(status) type of event to know that the child process heeded the command.
When the kernel stopped the child (with SIGTRAP) because the child process is about to execute a syscall. (In the parent, WIFSTOPPED(status) will be true.)
Whenever the child process gets stopped because it is about to execute a syscall, you need to use ptrace(PTRACE_GETREGS, childpid, (void *)0, ®s) to obtain the CPU register state in the child process at the point of syscall execution.
regs is of type struct user, defined in <sys/user.h>. For Intel/AMD architectures, regs.regs.eax (for 32-bit) or regs.regs.rax (for 64-bit) contains the syscall number (SYS_foo as defined in <sys/syscall.h>.
You then need to call ptrace(PTRACE_SYSCALL, childpid, (void *)0, (void *)0) to tell the kernel to execute that syscall, and waitpid() again to wait for the WIFCONTINUED(status) event notifying that it did.
The next WIFSTOPPED(status) type event from waitpid() will occur when the syscall is completed. If you want, you can use PTRACE_GETREGS again to examine regs.regs.eax or regs.regs.rax, which contains the syscall return value; on Intel/AMD, if an error occurred, it will be a negative errno value (i.e. -EACCES, -EINVAL, or similar.)
You need to call ptrace(PTRACE_SYSCALL, childpid, (void *)0, (void *)0) to tell the kernel to continue running the child, until the next syscall.
There are quite a few examples on-line showing some of the details above, although most that I have personally seen are pretty lax on error checking, and occasionally omit checking the WIFCONTINUED(status) waitpid() events. I've even written an answer detailing how to stop and continue individual threads on StackOverflow. Since the technique can be used as a very powerful custom debugging tool, I do recommend you try to learn the facility so you can leverage it in your work, rather than just copy-paste some existing code to get a passing grade on the exercise.
I've recently had a problem with signals. I'd like to write a program in C which would print anything after a signal is sent to the process. For example: If I send SIGTERM to my process (which is simply running program), I want the program to print out for example, "killing the process denied" instead of killing the process. So how to do that? How to force process to catch and change the meaning of such signal. Also I have a question if there is any possibility to kill the init process (I know it's kind of a stupid question, but I was wondering how linux deals with such a signal, and how would it technically look if I type: sudo kill -9 1.
Don't use the signal handler to print. You can set a variable of type volatile sig_atomic_t instead, and have your main thread check this (see this example).
When your main thread has nothing else to do (which should be most of the time), let it block on a blocking function call (e.g. sleep()) that will wake up immediately when the signal is received (and set errno to EINTR).
C++ gotcha: Unlike the C sleep() function, std::this_thread::sleep_for() (in recent versions of glibc) does not wake up when a signal is received.
Regarding if it's possible to kill pid 1, see this question. The answer seems to be no, but I remember that Linux got very grumpy once I booted with init=/bin/bash and later exited this shell – had to hard reboot.
If you're looking for trouble, better kill pid -1.
I am starting a process using execv and letting it write to a file. I start a thread simultaneously that monitors the file so that it's size does not exceed a certain limit using stat.st_size. Now, when the limit is hit, I waitpid for the child process, but this throws an error and the process I start in the background becomes a zombie. When I do the stop using the same waitpid from the main thread, the process is killed without becoming a zombie. Any ideas?
Edit: The errno is 10 and waitpid returns -1. This is on a linux platform.
This is difficult to debug without code, but errno 10 is ECHILD.
Per the man page, this is returned as follows:
ECHILD (for waitpid() or waitid()) The process specified by pid (waitpid()) or idtype and id (waitid()) does not exist or is not a child of the calling process. (This can happen for one's own child if the action for SIGCHLD is set to SIG_IGN. See also the Linux Notes section about threads.)
In short, the pid you are specifying is not a child of the process calling waitpid() (or is no longer, perhaps because it has terminated).
Note the parenthetical section:
"This can happen for one's own child if the action for SIGCHLD is set to SIG_IGN" - if you've set up a signal handler for SIGCHLD to be SIG_IGN, the wait is effectively done automatically, and therefore waitpid won't work as the child will have already terminated (will not go through zombie state).
"See also the Linux Notes section about threads." - In Linux, threads are essentially processes. Modern linux will allow one thread to wait for children of other threads (provided they are in the same thread group - broadly parent process). If you are using Linux prior to 2.4, this is not the case. See the documentation on __WNOTHREAD for details.
I'm guessing the thread thing is a red herring, and the problem is actually the signal handler, as this accords with your statement 'the process is killed without becoming a zombie.'
I have the pid of a forked process. Now, from my c code (running on Linux), I have to check periodically whether this process is still running or terminated. I do not want to use blocking calls like wait() or waitpid(). Need (preferably) a non-blocking system call which would just check whether this pid is still running and return the status of the child.
What is best and easiest way to do it?
The waitpid() function can take the option value WNOHANG to not block. See the manual page, as always.
That said, I'm not sure that pids are guaranteed to not be recycled by the system, which would open this up to race conditions.
kill(pid, 0);
This will "succeed" (return 0) if the PID exists. Of course, it could exist because the original process ended and something new took its place...it's up to you to decide if that matters.
You might consider instead registering a handler for SIGCHLD. That will not depend on the PID which could be recycled.
Use the WNOHANG option in waitpid().
This is regarding the application that runs on POSIX (Linux) environment. Most signals (e.g. Ctrl+C - signal 2, SIGINT), and few others are handled. When that is done the exit() system call is called from the handler with a desirable exit code.
However, there are some signals like Signal 9 and Signal 15 can't be handled.
Unfortunately, the parent process (an external script) which launches the given application needs to know and clean up some stuff if the signal 9 or 15 was the reason for termination.
Is there a predefined exit code that can be received by parent process to know the above?
The script that launches the app is a bash_script. The application itself is in C.
The return status from wait() or waitpid() encodes the information you need.
The POSIX macros are:
WIFEXITED(status) returns true if the child exited via exit() or one of its relatives.
WEXITSTATUS(status) tells you what that exit status was (0..255).
WIFSIGNALED(status) returns true if the child exited because of a signal (any signal).
WTERMSIG(status) returns the signal number that killed the child.
The non-standard but common macro WCOREDUMP(status) tells you if the process dumped core. You can also tell whether status reflect that the process was stopped, or continued (and what the stop signal was).
Note that signal 15 is usually SIGTERM and SIGTERM can be trapped by an application. The signals that cannot be trapped are SIGKILL (9) and SIGSTOP (17 on Mac OS X; may not be the same everywhere).
The question then is if bash provides this info for a script.
The answer is yes, but only indirectly and not 100% unambiguously. The status value reported by bash will be 128 + <signum> for processes that terminate due to signal <signum>, but you can't distinguish between a process that exits with status 130, say, and a process that was interrupted by SIGINT, aka signal 2.
15 (SIGTERM) could be caught and handled by the application, if it so chose to do so, but perhaps it does not at the moment
9 (SIGKILL) obviously cannot be caught by any application.
However, typically the operating system sets the exit status in such a way that the signal which terminated the process can be identified. Normally only the lower 8 bits of the status parameter to the exit(3) function [and thus the _exit(2) system call] are copied into the status value returned by wait(2) to the parent process (the shell running the external script in your example). So, that leaves sizeof(int)-1 bytes of space in the status value for the OS to use to fill in other information about the terminated process. Typically the wait(2) manual page will describe the way to interpret the wait status and thus split appart any additional information about the process termination from the status the process passed to _exit(2), IFF the process exited.
Unfortunately whether or not this extra information is made available to a script depends on how the shell executing the script might handle it.
First check your shell's manual page for details on how to interpret $?.
If the shell makes the whole status int value available verbatim to the script (in the $? variable), then it will be possible to parse apart the value and determine how and why the program exited. Most shells don't seem to do this completely (and for various reasions, not the least of which might be standards compliance), but they do at least go far enough to make it possible to solve your query (and must, to be POSIX compatible).
Here for example I'm running the AT&T version of KSH on Mac OS X. My ksh(1) manual page says that the exit status is 0-255 if the program just run terminated normally (where the value is presumably what was passed to _exit(2)) and 256+signum if the process was terminated by a signal (numbered "signum"). I don't know about on Linux, but on OS X bash gives a different exit status than Ksh does (with bash using the 8'th bit to represent a signal and thus only allowing 0-127 as valid exit values). (There is discrepancy in the POSIX standard between wait(2)'s claim that 8 low-order bits of _exit(2) being available, and the shell's conversion of wait status to $? preserving only 7 bits. Go figure! Ksh's behaviour is in violation of POSIX, but it is safer, since a strictly compatible shell may not be able to distinguish between a process passing a value of 128-255 to _exit(2) and having been terminated by a signal.)
So, anyway, I start a cat process, then I send it a SIGQUIT from the terminal (by pressing ^) (I use SIGQUIT because there's no easy way to send SIGTERM from the terminal keyboard):
22:01 [2389] $ cat
^\Quit(coredump)
ksh: exit code: 259
(I have a shell EXIT trap defined to print $? if it is not zero, so you see it above too)
22:01 [2390] $ echo $?
259
(259 is an integer value representing the status returned by wait(2) to the shell)
22:02 [2391] $ bc
obase=16
259
103
^D22:03 [2392] $
(see that 259 has the hex value 0x0103, note that 0x0100 is 256 decimal)
22:03 [2392] $ signo SIGQUIT
#define SIGQUIT 3 /* quit */
(I have a shell alias called signo that searches headers to find the number representing a symbolic signal name. See here that 0x03 from the status value is the same number as SIGQUIT.)
Further exploration of the wait(2) system call, and the related macros from <sys/wait.h> will allow us to understand a bit more of what's going on.
In C the basic logic for decoding a wait status makes use of the macros from <sys/wait.h>:
if (!WIFEXITED(status)) {
if (WIFSIGNALED(status)) {
termsig = WTERMSIG(status);
} else if (WIFSTOPPED(status)) {
stopsig = WSTOPSIG(status);
}
} else {
exit_value = WEXITSTATUS(status));
}
I hope that helps!
It is not possible for a parent process to detect the SIGKILL or Signal 9 - given the SIGNAL occurs outside of the user space.
A suggestion would be to have your Parent Process detect whether your child process has gone away and deal with it accordingly.A Great example is seen in mysqld-safe etc.