Is there any limitations on reaping number of child processes ?
let's say my system is running a parent process and 500+ child processes.
Parent is doing a waitpid(-1,status,0) in a blocking mode.
I do see sometimes waitpid returns -1.
if 500 child finishes at the same time and reports their status to the parent, is there a case a child processes can be missed ?
When a system call returns an error (such as when waitpid returns -1), consult errno (usually via perror) if you need to determine what error occurred.
According to man 2 waitpid on my system, the possible errors are pretty limited:
ECHILD: The process specified by pid does not exist or is not a child of the calling process. (This can happen for one's own child if the action for SIGCHLD is set to SIG_IGN. See also the Linux Notes section about threads.)
EINTR: WNOHANG was not set and an unblocked signal or a SIGCHLD was caught; see signal(7).
EINVAL: The options argument was invalid.
Additionally, EFAULT could be returned if you pass a bad address for the second argument. It appears to be the case based on the code you said you used.[1]
waitpid(-1,status,0)
should be
waitpid(-1,&status,0)
If you misspoke or if you're still getting an error after fixing this problem, two possibilities are left:
The process has no children. Any children it might have created have already been reaped.
You setup a signal handler, and a signal came in while you were waiting for a child to end. Just call waitpid again.
ALWAYS enable your compiler's warnings, and address them as if they were errors! With gcc, I use -Wall -Wextra -pedantic.
Related
The waiting works fine with pidfd_open and poll.
The problem I’m facing, after the process quits, apparently the poll() API removes the information about the now dead process, so the waitid with P_PIDFD argument fails at once saying code 22 “Invalid argument”
I don’t think I can afford launching a thread for every child process to sleep on the blocking waitpid, I have multiple processes, and another handles which aren’t processes I need to poll efficiently.
Any workarounds?
If it matters, I only need to support Linux 5.13.12 and newer running on ARM64 and ARMv7 CPUs.
The approximate sequence of kernel calls is following:
fork
In the child: setresuid, setresgid, execvpe
In the new child: printf, sleep, _exit
Meanwhile in the parent: pidfd_open, poll, once completed waitid with P_PIDFD first argument.
Expected result: waitid should give me the exit code of the child.
Actual result: it does nothing and sets errno to EINVAL
There is one crucial bit. From man waitid:
Applications shall specify at least one of the flags WEXITED, WSTOPPED, or WCONTINUED to be OR'ed in with the options argument.
I was passing was WNOHANG
And you want to pass WNOHAND | WEXITED ;)
You can use a single reaper thread, looping on waitpid(-1, &status, 0). Whenever it reaps a child process, it looks it up in the set of current child processes, handles possible notifications (semaphore or callback), and stores the exit status.
There is one notable situation that needs special consideration: the child process may exit before fork() returns in the parent process. This means it is possible for the reaper to see a child process exiting before the code that did the fork() manages to register the child process ID in any data structure. Thus, both the reaper and the fork() registering functions must be ready to look up or create the record in the data store keeping track of child processes; including calling the callback or posting the semaphore. It is not complicated at all, but unless you are used to thinking in asynchronous terms, it is easy to miss these corner cases.
Because wait(...)/waitpid(-1,...) returns immediately when there are no child processes to wait for (with -1 and errno set to ECHILD), the reaper thread should probably wait on a condition variable when there are no child processes to wait for, with the code that registers the child process ID signaling on that condition variable to minimize resource use in the no-child-processes case. (Also, do remember to minimize the reaper thread stack size, as it is unreasonably large (order of 8 MiB) by default, and wastes resources. I often use 2*PTHREAD_STACK_MIN, myself.)
Currently, I'm learning about processes on the UNIX system.
My issue is, I need to do something every time a background process terminates. That means that I can't use the typical functionality of waitpid because then the process won't be running in the background and it'll hang the program.
I'm also aware of the SIGCHLD signal which is sent whenever a child of the parent process is terminated however I'm not aware of how to get the process id of the said process which I will need.
What is the proper way to go about this in C? I've tried things such as WNOHANG option on waitpid however that of course only gets called once so I don't see how I could make that apply to my current situation.
waitpid because then the process won't be running in the background and it'll hang the program.
If the process won't be running in the backrgound, waitpid with the pid argument will exit immediately (assuming there are no pid clashes). And still, that's not true - just use WNOHANG...
however I'm not aware of how to get the process id of the said process which I will need. What is the proper way to go about this in C?
Use sigaction to register the signal handler and use the field si_pid from the second signal handler argument of type siginfo_t. From man sigaction:
SIGCHLD fills in si_pid, si_uid, si_status, si_utime, and si_stime,
providing information about the child. The si_pid field is the
process ID of the child
A working example that uses it is in the man 3p wait page under section Waiting for a Child Process in a Signal Handler for SIGCHLD.
What is the proper way to go about this in C?
The C standard is not aware of child processes and SIGCHLD signals. These are part of your operating system. In this case the behavior is standardized by POSIX.
I am learning about forks, execl and parent and child processes in my systems programming class. One thing that is confusing me is waitpid() and getpid(). Could someone confirm or correct my understanding of these two functions?
getpid() will return the process ID of whatever process calls it. If the parent calls it, it returns the pid of the parent. Likewise for the child. (It actually returns a value of type pid_t, according to the manpages).
waitpid() seems more complex. I know that if I use it in the parent process, without any flags to prevent it from blocking (using WNOHANG), it will halt the parent process until the child process terminates. I'm a little unsure as to how waitpid() manages all this, however. waitpid() also returns pid_t. What is the value of the pid_t waitpid() returns? How does this change depending on whether or not a parent or child calls it, and whether or not a child process is still running, or has terminated?
Your understanding of getpid is correct, it returns the PID of the running process.
waitpid is used (as you said) to block the execution of a process (unless
WNOHANG is passed) and resume execution when a (or more) child of the process
ends. waitpid returns the pid of the child whose state has changed, -1 on
failure. It also can return 0 if WNOHANG has specified but the child has not
changed the state. See:
man waitpid
RETURN VALUE
waitpid(): on success, returns the process ID of the child whose state has changed; if WNOHANG
was specified and one or more child(ren) specified by pid exist, but have not yet changed state,
then 0 is returned. On error, -1 is returned.
Depending on the arguments passed to waitpid, it will behave differently. Here
I'l quote the man page again:
man waitpid
pid_t waitpid(pid_t pid, int *wstatus, int options);
...
The waitpid() system call suspends execution of the calling process until a child specified by pid argument
has changed state. By default, waitpid() waits only for terminated children, but this behavior is modifiable
via the options argument, as described below:
The value of pid can be:
< -1: meaning wait for any child process whose process group ID is equal to the absolute value of pid.
-1: meaning wait for any child process.
0: meaning wait for any child process whose process group ID is equal to that of the calling process.
> 0: meaning wait for the child whose process ID is equal to the value of pid.
The value of options is an OR of zero or more of the following constants:
WNOHANG: return immediately if no child has exited.
WUNTRACED also return if a child has stopped (but not traced via ptrace(2)).
Status for traced children which have stopped is provided even if this option is not specified.
WCONTINUED (since Linux 2.6.10) also return if a stopped child has been resumed by delivery of SIGCONT.
I'm a little unsure as to how waitpid() manages all this
waitpid is a syscall and the OS handles this.
How does this change depending on whether or not a parent or child calls it, and whether or not a child process is still running, or has terminated?
wait should only be called by a process that has executed fork(). So the parent
process should cal wait()/waitpid. If the child process hasn't called
fork(), then it doesn't need to call either one of these functions. If however
the child process has called fork(), then it also should call
wait()/waitpid().
The behaviour of these function is very well explained in the man page, I quoted the important parts of it. You should read the whole man page
to get a better understanding of it.
waitpid "shall only return the status of a child process" (from the POSIX spec). So the pid_t waitpid returns belongs to one of the current or former children of the process calling waitpid. For example, if a child has recently terminated, it returns that child's PID.
waitpid is only useful when called from a parent process. If called from a process that does not have any children, it returns ECHILD.
waitpid can check the status of children that have terminated, or that has recently stopped or continued (e.g., ^Z from a shell). The various pid/option argument combinations in the spec tell you the various types of information you can return. For example, the WCONTINUED option requests status of recently-continued children instead of recently-terminated children.
For example, in the parent process, I forked a child process and wait on the child process:
int main() {
setSignal(SIGCHLD, sigchld_handler)
while(1) {
// fork some child processes
myForkFunction()
waitpid(-1, &status, 0)
}
}
Moreover, I have a SIGCHLD signal handler:
void
sigchld_handler(int sig) {
while ((pid = waitpid(-1, &status, WNOHANG)) > 0) {
// Reap zombie processes
}
}
As can be seen, waitpid() appears both in the main() function and in the sigchld_handler() function. I was wondering whether waitpid can be interrupted by SIGCHLD. If it can be interrupted by SIGCHLD, what will happen then?
Does anyone have any ideas about this?
The POSIX specification for waitpid() says in part:
If _POSIX_REALTIME_SIGNALS is defined, and the implementation queues the SIGCHLD signal, then if wait() or waitpid() returns because the status of a child process is available, any pending SIGCHLD signal associated with the process ID of the child process shall be discarded. Any other pending SIGCHLD signals shall remain pending.
Otherwise, if SIGCHLD is blocked, if wait() or waitpid() return because the status of a child process is available, any pending SIGCHLD signal shall be cleared unless the status of another child process is available.
For all other conditions, it is unspecified whether child status will be available when a SIGCHLD signal is delivered.
The third of the quoted paragraphs seems to imply that you're treading on thin ice. It doesn't mention 'implementation defined' or similar — unspecified means that the standard says nothing about what shall happen and you may or may not get any information from the implementation-specific documentation.
There is a lot of (very densely worded) information in the POSIX specification. There are also some examples, and a rationale — which mentions sigwait() and sigwaitinfo(). It is worth reading the whole of the waipid() page. You should probably also read about Signal concepts too — more dense reading. (One of these days, I'll do it, too — when I need to know about bits of signals that I haven't covered before.)
Why are you using WUNTRACED instead of 0 or WNOHANG? WUNTRACED is a very specialized condition — POSIX says:
WUNTRACED
The status of any child processes specified by pid that are stopped, and whose status has not yet been reported since they stopped, shall also be reported to the requesting process.
Similar comments apply to WCONTINUED. Those two flags are useful when you need them, but you very seldom need them.
I suggest you should normally use either 0 or WNOHANG in the third argument to waitpid().
Yes, in the sense that only one of them can succeed for a given child process; if the signal handler interrupts the one in main, then after the signal handler returns, the child will already have been reaped and the call in main will fail.
With that said, however, it's bad practice to write code like this. There should be a single place you handle reaping of a given child process, and usually a signal handler is a very bad choice because it's global and it would have to be aware of all possible child processes your program might have finishing, and have a way to communicate those results to the proper parts of your program.
Instead, it's generally better to monitor the termination of child processes via poll on a pipe to/from the child process, and only waitpid after you know it's terminated, or to perform blocking waitpid from a thread whose only job is to wait for the child.
I am starting a process using execv and letting it write to a file. I start a thread simultaneously that monitors the file so that it's size does not exceed a certain limit using stat.st_size. Now, when the limit is hit, I waitpid for the child process, but this throws an error and the process I start in the background becomes a zombie. When I do the stop using the same waitpid from the main thread, the process is killed without becoming a zombie. Any ideas?
Edit: The errno is 10 and waitpid returns -1. This is on a linux platform.
This is difficult to debug without code, but errno 10 is ECHILD.
Per the man page, this is returned as follows:
ECHILD (for waitpid() or waitid()) The process specified by pid (waitpid()) or idtype and id (waitid()) does not exist or is not a child of the calling process. (This can happen for one's own child if the action for SIGCHLD is set to SIG_IGN. See also the Linux Notes section about threads.)
In short, the pid you are specifying is not a child of the process calling waitpid() (or is no longer, perhaps because it has terminated).
Note the parenthetical section:
"This can happen for one's own child if the action for SIGCHLD is set to SIG_IGN" - if you've set up a signal handler for SIGCHLD to be SIG_IGN, the wait is effectively done automatically, and therefore waitpid won't work as the child will have already terminated (will not go through zombie state).
"See also the Linux Notes section about threads." - In Linux, threads are essentially processes. Modern linux will allow one thread to wait for children of other threads (provided they are in the same thread group - broadly parent process). If you are using Linux prior to 2.4, this is not the case. See the documentation on __WNOTHREAD for details.
I'm guessing the thread thing is a red herring, and the problem is actually the signal handler, as this accords with your statement 'the process is killed without becoming a zombie.'