I'm implementing a Job Control Shell in C in Linux as a project for a Operating Systems-related subject. I have a main() function that does child process management, helped with a linked list as shown here in which background and suspended jobs information is stored:
typedef struct job_
{
pid_t pgid; /* group id = process lider id */
char * command; /* program name */
enum job_state state;
struct job_ *next; /* next job in the list */
} job;
Every time a child process exits or is stopped, a SIGCHLD is sent to parent process to be informed about that. Then, I have a signal handler as shown here that for each node of that job status linked list, checks if the process represented in that node has exited and if it did, that node is removed from the linked list.
Here is the code for the SIGCHLD handler, where 'job_list' is the linked list where the info is stored:
void mySIGCHLD_Handler(int signum) {
block_SIGCHLD();
if (signum == 17) {
job *current_node = job_list->next, *node_to_delete = NULL;
int process_status, process_id_deleted;
while (current_node) {
/* Wait for a child process to finish.
* - WNOHANG: return immediately if the process has not exited
*/
waitpid(current_node->pgid, &process_status, WNOHANG);
if (WIFEXITED(process_status) != 0) {
node_to_delete = current_node;
current_node = current_node->next;
process_id_deleted = node_to_delete->pgid;
if (delete_job(job_list, node_to_delete)) {
printf("Process #%d deleted from job list\n", process_id_deleted);
} else {
printf("Process #%d could not be deleted from job list\n", process_id_deleted);
}
} else {
current_node = current_node->next;
}
}
}
unblock_SIGCHLD();
}
The thing is, when the handler is called, some entries that should not be deleted because the process they represent are not exited, are deleted, when they shouldn't. Anyone would know why that happens?
Thank you and sorry for your lost time :(
I see many problems in this code, but the immediate issue is probably here:
waitpid(current_node->pgid, &process_status, WNOHANG);
if (WIFEXITED(process_status) != 0) {
When waitpid(pid, &status, WNOHANG) returns because the process has not exited, it does not write anything to status, so the subsequent if is branching on garbage. You need to check the actual return value of waitpid before assuming status is meaningful.
The most important other problems are:
The kernel is allowed to send only one SIGCHLD to tell you that several processes have exited. When you get a SIGCHLD, you need to call waitpid(0, &status, WNOHANG) in a loop until it tells you there are no more processes to wait for, and you need to process (no pun intended) all of the exited process IDs that it tells you about.
It is not safe to call printf or free from an asynchronous signal handler. Add terminated processes to a list of deferred tasks, instead. Make sure to block SIGCHLD in the main-loop code that consumes that list.
Don't block and unblock SIGCHLD yourself in the handler; that has an unavoidable race condition. Instead, let the kernel do it for you, atomically, by setting up your signal handler correctly: use sigaction and don't put SA_NODEFER in sa_flags. (Do put SA_RESTART in sa_flags, unless you have a very good reason not to.)
The literal number 17 should be the signal constant SIGCHLD instead. Some signal numbers have been stable across all Unixes throughout history, but SIGCHLD is not one of them.
Related
In this example from the CSAPP book chap.8:
\#include "csapp.h"
/* WARNING: This code is buggy! \*/
void handler1(int sig)
{
int olderrno = errno;
if ((waitpid(-1, NULL, 0)) < 0)
sio_error("waitpid error");
Sio_puts("Handler reaped child\n");
Sleep(1);
errno = olderrno;
}
int main()
{
int i, n;
char buf[MAXBUF];
if (signal(SIGCHLD, handler1) == SIG_ERR)
unix_error("signal error");
/* Parent creates children */
for (i = 0; i < 3; i++) {
if (Fork() == 0) {
printf("Hello from child %d\n", (int)getpid());
exit(0);
}
}
/* Parent waits for terminal input and then processes it */
if ((n = read(STDIN_FILENO, buf, sizeof(buf))) < 0)
unix_error("read");
printf("Parent processing input\n");
while (1)
;
exit(0);
}
It generates the following output:
......
Hello from child 14073
Hello from child 14074
Hello from child 14075
Handler reaped child
Handler reaped child //more than one child reaped
......
The if block used for waitpid() is used to generate a mistake that waitpid() is not able to reap all children. While I understand that waitpid() is to be put in a while() loop to ensure reaping all children, what I don't understand is that why only one waitpid() call is made, yet was able to reap more than one children(Note in the output more than one child is reaped by handler)? According to this answer: Why does waitpid in a signal handler need to loop?
waitpid() is only able to reap one child.
Thanks!
update:
this is irrelevant, but the handler is corrected in the following way(also taken from the CSAPP book):
void handler2(int sig)
{
int olderrno = errno;
while (waitpid(-1, NULL, 0) > 0) {
Sio_puts("Handler reaped child\n");
}
if (errno != ECHILD)
Sio_error("waitpid error");
Sleep(1);
errno = olderrno;
}
Running this code on my linux computer.
The signal handler you designated runs every time the signal you assigned to it (SIGCHLD in this case) is received. While it is true that waitpid is only executed once per signal receival, the handler still executes it multiple times because it gets called every time a child terminates.
Child n terminates (SIGCHLD), the handler springs into action and uses waitpid to "reap" the just exited child.
Child n+1 terminates and its behaviour follows the same as Child n. This goes on for every child there is.
There is no need to loop it as it gets called only when needed in the first place.
Edit: As pointed out below, the reason as to why the book later corrects it with the intended loop is because if multiple children send their termination signal at the same time, the handler may only end up getting one of them.
signal(7):
Standard signals do not queue. If multiple instances of a
standard signal are generated while that signal is blocked, then
only one instance of the signal is marked as pending (and the
signal will be delivered just once when it is unblocked).
Looping waitpid assures the reaping of all exited children and not just one of them as is the case right now.
Why is looping solving the issue of multiple signals?
Picture this: you are currently inside the handler, handling a SIGCHLD signal you have received and whilst you are doing that, you receive more signals from other children that have terminated in the meantime. These signals cannot queue up. By constantly looping waitpid, you are making sure that even if the handler itself can't deal with the multiple signals being sent, waitpid still picks them up as it's constantly running, rather than only running when the handler activates, which can or can't work as intended depending on whether signals have been merged or not.
waitpid still exits correctly once there are no more children to reap. It is important to understand that the loop is only there to catch signals that are sent when you are already in the signal handler and not during normal code execution as in that case the signal handler will take care of it as normal.
If you are still in doubt, try reading these two answers to your question.
How to make sure that `waitpid(-1, &stat, WNOHANG)` collect all children processes
Why does waitpid in a signal handler need to loop? (first two paragraphs)
The first one uses flags such as WNOHANG, but this only makes waitpid return immediately instead of waiting, if there is no child process ready to be reaped.
On a normal day , when a process is killed , all its child processes must be attached to 'init' process (Great Grand-Daddy of all processes). Strangely , the XV6 doesn't seem to do so. Below is the code of 'kill' function in proc.c file in XV6
int kill(int pid)
{
struct proc * p;
acquire(&ptable.lock);
for(p = ptable.proc; p < &ptable.proc[NPROC]; p++){
if(p->pid == pid){
p->killed = 1;
if(p->state == SLEEPING)
p->state = RUNNABLE;
release(&ptable.lock);
return 0;
}
}//end of for loop
release(&ptable.lock);
return -1;
}
Why isn't the killed process removed from the process table?
Why aren't its children adopted by 'init'?
And for God's sake, why is the killed process made RUNNABLE again ?
In POSIX, to receive and handle a signal (terminating the process is handling a signal) a process must be scheduled. Before that time the signal is considered pending and the process still exists.
The child processes are probably orphaned when the process is actually terminated and removed from the kernel data structures.
I would need some help with some C code.
Basically I have n processes which execute some code. Once they're almost done, I'd like the "Manager Process" (which is the main function) to send to each of the n processes an int variable, which may be different for every process.
My idea was to signal(handler_function, SIGALRM) once all processes started. When process is almost done, it uses kill(getpid(), SIGSTOP) in order to wait for the Manager Process.
After SIM_TIME seconds passed, handler_function sends int variable on a Message Queue then uses kill(process_pid, SIGCONT) in order to wake up waiting processes. Those processes, after being woken up should receive that int variable from Message Queue, print it and simply terminate, letting Manager Process take control again.
Here's some code:
/**
* Child Process creation using fork() system call
* Parent Process allocates and initializes necessary variables in shared memory
* Child Process executes Student Process code defined in childProcess function
*/
pid_t runChild(int index, int (*func)(int index))
{
pid_t pid;
pid = fork();
if (pid == -1)
{
printf(RED "Fork ERROR!\n" RESET);
exit(EXIT_FAILURE);
}
else if (pid == 0)
{
int res = func(index);
return getpid();
}
else
{
/*INSIGNIFICANT CODE*/
currentStudent = createStudent(pid);
currentStudent->status = FREE;
students[index] = *currentStudent;
currentGroup = createGroup(index);
addMember(currentStudent, currentGroup);
currentGroup->closed = FALSE;
groups[index] = *currentGroup;
return pid;
}
}
Code executed by each Process
/**
* Student Process Code
* Each Student executes this code
*/
int childProcess(int index)
{
/*NOTICE: showing only relevant part of code*/
printf("Process Index %d has almost done, waiting for manager!\n", index);
/* PROGRAM GETS STUCK HERE!*/
kill(getpid(), SIGSTOP);
/* mex variable is already defines, it's a struct implementing Message Queue message struct*/
receiveMessage(mexId, mex, getpid());
printf(GREEN "Student %d has received variable %d\n" RESET, getpid(), mex->variable);
}
Handler Function:
* Handler function
* Will be launched when SIM_TIME is reached
*/
void end_handler(int sig)
{
if (sig == SIGALRM)
{
usleep(150000);
printf(RED "Time's UP!\n" RESET);
printGroups();
for(int i = 0; i < POP_SIZE; i++){
mex->mtype = childPids[i];
mex->variable = generateInt(18, 30);
sendMessage(mexId, mex);
//childPids is an array containing PIDs of all previously launched processes
kill(childPids[i], SIGCONT);
}
}
I hope my code is understandable.
I have an issue though, Using provided code the entire program gets stuck at kill(getpid(), SIGSTOP) system call.
I also tried to launch ps in terminal and no active processes are detected.
I think handler_function doesn't send kill(childPids[i], SIGCONT) system call for some reason.
Any idea how to solve this problem?
Thank you
You might want to start by reading the manual page for mq_overview (man mq_overview). It provides a portable and flexible communication mechanism between processes which permits sync and async mechanisms to communicate.
In your approach, there is a general problem of “how does one process know if another is waiting”. If the process hasn’t stopped itself, the SIGCONT is ignored, and when it subsequently suspends itself, nobody will continue it.
In contrast, message-based communication between the two can be viewed as a little language. For simple exchanges (such as yours), the completeness of the grammar can be readily hand checked. For more complex ones, state machines or even nested state machines can be constructed to analyze their behaviour.
Assuming check_if_pid_exists(pid) returns true when a process with such a pid exists (but possibly hasn't been running yet) or false when there is no process with such pid, is there any chance in parent code for a race condition when the fork() returned the child pid, however the kernel hasn't had a chance to initialize the data structures so that check_if_pid_exists(child) returns false? Or perhaps after returning from fork() we have a guarantee that check_if_pid_exists(pid) returns true?
pid_t child = fork();
if (child == 0) {
/* here the child just busy waits */
for (;;)
;
}
if (child > 0) {
/* here the parent checks whether child PID already exists */
check_if_pid_exists(child);
}
No.
The fork() returns after the new task is created and visible as expected by the parent. Most everything wouldn't work otherwise.
Whether the process has had a chance to run at all or has started quite a bit, is not known at that point. Whether the child has completed is known as you receive the SIGCHLD signal once that happens.
Where you can have a race is with the SIGCHLD if not handled properly. That is, you are expected to ignore the SIGCHLD signal, call fork(), save the results appropriately so you have the PID of the child (i.e. allocate a struct to save the child pid_t value), then use one of the wait() functions to know whether the child died.
Assuming your fork()ed process is expected to run for a while, then the
check_if_pid_exists(child) == true
will likely be true 99.9999% of the time (actually, assuming the parent is in control of when the child is expected to exit, make it 100% of the time).
As mentioned by others, many things can prevent the new process from running:
Not enough memory
You already started too many children
The child tries to do something and encounters a fatal error and exits
Some third party thing prevents the fork()
...
Also, fork() may return -1 in case it fails to create the child.
However, if the question was about: how do I track the lifetime of a child? Then the correct answer is for the parent to check whether it died. You should not rely on a function such as check_if_pid_exists() searching for the process under /proc/... or similar implementation (see waitid), because such a function may determine that the process is still running, then the process dies, and yet the function still returns true...
// ignore SIGCHLD
struct sigaction sa, chld;
sa.sa_handler = SIG_IGN;
sa.sa_flags = 0;
__sigemptyset (&sa.sa_mask);
sigaction(SIGCHLD, &sa, NULL);
[...]
int child_is_running = 0;
[...]
pid_t child = fork();
if(child == 0) ...do stuff in the child...
if(child < 0) ...handle error...
int child_is_running = 1;
[...]
wait(...);
if(...child exited...) child_is_running = 0;
Now you can write a safe check_if_pid_exists():
int check_if_pid_exists()
{
return child_is_running;
}
In my example here, I only allow one child. If you need multiple, that's where you need a better scheme (probably a struct to save the child info and a table of some sort or linked list of all your children).
I read in an ebook that waitpid(-1, &status, WNOHANG) should be put under a while loop so that if multiple child process exits simultaniously , they are all get reaped.
I tried this concept by creating and terminating 2 child processes at the same time and reaping it by waitpid WITHOUT using loop. And the are all been reaped .
Question is , is it very necessary to put waitpid under a loop ?
#include<stdio.h>
#include<sys/wait.h>
#include<signal.h>
int func(int pid)
{
if(pid < 0)
return 0;
func(pid - 1);
}
void sighand(int sig)
{
int i=45;
int stat, pid;
printf("Signal caught\n");
//while( (
pid = waitpid(-1, &stat, WNOHANG);
//) > 0){
printf("Reaped process %d----%d\n", pid, stat);
func(pid);
}
int main()
{
int i;
signal(SIGCHLD, sighand);
pid_t child_id;
if( (child_id=fork()) == 0 ) //child process
{
printf("Child ID %d\n",getpid());
printf("child exiting ...\n");
}
else
{
if( (child_id=fork()) == 0 ) //child process
{
printf("Child ID %d\n",getpid());
printf("child exiting ...\n");
}
else
{
printf("------------Parent with ID %d \n",getpid());
printf("parent exiting ....\n");
sleep(10);
sleep(10);
}
}
}
Yes.
Okay, I'll elaborate.
Each call to waitpid reaps one, and only one, child. Since you put the call inside the signal handler, there is no guarantee that the second child will exit before you finish executing the first signal handler. For two processes that is okay (the pending signal will be handled when you finish), but for more, it might be that two children will finish while you're still handling another one. Since signals are not queued, you will miss a notification.
If that happens, you will not reap all children. To avoid that problem, the loop recommendation was introduced. If you want to see it happen, try running your test with more children. The more you run, the more likely you'll see the problem.
With that out of the way, let's talk about some other issues.
First, your signal handler calls printf. That is a major no-no. Very few functions are signal handler safe, and printf definitely isn't one. You can try and make your signal handler safer, but a much saner approach is to put in a signal handler that merely sets a flag, and then doing the actual wait call in your main program's flow.
Since your main flow is, typically, to call select/epoll, make sure to look up pselect and epoll_pwait, and to understand what they do and why they are needed.
Even better (but Linux specific), look up signalfd. You might not need the signal handler at all.
Edited to add:
The loop does not change the fact that two signal deliveries are merged into one handler call. What it does do is that this one call handles all pending events.
Of course, once that's the case, you must use WNOHANG. The same artifacts that cause signals to be merged might also cause you to handle an event for which a signal is yet to be delivered.
If that happens, then once your first signal handler exists, it will get called again. This time, however, there will be no pending events (as the events were already extracted by the loop). If you do not specify WNOHANG, your wait block, and the program will be stuck indefinitely.