Why does popen() invoke a shell to execute a process? - c

I'm currently reading up on and experimenting with the different possibilities of running programs from within C code on Linux. My use cases cover all possible scenarios, from simply running and forgetting about a process, reading from or writing to the process, to reading from and writing to it.
For the first two, popen() is very easy to use and works well. I understand that it uses some version of fork() and exec() internally, then invokes a shell to actually run the command.
For the third scenario, popen() is not an option, as it is unidirectional. Available options are:
Manually fork() and exec(), plus pipe() and dup2() for input/output
posix_spawn(), which internally uses the above as need be
What I noticed is that these can achieve the same that popen() does, but we can completely avoid the invoking of an additional sh. This sounds desirable, as it seems less complex.
However, I noticed that even examples on posix_spawn() that I found on the Internet do invoke a shell, so it would seem there must be a benefit to it. If it is about parsing command line arguments, wordexp() seems to do an equally good job.
What is the reason behind benefit of invoking a shell to run the desired process instead of running it directly?
Edit: I realized that my wording of the question didn't precisely reflect my actual interest - I was more curious about the benefits of going through sh rather than the (historical) reason, though both are obviously connected, so answers for both variations are equally relevant.

Invoking a shell allows you to do all the things that you can do in a shell.
For example,
FILE *fp = popen("ls *", "r");
is possible with popen() (expands all files in the current directory).
Compare it with:
execvp("/bin/ls", (char *[]){"/bin/ls", "*", NULL});
You can't exec ls with * as argument because exec(2) will interpret * literally.
Similarly, pipes (|), redirection (>, <, ...), etc., are possible with popen.
Otherwise, there's no reason to use popen if you don't need shell - it's unnecessary. You'll end up with an extra shell process and all the things that can go wrong in a shell go can wrong in your program (e.g., the command you pass could be incorrectly interpreted by the shell and a common security issue). popen() is designed that way. fork + exec solution is cleaner without the issues associated with a shell.

The glib answer is because the The POSIX standard ( http://pubs.opengroup.org/onlinepubs/9699919799/functions/popen.html ) says so. Or rather, it says that it should behave as if the command argument is passed to /bin/sh for interpretation.
So I suppose a conforming implementation could, in principle, also have some internal library function that would interpret shell commands without having to fork and exec a separate shell process. I'm not actually aware of any such implementation, and I suspect getting all the corner cases correct would be pretty tricky.

The 2004 version of the POSIX system() documentation has a rationale that is likely applicable to popen() as well. Note the stated restrictions on system(), especially the one stating "that the process ID is different":
RATIONALE
...
There are three levels of specification for the system() function. The
ISO C standard gives the most basic. It requires that the function
exists, and defines a way for an application to query whether a
command language interpreter exists. It says nothing about the command
language or the environment in which the command is interpreted.
IEEE Std 1003.1-2001 places additional restrictions on system(). It
requires that if there is a command language interpreter, the
environment must be as specified by fork() and exec. This ensures, for
example, that close-on- exec works, that file locks are not inherited,
and that the process ID is different. It also specifies the return
value from system() when the command line can be run, thus giving the
application some information about the command's completion status.
Finally, IEEE Std 1003.1-2001 requires the command to be interpreted
as in the shell command language defined in the Shell and Utilities
volume of IEEE Std 1003.1-2001.
Note the multiple references to the "ISO C Standard". The latest version of the C standard requires that the command string be processed by the system's "command processor":
7.22.4.8 The system function
Synopsis
#include <stdlib.h>
int system(const char *string);
Description
If string is a null pointer, the system function determines
whether the host environment has a command processor. If string
is not a null pointer, the system function passes the string
pointed to by string to that command processor to be executed
in a manner which the implementation shall document; this might then
cause the program calling system to behave in a non-conforming
manner or to terminate.
Returns
If the argument is a null pointer, the system function
returns nonzero only if a command processor is available. If
the argument is not a null pointer, and the system function
does return, it returns an implementation-defined value.
Since the C standard requires that the systems "command processor" be used for the system() call, I suspect that:
Somewhere there's a requirement in POSIX that ties popen() to the system() implementation.
It's much easier to just reuse the "command processor" entirely since there's also a requirement to run as a separate process.
So this is the glib answer twice-removed.

Related

What are common uses for the system (3) command?

I came across the command while reading the famous C Language Book (1988). Is the command commonly used today?
From the book (section 7.8.4):
The function system(char *s) executes the command contained in the
character string s, then resumes execution of the current program. The
contents of s depend strongly on the local operating system. As a
trivial example, on UNIX systems, the statement
system("date");
causes the program date to be run ...
I was under the impression that fork-and-exec is the main way to run another program from the current one...
system it the function from the standard C library that allows a C program to invoke an external (meaning OS level) command.
(Almost) everything is in the above sentence: the function is standard C, meaning that is is supported by any conformant implementation. But what OS does is err... just OS dependant.
It should the prefered way for writing portable programs (because it is standard C) but unfortunately:
not all OS support same commands and/or same syntax
it is known to have some caveats on most systems
The latter part is related to security: many OS (at least all I know) have a configurable path where a command is searched, and in that case the system function does use that path. The problem is that by changing the path, the program can invoke in reality a command that is not the one intended by the programmer, if someone managed to install a different command with same name in a place they control, and also managed to change the path.
This is the reason why system is generally frowned upon and careful programmers only rely on lower level system dependant functions like fork+exec on Unix like or CreateProcess on Windows, or alternatively use absolute paths for the commands called from system. But then you need a rather complex configuration way to adapt that absolute path to various systems...

Why does system() exist?

Many papers and such mention that calls to 'system()' are unsafe and unportable. I do not dispute their arguments.
I have noticed, though, that many Unix utilities have a C library equivalent. If not, the source is available for a wide variety of these tools.
While many papers and such recommend against goto, there are those who can make an argument for its use, and there are simple reasons why it's in C at all.
So, why do we need system()? How much existing code relies on it that can't easily be changed?
sarcastic answer Because if it didn't exist people would ask why that functionality didn't exist...
better answer
Many of the system functionality is not part of the 'C' standard but are part of say the Linux spec and Windows most likely has some equivalent. So if you're writing an app that will only be used on Linux environments then using these functions is not an issue, and as such is actually useful. If you're writing an application that can run on both Linux and Windows (or others) these calls become problematic because they may not be portable between system. The key (imo) is that you are simply aware of the issues/concerns and program accordingly (e.g. use appropriate #ifdef's to protect the code etc...)
The closest thing to an official "why" answer you're likely to find is the C89 Rationale. 4.10.4.5 The system function reads:
The system function allows a program to suspend its execution temporarily in order to run another program to completion.
Information may be passed to the called program in three ways: through command-line argument strings, through the environment, and (most portably) through data files. Before calling the system function, the calling program should close all such data files.
Information may be returned from the called program in two ways: through the implementation-defined return value (in many implementations, the termination status code which is the argument to the exit function is returned by the implementation to the caller as the value returned by the system function), and (most portably) through data files.
If the environment is interactive, information may also be exchanged with users of interactive devices.
Some implementations offer built-in programs called "commands" (for example, date) which may provide useful information to an application program via the system function. The Standard does not attempt to characterize such commands, and their use is not portable.
On the other hand, the use of the system function is portable, provided the implementation supports the capability. The Standard permits the application to ascertain this by calling the system function with a null pointer argument. Whether more levels of nesting are supported can also be ascertained this way; assuming more than one such level is obviously dangerous.
Aside from that, I would say mainly for historical reasons. In the early days of Unix and C, system was a convenient library function that fulfilled a need that several interactive programs needed: as mentioned above, "suspend[ing] its execution temporarily in order to run another program". It's not well-designed or suitable for any serious tasks (the POSIX requirements for it make it fundamentally non-thread-safe, it doesn't admit asynchronous events to be handled by the calling program while the other program is running, etc.) and its use is error-prone (safe construction of command string is difficult) and non-portable (because the particular form of command strings is implementation-defined, though POSIX defines this for POSIX-conforming implementations).
If C were being designed today, it almost certainly would not include system, and would either leave this type of functionality entirely to the implementation and its library extensions, or would specify something more akin to posix_spawn and related interfaces.
Many interactive applications offer a way for users to execute shell commands. For instance, in vi you can do:
:!ls
and it will execute the ls command. system() is a function they can use to do this, rather than having to write their own fork() and exec() code.
Also, fork() and exec() aren't portable between operating systems; using system() makes code that executes shell commands more portable.

How to hide the system call passed in the system() function from htop

Consider this C snippet:
snprintf(buf, sizeof(buf), "<LONG PROCESS WITH PARAMETERS HAVING SENSITIVE INFO>";
system(buf);
Now on compiling and executing this, the "sensitive" parameters of the process can be seen on programs like htop
And I don't want that.
I would like to know if there's a way to hide everything passed in system() such that htop will only show the name of the compiled executable (i.e htop just displays a.out all the time)
In all the Unix-like systems I've used, including many Linux variants, it's possible for a program to overwrite it's command-line arguments "from inside". So in C we might use, for example, strcnpy() just to blank the values of argv[1], argv[2], etc. Of course, you need to have processed or copied these arguments first, and you need to be careful not to overwrite memory outside the specific limits of each argv.
I don't think anything about Unix guarantees the portability or continued applicability of this approach, but I have been using it for at least twenty years. It conceals the command from casual uses of ps, etc., and also from /proc/NN/cmdline, but it won't stop the shell storing the command line somewhere (e.g., in a shell history file). So it only prevents casual snooping.
A better approach is not to get into the situation in the first place -- have the program take its input from files (which could be encrypted), or environment variables, or use certificates. Or almost anything, in fact, except the command line.

Linux shell pipe syntax

I am implementing a program that simulates the Linux shell and I need to implement expressions with multiple pipes - but I am not sure what's considered legal or how to handle a few things, for example:
Is pipe as the last character in the command legal? When I try it in the Linux shell it displays really weird behavior - after pressing enter it shows a new line with > in the beginning. I am not sure what does this mean as to the legality of the command?
How to handle several consecutive pipes? For example ls -l ||||| grep 7
it seems the shell just works as usual and ignores the redundant pipes but I am nit sure. Would like some help.
There is not a single Linux shell (but several shells). The most common one is GNU bash, but you can use some other like zsh (which I am using interactively) or fish, or even scsh -or es- which has a quite different syntax. And all of them don't share exactly the same syntax and don't report the same errors.
There is however a standard, POSIX, which defines the POSIX shell specification (as a technical document in English):
The format for a pipeline is:
[!] command1 [ | command2 ...]
The standard output of command1 shall be connected to the standard input of command2.
As you can see, you can't end your command with a |.
Your interactive bash shell is giving a different prompt when an incomplete line has been input. It is using the GNU readline library for interactive editable input (and completion).
All the shells I know on Linux are free software, so you could study their source code. sash is a quite simple shell whose code is quite readable (but a bit buggy); it lacks most of the interactive facilities (notably auto-completion) of more sophisticated shells.
You'll need to understand most of Advanced Linux Programming before coding your own shell...
For a homework, you probably can afford giving an error message on the first encountered error.

Argument treatment at command line in C

The question asked to me is that even if we supplied integer/float arguments at the command prompt they are still treated as strings or not in C language. I am not sure about that can any help just little bit. Is this true or not in C language and why? And what about others like Java or python ?
It is true, independent of the language, that the command line arguments to programs on Unix are strings. If the program chooses to interpret them as numbers, that is fine, but it is because the program (or programmer) chose to do so. Similarly, the runtime support for a language might alter the arguments passed by the o/s into integer or float types, but the o/s passes strings to that runtime (I know of no language that does this, but I don't claim to know all languages).
To see this, note that the ways to execute a program are the exec*() family of functions, and each of those takes a string which is the name of the program to be executed, and an array of strings which are the arguments to be passed to the program. Functions such as system() and popen() are built atop the exec*() family of functions — they also use fork(), of course. Even the posix_spawn() function takes an array of pointers to strings for the arguments of the spawned program.
It's not unlike mailing a letter without an envelope. We all agree to use the common enclosure known as an envelope. Operating systems pass parameters to programs using the common item known as a string of characters. It's beyond the scope of the operating system to understand what the program wants to do with the parameters.
There are some exceptions, one which comes to mind is the passing of parameters to a Linux Kernel Module. These can be passed as items other than strings.
Basically, this is an issue of creating an interface between the operating system and the program. any program. Remember that programs are not always written in C, And you don't even know whether there are things like float or int in the language.
You want to be able to pass several arguments (with natural delimiters), which may easily encode arbitrary information. In C, strings can be of arbitrary length, and the only constraint on them is that a zero byte in them signifies the end of the string. this is a highly flexible and natural way to pass arbitrary information to the program.
So you can never supply an arbitrary integer/float arguments directly to a program; The operating system (Unix / Linux / Windows / etc.) won't let you. You don't have any tool that gives you that interface, in the same way that you can't pass a mouse-click as an argument. All you supply is a sequence of characters.
Since Unix and C were designed together, it is also part of the C programming language, and from there it worked its way to C++, Java, Python and most other modern programming languages, and the same way into Linux, Windows and most other operating systems.

Resources