How argv[0] works - c

I know that argv[0] represents the executable file name, but I don't understand how it is implemented — how it gets the filename and options at the source code level. At first I thought it was dependent on built-in functions in linux, but then found out that windows also supports it, leading me to believe that it may be done by the compiler?

It's actually part of the C99 standard, hence the same implementation across compilers and operating systems. From 5.1.2.2.1 Program startup (page 12):
If the value of argc is greater than zero, the string pointed to by argv[0] represents the program name; argv[0][0] shall be the null character if the program name is not available from the host environment. If the value of argc is greater than one, the strings pointed to by argv[1] through argv[argc-1] represent the program parameters.
Edit: Following up on Waleed Khan's comment, you can retrieve these values via:
Linux - /proc/self/cmdline
OSX - _NSGetArgc / _NSGetArgv or [NSProcessInfo arguments]
Windows - GetCommandLine() with CommandLineToArgvW()

when the binary is executed, glibc calls the function __libc_start_main, which passes the ball to the system call execve where argv/argc are pushed to the stack.
the kernel parses the stack to populate argv for you.. so if you're interested in modifying or understanding the parsing part, you should look into the kernel execve code, if you follow it in lxr you'll get to this line, which I believe is what you are looking for:
http://lxr.linux.no/linux+v3.0/fs/exec.c#L1541

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=fs/exec.c#l1376
Search for sys_execve() ,read the kernel code,you can find it.

Related

Why does popen() invoke a shell to execute a process?

I'm currently reading up on and experimenting with the different possibilities of running programs from within C code on Linux. My use cases cover all possible scenarios, from simply running and forgetting about a process, reading from or writing to the process, to reading from and writing to it.
For the first two, popen() is very easy to use and works well. I understand that it uses some version of fork() and exec() internally, then invokes a shell to actually run the command.
For the third scenario, popen() is not an option, as it is unidirectional. Available options are:
Manually fork() and exec(), plus pipe() and dup2() for input/output
posix_spawn(), which internally uses the above as need be
What I noticed is that these can achieve the same that popen() does, but we can completely avoid the invoking of an additional sh. This sounds desirable, as it seems less complex.
However, I noticed that even examples on posix_spawn() that I found on the Internet do invoke a shell, so it would seem there must be a benefit to it. If it is about parsing command line arguments, wordexp() seems to do an equally good job.
What is the reason behind benefit of invoking a shell to run the desired process instead of running it directly?
Edit: I realized that my wording of the question didn't precisely reflect my actual interest - I was more curious about the benefits of going through sh rather than the (historical) reason, though both are obviously connected, so answers for both variations are equally relevant.
Invoking a shell allows you to do all the things that you can do in a shell.
For example,
FILE *fp = popen("ls *", "r");
is possible with popen() (expands all files in the current directory).
Compare it with:
execvp("/bin/ls", (char *[]){"/bin/ls", "*", NULL});
You can't exec ls with * as argument because exec(2) will interpret * literally.
Similarly, pipes (|), redirection (>, <, ...), etc., are possible with popen.
Otherwise, there's no reason to use popen if you don't need shell - it's unnecessary. You'll end up with an extra shell process and all the things that can go wrong in a shell go can wrong in your program (e.g., the command you pass could be incorrectly interpreted by the shell and a common security issue). popen() is designed that way. fork + exec solution is cleaner without the issues associated with a shell.
The glib answer is because the The POSIX standard ( http://pubs.opengroup.org/onlinepubs/9699919799/functions/popen.html ) says so. Or rather, it says that it should behave as if the command argument is passed to /bin/sh for interpretation.
So I suppose a conforming implementation could, in principle, also have some internal library function that would interpret shell commands without having to fork and exec a separate shell process. I'm not actually aware of any such implementation, and I suspect getting all the corner cases correct would be pretty tricky.
The 2004 version of the POSIX system() documentation has a rationale that is likely applicable to popen() as well. Note the stated restrictions on system(), especially the one stating "that the process ID is different":
RATIONALE
...
There are three levels of specification for the system() function. The
ISO C standard gives the most basic. It requires that the function
exists, and defines a way for an application to query whether a
command language interpreter exists. It says nothing about the command
language or the environment in which the command is interpreted.
IEEE Std 1003.1-2001 places additional restrictions on system(). It
requires that if there is a command language interpreter, the
environment must be as specified by fork() and exec. This ensures, for
example, that close-on- exec works, that file locks are not inherited,
and that the process ID is different. It also specifies the return
value from system() when the command line can be run, thus giving the
application some information about the command's completion status.
Finally, IEEE Std 1003.1-2001 requires the command to be interpreted
as in the shell command language defined in the Shell and Utilities
volume of IEEE Std 1003.1-2001.
Note the multiple references to the "ISO C Standard". The latest version of the C standard requires that the command string be processed by the system's "command processor":
7.22.4.8 The system function
Synopsis
#include <stdlib.h>
int system(const char *string);
Description
If string is a null pointer, the system function determines
whether the host environment has a command processor. If string
is not a null pointer, the system function passes the string
pointed to by string to that command processor to be executed
in a manner which the implementation shall document; this might then
cause the program calling system to behave in a non-conforming
manner or to terminate.
Returns
If the argument is a null pointer, the system function
returns nonzero only if a command processor is available. If
the argument is not a null pointer, and the system function
does return, it returns an implementation-defined value.
Since the C standard requires that the systems "command processor" be used for the system() call, I suspect that:
Somewhere there's a requirement in POSIX that ties popen() to the system() implementation.
It's much easier to just reuse the "command processor" entirely since there's also a requirement to run as a separate process.
So this is the glib answer twice-removed.

Is it legal to pass a null program argument vector to execve()?

Consider following C code (x86_64)
#include <unistd.h>
int main()
{
execve("/bin/ls", 0, 0);
}
I compiled as gcc a.c and executed; I got a SIGABRT with error
A NULL argv[0] was passed through an exec system call.
Aborted
Next running on gdb, at first I also got a SIGABRT, however I did second run and it worked!
Starting program: /bin/ls
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Why?
I tested for /bin/sh and found it always worked with *argv[] = NULL ...
Again I wrote some executable file (without any parameter needed) to test and found all them work.
So I guess only /bin/sh or other shells would work with *argv[] set to NULL, other files (like /bin/ls) fail or behave unexpectedly.
The man page for the execve() system call says
The argv and envp arrays must each include a null pointer at the end of the array.
If your program doesn't conform to those requirements, there's no certainty of how well things will go from that point on. If it "works" for some programs, that's just bad luck.
The man page also says
By convention, the first of these strings (i.e., argv[0]) should contain the filename associated with the file being executed.
That convention is rather strong (mandated by POSIX), so programs that fail to do so can be considered buggy. It's probably a good idea for your main() to test it's been called correctly if you're going to rely on argv[0] so you can fail with a nice error message rather than a fault, but not all programs do.
Linux specifically treats argv=NULL or envp=NULL as a valid empty list, instead of returning -EFAULT1. This is documented in the man page:
https://man7.org/linux/man-pages/man2/execve.2.html#NOTES.
This is not guaranteed by POSIX; some other OSes work this way but some don't.
Note 1: (Or if going through the libc wrapper, returning -1 and setting errno = EFAULT).
This is common in shellcode because it's already non-portable, and it saves machine-code bytes in the payload. (Or in what you have to construct as part of a ROP attack). But otherwise it's a very bad idea. The Linux man page strongly recommends against it only for portability reasons; it is actually guaranteed to work on Linux. (The execve itself; the program being invoked may be unhappy to find argc=0 and argv[0] == NULL, as well as an empty environment.)
Actual shells like /bin/sh do work if invoked this way, which is the other part of why it's viable for shellcode not to construct a {"/bin/sh", NULL} array of char* and pass a pointer to that.
As Toby explained, it's not totally surprising for /bin/ls to bail out in that case. Even if you had used execve("/bin/sh", (char*[]){NULL}, (char*[]){NULL}); to use execve portably, you'd still be starting the program with argc=0 exactly the way you are under Linux.
TL:DR: don't do it unless it's part of some code-size or other hacky reason for not doing it the normal way.

Argument treatment at command line in C

The question asked to me is that even if we supplied integer/float arguments at the command prompt they are still treated as strings or not in C language. I am not sure about that can any help just little bit. Is this true or not in C language and why? And what about others like Java or python ?
It is true, independent of the language, that the command line arguments to programs on Unix are strings. If the program chooses to interpret them as numbers, that is fine, but it is because the program (or programmer) chose to do so. Similarly, the runtime support for a language might alter the arguments passed by the o/s into integer or float types, but the o/s passes strings to that runtime (I know of no language that does this, but I don't claim to know all languages).
To see this, note that the ways to execute a program are the exec*() family of functions, and each of those takes a string which is the name of the program to be executed, and an array of strings which are the arguments to be passed to the program. Functions such as system() and popen() are built atop the exec*() family of functions — they also use fork(), of course. Even the posix_spawn() function takes an array of pointers to strings for the arguments of the spawned program.
It's not unlike mailing a letter without an envelope. We all agree to use the common enclosure known as an envelope. Operating systems pass parameters to programs using the common item known as a string of characters. It's beyond the scope of the operating system to understand what the program wants to do with the parameters.
There are some exceptions, one which comes to mind is the passing of parameters to a Linux Kernel Module. These can be passed as items other than strings.
Basically, this is an issue of creating an interface between the operating system and the program. any program. Remember that programs are not always written in C, And you don't even know whether there are things like float or int in the language.
You want to be able to pass several arguments (with natural delimiters), which may easily encode arbitrary information. In C, strings can be of arbitrary length, and the only constraint on them is that a zero byte in them signifies the end of the string. this is a highly flexible and natural way to pass arbitrary information to the program.
So you can never supply an arbitrary integer/float arguments directly to a program; The operating system (Unix / Linux / Windows / etc.) won't let you. You don't have any tool that gives you that interface, in the same way that you can't pass a mouse-click as an argument. All you supply is a sequence of characters.
Since Unix and C were designed together, it is also part of the C programming language, and from there it worked its way to C++, Java, Python and most other modern programming languages, and the same way into Linux, Windows and most other operating systems.

wmain vs main C runtime

I have read few articles about different Windows C entry pooints, wmain and WinMain.
So, if I am correct, these are added to C language compilers for Windows OS. But, how are implemented?
For example, wmain gets Unicode as argv[], but its Os that sends these arguments to program, so is there any special field in the .exe file entry which says to windows to pass arguemnts as unicode? Thanks.
Modern versions of Windows internally use UTF-16. Therefore, when you launch an executable, all command line arguments likely are passed as UTF-16 from the onset, and the runtime library linked into the launched application either passes the arguments through unscathed (if using wmain) or converts them to the local encoding automatically (if using main). (Specifically this would be done by wmainCRTStartup/mainCRTStartup which are the actual entry points used for console Windows applications.)
First: a pedantic rant: wmain certainly does not get Unicode arguments. Unicode is defined independently of any particular encoding. wmain gets arguments in a 16 bit character encoding of Unicode, UTF-16 at a guess. I've just checked Microsoft's documentation on wmain and the links from it and it is clear that Microsoft had no clue about what Unicode is when they wrote it.
Anyway, the entry point of a program is defined by the linker. A C program always has a certain amount of prologue code that runs before main/wmain/WinMain/wWinMain. If one of the wide versions of main is used, the prologue code converts the characters in the environment from whatever character set they are in to the wide character version.

Using '__progname' instead of argv[0]

In the C / Unix environment I work in, I see some developers using __progname instead of argv[0] for usage messages. Is there some advantage to this? What's the difference between __progname and argv[0]. Is it portable?
__progname isn't standard and therefore not portable, prefer argv[0]. I suppose __progname could lookup a string resource to get the name which isn't dependent on the filename you ran it as. But argv[0] will give you the name they actually ran it as which I would find more useful.
Using __progname allows you to alter the contents of the argv[] array while still maintaining the program name. Some of the common tools such as getopt() modify argv[] as they process the arguments.
For portability, you can strcopy argv[0] into your own progname buffer when your program starts.
There is also a GNU extension for this, so that one can access the program invocation name from outside of main() without saving it manually. One might be better off doing it manually, however; thus making it portable as opposed to relying on the GNU extension. Nevertheless, I here provide an excerpt from the available documentation.
From the on-line GNU C Library manual (accessed today):
"Many programs that don't read input from the terminal are designed to exit if any system call fails. By convention, the error message from such a program should start with the program's name, sans directories. You can find that name in the variable program_invocation_short_name; the full file name is stored the variable program_invocation_name.
Variable: char * program_invocation_name
This variable's value is the name that was used to invoke the program running in the current process. It is the same as argv[0]. Note that this is not necessarily a useful file name; often it contains no directory names.
Variable: char * program_invocation_short_name
This variable's value is the name that was used to invoke the program running in the current process, with directory names removed. (That is to say, it is the same as program_invocation_name minus everything up to the last slash, if any.)
The library initialization code sets up both of these variables before calling main.
Portability Note: These two variables are GNU extensions. If you want your program to work with non-GNU libraries, you must save the value of argv[0] in main, and then strip off the directory names yourself. We added these extensions to make it possible to write self-contained error-reporting subroutines that require no explicit cooperation from main."
I see at least two potential problems with argv[0].
First, argv[0] or argv itself may be NULL if execve() caller was evil or careless enough. Calling execve("foobar", NULL, NULL) is usually an easy and fun way to prove an over confident programmer his code is not sig11-proof.
It must also be noted that argv will not be defined outside of main() while __progname is usually defined as a global variable you can use from within your usage() function or even before main() is called (like non standard GCC constructors).
It's a BSDism, and definitely not portable.
__progname is just argv[0], and examples in other replies here show the weaknesses of using it. Although not portable either, I'm using readlink on /proc/self/exe (Linux, Android), and reading the contents of /proc/self/exefile (QNX).
If your program was run using, for instance, a symbolic link, argv[0] will contain the name of that link.
I'm guessing that __progname will contain the name of the actual program file.
In any case, argv[0] is defined by the C standard. __progname is not.

Resources