SVN File descriptor corruption - c

I was looking at the source code of SVN and found this comment in the source code.
/* The following makes sure that file descriptors 0 (stdin), 1
(stdout) and 2 (stderr) will not be "reused", because if
e.g. file descriptor 2 would be reused when opening a file, a
write to stderr would write to that file and most likely
corrupt it. */
followed by this code:
if ((fstat(0, &st) == -1 && open("/dev/null", O_RDONLY) == -1) ||
(fstat(1, &st) == -1 && open("/dev/null", O_WRONLY) == -1) ||
(fstat(2, &st) == -1 && open("/dev/null", O_WRONLY) == -1))
{
if (error_stream)
fprintf(error_stream, "%s: error: cannot open '/dev/null'\n",
progname);
return EXIT_FAILURE;
}
I don't understand how "reusing" a file descriptor can corrupt a file. And how can you open a file with file descriptor 2 (stderr) ? And how opening a file can effect write to stderr ?

I'm afraid that I don't know about SVN-specific problems this code is designed to solve (it was committed more than 18 years ago). But I think that the below summary describes the potential problem at high level. So I assume that the part of code in question prevents this from happening:
Longstanding Unix practice dictates that applications are started with
the standard input, output, and error I/O streams on file descriptors
0, 1, and 2, respectively. The assumption that these file descriptors
will be properly set up is so strong that most developers never think
to check them. So interesting things can happen if an application is
run with one or more of the standard file descriptors closed.
Consider, for example, running a program with file descriptor 2
closed. The next file the program opens will be assigned that
descriptor. If something then causes the program to write to (what it
thinks is) the standard error stream, that output will, instead, go to
the other file which had been opened, probably corrupting that file. A
malicious user can easily make messes this way; when setuid programs
are involved, the potential consequences are worse.
Taken from https://lwn.net/Articles/347815/
PS Check svn blame for this part of code. It has a log message that describes the purpose of these checks.

Related

Can a file descriptor be duplicated multiple times?

I've been looking for quite a while and cannot find the answer to my question.
I'm trying to reproduce a shell in C, with full redirections. In order to do this, I wanted to open the file before executing my command.
For example, in ls > file1 > file2, I use dup2(file1_fd, 1) and dup2(file2_fd, 1) and then I execute ls to fill the files, but it seems a standard output can only be open once so only file2 will be filled, because it was the last one to be duplicated.
Is there a way to redirect standard output to multiple file?
Is there something I am missing? Thanks!
Is there a way to redirect standard output to multiple files?
Many file descriptors cannot be made one file descriptor. You need to write into each file descriptor separately. This is what tee utility does for you.
What you are asking for is the exact reason why the tee command exists (you can take a look at its source code here).
You cannot duplicate a file descriptor using dup2() multiple times. As you already saw, the last one overwrites any previous duplication. Therefore you cannot redirect the output of a program to multiple files directly using dup2().
In order to do this, you really need multiple descriptors, and therefore you would have to open both files, launch the command using popen() and then read from the pipe and write to both files.
Here is a very simple example of how you could do it:
#include <stdio.h>
#include <stdlib.h>
#define N 4096
int main(int argc, const char *argv[]) {
FILE *fp1, *fp2, *pipe;
fp1 = fopen("out1.txt", "w");
if (fp1 == NULL) {
perror("fopen out1 failed");
return 1;
}
fp2 = fopen("out2.txt", "w");
if (fp2 == NULL) {
perror("fopen out2 failed");
return 1;
}
// Run `ls -l` just as an example.
pipe = popen("ls -l", "r");
if (pipe == NULL) {
perror("popen failed");
return 1;
}
size_t nread, nwrote;
char buf[N];
while ((nread = fread(buf, 1, N, pipe))) {
nwrote = 0;
while (nwrote < nread)
nwrote += fwrite(buf + nwrote, 1, nread - nwrote, fp1);
nwrote = 0;
while (nwrote < nread)
nwrote += fwrite(buf + nwrote, 1, nread - nwrote, fp2);
}
pclose(pipe);
fclose(fp2);
fclose(fp1);
return 0;
}
The above code is only to give a rough estimate on how the whole thing works, it doesn't check for some errors on fread, fwrite, etc: you should of course check for errors in your final program.
It's also easy to see how this could be extended to support an arbitrary number of output files (just using an array of FILE *).
Standard output is not different from any other open file, the only special characteristic is for it to be file descriptor 1 (so only one file descriptor with index 1 can be in your process) You can dup(2) file descriptor 1 to get, let´s say file descriptor 6. That's the mission of dup() just to get another file descriptor (with a different number) than the one you use as source, but for the same source. Dupped descriptors allow you to use any of the dupped descriptors indifferently to output, or to change open flags like close on exec flag or non block or append flag (not all are shared, I'm not sure which ones can be changed without affecting the others in a dup). They share the file pointer, so every write() you attempt to any of the file descriptors will be updated in the others.
But the idea of redirection is not that. A convention in unix says that every program will receive three descriptors already open from its parent process. So to use forking, first you need to consider how to write notation to express that a program will receive (already opened) more than one output stream (so you can redirect any of them properly, before calling the program) The same also applies for joining streams. Here, the problem is more complex, as you'll need to express how the data flows might be merged into one, and this makes the merging problem, problem dependant.
File dup()ping is not a way to make a file descriptor to write in two files... but the reverse, it is a way to make two different file descriptors to reference the same file.
The only way to do what you want is to duplicate write(2) calls on every file descriptor you are going to use.
As some answer has commented, tee(1) command allows you to fork the flow of data in a pipe, but not with file descriptors, tee(1) just opens a file, and write(2)s there all the input, in addition to write(2)`ing it to stdout also.
There's no provision to fork data flows in the shell, as there's no provision to join (in paralell) dataflows on input. I think this is some abandoned idea in the shell design by Steve Bourne, and you'll probably get to the same point.
BTW, just study the possibility of using the general dup2() operator, which is <n>&m>, but again, consider that, for the redirecting program, 2>&3 2>&4 2>&5 2>&6 mean that you have pre-opened 7 file descriptors, 0...6 in which stderr is an alias of descpritors 3 to 6 (so any data written to any of those descriptors will appear into what was stderr) or you can use 2<file_a 3<file_b 4<file_c meaning your program will be executed with file descriptor 2 (stderr) redirected from file_a, and file descriptors 3 and 4 already open from files file_b and file_c. Probably, some notation should be designed (and it doesn't come easily to my mind now, how to devise it) to allow for piping (with the pipe(2) system call) between different processes that have been launched to do some task, but you need to build a general graph to allow for generality.

Are stdin and stdout actually the same file?

I am completely confused, is it possible that stdin, stdout, and stderr point to the same filedescriptor internally?
Because it makes no difference in C if i want to read in a string from the console if I am using stdin as input or stdout.
read(1, buf, 200) works as read(0, buf, 200) how is this possible?
(0 == STDIN_FILENO == fileno(stdin),
1 == STDOUT_FILENO == fileno(stdout))
When the input comes from the console, and the output goes to the console, then all three indeed happen to refer to the same file. (But the console device has quite different implementations for reading and writing.)
Anyway, you should use stdin/stdout/stderr only for their intended purpose; otherwise, redirections like the following would not work:
<inputfile myprogram >outputfile
(Here, stdin and stdout refer to two different files, and stderr refers to the console.)
One thing that some people seem to be overlooking: read is the low-level system call. Its first argument is a Unix file descriptor, not a FILE* like stdin, stdout and stderr. You should be getting a compiler warning about this:
warning: passing argument 1 of ‘read’ makes integer from pointer without a cast [-Wint-conversion]
int r = read(stdout, buf, 200);
^~~~~~
On my system, it doesn't work with either stdin or stdout. read always returns -1, and errno is set to EBADF, which is "Bad file descriptor". It seems unlikely to me that those exact lines work on your system: the pointer would have to point to memory address 0, 1 or 2, which won't happen on a typical machine.
To use read, you need to pass it STDIN_FILENO, STDOUT_FILENO or STDERR_FILENO.
To use a FILE* like stdin, stdout or stderr, you need to use fread instead.
is it possible that stdin, stdout, and stderr point to the same filedescriptor internally?
A file descriptor is an index into the file descriptor table of your process (see also credentials(7)...). By definition STDIN_FILENO is 0, STDOUT_FILENO is 1, annd STDERR_FILENO is 2. Read about proc(5) to query information about some process (for example, try ls -l /proc/$$/fd in your interactive shell).
The program (usually, but not always, some shell) which has execve(2)-d your executable might have called dup2(2) to share (i.e. duplicate) some file descriptors.
See also fork(2), intro(2) and read some Linux programming book, such as the old ALP.
Notice that read(2) from STDOUT_FILENO could fail (e.g. with errno(3) being EBADF) in the (common) case where stdout is not readable (e.g. after redirection by the shell). If reading from the console, it could be readable. Read also the Tty Demystified.
There is nothing prohibiting any number of file-handles referring the same thing in the kernel.
And the default for a terminal-program is to have STDIN, STDOUT and STDERR refer to the same terminal.
So, it might look like it doesn't matter which you use, but it will all go wrong if the caller does any handle-redirection, which is quite common.
The most common is piping output from one program into the input of the next, but keeping stdout out of that.
An example for the shell:
source | filter | sink
Programs such as login and xterm typically open the tty device once when creating a new terminal session, and duplicate the file descriptor two or three times, arranging for file descriptors 0, 1 and 2 to be linked to the open file description of the opened tty device. They typically close all other file descriptors before exec-ing the shell. So if no further redirection is done by the shell or its child processes, the file descriptors, 0, 1 and 2, remain linked to the same file. Because the underlying tty device was opened in read-write mode, all three file descriptors have both read and write access.

How to reserve a file descriptor?

I'm writing a curses-based program. In order to make it simpler for me to find errors in this program, I would like to produce debug output. Due to the program already displaying a user interface on the terminal, I cannot put debugging output there.
Instead, I plan to write debugging output to file descriptor 3 unconditionally. You can invoke the program as program 3>/dev/ttyX with /dev/ttyX being a different teletype to see the debugging output. When file descriptor 3 is not opened, write calls fail with EBADF, which I ignore like all errors when writing debugging output.
A problem occurs when I open another file and no debugging output has been requested (i.e. file descriptor 3 has not been opened). In this case, the newly opened file might receive file descriptor 3, causing debugging output to randomly corrupt a file I just opened. This is a bad thing. How can I avoid this? Is there a portable way to mark a file descriptor as “reserved” or such?
Here are a couple of ideas I had and their problems:
I could open /dev/null or a temporary file to file descriptor 3 (e.g. by means of dup2()) before opening any other file. This works but I'm not sure if I can assume this to always succeed as opening /dev/null may not succeed.
I could test if file descriptor 3 is open and not write debugging output if it isn't. This is problematic when I'm attempting to restart the program by calling exec as a different file descriptor might have been opened (and not closed) prior to the exec call. I could intentionally close file descriptor 3 before calling exec when it has not been opened for debugging, but this feels really uggly.
Why use fd 3? Why not use fd 2 (stderr)? It already has a well-defined "I am logging of some sorts" meaning, is always (not true, but sufficiently true...) and you can redirect it before starting your binary, to get the logs where you want.
Another option would be to log messages to syslog, using the LOG_DEBUG level. This entails calling syslog() instead of a normal write function, but that's simply making the logging more explicit.
A simple way of checking if stderr has been redirected or is still pointing at the terminal is by using the isatty function (example code below):
#include <stdio.h>
#include <unistd.h>
int main(void) {
if (isatty(2)) {
printf("stderr is not redirected.\n");
} else {
printf("stderr seems to be redirected.\n");
}
}
In the very beginning of your program, open /dev/null and then assign it to file descriptor 3:
int fd = open ("/dev/null", O_WRONLY);
dup2(fd, 3);
This way, file descriptor 3 won't be taken.
Then, if needed, reuse dup2() to assign file descriptor 3 to your debugging output.
You claim you can't guarantee you can open /dev/null successfully, which is a little strange, but let's run with it. You should be able to use socketpair() to get a pair of FDs. You can then set the write end of the pair non-blocking, and dup2 it. You claim you are already ignoring errors on writes to this FD, so the data going in the bit-bucket won't bother you. You can of course close the other end of the socketpair.
Don't focus on a specific file descriptor value - you can't control it in a portable manner anyway. If you can control it at all. But you can use an environment variable to control debug output to a file:
int debugFD = getDebugFD();
...
int getDebugFD()
{
const char *debugFile = getenv( "DEBUG_FILE" );
if ( NULL == debugFile )
{
return( -1 );
}
int fd = open( debugFile, O_CREAT | O_APPEND | O_WRONLY, 0644 );
// error checking can be here
return( fd );
}
Now you can write your debug output to debugFD. I assume you know enough to make sure debugFD is visible where you need it, and also how to make sure it's initialized before trying to use it.
If you don't pass a DEBUG_FILE envval, you get an invalid file descriptor and your debug calls fail - presumably silently.

Redirecting of stdout in bash vs writing to file in c with fprintf (speed)

I am wondering which option is basically quicker.
What interests me the most is the mechanism of redirection. I suspect the file is opened at the start of the program ./program > file and is closed at the end. Hence every time a program outputs something it should be just written to a file, as simple as it sounds. Is it so? Then I guess both options should be comparable when it comes to speed.
Or maybe it is more complicated process since the operating system has to perform more operations?
There is no much difference between that options (except making file as a strict option reduces flexibility of your program).
To compare both approaches, let's check, what stays behind a magical entity FILE*:
So in both cases we have a FILE* object, a file descriptor fd - a gateway to an OS kernel and in-kernel infrastructure that provides access to files or user terminals, which should (unless libc has some special initializer for stdout or kernel specially handles files with fd = 1).
How does bash redirection work in compare with fopen()?
When bash redirects file:
fork() // new process is created
fd = open("file", ...) // open new file
close(1) // get rid of fd=1 pointing to /dev/pts device
dup2(fd, 1) // make fd=1 point to opened file
close(fd) // get rid of redundant fd
execve("a") // now "a" will have file as its stdout
// in a
stdout = fdopen(1, ...)
When you open file on your own:
fork() // new process is created
execve("a") // now "a" will have file as its stdout
stdout = fdopen(1, ...)
my_file = fopen("file", ...)
fd = open("file", ...)
my_file = fdopen(fd, ...)
So as you can see, the main bash difference is twiddling with file descriptors.
Yes, you are right. The speed will be identical. The only difference in the two cases is which program opens and closes the file. When you redirect it using shell, it is the shell that opens the file and makes the handle available as stdout to the program. When the program opens the file, well, the program opens the file. After that, the handle is a file handle in both the cases, so there should be absolutely no difference in speed.
As a side remark, the program which writes to stdout can be used in more general ways. You can for example say
./program | ssh remotehost bash -c "cat > file"
which will cause the output of the program to be written to file on remotehost. Of course in this case there is no comparison like one you are making in the question.
stdout is a FILE handle, fprintf writes to a file handle, so the speed will be very similar in both cases. In fact printf("Some string") is equivalent to fprintf(stdout, "Some string"). I will say no more :)

where stdin / stdout created

In c( ansi ) , we say input taken by (s/v/f)scanf and stored in stdin , same as we say
stdout . I wonder, in linux ( unix ) where are they reside, under which folder .
Or they ( stdin / stdout ) are arbitrary ( that is, no such things exist )
They are streams created for your process by the operating system. There is no named file object associated with them, and so they do not have a representation within the file system, although as unwind points out, they may be accessed via a pseudo file system if your UNIX variant supports such a thing.
stdin is a FILE * referring to the stdio (standard io) structure that is tied to the file descriptor 0. File descriptors are what Unix-like systems, such as Linux, use to talk with applications about particular file-like things. (Actually, I'm pretty sure that Windows does this as well).
File descriptor 0 may refer to any type of file, but to make sense it must be one that read can be called on (it must be a regular file, a steam socket, or a character device opened for reading or the read side of a pipe, as opposed to a directory file, data gram socket, or a block device).
Processes in Unix-like systems inherit their open file descriptors from their parent process in Unix-like systems. So to run a program with stdin set to something besides the parent's stdin you would do:
int new_stdin = open("new_stdin_file, O_RDONLY);
pid_t fk = fork();
if (!fk) { // in the child
dup2(new_stdin, 0);
close(new_stdin);
execl("program_name", "program_name", NULL);
_exit(127); // should not have gotten here, and calling exit (without _ ) can have
// side effects because it runs atexit registered functions, and we
// don't want that here
} else if (fk < 0) {
// in parent with error from fork
} else {
// in parent with no error so fk = pid of child
}
close(new_stdin); // we don't need this anymore
dup2 duplicates the first file descriptor argument as the second (closing the second before doing so if it were open for the current process).
fork creates a duplicate of the current process. execl is one of the exec family of functions, which use the execve system call to replace the current program with another program. The combination of fork and exec are how programs are generally run (even when hidden within other functions).
In the above example we could have run the new program with stdin set to the read end of a pipe, a tty (serial port / TeleTYpe), or several other things. Some of these have names present in the filesystem and others do not (like some pipes and sockets, though some do have names in the filesystem).
Linux makes /proc/self/fd/0 a symbolic link to the file opened as 0 in the current process. /proc/%i/fd/0, pid would represent the symbolic link to the same thing for an arbitrary pid (process ID) using the printf syntax. These symbolic links are often usable to find the real file in the filesystem (using the readlink system call), but if the file does not actually exist in the filesystem the link data (what would usually be a file name) instead is just a string that tells a little bit about the file.
I should point out here that a file that stdin (fd 0) refers to, even if it is in the filesystem, may not have just one name. It may have more than one hard link, so it would have more than one name -- and each of these would be just as much its name as any other hard link. Additionally it may have no name at all if all of its hard links have been unlinked since it was opened, though it's data would still live on the disk until all open file descriptors for it are closed.
If you don't actually need to know where it is in the filesystem, but just want some data about it you can use the fstat system call. This is like the stat system call and command line utility, except for already open files.
Everything I said here about stdin (fd 0) should be applicable to stdout (fd 1) and stderr (fd 2) except that they will both be writable rather than readable.
If you want to know more about any of the functions I mentioned be sure to look them up in the man pages by typing:
man fork
on the command line. Most functions I mentioned are in section 2 of the man pages, but one or two may be in section one, so man 2 fork will work too, and may be useful when a command line tool has the same name as a function.
In Linux, you can generally find stdin through the /proc file system in /proc/self/fd/0, and stdout is /proc/self/fd/1.
stdin is standard input - for example, keyboard input.
stdout is standard output - for example, monitor.
For more info, read this.
If you run:
./myprog < /etc/passwd
then stdin exists in the filesystem as /etc/passwd. If you just run
./myprog
interactively on a terminal, then stdin exists in the filesystem as whatever your terminal device is (probably /dev/pts/5 or something).
If you run
cat /etc/passwd | ./myprog
then stdin is an anonymous pipe and has no instantiation in the filesystem, but Linux allows you to get at it via /proc/12345/fd/0 where 12345 is the pid of myprog.

Resources