I was reading DJB's "Some thoughts on security after ten years of Qmail 1.0" and he listed this function for moving a file descriptor:
int fd_move(to,from)
int to;
int from;
{
if (to == from) return 0;
if (fd_copy(to,from) == -1) return -1;
close(from);
return 0;
}
It occurred to me that this code does not check the return value of close, so I read the man page for close(2), and it seems it can fail with EINTR, in which case the appropriate behavior would seem to be to call close again with the same argument.
Since this code was written by someone with far more experience than I in both C and UNIX, and additionally has stood unchanged in qmail for over a decade, I assume there must be some nuance that I'm missing that makes this code correct. Can anyone explain that nuance to me?
I've got two answers:
He was trying to make a point about factoring out common code and often such examples omit error checking for brevity and clarity.
close(2) may return EINTER, but does it in practice, and if so what would you reasonably do? Retry once? Retry until success? What if you get EIO? That could mean almost anything, so you really have no reasonable recourse except logging it and moving on. If you retry after an EIO, you might get EBADF, then what? Assume that the descriptor is closed and move on?
Every system call can return EINTR, escpecially one that blocks like read(2) waiting on a slow human. This is a more likely scenario and a good "get input from terminal" routine will indeed check for this. That also means that write(2) can fail, even when writing a log file. Do you try to log the error that the logger generated or should you just give up?
When a file descriptor is dup'd, as it is in the fd_copy or dup2 function, you will end up with more than one file descriptor referring to the same thing (i.e. the same struct file in the kernel). Closing one of them will simply decrement its reference count. No operation is performed on the underlying object unless it is the last close. As a result, conditions such as EINTR and EIO are not possible.
Another possibility is that his function is used only in an application (or a part of one) which has done something to ensure that the call will not be interrupted by a signal. If you aren't going to do anything important with signals, then you don't have to be responsive to them, and it might make sense to mask them all out, rather than wrap every single blocking system call in an EINTR retry. Except of course the ones that will kill you, so SIGKILL and frequently SIGPIPE if you handle it by quitting, along with SIGSEGV and similar fatal errors which will in any case never be delivered to a correct user-space app.
Anyway, if all he's talking about is security, then quite possibly he doesn't have to retry close. If close failed with EIO, then he would not be able to retry it, it would be a permanent failure. Therefore, it is not necessary for the correctness of his program that close succeeds. It may well be that it is not necessary for the correctness of his program that close be retried on EINTR, either.
Usually you want your program to make a best effort to succeed, and that means retrying on EINTR. But this is a separate concern from security. If your program is designed so that some function failing for any reason isn't a security flaw, then in particular the fact that it happens to have failed EINTR, rather than for a permanent reason, isn't a flaw. DJB has been known to be fairly opinionated, so I would not be at all surprised if he has proved some reason why he doesn't need to retry, and therefore doesn't bother, even if doing so would allow his program to succeed in flushing the handle in certain situations where maybe it currently fails (like being explicitly sent a harmless signal with kill by the user at a crucial moment).
Edit: it occurs to me that retrying on EINTR could potentially itself be a security flaw under certain conditions. It introduces a new behaviour to that section of code: it can loop indefinitely in response to a signal flood, where previously it would make one attempt to close and then return. I don't know for sure that this would cause qmail any problems (after all, close itself makes no guarantees how soon it will return). But if giving up after one attempt does make the code easier to analyse then it could plausibly be a smart move. Or not.
You might think that retrying prevents a DoS flaw, where a signal causes a spurious failure. But retrying allows another (more difficult) DoS flaw, where a signal flood causes an indefinite stall. In terms of binary "can this app be DoSed?", which is the kind of absolute security question DJB was interested in when he wrote qmail and djbdns, it makes no difference. If something can happen once, then normally that means it can happen many times.
Only broken unices ever return EINTR without you explicitly asking for it. The sane semantics for signal() enable restartable system calls ("BSD style"). When building a program on a system with the sysv semantics (interrupting signals), you should always replace calls to signal() with calls to bsd_signal(), which you can define yourself in terms of sigaction() if it doesn't exist.
It's further worth noting that no systems will return EINTR on signal receipt unless you have installed signal handlers. If the default action is left in place, or if the signal is set to no action, it's impossible for system calls to be interrupted.
Related
On my system (Ubuntu Linux, glibc), man page of a close call specifies several error return values it can return. It also says
Not checking the return value of close() is a common but nevertheless serious programming error.
and at the same time
Note that the return value should only be used for diagnostics. In particular close() should not be retried after an EINTR since this may cause a reused descriptor from another thread to be closed.
So I am not allowed to ignore the return value nor to retry the call.
Given that, how shall I handle the close() call failure?
If the error happened when I was writing something to the file, I am probably supposed to try to write the information somewhere else to avoid the data loss.
If I was only reading the file, can I just log the failure and continue the program pretending nothing happened? Are there any caveats, leak of file descriptors or whatever else?
In practice, close should never be retried on error, and the fd you passed to close is always invalid (closed) after close returns, regardless of whether an error occurred. In some cases, an error may indicate that data was lost (certain NFS setups) or unusual hardware conditions for devices (e.g. tape could not be rewound), so you may want to be cautious to avoid data loss, but you should never attempt to close the fd again.
In theory, POSIX was unclear in the past as to whether the fd remains open when close fails with EINTR, and systems disagreed. Since it's important to know the state (otherwise you have either fd leaks or double-close bugs which are extremely dangerous in multithreaded programs), the resolution to Austin Group issue #529 specified the behavior strictly for future versions of POSIX, that EINTR means the fd remains open. This is the right behavior consistent with the definition of EINTR elsewhere, but Linux refuses to accept it. (FWIW there's an easy workaround for this that's possible at the libc syscall wrapper level; see glibc PR #14627.) Fortunately it never arises in practice anyway.
Some related questions you might find informative:
What are the reasons to check for error on close()?
Trying to make close sleep on Linux
First of all: EINTR means exactly that: System call was interrupted, if this happens on a close() call, there is exactly nothing you can do.
Apart from maybe keeping track of the fact, that if the fd belonged to a file, this file is possibly corrupt, there is not much you can do about errors on close() at all - depending on the return value. AFAIK the only case, where a close can be retried is on EBUSY, but I have yet to see that.
So:
Not checking the result of close() might mean that you miss file corruption, especially truncation.
Depending on the error, most of the time you can do nothing - a failed close() just means something has gone awfully wrong outside the scope of your application.
I felt at peace with Posix after many years of experience.
Then I read this message from Linus Torvalds, circa 2002:
int ret;
do {
ret = close(fd);
} while(ret == -1 && errno != EBADF);
NO.
The above is
(a) not portable
(b) not current practice
The "not portable" part comes from the fact that (as somebody pointed
out), a threaded environment in which the kernel does close the FD
on errors, the FD may have been validly re-used (by the kernel) for
some other thread, and closing the FD a second time is a BUG.
Not only is looping until EBADF unportable, but any loop is, due to a race condition that I probably would have noticed if I hadn't "made peace" by taking such things for granted.
However, in the GCC C++ standard library implementation, basic_file_stdio.cc, we have
do
__err = fclose(_M_cfile);
while (__err && errno == EINTR);
The primary target for this library is Linux, but it seems not to be heeding Linus.
As far as I've come to understand, EINTR happens only after a system call blocks, which implies that the kernel received the request to free the descriptor before commencing whatever work got interrupted. So there's no need to loop. Indeed, the SA_RESTART signal behavior does not apply to close and generate such a loop by default, precisely because it is unsafe.
This is a standard library bug then, right? On every file ever closed by a C++ application.
EDIT: To avoid causing too much alarm before some guru comes along with an answer, I should note that close only seems to be allowed to block under specific circumstances, perhaps none of which ever apply to regular files. I'm not clear on all the details, but you should not see EINTR from close without opting into something by fcntl or setsockopt. Nevertheless the possibility makes generic library code more dangerous.
With respect to POSIX, R..'s answer to a related question is very clear and concise: close() is a non-restartable special case, and no loop should be used.
This was surprising to me, so I decided to describe my findings, followed by my conclusions and chosen solution at end.
This is not really an answer. Consider this more like the opinion of a fellow programmer, including the reasoning behind that opinion.
POSIX.1-2001 and POSIX.1-2008 describe three possible errno values that may occur: EBADF, EINTR, and EIO. The descriptor state after EINTR and EIO is "unspecified", which means it may or may not have been closed. EBADF indicates fd is not a valid descriptor. In other words, POSIX.1 clearly recommends using
if (close(fd) == -1) {
/* An error occurred, see 'errno'. */
}
without any retry looping to close file descriptors.
(Even the Austin Group defect #519 R.. mentioned, does not help with recovering from close() errors: it leaves it unspecified whether any I/O is possible after an EINTR error, even if the descriptor itself is left open.)
For Linux, the close() syscall is defined in fs/open.c, with __do_close() in fs/file.c managing the descriptor table locking, and filp_close() back in fs/open.c taking care of the details.
In summary, the descriptor entry is removed from the table unconditionally first, followed by filesystem-specific flushing (f_op->flush()), followed by notification (dnotify/fsnotify hook), and finally by removing any record or file locks. (Most local filesystems like ext2, ext3, ext4, xfs, bfs, tmpfs, and so on, do not have ->flush(), so given a valid descriptor, close() cannot fail. Only ecryptfs, exofs, fuse, cifs, and nfs have ->flush() handlers in Linux-3.13.6, as far as I can tell.)
This does mean that in Linux, if a write error occurs in the filesystem-specific ->flush() handler during close(), there is no way to retry; the file descriptor is always closed, just like Torvalds said.
The FreeBSD close() man page describes the exact same behaviour.
Neither the OpenBSD nor the Mac OS X close() man pages describe whether the descriptor is closed in case of errors, but I believe they share the FreeBSD behaviour.
It seems clear to me that no loop is necessary or required to close a file descriptor safely. However, close() may still return an error.
errno == EBADF indicates the file descriptor was already closed. If my code encounters this unexpectedly, to me it indicates there is a significant fault in the code logic, and the process should gracefully exit; I'd rather my processes die than produce garbage.
Any other errno values indicate an error in finalizing the file state. In Linux, it is definitely an error related to flushing any remaining data to the actual storage. In particular, I can imagine ENOMEM in case there is no room to buffer the data, EIO if the data could not be sent or written to the actual device or media, EPIPE if connection to the storage was lost, ENOSPC if the storage is already full with no reservation to the unflushed data, and so on. If the file is a log file, I'd have the process report the failure and exit gracefully. If the file contents are still available in memory, I would remove (unlink) the entire file, and retry. Otherwise I'd report the failure to the user.
(Remember that in Linux and FreeBSD, you do not "leak" file descriptors in the error case; they are guaranteed to be closed even if an error occurs. I am assuming all other operating systems I might use behave the same way.)
The helper function I'll use from now on will be something like
#include <unistd.h>
#include <errno.h>
/**
* closefd - close file descriptor and return error (errno) code
*
* #descriptor: file descriptor to close
*
* Actual errno will stay unmodified.
*/
static int closefd(const int descriptor)
{
int saved_errno, result;
if (descriptor == -1)
return EBADF;
saved_errno = errno;
result = close(descriptor);
if (result == -1)
result = errno;
errno = saved_errno;
return result;
}
I know the above is safe on Linux and FreeBSD, and I assume it is safe on all other POSIX-y systems. If I encounter one that is not, I can simply replace the above with a custom version, wrapping it in a suitable #ifdef for that OS. The reason this maintains errno unchanged is just a quirk of my coding style; it makes short-circuiting error paths shorter (less repeated code).
If I am closing a file that contains important user information, I will do a fsync() or fdatasync() on it prior to closing. This ensures the data hits the storage, but also causes a delay compared to normal operation; therefore I won't do it for ordinary data files.
Unless I will be unlink()ing the closed file, I will check closefd() return value, and act accordingly. If I can easily retry, I will, but at most once or twice. For log files and generated/streamed files, I only warn the user.
I want to remind anyone reading this far that we cannot make anything completely reliable; it is just not possible. What we can do, and in my opinion should do, is to detect when an error occurs, as reliably as we can. If we can easily and with neglible resource use retry, we should. In all cases, we should make sure the notification (about the error) is propagated to the actual human user. Let the human worry about whether some other action, possibly complex, needs to be done before the operation is retried. After all, a lot of tools are used only as a part of a larger task, and the best course of action usually depends on that larger task.
Obviously, it's good practice. That goes without saying. I see it every time in example code (like socket(), fork(), or malloc(), to name a few). I know to do it, I just don't understand the why of it so much. Are they prone to failing often? Is it because system calls are made in kernel mode? What's the reasoning behind it?
I presume you are asking why code that calls these routines checks the results to determine whether an error occurred.
Each of the routines you cite, socket, fork, and malloc, requires resources. Those resources may be unavailable either because the calling process has exceeded limits set by the system administrator or the user or because the system has exhausted the resources it has and cannot provide any more to processes. Therefore, it is possible, even if not frequent, that a call to one of these routines will return failure. So a calling process should check for failure.
Additionally, in some implementations, some system routines (such as read and write) can be interrupted if a signal is delivered to the process before the operation completed. (When a signal arrives, it is considered important, and it is desirable to deliver it to the process immediately rather than wait for a potentially long operation to complete. So the operation is interrupted, the signal is delivered, the process may handle the signal and return from the signal handler. Then control is returned to the code that called the original routine, and that code must be informed that the operation was interrupted.) This interruption results in returning failure with an error status indicating the operation was interrupted.
Always, if only..
Way back when as a C function could only return an integer, and exceptions were science fiction, they came up with the idea of returning either success or a code that provided a clue as to what had gone wrong. It became a convention.
Depends on what you call a failure.
Something like opening a file (given the developer can be bothered) are relatively easy to deal with, File not found for instance. Malloc, is a bit more difficult to take some remedial action.
The key point though is to check as near to the error as possible. If you don't, you find that the file you wanted to open and append to didn't exist 10,000 lines of code later, when you try and write the results of your extensive computation to it and get say an access violation.
Basically this stuff is the reason exceptions were invented. Checking the return value is "optional", swallowing an exception is explicit.
example:
FILE *fp;
fp = fopen("c:\\removedDirectory\nonexistingFile.txt", "r")//returns NULL
if(fp != NULL)
{
//stuff here will fail if fp == NULL
}
If you do not check output of fopen, (replace with any function that returns an error) and fp is NULL, the subsequent functions depending on a real file stream will not work.
To be clear, this is a design rather than an implementation question
I want to know the rationale behind why POSIX behaves this way. POSIX system calls when given an invalid memory location return EFAULT rather than crashing the userspace program (by sending a sigsegv), which makes their behavior inconsistent with userspace functions.
Why? Doesn't this just hide memory bugs? Is it a historical mistake or is there a good reason for it?
Because system calls are executed by the kernel, not by the user program --- when the system call occurs, the user process halts and waits for the kernel to finish.
The kernel itself, of course, isn't allowed to seg fault, so it has to manually check all the address areas the user process gives it. If one of these checks fails, the system call fails with EFAULT. So in this situation a segmentation fault hasn't actually happening --- it's been avoided by the kernel explicitly checking to make sure all the addresses are valid. Hence it makes sense that no signal is sent.
In addition, if a signal were sent, there'd be no way the kernel could attach a meaningful program counter to the signal, the user process isn't actually executing when the system call is running. This means there'd be no way for the user process to produce decent diagnostics, restart the failed instruction, etc.
To summarise: mostly historical, but there is actual logic to the reasoning. Like EINTR, this doesn't make it any less irritating to deal with.
Well, what would you want to happen. A system call is a request to the system. If you ask: "when does the ferry to Munchen leave?" would you like the program to crash, or to get return = -1 with errno = ENOHARBOR ? If you ask the sytem to put your car into your handbag, would you like to have your handbag destroyed, or a return of -1 with errno set to EBAGTOOSMALL ?
There is a technical detail: before or after syscalls,arguments to/from user/system -land have to be converted (copied) when entering/leaving the system call. Mostly for security reasons the system is very reluctant to write into user-space. (Linux has a copy_to_user_space function for this (and vice versa), which checks the credentials before doing the actual copying)
Why? Doesn't this just hide memory bugs?`
On the contrary. It allows your program to handle the error(impossible in this case), and terminate gracefully. But the program must check the return value from system calls and inspect errno. In the case of SIGSEGVE, there is very little for your program to do, so mapping EINVAL to SIGSEGVE would be a bad idea.
Systemcalls were designed to always return (or block indefinitely...), whether they succeed or fail.
And a technical aspect could be that {segmentation faults, buserror, floating point exception, ...} are (often) generated by hardware interrupts.
What is the behavior of the select(2) function when a file descriptor it is watching for reading is closed by another thread?
From some cursory testing, it does return right away. I suspect the outcome is either that (a) it still continues to wait for data, but if you actually tried to read from it you'd get EBADF (possibly -- there's a potential race) or (b) that it pretends as though the file descriptor were never passed in. If the latter case is true, passing in a single fd with no timeout would cause a deadlock if it were closed.
From some additional investigation, it appears that both dwc and bothie are right.
bothie's answer to the question boils down to: it's undefined behavior. That doesn't mean that it's unpredictable necessarily, but that different OSes do it differently. It would appear that systems like Solaris and HP-UX return from select(2) in this case, but Linux does not based on this post to the linux-kernel mailing list from 2001.
The argument on the linux-kernel mailing list is essentially that it is undefined (and broken) behavior to rely upon. In Linux's case, calling close(2) on the file descriptor effectively decrements a reference count on it. Since there is a select(2) call also with a reference to it, the fd will remain open and waiting for input until the select(2) returns. This is basically dwc's answer. You will get an event on the file descriptor and then it'll be closed. Trying to read from it will result in a EBADF, assuming the fd hasn't been recycled. (A concern that MarkR made in his answer, although I think it's probably avoidable in most cases with proper synchronization.)
So thank you all for the help.
I would expect that it would behave as if the end-of-file had been reached, that's to say, it would return with the file descriptor shown as ready but any attempt to read it subsequently would return "bad file descriptor".
Having said that, doing that is very bad practice anyway, as you'd always have potential race conditions as another file descriptor with the same number could be opened by yet another thread immediately after the other 2nd closed it, then the selecting thread would end up waiting on the wrong one.
As soon as you close a file, its number becomes available for reuse, and may get reused by the next call to open(), socket() etc, even if by another thread. Therefore you really, really need to avoid this kind of thing.
The select system call is a way to wait for file desctriptors to change state while the programs doesn't have anything else to do. The main use is for server applications, which open a bunch of file descriptors and then wait for anything to do on them (accept new connections, read requests or send the responses). Those file descriptors will be opened in non-blocking io mode such that the server process won't hang in a syscall at any times.
This additionally means, there is no need for separate threads, because all the work, that could be done in the thread can be done prior to the select call as well. And if the work takes long, than it can be interrupted, select being called with timeout={0,0}, the file descriptors get handled and afterwards the work is being resumed.
Now, you close a file descriptor in another thread. Why do you have that extra thread at all, and why shall it close the file descriptor?
The POSIX standard doesn't provide any hints, what happens in this case, so what you're doing is UNDEFINED BEHAVIOR. Expect that the result will be very different between different operating systems and even between version of the same OS.
Regards, Bodo
It's a little confusing what you're asking...
Select() should return upon an "interesting" change. If the close() merely decremented the reference count and the file was still open for writing somewhere then there's no reason for select() to wake up.
If the other thread did close() on the only open descriptor then it gets more interesting, but I'd need to see a simple version of the code to see if something's really wrong.