How to safely open regular files without denial-of-service vulnerability? - c

The flag O_DIRECTORY can be used with the syscalls open(2) and openat(2) to avoid denial-of-service vulnerabilities when opening directories. However: How can I avoid the same kind of race conditions for regular files?
Some background information: I am trying to develop some kind of backup tool. The programs walks over a directory tree, reads all regular files and only stats other files. If I first call fstatat(2) for each directory entry, test the result for regular files and open them with openat(2), then there is a race condition between the syscalls. An attacker could replace the regular file with a FIFO, and my program would hang on the FIFO.
How can I avoid this race condition? For directories, there is O_DIRECTORY, for symbolic links, O_PATH can be used. However, I have found no solution for regular files. I only need a solution that works on recent Linux versions.

If your only concern is fifos, O_NONBLOCK will prevent blocking and allow you to open a fifo even if it has a no writers (see http://pubs.opengroup.org/onlinepubs/9699919799/functions/open.html for where this is specified). However, there are also a few other concerns:
Device nodes
Fake files in Linux /proc with bad properties
...
Since these normally can't be created in arbitrary locations by non-root users, O_NOFOLLOW should be sufficient to avoid following symlinks to them.
With that said, on modern Linux there is an even safer solution: perform the initial open with O_PATH|O_NOFOLLOW, then perform stat on /proc/self/fd/%d to check the file type. You can then open /proc/self/fd/%d and be completely certain it corresponds to the same file you just stat'd.
Note that on sufficiently new Linux, you don't need to use /proc/self/fd/%d to reach the file to which you obtained an inode handle with O_PATH. You can use fstat and openat on it directly to "stat" it and get a descriptor to a real open file description, respectively. However O_PATH file descriptors had a lot of broken/unimplemented corner cases like this in the range of late 2.6.x (when they were first added) to 3.8 or so, and I find the /proc method the most reliable. Of course you could always try the direct method and fallback to /proc if it fails.

Open with O_RDONLY|O_NONBLOCK, check that the result isn't -1, then do an fstat() on the resulting file descriptor and compare st_mode (and possibly st_dev and st_ino) with what you expected.
Remember to set the AT_SYMLINK_NOFOLLOW flag on your fstatat.

Related

What is the fastest way to detect file size is not zero without knowing the file descriptor?

To explain shortly why I need this,
I am currently doing the detection by stat(2). I don't have control over the file descriptor (may get used up by some other thread as my code is getting injected to replace syscalls) , so i can't use fstat(2) (which is faster). I need to do this check a lot of times, so is there a faster way to do the same thing?
I am checking the same file in different processes which do not have a parent child relation.
You should probably benchmark it for yourself.
I've measured
//Real-time System-time
272.58 ns(R) 170.11 ns(S) //lseek
366.44 ns(R) 366.28 ns(S) //fstat
812.77 ns(R) 711.69 ns(S) //stat("/etc/profile",&sb)
on my Linux laptop. It fluctuates a little between runs but lseek is usually a bunch of ns faster than fstat, but you also need a fd for it and opening is quite expensive at about 1.6µs, so stat is probably the best choice for your case.
As tom-karzes has noted, stat should dependent on the number of directory components in the path. I tried it on a PATH_MAX long "/foo/foo/.../foo" directory and there I'm getting about 80µs.
The most efficient approach, knowing the filesystem you are searching in, is to open the block device associated and search (block by block) the inode table, and check the actual size from the inodes there (open the block device, so you get the inodes from the in-memory images, and not from the disk). This allows you to get all the zero length inodes of a filesystem in a quick and dirty way. The drawback is that you first need to get the info of the filesystem, and then to access the block device directly, which is normally forbidden for a non-root process. After that, you have to search the filesystem to get the names of the files involved, just in case you need to do something on those files.
By the way, your assumption of not being able to use fstat(2) on a shared file descriptor with another thread is wrong, as the stat system call operates on an open file descriptor, and doesn't do anything on the file ---it's nonblocking---, and the system warrants that the inode is locked while accessing the stat structure.
The approach of using lseek(2) is not valid in this case, because it actually moves the file pointer to the end of file, and then back to the saved place, and this requires two system calls to do and undo the move, and there are many race scenarios that can happen if another thread uses another system call (does a write(2), between the two) while you have the file pointer at another place.
Unix (incl. all posix systems linux, bsd, etc.) warrants that a nonblocking system call (as stat(2) is) is atomic in nature, blocking the inode of the file while the process (or thread) is executing the system call. So no other thread can be using the file while your stat(2) system call is getting the data. Even on blocking calls, unix warrants that a different system call made to the same descriptor will be chained to be executed and the process/thread will have to wait until the stat(2) syscall ends.
The problem on fstat(2) is that it has to solve all the path elements until it gets to the final inode of the file (this is where the length of the file is stored) and this is done in a one by one basis. Until it doesn't get to the final inode, no lock is made to the final inode (indeed, it is unknown until we get to it, so we cannot block it until we finish the namei() resolving) and then it solves as the original stat(2).
CONCLUSION
Use stat(2) with the other thread file descriptor whithout fearing about data corruption, it's not possible to happen. Don't hesitate to do this, as nothing is going to happen to the inode of the file while you are gathering the stat info.

Is there an equivalent of O_TMPFILE for directories?

Ideally, I want to have a directory that is not visible in the filesystem and that will be automatically removed when it's last open file descriptor is closed. It's contents would only be accessible through openat(), fstatat(), etc.
For regular files, this behaviour is achieved by giving the O_TMPFILE flag to open(). However, mkdir() doesnt have a flags parameter.
Assuming I have the latest linux kernel available, is this possible?
I'm not aware of any way to do this, and don't expect it to be possible. Unlike files, which can have zero or more pathnames (due to hard links and unlinked files), directories have exactly one pathname, and it would probably break some valid application usage if the OS did not meet this expectation.

Validation of file path before actually opening the file

Assuming Linux, or more generally a sufficiently POSIX compliant system, is there a ready made method of checking if opening a file with a given name would succeed? Most optimistically I am seeking an implementation of a function with the same prototype as open(2)
int test_open(const char *pathname, int flags);
which would return result according to anticipated success or failure of open(2) system call with the same parameters, but without actually creating or opening any file. It should be suitably licensed (reusable in proprietary software project) open source.
The open(2) manual page lists many reasons for open(2) failing. One errno value can decode multiple reasons, and the errno is different between Linux and POSIX. But nevertheless roughly speaking:
I think in general the following cases as itemized by errno are most relevant: EACCESS, EEXIST, ENOENT, EISDIR, ENOTDIR (both POSIX and Linux).
Less important: ELOOP, EMFILE, ENFILE, ENAMETOOLONG, ENODEV, ENXIO, EOVERFLOW, EPERM, EROFS, ETXTBSY, EWOULDBLOCK (POSIX adds EAGAIN).
Irrelevant (more transient conditions): ENOMEM, EINTR, ENOSPC (POSIX adds EIO, ENOSR).
(I am now unable to quickly find online POSIX manual page for open(), I am personally referring to POSIX manual pages installed in my Linux machine - I will edit the question when I find online link.)
Background and Expectations: My application/system configuration architecture mandates that an input value need to be validated before storing it permanently. Only after the validation and storage steps are performed, is the file going to be used for writing. Accepting bad values would be huge inconvenience (also trying to actually change to use bad file path would disturb the operation). I cannot or do not want to make exception for this special case (it is just one of over a hundred of configuration values).
I would prefer to not introduce side effects for the validation by creating a file (the flags for open() include O_CREAT). It is obvious that the check I am seeking for cannot be implemented 100% reliably in the most general case, which is the underlying reason for my categorizing the possible error conditions into three groups. We could have a very educated guess by analyzing the directory permissions, existence of directories, and whether there is already something with the same name which hinders opening the file, and whether the file name makes sense (my group 1 conditions). (Group 2 checks for number of symbolic links, file descriptor limits, name length limit, O_NOATIME permission, writability of the file system, and maybe EWOULDBLOCK and POSIX EAGAIN cases could be done but they are more cumbersome and probably less portable to do, and are expected to be less likely to happen unless evil input, which were the reasons for categorizing them less important).
P.S. I added tag c which is my programming language now, but the language is not very relevant.
There is no fail-proof way to do that, because (as Jite commented) some other process could have changed the environment (e.g. removed the parent directory, or filled up the filesystem, exceeded the disk quota, ....) between your test_open and the further open or creat syscall. Or the disk (or the media containing the filesystem, e.g. some USB stick) could have burned or have been unplugged.
The good practice is to check the result of open and use errno when it has failed.
You could use access to check a few things before. But since there is no fail-proof way, why bother?
You might validate the directory part of your file path using the realpath(3) function .... But even that is useless, some other process could have created or deleted the directory between your test_open and the real open

Alternatives to using stat() to get file type?

Are there any alternatives to stat (which is found on most Unix systems) which can determine the file type? The manpage says that a call to stat is expensive, and I need to call it quite often in my app.
The alternative is fstat() if you already have the file open (so you have a file descriptor for it). Or lstat() if you want to find out about symbolic links rather than the file the symlink points to.
I think the man page is exaggerating the cost; it is not much worse than any other system call that has to resolve the name of the file into an inode. It is more costly than getpid(); it is less costly than open().
The "file type" that stat() gives you is whether the file is a regular file or something like a device file or directory, among other things like its size and inode number. If that's what you need to know, then you must use stat().
If what you actually need to know is the type of the file's contents -- e.g. text file, JPEG image, MP3 audio -- then you have two options. You can guess based on the filename extension (if it ends in ".mp3", the file probably contains MP3 audio), or you can use libmagic, which actually opens the file and reads some of its contents to figure out what it is. The libmagic approach is more expensive (if you're trying to avoid stat(), you probably want to avoid open() too), but less prone to error (in case that ".mp3" file is actually a JPEG image, for example).
Under Linux with some filesystems the file type (regular, char device, block device, directory, pipe, sym link, ...) is stored in the linux_dirent struct, which is what the kernel supplies applications directory entries in via the getdents system call. If the only thing in the stat structure you needed was the file type and you needed to get that for all or many entries of a directory, you could use getdents directly (rather than readdir) and attempt to get the file type out of that, only using stat if you found an invalid file type in linux_dirent. Depending on the your application's filesystem usage pattern this could be faster than using stat if you are using Linux, but stat should be fast in many cases.
Stat's speed has mostly to do with locating the data that is being asked for on disk. If you are traversing a directory recursively stat-ing all of the files then each stat should end up being fairly quick overall because most of the work getting the data stat needs ends up cached before you ask the kernel for it by a previous call to stat. If on the other hand you stat the same number of files randomly distributed around the system then the kernel will likely have to read from disk several directories for each file you are going to call stat on.
fstat should always be very fast since the kernel should already have the data you're asking for in RAM, as it needs to access it for the file to be in the open state, and the kernel won't have to go through the trouble of traversing the path of the filename to see if each component is in RAM or on disk and possibly reading in a directory from disk (but likely not having to), only to discover that it has the data that you are asking for in RAM.
That being said, calling stat on an open file should be faster than calling it on an unopened file.
Are you aware of the "magic" file on *nix systems? By querying a file from the command line with something like file myfile.ext you can get the real file type.
This is done by reading the contents of the file rather than looking at its extension, and is widely used on *nix (Linux, Unix, ...) systems.
If your application is expected to run on Linux systems, why don't you try inotify(7). It is definitely faster than stating many files.

Open system call

I'm studying for my operating systems midterm and was wondering if I can get some help.
Can someone explain the checks and what the kernel does during the open() system call?
Thanks!
Very roughly, you can think of the following steps:
Translate the file name into an inode, which is the actual file system object describing the contents of the file, by traversing the filesystem data structures.
During this traversal, the kernel will check that you have sufficient access through the directory path to the file, and check access on the file itself. The precise checks depend on what modes were passed to open.
Create what's sometimes called an open file descriptor within the kernel. There is one of these objects for each file the kernel has opened on behalf of any process.
Allocate an unused index in the per-process file descriptor table, and point it at the open file descriptor.
Return this index from the system call as the file descriptor.
This description should be essentially correct for opening plain files and/or directories, but things are different for various sorts of special files, in particular for devices.
I would go back to what the prof told you - there a lot of things that happen during open(), depending on what you're opening (i.e. a device, a file, a directory), and unless you write what the professor's looking for, you'll lose points.
That being said, it mostly involves the checks to see if this open is valid (i.e. does this file exist, does the user have permissions to read/write it, etc), then an entry in the kernel handle table is allocated to keep track of the fd and its current file position (and of course, some other things)

Resources