Ideally, I want to have a directory that is not visible in the filesystem and that will be automatically removed when it's last open file descriptor is closed. It's contents would only be accessible through openat(), fstatat(), etc.
For regular files, this behaviour is achieved by giving the O_TMPFILE flag to open(). However, mkdir() doesnt have a flags parameter.
Assuming I have the latest linux kernel available, is this possible?
I'm not aware of any way to do this, and don't expect it to be possible. Unlike files, which can have zero or more pathnames (due to hard links and unlinked files), directories have exactly one pathname, and it would probably break some valid application usage if the OS did not meet this expectation.
Related
The flag O_DIRECTORY can be used with the syscalls open(2) and openat(2) to avoid denial-of-service vulnerabilities when opening directories. However: How can I avoid the same kind of race conditions for regular files?
Some background information: I am trying to develop some kind of backup tool. The programs walks over a directory tree, reads all regular files and only stats other files. If I first call fstatat(2) for each directory entry, test the result for regular files and open them with openat(2), then there is a race condition between the syscalls. An attacker could replace the regular file with a FIFO, and my program would hang on the FIFO.
How can I avoid this race condition? For directories, there is O_DIRECTORY, for symbolic links, O_PATH can be used. However, I have found no solution for regular files. I only need a solution that works on recent Linux versions.
If your only concern is fifos, O_NONBLOCK will prevent blocking and allow you to open a fifo even if it has a no writers (see http://pubs.opengroup.org/onlinepubs/9699919799/functions/open.html for where this is specified). However, there are also a few other concerns:
Device nodes
Fake files in Linux /proc with bad properties
...
Since these normally can't be created in arbitrary locations by non-root users, O_NOFOLLOW should be sufficient to avoid following symlinks to them.
With that said, on modern Linux there is an even safer solution: perform the initial open with O_PATH|O_NOFOLLOW, then perform stat on /proc/self/fd/%d to check the file type. You can then open /proc/self/fd/%d and be completely certain it corresponds to the same file you just stat'd.
Note that on sufficiently new Linux, you don't need to use /proc/self/fd/%d to reach the file to which you obtained an inode handle with O_PATH. You can use fstat and openat on it directly to "stat" it and get a descriptor to a real open file description, respectively. However O_PATH file descriptors had a lot of broken/unimplemented corner cases like this in the range of late 2.6.x (when they were first added) to 3.8 or so, and I find the /proc method the most reliable. Of course you could always try the direct method and fallback to /proc if it fails.
Open with O_RDONLY|O_NONBLOCK, check that the result isn't -1, then do an fstat() on the resulting file descriptor and compare st_mode (and possibly st_dev and st_ino) with what you expected.
Remember to set the AT_SYMLINK_NOFOLLOW flag on your fstatat.
Assuming Linux, or more generally a sufficiently POSIX compliant system, is there a ready made method of checking if opening a file with a given name would succeed? Most optimistically I am seeking an implementation of a function with the same prototype as open(2)
int test_open(const char *pathname, int flags);
which would return result according to anticipated success or failure of open(2) system call with the same parameters, but without actually creating or opening any file. It should be suitably licensed (reusable in proprietary software project) open source.
The open(2) manual page lists many reasons for open(2) failing. One errno value can decode multiple reasons, and the errno is different between Linux and POSIX. But nevertheless roughly speaking:
I think in general the following cases as itemized by errno are most relevant: EACCESS, EEXIST, ENOENT, EISDIR, ENOTDIR (both POSIX and Linux).
Less important: ELOOP, EMFILE, ENFILE, ENAMETOOLONG, ENODEV, ENXIO, EOVERFLOW, EPERM, EROFS, ETXTBSY, EWOULDBLOCK (POSIX adds EAGAIN).
Irrelevant (more transient conditions): ENOMEM, EINTR, ENOSPC (POSIX adds EIO, ENOSR).
(I am now unable to quickly find online POSIX manual page for open(), I am personally referring to POSIX manual pages installed in my Linux machine - I will edit the question when I find online link.)
Background and Expectations: My application/system configuration architecture mandates that an input value need to be validated before storing it permanently. Only after the validation and storage steps are performed, is the file going to be used for writing. Accepting bad values would be huge inconvenience (also trying to actually change to use bad file path would disturb the operation). I cannot or do not want to make exception for this special case (it is just one of over a hundred of configuration values).
I would prefer to not introduce side effects for the validation by creating a file (the flags for open() include O_CREAT). It is obvious that the check I am seeking for cannot be implemented 100% reliably in the most general case, which is the underlying reason for my categorizing the possible error conditions into three groups. We could have a very educated guess by analyzing the directory permissions, existence of directories, and whether there is already something with the same name which hinders opening the file, and whether the file name makes sense (my group 1 conditions). (Group 2 checks for number of symbolic links, file descriptor limits, name length limit, O_NOATIME permission, writability of the file system, and maybe EWOULDBLOCK and POSIX EAGAIN cases could be done but they are more cumbersome and probably less portable to do, and are expected to be less likely to happen unless evil input, which were the reasons for categorizing them less important).
P.S. I added tag c which is my programming language now, but the language is not very relevant.
There is no fail-proof way to do that, because (as Jite commented) some other process could have changed the environment (e.g. removed the parent directory, or filled up the filesystem, exceeded the disk quota, ....) between your test_open and the further open or creat syscall. Or the disk (or the media containing the filesystem, e.g. some USB stick) could have burned or have been unplugged.
The good practice is to check the result of open and use errno when it has failed.
You could use access to check a few things before. But since there is no fail-proof way, why bother?
You might validate the directory part of your file path using the realpath(3) function .... But even that is useless, some other process could have created or deleted the directory between your test_open and the real open
Linux 2.6.39 introduced O_PATH open mode, which (roughly speaking) doesn't really open the file at all (i.e. doesn't create an open file description), but just gives a file descriptor that's a handle to the unopened target. Its main use is as an argument to the *at functions (openat, etc.), and it seems to be suitable as an implementation of the POSIX 2008 O_SEARCH functionality which Linux was previously missing. However, I've been unable to find any good documentation on the exact semantics of O_PATH. A couple specific questions I have are:
What operations are possible on Linux O_PATH file descriptors? (Only *at functions?)
Is O_PATH ever useful with non-directories?
How is the file descriptor bound to the underlying filesystem object, and what happens if it's moved, deleted, etc.? Does an O_PATH file descriptor count as a reference that prevents the object from being freed when the last link is unlinked? Etc.
File descriptors obtained using open(directory, O_PATH | O_DIRECTORY) are not only useful for ...at() functions, but for fchdir() (since kernel version 3.2.23, I believe).
There is also a recent patch for a new syscall, fbind(), that would allow very long Unix domain socket names. The socket file is first created using mknod(path, mode | S_IFSOCK, (dev_t)0), then opened using open(file, O_PATH). The file descriptor thus obtained, and a Unix domain socket descriptor, is passed to fbind(), to bind the socket to the pathname. Whether this will be included in the Linux kernel is yet to be seen -- although even if it is, it will be years before one can rely on it being universally available. (As a workaround for too-long Unix domain socket names it would be viable sooner, though.)
I'd say O_PATH is only useful for directories for now; file uses may be found in the future. Other than the possibility of a future fbind(), or similar future syscalls, I don't know of any use of file descriptors for files opened using O_PATH. Even fstatvfs() won't work, on a 3.5.0 kernel at least.
In Linux, inodes (file contents and metadata) are freed only when the last open file descriptor is closed. When removing (unlinking) a file, you only remove the file name associated with the inode. So, there are two separate filesystem objects associated with a file descriptor: the name used to open the object, and the underlying inode referred to. The name is only used for path resolution, i.e. when open() (or equivalent) is called. All data and metadata is in the inode.
File descriptors obtained using O_PATH behave (at least on kernel 3.5.0) just like normal file descriptors wrt. moving and renaming the name or name components used to open the descriptor. (The descriptor stays valid, as it refers to the inode, and the file name object was used only during path resolution. Holding the descriptor open will keep the inode resources allocated, even if the descriptor was opened O_PATH.)
Are there any alternatives to stat (which is found on most Unix systems) which can determine the file type? The manpage says that a call to stat is expensive, and I need to call it quite often in my app.
The alternative is fstat() if you already have the file open (so you have a file descriptor for it). Or lstat() if you want to find out about symbolic links rather than the file the symlink points to.
I think the man page is exaggerating the cost; it is not much worse than any other system call that has to resolve the name of the file into an inode. It is more costly than getpid(); it is less costly than open().
The "file type" that stat() gives you is whether the file is a regular file or something like a device file or directory, among other things like its size and inode number. If that's what you need to know, then you must use stat().
If what you actually need to know is the type of the file's contents -- e.g. text file, JPEG image, MP3 audio -- then you have two options. You can guess based on the filename extension (if it ends in ".mp3", the file probably contains MP3 audio), or you can use libmagic, which actually opens the file and reads some of its contents to figure out what it is. The libmagic approach is more expensive (if you're trying to avoid stat(), you probably want to avoid open() too), but less prone to error (in case that ".mp3" file is actually a JPEG image, for example).
Under Linux with some filesystems the file type (regular, char device, block device, directory, pipe, sym link, ...) is stored in the linux_dirent struct, which is what the kernel supplies applications directory entries in via the getdents system call. If the only thing in the stat structure you needed was the file type and you needed to get that for all or many entries of a directory, you could use getdents directly (rather than readdir) and attempt to get the file type out of that, only using stat if you found an invalid file type in linux_dirent. Depending on the your application's filesystem usage pattern this could be faster than using stat if you are using Linux, but stat should be fast in many cases.
Stat's speed has mostly to do with locating the data that is being asked for on disk. If you are traversing a directory recursively stat-ing all of the files then each stat should end up being fairly quick overall because most of the work getting the data stat needs ends up cached before you ask the kernel for it by a previous call to stat. If on the other hand you stat the same number of files randomly distributed around the system then the kernel will likely have to read from disk several directories for each file you are going to call stat on.
fstat should always be very fast since the kernel should already have the data you're asking for in RAM, as it needs to access it for the file to be in the open state, and the kernel won't have to go through the trouble of traversing the path of the filename to see if each component is in RAM or on disk and possibly reading in a directory from disk (but likely not having to), only to discover that it has the data that you are asking for in RAM.
That being said, calling stat on an open file should be faster than calling it on an unopened file.
Are you aware of the "magic" file on *nix systems? By querying a file from the command line with something like file myfile.ext you can get the real file type.
This is done by reading the contents of the file rather than looking at its extension, and is widely used on *nix (Linux, Unix, ...) systems.
If your application is expected to run on Linux systems, why don't you try inotify(7). It is definitely faster than stating many files.
I'm studying for my operating systems midterm and was wondering if I can get some help.
Can someone explain the checks and what the kernel does during the open() system call?
Thanks!
Very roughly, you can think of the following steps:
Translate the file name into an inode, which is the actual file system object describing the contents of the file, by traversing the filesystem data structures.
During this traversal, the kernel will check that you have sufficient access through the directory path to the file, and check access on the file itself. The precise checks depend on what modes were passed to open.
Create what's sometimes called an open file descriptor within the kernel. There is one of these objects for each file the kernel has opened on behalf of any process.
Allocate an unused index in the per-process file descriptor table, and point it at the open file descriptor.
Return this index from the system call as the file descriptor.
This description should be essentially correct for opening plain files and/or directories, but things are different for various sorts of special files, in particular for devices.
I would go back to what the prof told you - there a lot of things that happen during open(), depending on what you're opening (i.e. a device, a file, a directory), and unless you write what the professor's looking for, you'll lose points.
That being said, it mostly involves the checks to see if this open is valid (i.e. does this file exist, does the user have permissions to read/write it, etc), then an entry in the kernel handle table is allocated to keep track of the fd and its current file position (and of course, some other things)