Semantics of Linux O_PATH file descriptors? - c

Linux 2.6.39 introduced O_PATH open mode, which (roughly speaking) doesn't really open the file at all (i.e. doesn't create an open file description), but just gives a file descriptor that's a handle to the unopened target. Its main use is as an argument to the *at functions (openat, etc.), and it seems to be suitable as an implementation of the POSIX 2008 O_SEARCH functionality which Linux was previously missing. However, I've been unable to find any good documentation on the exact semantics of O_PATH. A couple specific questions I have are:
What operations are possible on Linux O_PATH file descriptors? (Only *at functions?)
Is O_PATH ever useful with non-directories?
How is the file descriptor bound to the underlying filesystem object, and what happens if it's moved, deleted, etc.? Does an O_PATH file descriptor count as a reference that prevents the object from being freed when the last link is unlinked? Etc.

File descriptors obtained using open(directory, O_PATH | O_DIRECTORY) are not only useful for ...at() functions, but for fchdir() (since kernel version 3.2.23, I believe).
There is also a recent patch for a new syscall, fbind(), that would allow very long Unix domain socket names. The socket file is first created using mknod(path, mode | S_IFSOCK, (dev_t)0), then opened using open(file, O_PATH). The file descriptor thus obtained, and a Unix domain socket descriptor, is passed to fbind(), to bind the socket to the pathname. Whether this will be included in the Linux kernel is yet to be seen -- although even if it is, it will be years before one can rely on it being universally available. (As a workaround for too-long Unix domain socket names it would be viable sooner, though.)
I'd say O_PATH is only useful for directories for now; file uses may be found in the future. Other than the possibility of a future fbind(), or similar future syscalls, I don't know of any use of file descriptors for files opened using O_PATH. Even fstatvfs() won't work, on a 3.5.0 kernel at least.
In Linux, inodes (file contents and metadata) are freed only when the last open file descriptor is closed. When removing (unlinking) a file, you only remove the file name associated with the inode. So, there are two separate filesystem objects associated with a file descriptor: the name used to open the object, and the underlying inode referred to. The name is only used for path resolution, i.e. when open() (or equivalent) is called. All data and metadata is in the inode.
File descriptors obtained using O_PATH behave (at least on kernel 3.5.0) just like normal file descriptors wrt. moving and renaming the name or name components used to open the descriptor. (The descriptor stays valid, as it refers to the inode, and the file name object was used only during path resolution. Holding the descriptor open will keep the inode resources allocated, even if the descriptor was opened O_PATH.)

Related

Is it possible to create an unlinked file on a selected filesystem?

Basically, the same result as creating a temporary file in the desired file system, opening it, and then unlinking it.
Even better, though unlikely, if this could be done without creating an inode that is visible to other processes.
The ability to do so is OS-specific, since the relevant POSIX function calls all result in a link being generated. Linux in particular has allowed, since version 3.11, the use of O_TMPFILE in the flags argument of open(2) in order to create an anonymous file in a given directory.
There are several POSIX APIs at your disposal:
mkstemp - generates a unique temporary filename from
template, creates and opens the file, and returns an open file
descriptor for the file.
tmpfile - opens a unique temporary file in binary
read/write (w+b) mode. The file will be automatically deleted when
it is closed or the program terminates.
Both of these functions do create files on the filesystem. Creating an inode is unavoidable, if you want to use a real file.
The first provides you a file descriptor for making low-level system calls, like read and write. The second gives you a FILE* for all of the <stdio.h> APIs.
If you don't need/desire an actual file on disk, you should consider the memory stream APIs provided by POSIX.1-2008.
open_memstream() - opens a stream for writing to a buffer.
The buffer is dynamically allocated (as with malloc(3)), and
automatically grows as required.
libtmpfilefd : create a temporary unnamed file seem to fullfill your requirements
Looking at the source file this function create a temporary file with mkstemp then unlink the file right after

mkstemp() - is it safe to close descriptor and reopen it again?

When generating a temporary file name using mkstemp(), is it safe to immediately call close() on the file descriptor returned by mkstemp(), store the file name generated by mkstemp() somewhere and use it (at a much later time) to open the file again for writing a temporary file? Or will this temporary file name become available again as soon as I call close() on it?
The reason why I'm asking is that I'm wondering why mkstemp() returns a file descriptor at all. If it is safe to close() the descriptor immediately, why does it return a descriptor at all? mkstemp() could close it then on its own and just give me a file name.
No. In between the time when you use mkstemp() to create the file and the time when you reopen it, your adversary may have removed the file you created and put a symlink in its place pointing to somewhere else altogether. This is a TOCTOU — Time of Check, Time of Use — vulnerability which the use of mkstemp() largely avoids, provided you keep the file descriptor open.
Once you close the file descriptor, all bets are off in a sufficiently hostile environment.
Note that even if you keep the file descriptor open, an adversary might remove the file, or rename it, and then create their own file (symlink, directory) in its place. The file descriptor remains valid. You could use stat() to get the name information and the fstat() to get the file descriptor information, and if the two match (st_dev and st_ino fields), then you're probably still OK. If they differ, someone's got at the file — if you rename it, you may be renaming their file rather than the one you created.
While the file originally created by mkstemp() still exists, the name will not be regenerated. In general, successive calls to mkstemp() will create distinct names anyway, but the name is guaranteed to be unique at the moment of creation (see the O_EXCL flag for open()).
And just in case you're wondering, no — there isn't a way to associate a name with a file descriptor (there is no hypothetical int flink(int fd, const char *name) system call). There was a question about that on one of the Stack Exchange sites a while ago, and the answer was definitely negative, with references to the Linux Kernel mailing list and so on. One such question is Is it possible to recreate a file from an opened file descriptor?, but I think there was a more thorough version of the question too.
The mkstemp function specifically uses descriptors instead of filenames to avoid race conditions that are commonly associated with its predecessors such as mktemp. In fact, the "s" in "mkstemp" means "secure", because the race condition can be a source of vulnerability (e.g. if you use the temporary file to store JIT code, for example, and someone guessing/stomping the file before you open it could cause your application to load/run the provided code rather than the code that your program generates).
Once you close the descriptor, nothing prevents another application from writing a file with the same name, so please don't do that. You should retain the descriptor for as long as the temporary file is needed (and close the descriptor once the temporary file is no longer going to be used by your program).

How to safely open regular files without denial-of-service vulnerability?

The flag O_DIRECTORY can be used with the syscalls open(2) and openat(2) to avoid denial-of-service vulnerabilities when opening directories. However: How can I avoid the same kind of race conditions for regular files?
Some background information: I am trying to develop some kind of backup tool. The programs walks over a directory tree, reads all regular files and only stats other files. If I first call fstatat(2) for each directory entry, test the result for regular files and open them with openat(2), then there is a race condition between the syscalls. An attacker could replace the regular file with a FIFO, and my program would hang on the FIFO.
How can I avoid this race condition? For directories, there is O_DIRECTORY, for symbolic links, O_PATH can be used. However, I have found no solution for regular files. I only need a solution that works on recent Linux versions.
If your only concern is fifos, O_NONBLOCK will prevent blocking and allow you to open a fifo even if it has a no writers (see http://pubs.opengroup.org/onlinepubs/9699919799/functions/open.html for where this is specified). However, there are also a few other concerns:
Device nodes
Fake files in Linux /proc with bad properties
...
Since these normally can't be created in arbitrary locations by non-root users, O_NOFOLLOW should be sufficient to avoid following symlinks to them.
With that said, on modern Linux there is an even safer solution: perform the initial open with O_PATH|O_NOFOLLOW, then perform stat on /proc/self/fd/%d to check the file type. You can then open /proc/self/fd/%d and be completely certain it corresponds to the same file you just stat'd.
Note that on sufficiently new Linux, you don't need to use /proc/self/fd/%d to reach the file to which you obtained an inode handle with O_PATH. You can use fstat and openat on it directly to "stat" it and get a descriptor to a real open file description, respectively. However O_PATH file descriptors had a lot of broken/unimplemented corner cases like this in the range of late 2.6.x (when they were first added) to 3.8 or so, and I find the /proc method the most reliable. Of course you could always try the direct method and fallback to /proc if it fails.
Open with O_RDONLY|O_NONBLOCK, check that the result isn't -1, then do an fstat() on the resulting file descriptor and compare st_mode (and possibly st_dev and st_ino) with what you expected.
Remember to set the AT_SYMLINK_NOFOLLOW flag on your fstatat.

Difference between creating a duplicate file descriptor using dup() and creating a hard link?

I just tried out this program where I use dup to duplicate the file desciptor of an opened file.
I had made a hard link to this same file and I opened the same file to read the contents of the file in the program.
My question is what is the difference?
I understand that dup gives me a run time abstraction to the file and that hard link refers more to the filsystem implementation but I do not understand the need for use of one over the other.
What are the advantages of using one over the other?
Why can't we explicitly refer to the hard link if we want to refer to the same file locations instead of creating a file descriptor and vice versa?
I am using Linux and the standard C library.
Hard links work on i-nodes, dup works on opened file descriptors. These are different animals.
A file is mostly an inode, with directory entries pointing to that inode (so some file can have more than one name thru hard links, other files can have no name at all: a temporary file still opened but unlinked has an i-node refered by an opened file descriptor, but no more any name). I-nodes exist for the duration of the file and are written to disks.
A file descriptor only exist in processes (in kernel memory only, not on disk) so can't be written to disk (you could only write its number, which usually don't make any sense).
A file descriptor knows (inside the kernel) its inode, but also some more state, notably the current offset.
You could have two file descriptors working on the same file (the same inode, perhaps by open-ing two different hardlinked or symlinked paths to it) but having different state (e.g. different file position or offset).
If using dup(2) syscall, the two file descriptors share the same state (just after the dup) in particular share the same file offset or position.
If using link(2) syscall, the two directory entries point to the same inode. They need to be on the same filesystem.
And a symlink(2) syscall creates a new inode (and a new file) which refers to the symbolic name. Read other man pages about path_resolution(7) and symlink(7).
A hard link is just a way to have the same
file in two different directories. It is useful for saving some disk space.
Using fdup lets you have two different file descriptors in your program that point to the same file. It is useful if you want to duplicate some kind of logical object that wraps a file descriptor.
The main difference is that a hard link is persistent and a duplicated file descriptor only lasts as long as the process. Plus the reasons already given.

Open system call

I'm studying for my operating systems midterm and was wondering if I can get some help.
Can someone explain the checks and what the kernel does during the open() system call?
Thanks!
Very roughly, you can think of the following steps:
Translate the file name into an inode, which is the actual file system object describing the contents of the file, by traversing the filesystem data structures.
During this traversal, the kernel will check that you have sufficient access through the directory path to the file, and check access on the file itself. The precise checks depend on what modes were passed to open.
Create what's sometimes called an open file descriptor within the kernel. There is one of these objects for each file the kernel has opened on behalf of any process.
Allocate an unused index in the per-process file descriptor table, and point it at the open file descriptor.
Return this index from the system call as the file descriptor.
This description should be essentially correct for opening plain files and/or directories, but things are different for various sorts of special files, in particular for devices.
I would go back to what the prof told you - there a lot of things that happen during open(), depending on what you're opening (i.e. a device, a file, a directory), and unless you write what the professor's looking for, you'll lose points.
That being said, it mostly involves the checks to see if this open is valid (i.e. does this file exist, does the user have permissions to read/write it, etc), then an entry in the kernel handle table is allocated to keep track of the fd and its current file position (and of course, some other things)

Resources