So, the normal POSIX way to safely, atomically replace the contents of a file is:
fopen(3) a temporary file on the same volume
fwrite(3) the new contents to the temporary file
fflush(3)/fsync(2) to ensure the contents are written to disk
fclose(3) the temporary file
rename(2) the temporary file to replace the target file
However, on my Linux system (Ubuntu 16.04 LTS), one consequence of this process is that the ownership and permissions of the target file change to the ownership and permissions of the temporary file, which default to uid/gid and current umask.
I thought I would add code to stat(2) the target file before overwriting, and fchown(2)/fchmod(2) the temporary file before calling rename, but that can fail due to EPERM.
Is the only solution to ensure that the uid/gid of the file matches the current user and group of the process overwriting the file? Is there a safe way to fall back in this case, or do we necessarily lose the atomic guarantee?
Is the only solution to ensure that the uid/gid of the file matches the current user and group of the process overwriting the file?
No.
In Linux, a process with the CAP_LEASE capability can obtain an exclusive lease on the file, which blocks other processes from opening the file for up to /proc/sys/fs/lease-break-time seconds. This means that technically, you can take the exclusive lease, replace the file contents, and release the lease, to modify the file atomically (from the perspective of other processes).
Also, a process with the CAP_CHOWN capability can change the file ownership (user and group) arbitrarily.
Is there a safe way to [handle the case where the uid or gid does not match the current process], or do we necessarily lose the atomic guarantee?
Considering that in general, files may have ACLs and xattrs, it might be useful to create a helper program, that clones the ownership including ACLs, and extended attributes, from an existing file to a new file in the same directory (perhaps with a fixed name pattern, say .new-################, where # indicate random alphanumeric characters), if the real user (getuid(), getgid(), getgroups()) is allowed to modify the original file. This helper program would have at least the CAP_CHOWN capability, and would have to consider the various security aspects (especially the ways it could be exploited). (However, if the caller can overwrite the contents, and create new files in the target directory -- the caller must have write access to the target directory, so that they can do the rename/hardlink replacement --, creating a clone file on their behalf with empty contents ought to be safe. I would personally exclude target files owned by root user or group, though.)
Essentially, the helper program would behave much like the mktemp command, except it would take the path to the existing target file as a parameter. It would then be relatively straightforward to wrap it into a library function, using e.g. fork()/exec() and pipes or sockets.
I personally avoid this problem by using group-based access controls: dedicated (local) group for each set. The file owner field is basically just an informational field then, indicating the user that last recreated (or was in charge of) said file, with access control entirely based on the group. This means that changing the mode and the group id to match the original file suffices. (Copying ACLs would be even better, though.) If the user is a member of the target group, they can do the fchown() to change the group of any file they own, as well as the fchmod() to set the mode, too.
I am by no means an expert in this area, but I don't think it's possible. This answer seems to back this up. There has to be a compromise.
Here are some possible solutions. Every one has advantages and disadvantages and weighted and chosen depending on the use case and scenario.
Use atomic rename.
Advantage: atomic operation
Disadvantage: possible to not keep owner/permissions
Create a backup. Write file in place
This is what some text editor do.
Advantage: will keep owner/permissions
Disadvantage: no atomicity. Can corrupt file. Other application might get a "draft" version of the file.
Set up permissions to the folder such that creating a new file is possible with the original owner & attributes.
Advantages: atomicity & owner/permissions are kept
Disadvantages: Can be used only in certain specific scenarios (knowledge at the time of creation of the files that would be edited, the security model must allow and permit this). Can decrease security.
Create a daemon/service responsible for editing the files. This process would have the necessary permissions to create files with the respective owner & permissions. It would accept requests to edit files.
Advantages: atomicity & owner/permissions are kept. Higher and granular control to what and how can be edited.
Disadvantages. Possible in only specific scenarios. More complex to implement. Might require deployment and installation. Adding an attack surface. Adding another source of possible (security) bugs. Possible performance impact due to the added intermediate layer.
Do you have to worry about the file that's named being a symlink to a file somewhere else in the file system?
Do you have to worry about the file that's named being one of multiple links to an inode (st_nlink > 1).
Do you need to worry about extended attributes?
Do you need to worry about ACLs?
Does the user ID and group IDs of the current process permit the process to write in the directory where the file is stored?
Is there enough disk space available for both the old and the new files on the same file system?
Each of these issues complicates the operation.
Symlinks are relatively easy to deal with; you simply need to establish the realpath() to the actual file and do file creation operations in the directory containing the real path to the file. From here on, they're a non-issue.
In the simplest case, where the user (process) running the operation owns the file and the directory where the file is stored, can set the group on the file, the file has no hard links, ACLs or extended attributes, and there's enough space available, then you can get atomic operation with more or less the sequence outlined in the question — you'd do group and permission setting before executing the atomic rename() operation.
There is an outside risk of TOCTOU — time of check, time of use — problems with file attributes. If a link is added between the time when it is determined that there are no links and the rename operation, then the link is broken. If the owner or group or permissions on the file change between the time when they're checked and set on the new file, then the changes are lost. You could reduce the risk of that by breaking atomicity but renaming the old file to a temporary name, renaming the new file to the original name, and rechecking the attributes on the renamed old file before deleting it. That is probably an unnecessary complication for most people, most of the time.
If the target file has multiple hard links to it and those links must be preserved, or if the file has ACLs or extended attributes and you don't wish to work out how to copy those to the new file, then you might consider something along the lines of:
write the output to a named temporary file in the same directory as the target file;
copy the old (target) file to another named temporary file in the same directory as the target;
if anything goes wrong during steps 1 or 2, abandon the operation with no damage done;
ignoring signals as much as possible, copy the new file over the old file;
if anything goes wrong during step 4, you can recover from the extra backup made in step 2;
if anything goes wrong in step 5, report the file names (new file, backup of original file, broken file) for the user to clean up;
clean up the temporary output file and the backup file.
Clearly, this loses all pretense at atomicity, but it does preserve links, owner, group, permissions, ACLS, extended attributes. It also requires more space — if the file doesn't change size significantly, it requires 3 times the space of the original file (formally, it needs size(old) + size(new) + max(size(old), size(new)) blocks). In its favour is that it is recoverable even if something goes wrong during the final copy — even a stray SIGKILL — as long as the temporary files have known names (the names can be determined).
Automatic recovery from SIGKILL probably isn't feasible. A SIGSTOP signal could be problematic too; a lot could happen while the process is stopped.
I hope it goes without saying that errors must be detected and handled carefully with all the system calls used.
If there isn't enough space on the target file system for all the copies of the files, or if the process cannot create files in the target directory (even though it can modify the original file), you have to consider what the alternatives are. Can you identify another file system with enough space? If there isn't enough space anywhere for both the old and the new file, you clearly have major issues — irresolvable ones for anything approaching atomicity.
The answer by Nominal Animal mentions Linux capabilities. Since the question is tagged POSIX and not Linux, it isn't clear whether those are applicable to you. However, if they can be used, then CAP_LEASE sounds useful.
How crucial is atomicity vs accuracy?
How crucial is POSIX compliance vs working on Linux (or any other specific POSIX implementation)?
Related
I am trying to implement a filesystem by using FUSE, and i want the file to get hidden temporarily when it is deleted. I tried to store all the files' name or its inode in an array and check them when some system call like 'open' , 'getattr' or 'readdir' get invoked. But it could eat up tons of performance when the number gets really huge. So i wonder is there a better way to do this? Thanks in advance!
There are two problems to your approach (and to the solution pointed out by Oren Kishon, and marked as selected):
first is that a file has no name per se. The name of a file is not part of the file. The mapping of filenames to files (actually to inodes) is created by the system for the comodity of the user, but the names are completely independent of the files they point to. This means that you know easily which is the inode a link points to, but it is very difficult to do the reverse mapping (getting the directory entry that points to the inode, with just knowing the inode) The deletion of a file is a two phase process. In the first phase, you call unlink(2) system call to erase a link (to erase a directory entry) from the directory it belongs, and then deallocate all the blocks pertaining to that file, but only in case the reference count (which is stored in the inode itself) drops to zero. This is an easy process, as everything starts from the directory entry you want to be deleted. But if you dont erase it, searching for it later will be painfull, as you can see below, in the second problem stated here.
Second is that if you do this with, let's say, six links (hard links) to the same file, you'll never know, when you need the space to be actually reallocated to another file (because you run out of unallocated space) because the link reference count is still six on the inode. Even worse, if you add a second ref count in the inode to follow the (different) number of truly erased files that have not yet been unallocated, the problem is that you have to search over the whole filesystem.(because you have no idea on where should the links be) So you need to maintain a lot of information (to add to the space the file occupies in the filesystem) first to gather all the links that pointed once to this file, and second to check if this is indeed the file that has to be deallocated, in case more space is needed in the filesystem.
By the way, your problem has an easy solution in user space, although. Just modify the rm command to never erase a file completely(e.g. never unlink the last link to a file), but to move the files in a queue in some fixed directory in the same filesystem in which the file resided, to handle the last link to it, and this will maintain the files still allocated (but you lose any reference, or you can save it in an associated file, to the name of the file). A monitor process can check the amount of free space and select from the queue the first one (erased oldest), and truly erase it. Beware that if you have large files erased, this will make your system load to grow at random times when it is time to actually erase the files you are deallocating.
There's another alternative. Use zfs as your filesystem. This requires a lot of memory and cpu, but is a complete solution to the undeletion of files, because zfs conserves the full history of the filesystem, so you can get back in time upto a snapshot in which the file existed, and then make a copy of it, actually recovering it. ZFS can be used on WORM(Write Once Read Many, as DVD) media and this allows you to conserve the filesystem state over time (at the expense of never reusing the same data again) But you will never lose a file.
Edit
There's one case in which the file is no longer available to use for any other process than the ones that have it open. In this scenario, one process opens a file, then deletes it (deletion involves just breaking the link that allows to translate the name of the file to the inode in the system) but continues using the file, until it finally closes.
As you probably know, a file can be opened by several files at the same time. Apart from the number of references that figures in the disk inode, there's a number of references to the inode in the inode table in kernel memory. This is the number of references of the file in the disk inode (the number of directory entries that point to the file's inode) plus one reference for each file entry that states a file is open.
When a file is unlinked (and it should be deleted, because no more links to the inode are referencing it) the deallocation doesn't take immediately, as the file is still being used by processes. The file is alive, although it doesn't appear in the filesystem (there's no more references to it in any directory) Only when the last close(2) of the file takes place, the file is deallocated in the system.
But what happened to the directory entry that referenced las that file. It can be reused (as I told you in one of the comments) immediately it has been freed, long before the file is deallocated. A new file (it will be forcibly a different inode, as the old one is still in use) will be created and named as the original one (because you decided to name it the same) and no problem is on this, but that you are using a different file. The old file is still in use, and has no name, and for this reason is unvisible to other processes except the one that is using it. This technique is used frequently to use temporary files, in which you create a file with open(2), and immediately unlink(2) it. No other process can access that file, and that file will be deallocated as soon as the file entry is close(2)d. But such a file will be deallocated as soon as the last close(2) on it is called. No file of this characteristics can survive a reboot of the system. (it cannot even survive the process that had it open)
As the question states:
Is it possible to temporarily hide a file from any system call in linux?
The file is hidden to all the system calls that require a name for the file (it has no name anymore) but not to other system calls (e.g. fstat(2) continue to work, while stat(2) will be impossible to use on that file, same with link(2), rename(2), open(2), etc.)
If I understand, when unlink is called, you want the file to be marked as deleted rather than actually deleted.
You could implement this mark as an extended attribute, one with "system" namespace (see https://man7.org/linux/man-pages/man7/xattr.7.html) which is not listed as part of the file xattr list.
In your "unlink", do setxattr(system.markdelete). In all of your other calls with path arg, and in readdir, getxattr and treat it as deleted.
I'm trying to implement an atomic version of copy on write. I have certain conditions if met that will make a copy of the original file.
I implemented something like this pseudo code.
//write operations//
if(some condition)
//create a temp file//
rename(srcfile, copied-version)
rename(tmpfile, srcfile)
problem with this logic :
Hardlinks.
I want to transfer the Hardlink from copied version to new srcfile.
You can't.
Hardlinks are one directional pointers. So you can't modify or remove other hardlinks that you don't explicitly know about. All you can do is write to the same file data, and that's not atomic.
This rule applies uniformly to both hadlinks and file descriptors. What that means is that you can't modify the content pointed to by an unknown hardlink and not modify the content pointed to by another process with the same file open.
That effectively prevents you from modifying the file an unknown hardlink points
to atomically.
If you have control over every process which might modify or access these files (if they are only modified by programs you've written), then you might be able to use flock() to signal to other processes that the file is in use. This won't work if the file is stored on an NFS remote file system, but should generally work otherwise.
In some cases, file leases can be a solution to the underlying issue – ensuring atomic content updates – but only if each reader and writer opens and closes the file for each snapshot.
Because a similar limitation happens for the traditional copy–update–rename-over sequence, perhaps the file lease solution would also work for OP.
For details, see man 2 fcntl Leases and Managing signals sections. The process must either have the same owner as the file, or have the CAP_LEASE capability (usually granted to the process via filesystem capabilities). Superuser processes (running as root) have the capability by default.
The idea is that when the process wishes to make "atomic" changes to the file, it acquires a write lease on the file. This only succeeds if no other process has the file open. If another process tries to open the file, the lease holder receives a signal, and has up to lease-break-time (about a minute, typically) to downgrade the lease (or simply close the file); during that time, the opener will block.
Note that there is no way to divert the opener. The situation is that the opener already has a handle to the underlying inode (so access checks and filename resolution has already occurred); it is just that kernel won't return it to the userspace process before the lease is released or broken.
Your lease owner can, however, create a copy of the current contents to a temporary file, acquiring a write lease on that as well, and then rename it over the target file name. This way, each (set of) opener(s) obtain a handle to the file contents as they were at the time of the opening; if they do any modifications, they will be "private", and not reflected on the original file. Since the underlying inode is no longer referred to by any filename, when they (the last process having it open) close it, the inode is deleted and the storage released back to the file system. The Linux page cache also caches such accesses very well, so in many cases the "temporary copy file" never even hits actual storage media (unless there is memory pressure, i.e. memory needed for non-pagecache purposes).
A pure "atomic modification" does not require any kind of copies or renames, only holding the lease for the duration of the set of writes that must appear atomic for the readers.
Note that taking a write lease will normally block until no other process has the file open any longer, so the time at which such a lease-based atomic update can occur, is restricted, and not guaranteed to be always available. (For example, you may have a lazy process that just keeps the file open, and occasionally polls it. If you have such processes, this lease-based approach won't work – but nor would the copy–rename-over approach either.)
Also, leases work only on local files.
If you need record-based atomicity, just use fcntl-based record locks, and have all readers take a read-lock for the region they want to access atomically, and all writers take a write-lock for the region to be updated, as record-locks are advisory (i.e., do not block reads or writes, only other record locks).
I'm developing a little software in C that reads and writes messages in a notice-board. Every message is a .txt named with a progressive number.
The software is multithreading, with many users that can do concurrent operations.
The operations that a user can do are:
Read the whole notice-board (concatenation of all the .txt file contents)
Add a message (add a file named "id_max++.txt")
Remove a message. When a message is removed there will be a hole in that number (e.g, "1.txt", "2.txt", "4.txt") that will never be filled up.
Now, I'd like to know if there is some I/O problem (*) that I should manage (and how) or the OS (Unix-like) does it all by itself.
(*) such as 2 users that want to read and delete the same file
As you have an Unix-like, OS will take care of deleting a file while it is open by another thread : the directory entry is immediately removed, and the file itself (inode) is deleted on last close.
The only problem I can see is between the directory scan and the open of a file : race conditions could make that the file has been deleted.
IMHO you simply must considere that an error file does not exist is normal, and simply go to next file.
What you describe is not really bad, since it is analog to MH folders for mails, and it can be accessed by many different processes, even if locking is involved. But depending on the load and on the size of the messages, you could considere using a database. Rule of thumb (my opinion) :
few concurrent accesses and big files : keep on using file system
many accesses and small files (several ko max.) : use a database
Of course, you must use a mutex protected routine to find next number when creating a new message (credits should be attributed to #merlin2011 for noticing the problem).
You said in a comment that your specs do not allow a database. On the analogy with mail handling, you could alse use a single file (like traditionnal mail format) :
one single file
each message is preceded with a fixed size header saying whether it is active or deleted
read access need not be synchronized
write accesses must be synchronized
It would be a poor man's database where all synchronization is done by hand, but you have only one file descriptor per thread and save all open and close operations. It makes sense where there are many reads and few writes or deletes
A possible improvement would be (still like mail readers do) to build an index with the offset and status of each message. The index could be on disk or in memory depending on your requirements.
The easier solution is to use a database like sqlite or MySQL, both of which provide transactions that you can use ot achieve consistency. If you still want to go down the route, read on.
The issue is not an IO problem, it's a concurrency problem if you do not implement proper monitors. Consider the following scenario (it is not the only problematic one, but it is one example of one).
User 1 reads the maximum id and stores it in a local variable.
Meanwhile, User 2 reads the same maximum id and stores it in a local variable also.
User 1 writes first, and then User 2 overwrites what User 1 just wrote, because it had the same idea of what the maximum id was.
This particular scenario can be solved by keeping the current maximum id as a variable that is initialized when the program is initialized, and protecting the get_and_increment operation with a lock. However, this is not the only problematic scenario that you will need to reason through if you go with this approach.
If the file already exists, I want to overwrite it. If it doesn't exist, I want to create it and write to it. I'd prefer to not have to use a 3rd party library like lockfile (which seems to handle all types of locking.)
My initial idea was to:
Write to a temporary file with a randomly generated large id to avoid conflict.
Rename the temp filename -> new path name.
os.Rename calls syscall.Rename which for Linux/UNIXs uses the rename syscall (which is atomic*). On Windows syscall.Rename calls MoveFileW which assuming the source and destination are on the same device (which can be arranged) and the filesystem is NTFS (which is often the case) is atomic*.
I would take care to make sure the source and destination are on the same device so the Linux rename does not fail, and the Windows rename is actually atomic. As Dave C mentions above creating your temporary file (usually using ioutil.TempFile) in the same directory as existing file is the way to go; this is how I do my atomic renames.
This works for me in my use case which is:
One Go process gets updates and renames files to swap updates in.
Another Go process is watching for file updates with fsnotify and re-mmaps the file when it is updated.
In the above use case simply using os.Rename has worked perfectly well for me.
Some further reading:
Is rename() atomic? "Yes and no. rename() is atomic assuming the OS does not crash...."
Is an atomic file rename (with overwrite) possible on Windows?
*Note: I do want to point out that when people talk about atomic filesystem file operations, from an application perspective, they usually mean the operation happens or does not happen (which journaling can help with) from the users perspective. If you are using atomic in the sense of an atomic memory operation, very few filesystem operations (outside of direct I/O [O_DIRECT] one block writes and reads with disk buffering disabled) can be considered truly atomic.
I'm making a program in C for linux that scans a directory every x seconds during a time period for modifications, but I'm having trouble finding out when a file or directory is created. Here are a few options I considered:
Using the stat struct, check if the last status change and data modification timestamps are the same. This brings up the problem that you can create a file, modify it before the program has a chance to check it, which changes the data modification timestamp and no longer sees it as a new file.
Keep a log of the name of every file/directory in the directory and check for new ones. This has the problem where you delete a file and then create a new one with the same name, it doesn't get interpreted as a new file.
Keep a count of the number of file/directories. Similliar problem to the last idea.
With that said, does anyone have any idea on how I can uniquely identify the creation of a file/directory?
You cannot, at least not this way. POSIX has no provisions for storing the file creation time in the file system, like Windows and some other OSes do. It only keeps the status change, access and modification times. Most Unix filesystems do not store that information either.
One of the reasons for this is the existence of hard links, since file timestamps are stored in their inodes and not in the directory references. What would you consider the creation time to be for a file that was created at 10am and then hard-linked into another directory at 11am? What if a file is copied?
Your best, but unfortunately OS-specific, approach would be to use whatever framework is available in your platform to monitor filesystem events, e.g. inotify on Linux and kqueue on FreeBSD and MacOS X...
EDIT:
By the way, Ext4fs on Linux does store inode creation times (crtime). Unfortunately getting to that information from userspace is still at least a bit awkward.
Perhaps you should use inotify?
Check out inotify (Linux-specific).