Related
So, the normal POSIX way to safely, atomically replace the contents of a file is:
fopen(3) a temporary file on the same volume
fwrite(3) the new contents to the temporary file
fflush(3)/fsync(2) to ensure the contents are written to disk
fclose(3) the temporary file
rename(2) the temporary file to replace the target file
However, on my Linux system (Ubuntu 16.04 LTS), one consequence of this process is that the ownership and permissions of the target file change to the ownership and permissions of the temporary file, which default to uid/gid and current umask.
I thought I would add code to stat(2) the target file before overwriting, and fchown(2)/fchmod(2) the temporary file before calling rename, but that can fail due to EPERM.
Is the only solution to ensure that the uid/gid of the file matches the current user and group of the process overwriting the file? Is there a safe way to fall back in this case, or do we necessarily lose the atomic guarantee?
Is the only solution to ensure that the uid/gid of the file matches the current user and group of the process overwriting the file?
No.
In Linux, a process with the CAP_LEASE capability can obtain an exclusive lease on the file, which blocks other processes from opening the file for up to /proc/sys/fs/lease-break-time seconds. This means that technically, you can take the exclusive lease, replace the file contents, and release the lease, to modify the file atomically (from the perspective of other processes).
Also, a process with the CAP_CHOWN capability can change the file ownership (user and group) arbitrarily.
Is there a safe way to [handle the case where the uid or gid does not match the current process], or do we necessarily lose the atomic guarantee?
Considering that in general, files may have ACLs and xattrs, it might be useful to create a helper program, that clones the ownership including ACLs, and extended attributes, from an existing file to a new file in the same directory (perhaps with a fixed name pattern, say .new-################, where # indicate random alphanumeric characters), if the real user (getuid(), getgid(), getgroups()) is allowed to modify the original file. This helper program would have at least the CAP_CHOWN capability, and would have to consider the various security aspects (especially the ways it could be exploited). (However, if the caller can overwrite the contents, and create new files in the target directory -- the caller must have write access to the target directory, so that they can do the rename/hardlink replacement --, creating a clone file on their behalf with empty contents ought to be safe. I would personally exclude target files owned by root user or group, though.)
Essentially, the helper program would behave much like the mktemp command, except it would take the path to the existing target file as a parameter. It would then be relatively straightforward to wrap it into a library function, using e.g. fork()/exec() and pipes or sockets.
I personally avoid this problem by using group-based access controls: dedicated (local) group for each set. The file owner field is basically just an informational field then, indicating the user that last recreated (or was in charge of) said file, with access control entirely based on the group. This means that changing the mode and the group id to match the original file suffices. (Copying ACLs would be even better, though.) If the user is a member of the target group, they can do the fchown() to change the group of any file they own, as well as the fchmod() to set the mode, too.
I am by no means an expert in this area, but I don't think it's possible. This answer seems to back this up. There has to be a compromise.
Here are some possible solutions. Every one has advantages and disadvantages and weighted and chosen depending on the use case and scenario.
Use atomic rename.
Advantage: atomic operation
Disadvantage: possible to not keep owner/permissions
Create a backup. Write file in place
This is what some text editor do.
Advantage: will keep owner/permissions
Disadvantage: no atomicity. Can corrupt file. Other application might get a "draft" version of the file.
Set up permissions to the folder such that creating a new file is possible with the original owner & attributes.
Advantages: atomicity & owner/permissions are kept
Disadvantages: Can be used only in certain specific scenarios (knowledge at the time of creation of the files that would be edited, the security model must allow and permit this). Can decrease security.
Create a daemon/service responsible for editing the files. This process would have the necessary permissions to create files with the respective owner & permissions. It would accept requests to edit files.
Advantages: atomicity & owner/permissions are kept. Higher and granular control to what and how can be edited.
Disadvantages. Possible in only specific scenarios. More complex to implement. Might require deployment and installation. Adding an attack surface. Adding another source of possible (security) bugs. Possible performance impact due to the added intermediate layer.
Do you have to worry about the file that's named being a symlink to a file somewhere else in the file system?
Do you have to worry about the file that's named being one of multiple links to an inode (st_nlink > 1).
Do you need to worry about extended attributes?
Do you need to worry about ACLs?
Does the user ID and group IDs of the current process permit the process to write in the directory where the file is stored?
Is there enough disk space available for both the old and the new files on the same file system?
Each of these issues complicates the operation.
Symlinks are relatively easy to deal with; you simply need to establish the realpath() to the actual file and do file creation operations in the directory containing the real path to the file. From here on, they're a non-issue.
In the simplest case, where the user (process) running the operation owns the file and the directory where the file is stored, can set the group on the file, the file has no hard links, ACLs or extended attributes, and there's enough space available, then you can get atomic operation with more or less the sequence outlined in the question — you'd do group and permission setting before executing the atomic rename() operation.
There is an outside risk of TOCTOU — time of check, time of use — problems with file attributes. If a link is added between the time when it is determined that there are no links and the rename operation, then the link is broken. If the owner or group or permissions on the file change between the time when they're checked and set on the new file, then the changes are lost. You could reduce the risk of that by breaking atomicity but renaming the old file to a temporary name, renaming the new file to the original name, and rechecking the attributes on the renamed old file before deleting it. That is probably an unnecessary complication for most people, most of the time.
If the target file has multiple hard links to it and those links must be preserved, or if the file has ACLs or extended attributes and you don't wish to work out how to copy those to the new file, then you might consider something along the lines of:
write the output to a named temporary file in the same directory as the target file;
copy the old (target) file to another named temporary file in the same directory as the target;
if anything goes wrong during steps 1 or 2, abandon the operation with no damage done;
ignoring signals as much as possible, copy the new file over the old file;
if anything goes wrong during step 4, you can recover from the extra backup made in step 2;
if anything goes wrong in step 5, report the file names (new file, backup of original file, broken file) for the user to clean up;
clean up the temporary output file and the backup file.
Clearly, this loses all pretense at atomicity, but it does preserve links, owner, group, permissions, ACLS, extended attributes. It also requires more space — if the file doesn't change size significantly, it requires 3 times the space of the original file (formally, it needs size(old) + size(new) + max(size(old), size(new)) blocks). In its favour is that it is recoverable even if something goes wrong during the final copy — even a stray SIGKILL — as long as the temporary files have known names (the names can be determined).
Automatic recovery from SIGKILL probably isn't feasible. A SIGSTOP signal could be problematic too; a lot could happen while the process is stopped.
I hope it goes without saying that errors must be detected and handled carefully with all the system calls used.
If there isn't enough space on the target file system for all the copies of the files, or if the process cannot create files in the target directory (even though it can modify the original file), you have to consider what the alternatives are. Can you identify another file system with enough space? If there isn't enough space anywhere for both the old and the new file, you clearly have major issues — irresolvable ones for anything approaching atomicity.
The answer by Nominal Animal mentions Linux capabilities. Since the question is tagged POSIX and not Linux, it isn't clear whether those are applicable to you. However, if they can be used, then CAP_LEASE sounds useful.
How crucial is atomicity vs accuracy?
How crucial is POSIX compliance vs working on Linux (or any other specific POSIX implementation)?
I've seen a few questions here on Stack overflow dealing with the same issues, but no definite answer. I thought I'll ask again, with a bunch of questions of my own. All relate to the subject matter at hand.
So, do we know when the data transfer from host to the openCL device occurs? Can you tell me the exact memory transfer operation of the functions below (that is, what data is transferred or created, if any, when these functions are invoked?):
clCreateBuffer()
clSetKernelArg()
clEnqueueNDRangeKernel()
The first two don't even produce events, so we can't time them, but surely some data transferring is happening here.
Is there a way to transfer data to a device without first setting it as a kernel arg?
It appears (from preliminary testing of my own) that a mem object created with CL_MEM_USE_HOST_PTR gets directly manipulated by the device. Why would that not be desirable, since, that way, we could avoid further data transfer commands (and surely the driver implements this in the most efficient way)?
Does transferred data (say, as par of a kernel arg) stay at the device for further manipulation, after a kernel returns? If not is there a way to do just that?
Buffer copies are related to command queues. Command queues are synced with host using finish() as easiest way.
clCreateBuffer()
clEnqueueWriteBuffer() <-------- you can get event data from this
(set blocking parameter to false to queue everything quickly)
(set blockinig to true if you sync write here)
clSetKernelArg()
clEnqueueWriteBuffer() <----- it could be here too
clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() <----- or here (too quickly re-set an array?)
clFinish() <--------- this ensures all queued commands are executed before this
now you can query data of that event to check when it started and when ended
to let a buffer stay in device, you should create it in device first then don't migrate it to another device. Using only CL_MEM_READ_WRITE flag in createBuffer() is enough to make it a real buffer on device-side until you release that buffer.
CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR uses host memory as device maps it to its cores. This is faster for streaming data in and out because of not-needing of extra data movements in host side. If you need to use device memory such as fast gddr5 or hbm always, then you should not use these flags.
Copy to device once, use as much as you want. If device has its own memory of course. For example, Intel HD Graphics 400 doesn't have its own memory and shares RAM so it is much faster to use CL_MEM_..._HOST_PTR flags and especially USE_HOST_PTR.
To check if device shares RAM with CPU, you query CL_DEVICE_HOST_UNIFIED_MEMORY property of device.
It appears (from preliminary testing of my own) that a mem object
created with CL_MEM_USE_HOST_PTR gets directly manipulated by the
device
Even without map/unmap commands pror to kernel execution, my computer is behaving same, but I'm using map/unmap just to be safe and it doesn't tax too many cycles.
Edit: if you want to make sure a command doesn't start before you want, you can add a user event in event list input parameter of bufferwrite command. Then you can trigger the user event to let writing start because commands wait for all events in the list to be fired+completed before continuing (if there are any specified in event list input parameter)
I'm developing a little software in C that reads and writes messages in a notice-board. Every message is a .txt named with a progressive number.
The software is multithreading, with many users that can do concurrent operations.
The operations that a user can do are:
Read the whole notice-board (concatenation of all the .txt file contents)
Add a message (add a file named "id_max++.txt")
Remove a message. When a message is removed there will be a hole in that number (e.g, "1.txt", "2.txt", "4.txt") that will never be filled up.
Now, I'd like to know if there is some I/O problem (*) that I should manage (and how) or the OS (Unix-like) does it all by itself.
(*) such as 2 users that want to read and delete the same file
As you have an Unix-like, OS will take care of deleting a file while it is open by another thread : the directory entry is immediately removed, and the file itself (inode) is deleted on last close.
The only problem I can see is between the directory scan and the open of a file : race conditions could make that the file has been deleted.
IMHO you simply must considere that an error file does not exist is normal, and simply go to next file.
What you describe is not really bad, since it is analog to MH folders for mails, and it can be accessed by many different processes, even if locking is involved. But depending on the load and on the size of the messages, you could considere using a database. Rule of thumb (my opinion) :
few concurrent accesses and big files : keep on using file system
many accesses and small files (several ko max.) : use a database
Of course, you must use a mutex protected routine to find next number when creating a new message (credits should be attributed to #merlin2011 for noticing the problem).
You said in a comment that your specs do not allow a database. On the analogy with mail handling, you could alse use a single file (like traditionnal mail format) :
one single file
each message is preceded with a fixed size header saying whether it is active or deleted
read access need not be synchronized
write accesses must be synchronized
It would be a poor man's database where all synchronization is done by hand, but you have only one file descriptor per thread and save all open and close operations. It makes sense where there are many reads and few writes or deletes
A possible improvement would be (still like mail readers do) to build an index with the offset and status of each message. The index could be on disk or in memory depending on your requirements.
The easier solution is to use a database like sqlite or MySQL, both of which provide transactions that you can use ot achieve consistency. If you still want to go down the route, read on.
The issue is not an IO problem, it's a concurrency problem if you do not implement proper monitors. Consider the following scenario (it is not the only problematic one, but it is one example of one).
User 1 reads the maximum id and stores it in a local variable.
Meanwhile, User 2 reads the same maximum id and stores it in a local variable also.
User 1 writes first, and then User 2 overwrites what User 1 just wrote, because it had the same idea of what the maximum id was.
This particular scenario can be solved by keeping the current maximum id as a variable that is initialized when the program is initialized, and protecting the get_and_increment operation with a lock. However, this is not the only problematic scenario that you will need to reason through if you go with this approach.
While wandering the web looking for explanations of how to read/write MFT I found the folowing section:(http://www.installsetupconfig.com/win32programming/1996%20AppE_apnilife.pdf)
If NtfsProtectSystemFiles is set to FALSE, then the special files can
be opened. There are, however, some drawbacks associated with
attempting to do this: Because many of the special files are opened in
a special way when mounting the volume, they are not prepared to
handle the IRP_MJ_READ requests resulting from a call to ZwReadFile,
and the system crashes if such a request is received. These special
files can be read by mapping the special file with ZwCreateSection and
ZwMapViewOfSection and then reading the mapped data. A further problem
is that a few of the special files are not prepared to handle the
IRP_MJ_CLEANUP request that is generated when the last handle to a
file object is closed, and the system crashes if such a request is
received. The only option is to duplicate the open handle to the
special file into a process that never terminates (such as the system
process).
What does it mean “they are not prepared to handle the IRP_MJ_READ requests” what kind of preparation is needed? What is IRP_MJ_READ?
“Mapping the special file with ZwCreateSection and ZwMapViewOfSection and then reading the mapped data” How does that solve the problem?
What does it means “files are not prepared to handle the IRP_MJ_CLEANUP request that is generated when the last handle to a file object is closed” again what is that preparation? What is IRP_MJ_CLEANUP?
“Duplicate the open handle to the special file into a process that never terminates” How does that solve the problem?
That's old data (from 1996). And more than a little incorrect. The world has moved on since then.
You might try opening \$MFT to read the MFT but getting the access bits right might be problematic. You can also write them but that's really playing with fire. The file system does not expect that it's data structures will be modified without it's involvement.
You are much better off opening the partition raw and walking the disk structures directly.
I need to write something like 64 kB of data atomically in the middle of an existing file. That is all, or nothing should be written. How to achieve that in Linux/C?
I don't think it's possible, or at least there's not any interface that guarantees as part of its contract that the write would be atomic. In other words, if there is a way that's atomic right now, that's an implementation detail, and it's not safe to rely on it remaining that way. You probably need to find another solution to your problem.
If however you only have one writing process, and your goal is that other processes either see the full write or no write at all, you can just make the changes in a temporary copy of the file and then use rename to atomically replace it. Any reader that already had a file descriptor open to the old file will see the old contents; any reader opening it newly by name will see the new contents. Partial updates will never be seen by any reader.
There are a few approaches to modify file contents "atomically". While technically the modification itself is never truly atomic, there are ways to make it seem atomic to all other processes.
My favourite method in Linux is to take a write lease using fcntl(fd, F_SETLEASE, F_WRLCK). It will only succeed if fd is the only open descriptor to the file; that is, nobody else (not even this process) has the file open. Also, the file must be owned by the user running the process, or the process must run as root, or the process must have the CAP_LEASE capability, for the kernel to grant the lease.
When successful, the lease owner process gets a signal (SIGIO by default) whenever another process is opening or truncating the file. The opener will be blocked by the kernel for up to /proc/sys/fs/lease-break-time seconds (45 by default), or until the lease owner releases or downgrades the lease or closes the file, whichever is shorter. Thus, the lease owner has dozens of seconds to complete the "atomic" operation, without any other process being able to see the file contents.
There are a couple of wrinkles one needs to be aware of. One is the privileges or ownership required for the kernel to allow the lease. Another is the fact that the other party opening or truncating the file will only be delayed; the lease owner cannot replace (hardlink or rename) the file. (Well, it can, but the opener will always open the original file.) Also, renaming, hardlinking, and unlinking/deleting the file does not affect the file contents, and therefore are not affected at all by file leases.
Remember also that you need to handle the signal generated. You can use fcntl(fd, F_SETSIG, signum) to change the signal. I personally use a trivial signal handler -- one with an empty body -- to catch the signal, but there are other ways too.
A portable method to achieve semi-atomicity is to use a memory map using mmap(). The idea is to use memmove() or similar to replace the contents as quickly as possible, then use msync() to flush the changes to the actual storage medium.
If the memory map offset in the file is a multiple of the page size, the mapped pages reflect the page cache. That is, any other process reading the file, in any way -- mmap() or read() or their derivatives -- will immediately see the changes made by the memmove(). The msync() is only needed to make sure the changes are also stored on disk, in case of a system crash -- it is basically equivalent to fsync().
To avoid preemption (kernel interrupting the action due to the current timeslice being up) and page faults, I'd first read the mapped data to make sure the pages are in memory, and then call sched_yield(), before the memmove(). Reading the mapped data should fault the pages into page cache, and sched_yield() releases the rest of the timeslice, making it extremely likely that the memmove() is not interrupted by the kernel in any way. (If you do not make sure the pages are already faulted in, the kernel will likely interrupt the memmove() for each page separately. You won't see that in the process, but other processes see the modifications to occur in page-sized chunks.)
This is not exactly atomic, but it is practical: it does not give you any guarantees, only makes the race window very very short; therefore I call this semi-atomic.
Note that this method is compatible with file leases. One could try to take a write lease on the file, but fall back to leaseless memory mapping if the lease is not granted within some acceptable time period, say a second or two. I'd use timer_create() and timer_settime() to create the timeout timer, and the same empty-body signal handler to catch the SIGALRM signal; that way the fcntl() is interrupted (returns -1 with errno == EINTR) when the timeout occurs -- with the timer interval set to some small value (say 25000000 nanoseconds, or 0.025 seconds) so it repeats very often after that, interrupting syscalls if the initial interrupt is missed for any reason.
Most userspace applications create a copy of the original file, modify the contents of the copy, then replace the original file with the copy.
Each process that opens the file will only see complete changes, never a mix of old and new contents. However, anyone keeping the file open, will only see their original contents, and not be aware of any changes (unless they check themselves). Most text editors do check, but daemons and other processes do not bother.
Remember that in Linux, the file name and its contents are two separate things. You can open a file, unlink/remove it, and still keep reading and modifying the contents for as long as you have the file open.
There are other approaches, too. I do not want to suggest any specific approach, because the optimal one depends heavily on the circumstances: Do the other processes keep the file open, or do they always (re)open it before reading the contents? Is atomicity preferred or absolutely required? Is the data plain text, structured like XML, or binary?
EDITED TO ADD:
Please note that there are no ways to guarantee beforehand that the file will be successfully modified atomically. Not in theory, and not in practice.
You might encounter a write error with the disk full, for example. Or the drive might hiccup at just the wrong moment. I'm only listing three practical ways to make it seem atomic in typical use cases.
The reason write leases are my favourite is that I can always use fcntl(fd,F_GETLEASE,&ptr) to check whether the lease is still valid or not. If not, then the write was not atomic.
High system load is unlikely to cause the lease to be broken for a 64k write, if the same data has been read just prior (so that it will likely be in page cache). If the process has superuser privileges, you can use setpriority(PRIO_PROCESS,getpid(),-20) to temporarily raise the process priority to maximum while taking the file lease and modifying the file. If the data to be overwritten has just been read, it is extremely unlikely to be moved to swap; thus swapping should not occur, either.
In other words, while it is quite possible for the lease method to fail, in practice it is almost always successful -- even without the extra tricks mentioned in this addendum.
Personally, I simply check if the modification was not atomic, using the fcntl() call after the modification, prior to msync()/fsync() (making sure the data hits the disk in case a power outage occurs); that gives me an absolutely reliable, trivial method to check whether the modification was atomic or not.
For configuration files and other sensitive data, I too recommend the rename method. (Actually, I prefer the hardlink approach used for NFS-safe file locking, which amounts to the same thing but uses a temporary name to detect naming races.) However, it has the problem that any process keeping the file open will have to check and reopen the file, voluntarily, to see the changed contents.
Disk writes cannot be atomic without a layer of abstraction. You should keep a journal and revert if a write is interrupted.
As far as I know a write below the size of PIPE_BUF is atomic. However I never rely on this. If the programs that access the file are written by you, you can use flock() to achieve exclusive access. This system call sets a lock on the file and allows other processes that know about the lock to get access or not.