A program Foo periodically updates a file and calls my C program Bar to process the file.
The issue is that the Foo might update the file, call Bar to process it, and while Bar reads the file, Foo might update the file again.
Is it possible for Bar to read the file in inconsistent state, e.g. read first half of the file as written by first Foo and the other half as written by the second Foo? If so, how would I prevent that, assuming I can modify only Bar's code?
Typically, Foo should not simply rewrite the contents of the file again and again, but create a new temporary file, and replace the old file with the temporary file when it is done (using link()). In this case, simply opening the file (at any point in time) will give the reader a consistent snapshot of the contents, because of how typical POSIX filesystems work. (After opening the file, the file descriptor will refer to the same inode/contents, even if the file gets deleted or replaced; the disk space will be released only after the last open file descriptor of a deleted/replaced file is closed.)
If Foo does rewrite the same file (without a temporary file) over and over, the recommended solution would be for both Foo and Bar to use fcntl()-based advisory locking. (However, using a temporary file and renaming/linking it over the actual file when complete, would be even better.)
(While flock()-based locking might seem easier, it is actually a bit of a guessing game whether it works on NFS mounts or not. fcntl() works, unless the NFS server is configured not to support locking. Which is a bit of an issue on some commercial web hosts, actually.)
If you cannot modify the behaviour of Foo, and it does not use advisory locking, there are still some options in Linux.
If Foo closes the file -- i.e., Bar is the only one to open the file --, then taking an exclusive file lease (using fcntl(descriptor, F_SETLEASE, F_WRLCK) is a workable solution. You can only get an exclusive file lease if descriptor is the only open descriptor on the file, and the owner user of the file is the same as the process UID (or the process has the CAP_LEASE capability). If any other process tries to open or truncate the file, the lease owner gets signaled (SIGIO by default), and has up to /proc/sys/fs/lease-break-time seconds to downgrade or release the lease. The opener is blocked for the duration, which allows Bar to either cancel the processing, or copy the file for later processing.
The other option for Bar is rather violent. It can monitor the file say once per second, and when the file is old enough -- say, a few seconds --, pause Foo by sending it a SIGSTOP signal, checking /proc/FOOPID/stat until it gets stopped, and rechecking the file statistics to verify it's still old, until making a temporary copy of it (either in memory, or on disk) for processing. After the file is read/copied, Bar can let Foo continue by sending it a SIGCONT signal.
Some filesystems may support file snapshots, but in my opinion, one of the above are much saner than relying on nonstandard filesystem support to function correctly. If Foo cannot be modified to co-operate, it is time to refactor it out of the picture. You do not want to be a hostage for a black box out of your control, so the sooner you replace it with something more user/administrator-friendly, the better you'll be in the long term.
This is difficult to do robustly without Foo's cooperation.
Unixes have two main kinds of file locking:
range locking with fcntl(2)
always-whole-file locking with flock(2)
Ideally, you use either of these in cooperative mode (advisory locking), where all participants attempt to acquire the lock and only one will get it at a time.
Without the other program's cooperation, your only recourse, as far as I know is mandatory locking, which you can have with fcntl if you allow it on the filesystem, but the manpage mentions that the Linux implementation is unreliable.
In all UN*X systems, what is warranted to happen atomically is the write(2) or read(2) system calls. The kernel even locks the file inode in memory, so while you are read(2)ing or write(2)ing it, it would not change.
For more spatial atomicity, you have to lock the whole file. You can use the file locking tools available to lock different regions of a file. Some are advisory (you can force an skip over them) and others are mandatory (you are blocked until the other side unblocks the file region)
See fcntl(2) and the options F_GETLK, F_SETLK and F_SETLKW to get lock info, set lock for reading or writing, respectively.
Related
Suppose two different processes open the same file independently, and so have different entries in the Open file table (system-wide). But they refer to the same i-node entry.
As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different? And how does the kernel avoid it?
Book: The Linux Programming Interface; Page no. 95; Chapter-5 (File I/O: Further details); Section 5.4
(I'm assuming because you used write() that the question refers to POSIX systems.)
Each write() operation is supposed to be fully atomic, assuming a POSIX system (presumed from the use of write()).
Per POSIX 7's 2.9.7 Thread Interactions with Regular File Operations:
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2017 when they operate on
regular files or symbolic links:
chmod()
chown()
close()
creat()
dup2()
fchmod()
fchmodat()
fchown()
fchownat()
fcntl()
fstat()
fstatat()
ftruncate()
lchown()
link()
linkat()
lseek()
lstat()
open()
openat()
pread()
read()
readlink()
readlinkat()
readv()
pwrite()
rename()
renameat()
stat()
symlink()
symlinkat()
truncate()
unlink()
unlinkat()
utime()
utimensat()
utimes()
write()
writev()
If two threads each call one of these functions, each call shall
either see all of the specified effects of the other call, or none of
them. The requirement on the close() function shall also apply
whenever a file descriptor is successfully closed, however caused (for
example, as a consequence of calling close(), calling dup2(), or of
process termination).
But pay particular attention to the specification for write() (bolding mine):
The write() function shall attempt to write nbyte bytes ...
POSIX says that write() calls to a file shall be atomic. POSIX does not say that the write() calls will be complete. Here's a Linux bug report where a signal was interrupting a write() that was partially complete. Note the explanation:
Now this is perfectly valid behavior as far as spec (POSIX, SUS,...) is concerned (please correct me if I'm missing something). So I'd say the program is incorrect. But OTOH I agree that this was not possible before a50527b1 and we don't want to break userspace. I'd hate to revert that commit since it allows us to interrupt processes doing large writes (especially when something goes wrong) but if you explain to us why this behavior is a problem for you then I guess I'll have to revert it.
That's all but admitting that there's a POSIX requirement for write() calls to be atomic, if not complete, with an offer to revert back to earlier behavior where the write() calls apparently were all also complete in this same circumstance.
Note, though, there are lots of file systems out there that don't conform to POSIX standards.
As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different?
Any write() in Linux can return a short count, for example due to a signal being delivered to an userspace handler. For simplicity, let's ignore that, and only consider what happens to the successfully written data.
There are two scenarios:
The regions written to do not overlap.
(For example, one process writes 100 bytes starting at offset 23, and another writes 50 bytes starting at offset 200.)
There is no race condition in this case.
The regions written to do overlap.
(For example, one process writes 100 bytes starting at offset 50, and another writes 10 bytes starting at offset 70.)
There is a race condition. It is impossible to predict (without advisory locks etc.) the order in which the data gets updated.
Depending on the target filesystem, and if the writes are large enough (so that paging effects can be observed), the two writes may even be "mixed" (in page-sized chunks) in Linux on some filesystems on machines with more than one hardware thread, even though POSIX says this shouldn't happen.
Normally, writes go through the Linux page cache. It is possible for one of the processes to have opened the file with O_DIRECT | O_SYNC, bypassing the page cache. In that case, there are many additional corner cases that can occur. Specifically, even if you use a shared clock source, and can show that the normal/page-cached write completed before the direct write call was made, it may still be possible for the page-cached write to overwrite the direct write contents.
And how does the kernel avoid it?
It doesn't. Why should it? POSIX says each write is atomic, but there is no practical way to avoid a race condition relying on that alone (and get consistent and expected results).
Userspace programs have at least four different methods to avoid such races:
Advisory file locks on the entire open file using the flock() interface.
Advisory file locks on the entire open file using the lockf() interface. In Linux, these are just shorthand for placing/removing fcntl() advisory locks on the entire file.
Advisory record locks on the file using the fcntl() interface. This works even across shared volumes, as long as the file server is configured to support file locking.
Obtaining an exclusive lease on the open file using the fcntl() interface.
Advisory file locks are like street lights: they are intended for co-operating processes to easily determine who gets to go when. However, they do not stop any other process from actually ignoring the "lock" and accessing the file.
File leases are a mechanism, where one or more processes can get a read lease at the same time on the same file, but only one process can get a write lease and only when that process is the only one having the file open. When granted, the write lease (or exclusive lease) means that if any other process tries to open the same file, the lease owner process is notified by a signal (that you can control using the fcntl() interface), and has a configured time (typically 45 seconds; see man 5 proc and /proc/sys/fs/lease-break-time, in seconds) to relinguish the lease. The opener is blocked in the kernel until the lease is downgraded or the lease break time passes, in which case the kernel breaks the lease.
This allows the lease holder to postpone the opening for a short while.
However, the lease holder cannot block the opening, and cannot e.g. replace the file with a decoy one; the opener already has a hold on the inode, and the lease break time is just a grace period for cleanup work.
Technically, a fifth method would be mandatory file locking, but aside from the kernel use wrt. executed binaries, they're not used, and are actually buggy in Linux anyway. In Linux, inodes are only locked against modification when that inode is being executed as a binary by the kernel. (You can still rename or delete the original file, and create a new one, so that any subsequent execs will execute the modified/new data. Attempts to modify a file that is being executed as a binary file will fail with error EBUSY.)
this is a design question more than a coding problem. I have a parent process that will fork many children. Each of the children is supposed to read and write on the same text file.
How can we achieve this safely?
My thoughts:
create the file pointer in the parent, then create a binary semaphore on it. And processes will compete on obtaining the file pointer and write on the file. In the read case i don't need a semaphore.
Please tell me if i got it wrong.
I am using C under linux.
Thank you.
POSIX systems have kernel level file locks using fcntl and/or flock. Their history is a bit complicated and their use and semantics not always obvious but they do work, especially in simple cases. For locking an entire file, flock is easier to use IMO. If you need to lock only parts of a file, fcntl provides that ability.
As an aside, file locking over NFS is not safe on all (most?) platforms.
man 2 flock
man 2 fcntl
http://en.wikipedia.org/wiki/File_locking#In_Unix-like_systems
Also, keep in mind that file locks are "advisory" only. They don't actually prevent you from writing/reading/etc to a file if you bypass acquiring the lock.
If writers are appending data to the file, your approach seems fine (at least up until the file becomes too large for the file system).
If writers are doing file replacement, then I would approach it something like this:
The reading API would check the time of last modification (with fstat()) against a cached value. If the time has changed, the file is re-opened, and the cached modification time updated, before the read is performed.
The writing API would acquire a lock, and write to a temporary file. Then, the actual data file is replaced by calling rename(), after which the lock is released.
If writers can write anywhere in the file, then you probably want are more structured file than just plain text, similar to a database. In such a case, some kind of reader-writer lock should be used to manage data consistency and data integrity.
Is there anyway in Linux (or more generally in a POSIX OS) to guarantee that during the execution of a program, no file descriptors will be reused, even if a file is closed and another opened? My understanding is that this situation would usually lead to the file descriptor for the closed file being reassigned to the newly opened file.
I'm working on an I/O tracing project and it would make life simpler if I could assume that after an open()/fopen() call, all subsequent I/O to that file descriptor is to the same file.
I'll take either a compile-time or run-time solution.
If it is not possible, I could do my own accounting when I process the trace file (noting the location of all open and close calls), but I'd prefer to squash the problem during execution of the traced program.
Note that POSIX requires:
The open() function shall return a file descriptor for the named file
that is the lowest file descriptor not currently open for that
process.
So in the strictest sense, your request will change the program's environment to be no longer POSIX compliant.
That said, I think your best bet is to use the LD_PRELOAD trick to intercept calls to close and ignore them.
You'd have to write a SO that contains a close(2) that opens /dev/null on old FDs, and then use $LD_PRELOAD to load it into process space before starting the application.
You must already be ptraceing the application to intercept its file opening and closing operations.
It would appear trivial to prevent FD re-use by "injecting" dup2(X, Y); close(X); calls into the application, and adjusting Y to be anything you want.
However, the application itself could be using dup2 to force a re-use of previously closed FD, and may not work if you prevent that, so I think you'll just have to deal with this in post-processing step.
Also, it's quite easy to write an app that will run out of FDs if you disallow re-use.
I'm writing a server web.
Each connection is served by a separate thread, so I don't know in advance the number of threads.
There are also a group of text files (don't know the number, too), and each thread can read/write on each file.
A file can be written by just one thread a time, but different threads can write on different files at the same time.
If a file is read by one or more threads (reads can be concurrent), no thread can write on THAT file.
Now, I noticed this (Thread safe multi-file writing) solution, but I'd like also to use functions as fgets(), for example.
So, can I flock() a file, and then use a fgets() or another stdio read/write library function?
First of all, use fcntl, not flock. The latter is a non-standard, deprecated BSD function and does not work with NFS and possibly other filesystems. fcntl locking on the other hand is POSIX standard and is intended to work everywhere.
Now if you want to use file-level reader-writer locking mixed with stdio, it will work, but you have to take some care to ensure that buffering does not break your assumptions about locks. The method I'm about to explain is not the only one, but I believe it's the clearest/simplest:
When you want to operate on one of your files with stdio, obtaining the correct type of lock (read or write, aka shared of exclusive) should be the first thing you do after fopen. Use fileno to get the file descriptor number and apply the lock to it. After that, perform your entire read or write operation. Do not make any attempt to unlock the file; instead, call fclose to close the file and let it be implicitly unlocked when it's closed. Otherwise you may release the lock while unbuffered data is still unwritten, or later read data that was buffered before the lock was released, that's no longer valid after the lock is released.
I am using fcntl locks in C on linux and have a dilemma of trying to delete a file that may possibly be locked from other processes that also check for the fcntl locking mechanism. What would be the preferred way of handling this file which must be deleted, (Should I simply delete the file without regard of other processes that may have reader locks or is there a better way)?
Any help would be much appreciated.
On UNIX systems, it is possible to unlink a file while it is still open; doing so decrements the reference count on the file, but the actual file and its inode remains around until the reference count goes to zero.
As others have noted, you are free to delete the file even while you hold the lock.
Now, a cautionary note: you didn't mention why processes are locking this file, but you should be aware that if you are using that file for interprocess synchronization, deleting it is a good way to introduce subtle race conditions into your system, basically because there's no way to atomically create AND lock the file in a single operation.
For example, process AA might create the file, with the intention of locking it immediately to do whatever updates it needs to do. However, there's nothing to prevent process BB from grabbing the lock on the file first, then deleting the file, leaving process AA with a handle to the now deleted file. Process AA will still be able to lock and update that file, but those updates will effectively be "lost" because the file's already been deleted.
Moreover, locks on UNIX system are advisory by default, not mandatory, so that locking a file does not prevent it from being open or unlinked, just from being locked again.