According to man ioctl, opening file descriptors with open may cause unwanted side-effects. The manual also states that opening with O_NONBLOCK solves those unwanted issues but I can't seem to find what's the reason for that, nor what are the actual side-effects. Can someone shed light into that? With ioctl is it always possible and equivalent* to open file descriptors with O_NONBLOCK?
NOTES (from man ioctl)
In order to use this call, one needs an open file descriptor. Often the open(2) call has unwanted side effects, that can be avoided
under Linux by giving it the O_NONBLOCK flag.
(* I am aware of what O_NONBLOCK implies, but I don't know if that affects ioctl calls the same way it affects other syscalls. My program, which uses ioctl to write and read from an SPI bus, works perfectly with that flag enabled.)
The obvious place to look for the answer would be the open(2) man page, which, under the heading for O_NONBLOCK, says:
When possible, the file is opened in nonblocking mode.
Neither the open() nor any subsequent operations on the file
descriptor which is returned will cause the calling process to
wait.
[...]
For the handling of FIFOs (named pipes), see also fifo(7).
For a discussion of the effect of O_NONBLOCK in conjunction
with mandatory file locks and with file leases, see fcntl(2).
OK, that wasn't very informative, but let's follow the links and see what the manual pages for fifo(7) and fcntl(2) say:
Normally, opening the FIFO blocks until the other end is opened also.
A process can open a FIFO in nonblocking mode. In this case, opening
for read-only will succeed even if no-one has opened on the write
side yet and opening for write-only will fail with ENXIO (no such
device or address) unless the other end has already been opened.
Under Linux, opening a FIFO for read and write will succeed both in
blocking and nonblocking mode. POSIX leaves this behavior undefined.
So here's at least one "unwanted side effect": even just trying to open a FIFO may block, unless you pass O_NONBLOCK to the open() call (and open it for reading).
What about fcntl, then? As the open(2) man page notes, the sections to look under are those title "Mandatory locking" and "File leases". It seems to me that mandatory locks, in this case, are a red herring, though — they only cause blocking when one tries to actually read from or write to the file:
If a process tries to perform an incompatible
access (e.g., read(2) or write(2)) on a file region that has an
incompatible mandatory lock, then the result depends upon whether the
O_NONBLOCK flag is enabled for its open file description. If the
O_NONBLOCK flag is not enabled, then the system call is blocked until
the lock is removed or converted to a mode that is compatible with
the access.
What about leases, then?
When a process (the "lease breaker") performs an open(2) or
truncate(2) that conflicts with a lease established via F_SETLEASE,
the system call is blocked by the kernel and the kernel notifies the
lease holder by sending it a signal (SIGIO by default). [...]
Once the lease has been voluntarily or forcibly removed or
downgraded, and assuming the lease breaker has not unblocked its
system call, the kernel permits the lease breaker's system call to
proceed.
[...] If
the lease breaker specifies the O_NONBLOCK flag when calling open(2),
then the call immediately fails with the error EWOULDBLOCK, but the
other steps still occur as described above.
OK, so that's another unwanted side effect: if the file you're trying to open has a lease on it, the open() call may block until the lease holder has released the lease.
In both cases, the "unwanted side effect" avoided by passing O_NONBLOCK to open() is, unsurprisingly, the open() call itself blocking until some other process has done something. If there are any other kinds of side effects that the man page you cite refers to, I'm not aware of them.
From Linux System Programming, 2nd Edition, by R. Love (emphasis mine):
O_NONBLOCK
If possible, the file will be opened in nonblocking mode. Neither the
open() call,
nor any other operation will cause the process to block (sleep) on the I/O.
This
behaviour may be defined only for FIFOs.
Sometimes, programmers do not want a call to read() to block when there is
no data
available. Instead, they prefer that the call return immediately,
indicating that no data
is available. This is called nonblocking I/O; it allows applications to
perform I/O, potentially
on multiple files, without ever blocking, and thus missing data available in
another file.
Consequently, an additional errno value is worth checking: EAGAIN.
See this thread on the kernel mailing list.
Related
Suppose two different processes open the same file independently, and so have different entries in the Open file table (system-wide). But they refer to the same i-node entry.
As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different? And how does the kernel avoid it?
Book: The Linux Programming Interface; Page no. 95; Chapter-5 (File I/O: Further details); Section 5.4
(I'm assuming because you used write() that the question refers to POSIX systems.)
Each write() operation is supposed to be fully atomic, assuming a POSIX system (presumed from the use of write()).
Per POSIX 7's 2.9.7 Thread Interactions with Regular File Operations:
All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2017 when they operate on
regular files or symbolic links:
chmod()
chown()
close()
creat()
dup2()
fchmod()
fchmodat()
fchown()
fchownat()
fcntl()
fstat()
fstatat()
ftruncate()
lchown()
link()
linkat()
lseek()
lstat()
open()
openat()
pread()
read()
readlink()
readlinkat()
readv()
pwrite()
rename()
renameat()
stat()
symlink()
symlinkat()
truncate()
unlink()
unlinkat()
utime()
utimensat()
utimes()
write()
writev()
If two threads each call one of these functions, each call shall
either see all of the specified effects of the other call, or none of
them. The requirement on the close() function shall also apply
whenever a file descriptor is successfully closed, however caused (for
example, as a consequence of calling close(), calling dup2(), or of
process termination).
But pay particular attention to the specification for write() (bolding mine):
The write() function shall attempt to write nbyte bytes ...
POSIX says that write() calls to a file shall be atomic. POSIX does not say that the write() calls will be complete. Here's a Linux bug report where a signal was interrupting a write() that was partially complete. Note the explanation:
Now this is perfectly valid behavior as far as spec (POSIX, SUS,...) is concerned (please correct me if I'm missing something). So I'd say the program is incorrect. But OTOH I agree that this was not possible before a50527b1 and we don't want to break userspace. I'd hate to revert that commit since it allows us to interrupt processes doing large writes (especially when something goes wrong) but if you explain to us why this behavior is a problem for you then I guess I'll have to revert it.
That's all but admitting that there's a POSIX requirement for write() calls to be atomic, if not complete, with an offer to revert back to earlier behavior where the write() calls apparently were all also complete in this same circumstance.
Note, though, there are lots of file systems out there that don't conform to POSIX standards.
As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different?
Any write() in Linux can return a short count, for example due to a signal being delivered to an userspace handler. For simplicity, let's ignore that, and only consider what happens to the successfully written data.
There are two scenarios:
The regions written to do not overlap.
(For example, one process writes 100 bytes starting at offset 23, and another writes 50 bytes starting at offset 200.)
There is no race condition in this case.
The regions written to do overlap.
(For example, one process writes 100 bytes starting at offset 50, and another writes 10 bytes starting at offset 70.)
There is a race condition. It is impossible to predict (without advisory locks etc.) the order in which the data gets updated.
Depending on the target filesystem, and if the writes are large enough (so that paging effects can be observed), the two writes may even be "mixed" (in page-sized chunks) in Linux on some filesystems on machines with more than one hardware thread, even though POSIX says this shouldn't happen.
Normally, writes go through the Linux page cache. It is possible for one of the processes to have opened the file with O_DIRECT | O_SYNC, bypassing the page cache. In that case, there are many additional corner cases that can occur. Specifically, even if you use a shared clock source, and can show that the normal/page-cached write completed before the direct write call was made, it may still be possible for the page-cached write to overwrite the direct write contents.
And how does the kernel avoid it?
It doesn't. Why should it? POSIX says each write is atomic, but there is no practical way to avoid a race condition relying on that alone (and get consistent and expected results).
Userspace programs have at least four different methods to avoid such races:
Advisory file locks on the entire open file using the flock() interface.
Advisory file locks on the entire open file using the lockf() interface. In Linux, these are just shorthand for placing/removing fcntl() advisory locks on the entire file.
Advisory record locks on the file using the fcntl() interface. This works even across shared volumes, as long as the file server is configured to support file locking.
Obtaining an exclusive lease on the open file using the fcntl() interface.
Advisory file locks are like street lights: they are intended for co-operating processes to easily determine who gets to go when. However, they do not stop any other process from actually ignoring the "lock" and accessing the file.
File leases are a mechanism, where one or more processes can get a read lease at the same time on the same file, but only one process can get a write lease and only when that process is the only one having the file open. When granted, the write lease (or exclusive lease) means that if any other process tries to open the same file, the lease owner process is notified by a signal (that you can control using the fcntl() interface), and has a configured time (typically 45 seconds; see man 5 proc and /proc/sys/fs/lease-break-time, in seconds) to relinguish the lease. The opener is blocked in the kernel until the lease is downgraded or the lease break time passes, in which case the kernel breaks the lease.
This allows the lease holder to postpone the opening for a short while.
However, the lease holder cannot block the opening, and cannot e.g. replace the file with a decoy one; the opener already has a hold on the inode, and the lease break time is just a grace period for cleanup work.
Technically, a fifth method would be mandatory file locking, but aside from the kernel use wrt. executed binaries, they're not used, and are actually buggy in Linux anyway. In Linux, inodes are only locked against modification when that inode is being executed as a binary by the kernel. (You can still rename or delete the original file, and create a new one, so that any subsequent execs will execute the modified/new data. Attempts to modify a file that is being executed as a binary file will fail with error EBUSY.)
When using POSIX AIO with a file descriptor, did the file descriptor need to be opened with O_NONBLOCK in open()?
In APUE, I don't find it explicitly say yes or no, but I don't find such a file descriptor is opened with O_NONBLOCK in open() in one example.
Thank.
Since you comment that you are unclear about the relation and distinction between non-blocking I/O and asynchronous I/O:
I/O operations on a file open in non-blocking mode do not block, even if no data can be transferred immediately. If they transfer less data than was requested (or none at all) then it is up to the caller to try again later if they wish. Nothing is queued for later action.
The POSIX AIO interface provides for I/O operations to be performed asynchronously with respect to the caller's thread. AIO calls return without waiting for the I/O, while that I/O is being attempted in a different execution context. The caller can arrange to be notified in various ways of the completion of the operation (or not). In the meantime, it can perform whatever other work it wants.
There is no particular relationship between these. Neither the POSIX specifications, such as those for aio_read(), nor the Linux manual for the POSIX AIO interfaces document any requirement that the file on which AIO is performed be in non-blocking mode, nor do they define any error condition for the case that it is in blocking mode. Non-blocking mode is not a requirement.
Indeed, although it is allowed, it would not even be particularly useful to perform AIO on a non-blocking file. If you can rely on your operation not to block, then what do you gain from performing it asynchronously? The main point of AIO is that the caller does not have to wait, but the I/O is performed.
So I'm studying fclose manpage for quite I while and my conclusion is that if fclose is interrupted by some signal, according to the manpage there is no way to recover...? Am I missing some point?
Usually, with unbuffered POSIX functions (open, close, write, etc...) there is ALWAYS a way to recover from signal interruption (EINTR) by restarting the call; in contrast documentation of buffered calls states that after a failed fclose attempt another try has undefined behavior... no hint about HOW to recover instead. Am I just "unlucky" if a signal interrupts fclose? Data might be lost and I can't be sure whether the file descriptor is actually closed or not. I do know that the buffer is deallocated, but what about the file descriptor?
Think about large scale applications that use lot's of fd's simultaneously and would run into problems if fd's are not properly freed -> I would assume there must be a CLEAN solution to this problem.
So let's assume I'm writing a library and it's not allowed to use sigaction and SA_RESTART and lots of signals are sent, how do I recover if fclose is interrupted?
Would it be a good idea to call close in a loop (instead of fclose) after fclose failed with EINTR? Documentation of fclose simply doesn't mention the state of the file descriptor; UNDEFINED is not very helpful though... if fd is closed and I call close again, weird hard-to-debug side-effects could occur so naturally I would rather ignore this case as doing the wrong thing... then again, there is no unlimited number of file descriptors available, and resource leakage is some sort of bug (at least to me).
Of course I could check one specific implementation of fclose but I can't believe someone designed stdio and didn't think about this problem? Is it just the documentation that is bad or the design of this function?
This corner case really bugs me :(
EINTR and close()
In fact, there are also problems with close(), not only with fclose().
POSIX states that close() returns EINTR, which usually means that application may retry the call. However things are more complicated in linux. See this article on LWN and also this post.
[...] the POSIX EINTR semantics are not really possible on Linux. The file descriptor passed to close() is de-allocated early in the processing of the system call and the same descriptor could already have been handed out to another thread by the time close() returns.
This blog post and this answer explain why it's not a good idea to retry close() failed with EINTR. So in Linux, you can do nothing meaningful if close() failed with EINTR (or EINPROGRESS).
Also note that close() is asynchronous in Linux. E.g., sometimes umount may return EBUSY immediately after closing last opened descriptor on filesystem since it's not yet released in kernel. See interesting discussion here: page 1, page 2.
EINTR and fclose()
POSIX states for fclose():
After the call to fclose(), any use of stream results in undefined behavior.
Whether or not the call succeeds, the stream shall be disassociated from the file and any buffer set by the setbuf() or setvbuf() function shall be disassociated from the stream. If the associated buffer was automatically allocated, it shall be deallocated.
I believe it means that even if close() failed, fclose() should free all resources and produce no leaks. It's true at least for glibc and uclibc implementations.
Reliable error handling
Call fflush() before fclose().
Since you can't determine if fclose() failed when it called fflush() or close(), you have to explicitly call fflush() before fclose() to ensure that userspace buffer was successfully sent to kernel.
Don't retry after EINTR.
If fclose() failed with EINTR, you can not retry close() and also can not retry fclose().
Call fsync() if you need.
If you care about data integrity, you should call fsync() or fdatasync() before calling fclose() 1.
If you don't, just ignore EINTR from fclose().
Notes
If fflush() and fsync() succeeded and fclose() failed with EINTR, no data is lost and no leaks occur.
You should also ensure that the FILE object is not used between fflush() and fclose() calls from another thread 2.
[1] See "Everything You Always Wanted to Know About Fsync()" article which explains why fsync() may also be an asynchronous operation.
[2] You can call flockfile() before calling fflush() and fclose(). It should work with fclose() correctly.
Think about large scale applications that use lot's of fd's simultaneously and would run into problems if fd's are not properly freed -> I would assume there must be a CLEAN solution to this problem.
The possibility to retry fflush() and then close() on the underlying file descriptor was already mentioned in the comments. For a large scale application, I would favour the pattern to use threads and have one dedicated signal handling thread, while all other threads have signals blocked using pthread_sigmask(). Then, when fclose() fails, you have a real problem.
I would like to create blocking and non-blocking file in Unix's C. First, blocking:
fd = open("file.txt", O_CREAT | O_WRONLY | O_EXCL);
is that right? Shouldnt I add some mode options, like 0666 for example?
How about non-blocking file? I have no idea for this.
I would like to achieve something like:
when I open it to write in it, and it's opened for writing, it's ok; if not it blocks.
when I open it to read from it, and it's opened for reading, it's ok; if not it blocks.
File descriptors are blocking or non-blocking; files are not. Add O_NBLOCK to the options in the open() call if you want a non-blocking file descriptor.
Note that opening a FIFO for reading or writing will block unless there's a process with the FIFO open for the other operation, or you specify O_NBLOCK. If you open it for read and write, the open() is non-blocking (will return promptly); I/O operations are still controlled by whether you set O_NBLOCK or not.
The updated question is not clear. However, if you're looking for 'exclusive access to the file' (so that no-one else has it open), then neither O_EXCL nor O_NBLOCK is the answer. O_EXCL affects what happens when you create the file; the create will fail if the file already exists. O_NBLOCK affects whether a read() operation will block when there's no data available to read. If you read the POSIX open() description, there is nothing there that allows you to request 'exclusive access' to a file.
To answer the question about file mode: if you include O_CREAT, you need the third argument to open(). If you omit O_CREAT, you don't need the third argument to open(). It is a varargs function:
int open(const char *filename, int options, ...);
I don't know what you are calling a blocking file (blocking IO in Unix means that the IO operations wait for the data to be available or for a sure failure, they are opposed to non-blocking IO which returns immediately if there is no available data).
You always need to specify a mode when opening with O_CREAT.
The open you show will fails if the file already exists (when fixed for the above point).
Unix has no standard way to lock file for exclusive access excepted that. There are advisory locks (but all programs must respect the protocol). Some have mandatory lock extension. The received wisdom is not to rely on either kind of locking when accessing network file system.
Shouldn't I add some mode options?
You should, if the file is write-only and to be created if nonexistent. In this case, open() expects a third argument as well, so omitting it results in undefined behavior.
Edit:
The updated question is even more confusing...
when I open it to write in it, and it's opened for writing, it's ok; if not it blocks.
Why would you need that? See, if you try to write to a file/file descriptor not opened for writing, write() will return -1 and you can check the error code stored in errno. Tell us what you're trying to achieve by this bizarre thing you want instead of overcomplicating and messing up your code.
(Remarks in parentheses:
I would like to create blocking and non-blocking file
What's that?
in unix's C
Again, there's no such thing. There is the C language, which is platform-independent.)
What is the behavior of the select(2) function when a file descriptor it is watching for reading is closed by another thread?
From some cursory testing, it does return right away. I suspect the outcome is either that (a) it still continues to wait for data, but if you actually tried to read from it you'd get EBADF (possibly -- there's a potential race) or (b) that it pretends as though the file descriptor were never passed in. If the latter case is true, passing in a single fd with no timeout would cause a deadlock if it were closed.
From some additional investigation, it appears that both dwc and bothie are right.
bothie's answer to the question boils down to: it's undefined behavior. That doesn't mean that it's unpredictable necessarily, but that different OSes do it differently. It would appear that systems like Solaris and HP-UX return from select(2) in this case, but Linux does not based on this post to the linux-kernel mailing list from 2001.
The argument on the linux-kernel mailing list is essentially that it is undefined (and broken) behavior to rely upon. In Linux's case, calling close(2) on the file descriptor effectively decrements a reference count on it. Since there is a select(2) call also with a reference to it, the fd will remain open and waiting for input until the select(2) returns. This is basically dwc's answer. You will get an event on the file descriptor and then it'll be closed. Trying to read from it will result in a EBADF, assuming the fd hasn't been recycled. (A concern that MarkR made in his answer, although I think it's probably avoidable in most cases with proper synchronization.)
So thank you all for the help.
I would expect that it would behave as if the end-of-file had been reached, that's to say, it would return with the file descriptor shown as ready but any attempt to read it subsequently would return "bad file descriptor".
Having said that, doing that is very bad practice anyway, as you'd always have potential race conditions as another file descriptor with the same number could be opened by yet another thread immediately after the other 2nd closed it, then the selecting thread would end up waiting on the wrong one.
As soon as you close a file, its number becomes available for reuse, and may get reused by the next call to open(), socket() etc, even if by another thread. Therefore you really, really need to avoid this kind of thing.
The select system call is a way to wait for file desctriptors to change state while the programs doesn't have anything else to do. The main use is for server applications, which open a bunch of file descriptors and then wait for anything to do on them (accept new connections, read requests or send the responses). Those file descriptors will be opened in non-blocking io mode such that the server process won't hang in a syscall at any times.
This additionally means, there is no need for separate threads, because all the work, that could be done in the thread can be done prior to the select call as well. And if the work takes long, than it can be interrupted, select being called with timeout={0,0}, the file descriptors get handled and afterwards the work is being resumed.
Now, you close a file descriptor in another thread. Why do you have that extra thread at all, and why shall it close the file descriptor?
The POSIX standard doesn't provide any hints, what happens in this case, so what you're doing is UNDEFINED BEHAVIOR. Expect that the result will be very different between different operating systems and even between version of the same OS.
Regards, Bodo
It's a little confusing what you're asking...
Select() should return upon an "interesting" change. If the close() merely decremented the reference count and the file was still open for writing somewhere then there's no reason for select() to wake up.
If the other thread did close() on the only open descriptor then it gets more interesting, but I'd need to see a simple version of the code to see if something's really wrong.