I would like to know if the open() system call in Linux latest kernel would block if the filesystem is mounted as remote device, for example a CEPH filesystem, or NFS , and there is a network failure of some sort?
Yes. How long depends on the speed (and state) of the uplink, but your process or thread will block until the remote operation finishes. NFS is a bit notorious for this, and some FUSE file systems handle the blocking for whatever has the file handle, but you will block on open(), read() and write(), often at the mercy of the network and the other system.
Don't use O_NONBLOCK to get around it, or you're potentially reading from or writing to a black hole (which would just block anyway).
Yes, an open() call can block when trying to open a file on a remote file system if there is a network failure of some sort.
Depending on how the remote file system is mounted, it may just take a long time (multiple seconds) to determine that the remote file system is unavailable and return unsuccessfully after what seems like an inordinate amount of time, or it may simply lock up indefinitely until the remote resource becomes available once more (or until the mapping is removed from the system).
Related
Direct I/O is the most performant way to copy larger files, so I wanted to add that ability to a program.
Windows offers FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING in the Win32's CreateFileA(). Linux, since 2.4.10, has the O_DIRECT flag for open().
Is there a way to achieve the same result portably within POSIX? Like how the Win32 API here works from Windows XP to Windows 11, it would be nice to do direct IO across all UNIX-like systems in one reliably portable way.
No, there is no POSIX standard for direct IO.
There are at least two different APIs and behaviors that exist as of January 2023. Linux, FreeBSD, and apparently IBM's AIX use an O_DIRECT flag to open(), while Oracle's Solaris uses a directio() function on an already-opened file descriptor.
The Linux use of the O_DIRECT flag to the POSIX open() function is documented on the Linux open() man page:
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from thishttps://man7.org/linux/man-pages/man2/open.2.html
file. In general this will degrade performance, but it is
useful in special situations, such as https://en.wikipedia.org/wiki/QFSwhen applications do
their own caching. File I/O is done directly to/from
user-space buffers. The O_DIRECT flag on its own makes an
effort to transfer data synchronously, but does not give
the guarantees of the O_SYNC flag that data and necessary
metadata are transferred. To guarantee synchronous I/O,
O_SYNC must be used in addition to O_DIRECT. See NOTES
below for further discussion.
Linux does not clearly specify how direct IO interacts with other descriptors open on the same file, or what happens when the file is mapped using mmap(); nor any alignment or size restrictions on direct IO read or write operations. In my experience, these are all file-system specific and have been improving/becoming less restrictive over time, but most Linux filesystems require page-aligned IO buffers, and many (most? all?) (did? still do?) require page-sized reads or writes.
FreeBSD follows the Linux model: passing an O_DIRECT flag to open():
O_DIRECT may be used to minimize or eliminate the cache effects
of reading and writing. The system will attempt to avoid caching the
data you
read or write. If it cannot avoid caching the data, it will minimize the
impact the data has on the cache. Use of this flag can drastically reduce performance if not used with care.
OpenBSD does not support direct IO. There's no mention of direct IO in either the OpenBSD open() or the OpenBSD 'fcntl()` man pages.
IBM's AIX appears to support a Linux-type O_DIRECT flag to open(), but actual published IBM AIX man pages don't seem to be generally available.
SGI's Irix also supported the Linux-style O_DIRECT flag to open():
O_DIRECT
If set, all reads and writes on the resulting file descriptor will
be performed directly to or from the user program buffer, provided
appropriate size and alignment restrictions are met. Refer to the
F_SETFL and F_DIOINFO commands in the fcntl(2) manual entry for
information about how to determine the alignment constraints.
O_DIRECT is a Silicon Graphics extension and is only supported on
local EFS and XFS file systems, and remote BDS file systems.
Of interest, the XFS file system on Linux originated with SGI's Irix.
Solaris uses a completely different interface. Solaris uses a specific directio() function to set direct IO on a per-file basis:
Description
The directio() function provides advice to the system about the
expected behavior of the application when accessing the data in the
file associated with the open file descriptor fildes. The system
uses this information to help optimize accesses to the file's data.
The directio() function has no effect on the semantics of the other
operations on the data, though it may affect the performance of other
operations.
The advice argument is kept per file; the last caller of directio()
sets the advice for all applications using the file associated with
fildes.
Values for advice are defined in <sys/fcntl.h>.
DIRECTIO_OFF
Applications get the default system behavior when accessing file data.
When an application reads data from a file, the data is first cached
in system memory and then copied into the application's buffer (see
read(2)). If the system detects that the application is reading
sequentially from a file, the system will asynchronously "read ahead"
from the file into system memory so the data is immediately available
for the next read(2) operation.
When an application writes data into a file, the data is first cached
in system memory and is written to the device at a later time (see
write(2)). When possible, the system increases the performance of
write(2) operations by cacheing the data in memory pages. The data
is copied into system memory and the write(2) operation returns
immediately to the application. The data is later written
asynchronously to the device. When possible, the cached data is
"clustered" into large chunks and written to the device in a single
write operation.
The system behavior for DIRECTIO_OFF can change without notice.
DIRECTIO_ON
The system behaves as though the application is not going to reuse the
file data in the near future. In other words, the file data is not
cached in the system's memory pages.
When possible, data is read or written directly between the
application's memory and the device when the data is accessed with
read(2) and write(2) operations. When such transfers are not
possible, the system switches back to the default behavior, but just
for that operation. In general, the transfer is possible when the
application's buffer is aligned on a two-byte (short) boundary, the
offset into the file is on a device sector boundary, and the size of
the operation is a multiple of device sectors.
This advisory is ignored while the file associated with fildes is
mapped (see mmap(2)).
The system behavior for DIRECTIO_ON can change without notice.
Notice also the behavior on Solaris is different: if direct IO is enabled on a file by any process, all processes accessing that file will do so via direct IO (Solaris 10+ has no alignment or size restrictions on direct IO, so switching between direct IO and "normal" IO won't break anything.*). And if a file is mapped via mmap(), direct IO on that file is disabled entirely.
* - That's not quite true - if you're using a SAMFS or QFS filesystem in shared mode and access data from the filesystem's active metadata controller (where the filesystem must be mounted by design with the Solaris forcedirectio mount option so all access is done via direct IO on that one system in the cluster), if you disable direct IO for a file using directio( fd, DIRECTIO_OFF ), you will corrupt the filesystem. Oracle's own top-end RAC database would do that if you did a database restore on the QFS metadata controller, and you'd wind up with a corrupt filesystem.
The short answer is no.
IEEE 1003.1-2017 (the current POSIX standard afaik) doesn't mention any directives for direct I/O like O_DIRECT. That being said, a cursory glance tells me that GNU/Linux and FreeBSD support the O_DIRECT flag, while OpenBSD doesn't.
Beyond that, it appears that not all filesystems support O_DIRECT so even on a GNU/Linux system where you know your implementation of open() will recognize that directive, there's still no guarantee that you can use it.
At the end of the day, the only way I can see portable, direct I/O is runtime checks for whether or not the platform your program is running on supports it; you could do compile time checks, but I don't recommend it since filesystems can change, or your destination may not be on the OS drive. You might get super lucky and find a project out there that's already started to do this, but I kind of doubt it exists.
My recommendation for you is to start by writing your program to check for direct I/O support for your platform and act accordingly, adding checks and support for kernels and file systems you know your program will run on.
Wish I could be more help,
--K
I'm working on a Linux/C application with strict timing requirements. I want to open a directory for reading without blocking on I/O (i.e. succeed only if the information is immediately available in cache). If this request would block on I/O I would like to know so that I can abort and ignore this directory for now. I know that open() has a non-blocking option O_NONBLOCK. However, it has this caveat:
Note that this flag has no effect for regular files and
block devices; that is, I/O operations will (briefly)
block when device activity is required, regardless of
whether O_NONBLOCK is set.
I assume that a directory entry is treated like a regular file. I don't know of a good way to prove/disprove this. Is there a way to open a directory without any I/O blocking?
You could try using COPROC command in linux to run a process in background. Maybe it could work for you.
When writing a non-blocking program (handling multiple sockets) which at a certain point needs to open files using open(2), stat(2) files or open directories using opendir(2), how can I ensure that the system calls do not block?
To me it seems that there's no other alternative than using threads or fork(2).
As Mel Nicholson replied, for everything file descriptor based you can use select/poll/epoll. For everything else you can have a proxy thread-per-item (or a thread pool) with the small stack that would convert (by means of the kernel scheduler) any synchronous blocking waits to select/poll/epoll-able asynchronous events using eventfd or a unix pipe (where portability is required).
The proxy thread shall block till the operation completes and then write to the eventfd or to the pipe to wake up the select/poll/epoll.
Indeed there is no other method.
Actually there is another kind of blocking that can't be dealt with other than by threads and that is page faults. Those may happen in program code, program data, memory allocation or data mapped from files. It's almost impossible to avoid them (actually you can lock some pages to memory, but it's privileged operation and would probably backfire by making the kernel do a poor job of memory management somewhere else). So:
You can't really weed out every last chance of blocking for a particular client, so don't bother with the likes of open and stat. The network will probably add larger delays than these functions anyway.
For optimal performance you should have enough threads so some can be scheduled if the others are blocked on page fault or similar difficult blocking point.
Also if you need to read and process or process and write data during handling a network request, it's faster to access the file using memory-mapping, but that's blocking and can't be made non-blocking. So modern network servers tend to stick with the blocking calls for most stuff and simply have enough threads to keep the CPU busy while other threads are waiting for I/O.
The fact that most modern servers are multi-core is another reason why you need multiple threads anyway.
You can use the poll( ) command to check any number of sockets for data using a single thread.
See here for linux details, or man poll for the details on your system.
open( ) and stat( ) will block in the thread they are called from in all POSIX compliant systems unless called via an asynchronous tactic (like in a fork)
I have remote disks mounted on my system using NFS and I am trying to write to the files on the mounted remote disks using pwrite() API.
It doesn't happen every time but in some cases while doing I/O pwrite() fails and set the error number to EIO(Input/Output error).
Can some one please explain why this error occur on the first place and is there any way I can correct it?
Thanks
From (bad) experiences with reading and writing to NFS based files I learned that you have a good chance to work around this EIO by simply retrying the failed I/O operation (read(), write()).
Also on NFS one can not assume that read()/write() do transfer the amount of data specified, so it is good idea to always check the return value of the function in question on how many byte were transfered.
I see the issue in the underlying NFS functionality or in the way the NFS driver's results are handled by the kernel, so I strongly assume pread()/pwrite() show the same effects as I witnessed when using read()/write().
To extend the title.I am wondering how the OS handles functions like fwrite,fread,fopen and fclose.
What is actually a stream?
Sorry if I was not clear enough.
BTW I am using GNU/Linux Ubuntu 11.04.
A bit better explanation of what I am trying to ask.
I want to know how are files written to HDD how are read into memory and how can is later a handle to them created.Is BIOS doing that through drivers?
The C library takes a function like fopen and converts that to the proper OS system call. On Linux that is the POSIX open function. You can see the definition for this in a Linux terminal with man 2 open. On Windows the call would be CreateFile which you can see in the MSDN documentation. On Windows NT, that function is in turn another translation of the actual NT kernel function NtCreateFile.
A stream in the C library is a collection of information stored in a FILE struct. This is usually a 'handle' to the operating system's idea of the file, an area of memory allocated as a 'buffer', and the current read and write positions.
I just noticed you tagged this with 'assembly'. You might then want to know about the really low level details. This seems like a good article.
Now you've changed the question to ask about even lower levels. Well, once the operating system gets a command to open a file, it passes that command to the VFS (Virtual File System). That piece of the operating system looks up the file name, including any directories needed and does the necessary access checks. If this is in RAM cache then no disk access is needed. If not, the VFS sends a read request to the specific file system which is probably EXT4. Then the EXT4 file system driver will determine in what disk block that directory is located in. It will then send a read command to the disk device driver.
Assuming that the disk driver is AHCI, it will convert a request to read a block into a series of register writes that will set up a DMA (Direct Memory Access) request. This looks like a good source for some details.
At that point the AHCI controller on the motherboard takes over. It will communicate with the hard disk controller to cooperate in reading the data and writing into the DMA memory location.
While this is going on the operating system puts the process on hold so it can continue with other work. The hardware is taking care of things and the CPU isn't required to pay attention. The disk request will take many milliseconds during which the CPU can run millions of instructions.
When the request is complete the AHCI controller will send an interrupt. One of the system CPUs will receive the interrupt, look in its IDT (Interrupt Descriptor Table) and jump to the machine code at that location: the interrupt handler.
The operating system interrupt handler will read some data, find out that it has been interrupted by the AHCI controller, then it will jump into the AHCI driver code. The AHCI driver will read the registers on the controller, determine that the read is complete, put a marker into its operations queue, tell the OS scheduler that it needs to run, then return. Nothing else happens at this point.
The operating system will note that it needs to run the AHCI driver's queue. When it decides to do that (it might have a real-time task running or it might be reading networking packets at the moment) it will then go read the data from the memory block marked for DMA and copy that data to the EXT4 file system driver. That EXT4 driver will then return the data to the VFS which will put it into cache. The VFS will return an operating system file handle to the open system call, which will return that to the fopen library call, which will put that into the FILE struct and return a pointer to that to the program.
fopen et al are usually implemented on top of OS-specific system calls. On Unix, this means the APIs for working with file descriptors: open, read, write, close, and a few others. On Windows, it's CreateFile, ReadFile, etc.