How does read(2) in Linux C work? - c

According to the man page, we can specify the amount of bytes we want to read from a file descriptor.
But in the read's implementation, how many read requests will be created to perform a read?
For example, if I want to read 4MB, will it create only one request for 4MB or will it split it into multiple small requests? such as 4KB per request?

read(2) is a system call, so it calls the vDSO shared library to dispatch the system call (in very old times it used to be an interrupt, but nowadays there are faster ways of dispatching system calls).
inside the kernel the call is first handled by the vfs (virtual file system); the virtual file system provides a common interface for inodes (the structures that represents open files) and a common way of interfacing with the underlying file system.
the vfs dispatches to the underlying file system (the mount(8) program will tell you which mount point exists and what file system is used there). (see here for more information http://www.inf.fu-berlin.de/lehre/SS01/OS/Lectures/Lecture16.pdf )
the file system can do its own caching, so number of disk reads depends on what is present in the cache and how the file system allocates blocks for storage of a particular file and how the file is divided into disk blocks - all questions to the particular file system)
If you want to do your own caching then open the file with O_DIRECT flag; in this case there is an effort not to use the cache; however all reads have to be aligned to 512 offsets and come in multiples of 512 size (this is in order that your buffer can be transfered via DMA to the backing store http://www.quora.com/Why-does-O_DIRECT-require-I-O-to-be-512-byte-aligned )

It depends on how deep you go.
The C library just passes the size you gave it straight to the kernel in one read() system call, so at that level it's just one request.
Inside the kernel, for an ordinary file in standard buffered mode the 4MB you requested is going to be copied from multiple pagecache pages (4kB each) which are unlikely to be contiguous. Any of the file data which isn't actually already in the pagecache is going to have to be read from disk. The file might not be stored contiguously on disk, so that 4MB could result in multiple requests to the underlying block device.

If there is data available, read will return as much data as is immediately available and will fit in the buffer, without waiting. If there's no data available, it will wait until there is some and return what it can without waiting more.
How much that is depends on what the file descriptor refers to. If it refers to a socket, that will be whatever is in the socket buffer. If it is a file, that will be whatever is in the buffer cache.

When you call read it only make just one request to fill the buffer size and if it couldn't to fill all the buffer (no more data or data is not arrived like in sockets) it returns the number of bytes it actually wrote in your buffer.
As the manual says:
RETURN VALUE
Upon successful completion, these functions shall return a non-negative integer indicating the number of bytes actually read. Otherwise, the functions shall return −1 and set errno to indicate the
error.

There's really no one right answer, other than however many are necessary what whatever layer the request winds up going to. Typically, a single request will be passed to the kernel. This may result in no further requests going to other layers because all the information is in memory. But if the data has to be read from, say, a software RAID, requests may have to be issued to multiple physical devices to satisfy the request.
I don't think you can really give a better answer than "whatever the implementer thought was was the best way".

Related

Where and why do read(2) and write(2) system calls copy to and from userspace?

I was reading about sendfile(2) recently, and the man page states:
sendfile() copies data between one file descriptor and another.
Because this copying is done within the kernel, sendfile() is more
efficient than the combination of read(2) and write(2), which would
require transferring data to and from user space.
It got me thinking, why exactly is the combination of read()/write() slower? The man page focuses on extra copying that has to happen to and from userspace, not the total number of calls required. I took a short look at the kernel code for read and write but didn't see the copy.
Why does the copy exist in the first place? Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
why exactly is the combination of read()/write() slower?
The manual page is quite clear about this. Doing read() and then write() requires to copy the data two times.
Why does the copy exist in the first place?
It should be quite obvious: since you invoke read, you want the data to be copied to the memory of your process, in the specified destination buffer. Same goes for write: you want the data to be copied from the memory of your process. The kernel doesn't really know that you just want to do a read + write, and that copying back and forth two times could be avoided.
When executing read, the data is copied by the kernel from the file descriptor to the process memory. When executing write the data is copied by the kernel from the process memory to the file descriptor.
Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
The crucial point here is that when you read or write a file, the file has to be mapped from disk to memory by the kernel in order for it to be read or written. This is called memory-mapped file I/O, and it's a huge factor in the performance of modern operating systems.
The file content is already present in kernel memory, mapped as a memory page (or more). In case of a read, the data needs to be copied from that file kernel memory page to the process memory, while in case of a write, the data needs to be copied from the process memory to the file kernel memory page. The kernel will then ensure that the data in the kernel memory page(s) corresponding to the file is correctly written back to disk when needed (if needed at all).
This "intermediate" kernel mapping can be avoided, and the file mapped directly into userspace memory, but then the application would have to manage it manually, which is complicated and easy to mess up. This is why, for normal file operations, files are mapped into kernel memory. The kernel provides high level APIs for userspace programs to interact with them, and the hard work is left to the kernel itself.
The sendfile syscall is much faster because you do not need to perform the copy two times, but only once. Assuming that you want to do a sendfile of file A to file B, then all the kernel needs to do is to copy the data from A to B. However, in the case of read + write, the kernel needs to first copy from A to your process, and then from your process to B. This double copy is of course slower, and if you don't really need to read or manipulate the data, then it's a complete waste of time.
FYI, sendfile itself is basically an easy-to-use wrapper around splice (as can bee seen from the source code), which is a more generic syscall to perform zero-copy data transfer between file descriptors.
I took a short look at the kernel code for read and write but didn't see the copy.
In terms of kernel code, the whole process for reading a file is very complicated, but what the kernel ends up doing is a "special" version of memcpy(), called copy_to_user(), which copies the content of the file from the kernel memory to the userspace memory (doing the appropriate checks before performing the actual copy). More specifically, for files, the copyout() function is used, but the behavior is very similar, both end up calling raw_copy_to_user() (which is architecture-dependent).
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
The aio_{read,write} libc functions defined by POSIX are just asynchronous wrappers around read and write (i.e. they still use read and write under the hood). These still copy data to/from userspace.
io_uring can provide zero-copy operations, when using the O_DIRECT flag of open (see the manual page):
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT. See NOTES below for further
discussion.
This should be done carefully though, as it could very well degrade performance in case the userspace application does not do the appropriate caching on its own (if needed).
See also this related detailed answer on asynchronous I/O, and this LWN article on io_uring.

What does opening a file actually do?

In all programming languages (that I use at least), you must open a file before you can read or write to it.
But what does this open operation actually do?
Manual pages for typical functions dont actually tell you anything other than it 'opens a file for reading/writing':
http://www.cplusplus.com/reference/cstdio/fopen/
https://docs.python.org/3/library/functions.html#open
Obviously, through usage of the function you can tell it involves creation of some kind of object which facilitates accessing a file.
Another way of putting this would be, if I were to implement an open function, what would it need to do on Linux?
In almost every high-level language, the function that opens a file is a wrapper around the corresponding kernel system call. It may do other fancy stuff as well, but in contemporary operating systems, opening a file must always go through the kernel.
This is why the arguments of the fopen library function, or Python's open closely resemble the arguments of the open(2) system call.
In addition to opening the file, these functions usually set up a buffer that will be consequently used with the read/write operations. The purpose of this buffer is to ensure that whenever you want to read N bytes, the corresponding library call will return N bytes, regardless of whether the calls to the underlying system calls return less.
I am not actually interested in implementing my own function; just in understanding what the hell is going on...'beyond the language' if you like.
In Unix-like operating systems, a successful call to open returns a "file descriptor" which is merely an integer in the context of the user process. This descriptor is consequently passed to any call that interacts with the opened file, and after calling close on it, the descriptor becomes invalid.
It is important to note that the call to open acts like a validation point at which various checks are made. If not all of the conditions are met, the call fails by returning -1 instead of the descriptor, and the kind of error is indicated in errno. The essential checks are:
Whether the file exists;
Whether the calling process is privileged to open this file in the specified mode. This is determined by matching the file permissions, owner ID and group ID to the respective ID's of the calling process.
In the context of the kernel, there has to be some kind of mapping between the process' file descriptors and the physically opened files. The internal data structure that is mapped to the descriptor may contain yet another buffer that deals with block-based devices, or an internal pointer that points to the current read/write position.
I'd suggest you take a look at this guide through a simplified version of the open() system call. It uses the following code snippet, which is representative of what happens behind the scenes when you open a file.
0 int sys_open(const char *filename, int flags, int mode) {
1 char *tmp = getname(filename);
2 int fd = get_unused_fd();
3 struct file *f = filp_open(tmp, flags, mode);
4 fd_install(fd, f);
5 putname(tmp);
6 return fd;
7 }
Briefly, here's what that code does, line by line:
Allocate a block of kernel-controlled memory and copy the filename into it from user-controlled memory.
Pick an unused file descriptor, which you can think of as an integer index into a growable list of currently open files. Each process has its own such list, though it's maintained by the kernel; your code can't access it directly. An entry in the list contains whatever information the underlying filesystem will use to pull bytes off the disk, such as inode number, process permissions, open flags, and so on.
The filp_open function has the implementation
struct file *filp_open(const char *filename, int flags, int mode) {
struct nameidata nd;
open_namei(filename, flags, mode, &nd);
return dentry_open(nd.dentry, nd.mnt, flags);
}
which does two things:
Use the filesystem to look up the inode (or more generally, whatever sort of internal identifier the filesystem uses) corresponding to the filename or path that was passed in.
Create a struct file with the essential information about the inode and return it. This struct becomes the entry in that list of open files that I mentioned earlier.
Store ("install") the returned struct into the process's list of open files.
Free the allocated block of kernel-controlled memory.
Return the file descriptor, which can then be passed to file operation functions like read(), write(), and close(). Each of these will hand off control to the kernel, which can use the file descriptor to look up the corresponding file pointer in the process's list, and use the information in that file pointer to actually perform the reading, writing, or closing.
If you're feeling ambitious, you can compare this simplified example to the implementation of the open() system call in the Linux kernel, a function called do_sys_open(). You shouldn't have any trouble finding the similarities.
Of course, this is only the "top layer" of what happens when you call open() - or more precisely, it's the highest-level piece of kernel code that gets invoked in the process of opening a file. A high-level programming language might add additional layers on top of this. There's a lot that goes on at lower levels. (Thanks to Ruslan and pjc50 for explaining.) Roughly, from top to bottom:
open_namei() and dentry_open() invoke filesystem code, which is also part of the kernel, to access metadata and content for files and directories. The filesystem reads raw bytes from the disk and interprets those byte patterns as a tree of files and directories.
The filesystem uses the block device layer, again part of the kernel, to obtain those raw bytes from the drive. (Fun fact: Linux lets you access raw data from the block device layer using /dev/sda and the like.)
The block device layer invokes a storage device driver, which is also kernel code, to translate from a medium-level instruction like "read sector X" to individual input/output instructions in machine code. There are several types of storage device drivers, including IDE, (S)ATA, SCSI, Firewire, and so on, corresponding to the different communication standards that a drive could use. (Note that the naming is a mess.)
The I/O instructions use the built-in capabilities of the processor chip and the motherboard controller to send and receive electrical signals on the wire going to the physical drive. This is hardware, not software.
On the other end of the wire, the disk's firmware (embedded control code) interprets the electrical signals to spin the platters and move the heads (HDD), or read a flash ROM cell (SSD), or whatever is necessary to access data on that type of storage device.
This may also be somewhat incorrect due to caching. :-P Seriously though, there are many details that I've left out - a person (not me) could write multiple books describing how this whole process works. But that should give you an idea.
Any file system or operating system you want to talk about is fine by me. Nice!
On a ZX Spectrum, initializing a LOAD command will put the system into a tight loop, reading the Audio In line.
Start-of-data is indicated by a constant tone, and after that a sequence of long/short pulses follow, where a short pulse is for a binary 0 and a longer one for a binary 1 (https://en.wikipedia.org/wiki/ZX_Spectrum_software). The tight load loop gathers bits until it fills a byte (8 bits), stores this into memory, increases the memory pointer, then loops back to scan for more bits.
Typically, the first thing a loader would read is a short, fixed format header, indicating at least the number of bytes to expect, and possibly additional information such as file name, file type and loading address. After reading this short header, the program could decide whether to continue loading the main bulk of the data, or exit the loading routine and display an appropriate message for the user.
An End-of-file state could be recognized by receiving as many bytes as expected (either a fixed number of bytes, hardwired in the software, or a variable number such as indicated in a header). An error was thrown if the loading loop did not receive a pulse in the expected frequency range for a certain amount of time.
A little background on this answer
The procedure described loads data from a regular audio tape - hence the need to scan Audio In (it connected with a standard plug to tape recorders). A LOAD command is technically the same as open a file - but it's physically tied to actually loading the file. This is because the tape recorder is not controlled by the computer, and you cannot (successfully) open a file but not load it.
The "tight loop" is mentioned because (1) the CPU, a Z80-A (if memory serves), was really slow: 3.5 MHz, and (2) the Spectrum had no internal clock! That means that it had to accurately keep count of the T-states (instruction times) for every. single. instruction. inside that loop, just to maintain the accurate beep timing.
Fortunately, that low CPU speed had the distinct advantage that you could calculate the number of cycles on a piece of paper, and thus the real world time that they would take.
It depends on the operating system what exactly happens when you open a file. Below I describe what happens in Linux as it gives you an idea what happens when you open a file and you could check the source code if you are interested in more detail. I am not covering permissions as it would make this answer too long.
In Linux every file is recognised by a structure called inode. Each structure has an unique number and every file only gets one inode number. This structure stores meta data for a file, for example file-size, file-permissions, time stamps and pointer to disk blocks, however, not the actual file name itself. Each file (and directory) contains a file name entry and the inode number for lookup. When you open a file, assuming you have the relevant permissions, a file descriptor is created using the unique inode number associated with file name. As many processes/applications can point to the same file, inode has a link field that maintains the total count of links to the file. If a file is present in a directory, its link count is one, if it has a hard link its link count will be two and if a file is opened by a process, the link count will be incremented by 1.
Bookkeeping, mostly. This includes various checks like "Does the file exist?" and "Do I have the permissions to open this file for writing?".
But that's all kernel stuff - unless you're implementing your own toy OS, there isn't much to delve into (if you are, have fun - it's a great learning experience). Of course, you should still learn all the possible error codes you can receive while opening a file, so that you can handle them properly - but those are usually nice little abstractions.
The most important part on the code level is that it gives you a handle to the open file, which you use for all of the other operations you do with a file. Couldn't you use the filename instead of this arbitrary handle? Well, sure - but using a handle gives you some advantages:
The system can keep track of all the files that are currently open, and prevent them from being deleted (for example).
Modern OSs are built around handles - there's tons of useful things you can do with handles, and all the different kinds of handles behave almost identically. For example, when an asynchronous I/O operation completes on a Windows file handle, the handle is signalled - this allows you to block on the handle until it's signalled, or to complete the operation entirely asynchronously. Waiting on a file handle is exactly the same as waiting on a thread handle (signalled e.g. when the thread ends), a process handle (again, signalled when the process ends), or a socket (when some asynchronous operation completes). Just as importantly, handles are owned by their respective processes, so when a process is terminated unexpectedly (or the application is poorly written), the OS knows what handles it can release.
Most operations are positional - you read from the last position in your file. By using a handle to identify a particular "opening" of a file, you can have multiple concurrent handles to the same file, each reading from their own places. In a way, the handle acts as a moveable window into the file (and a way to issue asynchronous I/O requests, which are very handy).
Handles are much smaller than file names. A handle is usually the size of a pointer, typically 4 or 8 bytes. On the other hand, filenames can have hundreds of bytes.
Handles allow the OS to move the file, even though applications have it open - the handle is still valid, and it still points to the same file, even though the file name has changed.
There's also some other tricks you can do (for example, share handles between processes to have a communication channel without using a physical file; on unix systems, files are also used for devices and various other virtual channels, so this isn't strictly necessary), but they aren't really tied to the open operation itself, so I'm not going to delve into that.
At the core of it when opening for reading nothing fancy actually needs to happen. All it needs to do is check the file exists and the application has enough privileges to read it and create a handle on which you can issue read commands to the file.
It's on those commands that actual reading will get dispatched.
The OS will often get a head start on reading by starting a read operation to fill the buffer associated with the handle. Then when you actually do the read it can return the contents of the buffer immediately rather then needing to wait on disk IO.
For opening a new file for write the OS will need to add a entry in the directory for the new (currently empty) file. And again a handle is created on which you can issue the write commands.
Basically, a call to open needs to find the file, and then record whatever it needs to so that later I/O operations can find it again. That's quite vague, but it will be true on all the operating systems I can immediately think of. The specifics vary from platform to platform. Many answers already on here talk about modern-day desktop operating systems. I've done a little programming on CP/M, so I will offer my knowledge about how it works on CP/M (MS-DOS probably works in the same way, but for security reasons, it is not normally done like this today).
On CP/M you have a thing called the FCB (as you mentioned C, you could call it a struct; it really is a 35-byte contiguous area in RAM containing various fields). The FCB has fields to write the file-name and a (4-bit) integer identifying the disk drive. Then, when you call the kernel's Open File, you pass a pointer to this struct by placing it in one of the CPU's registers. Some time later, the operating system returns with the struct slightly changed. Whatever I/O you do to this file, you pass a pointer to this struct to the system call.
What does CP/M do with this FCB? It reserves certain fields for its own use, and uses these to keep track of the file, so you had better not ever touch them from inside your program. The Open File operation searches through the table at the start of the disk for a file with the same name as what's in the FCB (the '?' wildcard character matches any character). If it finds a file, it copies some information into the FCB, including the file's physical location(s) on the disk, so that subsequent I/O calls ultimately call the BIOS which may pass these locations to the disk driver. At this level, specifics vary.
In simple terms, when you open a file you are actually requesting the operating system to load the desired file ( copy the contents of file ) from the secondary storage to ram for processing. And the reason behind this ( Loading a file ) is because you cannot process the file directly from the Hard-disk because of its extremely slow speed compared to Ram.
The open command will generate a system call which in turn copies the contents of the file from the secondary storage ( Hard-disk ) to Primary storage ( Ram ).
And we 'Close' a file because the modified contents of the file has to be reflected to the original file which is in the hard-disk. :)
Hope that helps.

When does actual write() takes place in C?

What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.
The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.
As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).
write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.
OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.
Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.
Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.

Working of Open System Call

I am reading about Memory Mapped Files, the souce says it is faster than traditional methods to open a file or read a file such as an open system call and read system call respectively without giving the description that how open or read system call works.
So here's my question how the open system call works?
As far i know it will load the file into the memory, whereas by using mapped file only their addresses will be saved in the memory and when needed the requested page may be brought into the memory.
I expect clarification over my so far understanding.
EDIT
My previous understanding written above is almost wrong, for coorrect explanation refer to the accepted answer by Pawel.
Since you gave no details I'm assuming you are interested in behavior of Unix-like systems.
Actually open() system call only creates a file descriptor which then may be used by either mmap() or read().
Both memory mapped I/O and standard I/O internally access files on disk through page cache, a buffer in which files are cached in order to reduce number of I/O operations.
Standard I/O approach (using write() and read()) involves performing a system call which then copies data from (or to if you are writing) page cache to a buffer chosen by application. In addition to that non-sequential access requires another system call lseek(). System calls are expensive and so is copying data.
When a file is memory mapped usually a memory region in process address space is mapped directly to page cache, so that all reads and writes of already loaded data can be performed without any additional delay (no system calls, no data copying). Only when an application attempts to access file region that is not already loaded a page fault occurs and the kernel loads required data (whole page) from disk.
EDIT:
I see that I also have to explain memory paging. On most modern architectures there is physical memory which is a real piece of hardware and virtual memory which creates address spaces for processes. Kernel decides how addresses in virtual memory are mapped to addresses in physical memory. The smallest unit is a memory page (usually, but not always 4K). It does not have to be 1:1 mapping, for example all virtual memory pages may be mapped to the same physical address.
In memory mapped I/O part of application address space and kernel's page cache are mapped to the same physical memory region, hence program is able to directly access page cache.
Pawel has beautifully explained how read/writes are performed. Let me explain the original question: How does fopen(3) works:
when user space process encounters fopen(defined in libc or any user space library), it translates it into open(2) system call. First, it takes arguments from fopen, writes them into architecture specific registers along with open() syscall number. This number tells kernel the system call user space program wants to run. After loading these register, user space process interrupts kernel(via softirq, traditionally INT 80H on x86) and blocks.
Kernel verifies the arguments provided and access permissions etc, and then either returns error or invokes actual system call which is vfs_open() in this case. vfs_open() checks for available file descriptor in fd array and allocates struct file. The ref counts of accessed file is increased and fd is returned to user program. That's completes the working of open, and of most of the system calls in general.
open() together with read()/write(), followed by close() is undoubtedly much lengthy process than having memory mapped file in buffer cache.
For a lucid explanation of how open and read work on Linux, you can read this. The code snippets are from an older version of the kernel but the theory still holds.
You would still need to use the open() system call to get a valid file descriptor, which you would pass on to mmap(). As to why mmaped IO is faster, it is because there is no copy of data from (to) user space to (from) kernel space buffers which is what happens with read and write system calls.

Concurrent writes to a file using multiple threads

I have a userlevel program which opens a file using the flags O_WRONLY|O_SYNC. The program creates 256 threads which attempt to write 256 or more bytes of data each to the file. I want to have a total of 1280000 requests, making it a total of about 300 MB of data. The program ends once 1280000 requests have been completed.
I use pthread_spin_trylock() to increment a variable which keeps track of the number of requests that have been completed. To ensure that each thread writes to a unique offset, I use pwrite() and calculate the offset as a function of the number of requests that have been written already. Hence, I don't use any mutex when actually writing to the file (does this approach ensure data integrity?)
When I check the average time for which the pwrite() call was blocked and the corresponding numbers (i.e., the average Q2C times -- which is the measure of the times for the complete life cycle of BIOs) as found using blktrace, I find that there is a significant difference. In fact, the average completion time for a given BIO is much greater than the average latency of a pwrite() call. What is the reason behind this discrepancy? Shouldn't these numbers be similar since O_SYNC ensures that the data is actually written to the physical medium before returning?
pwrite() is suppose to be atomic, so you should be safe there ...
In regards to the difference in latency between your syscall and the actual BIO, according to this information on the man-pages at kernel.org for open(2):
POSIX provides for three different variants of synchronized I/O,
corresponding
to the flags O_SYNC, O_DSYNC, and O_RSYNC. Currently (2.6.31),
Linux only
implements O_SYNC, but glibc maps O_DSYNC and O_RSYNC to the
same numerical
value as O_SYNC. Most Linux file systems don't actually
implement the POSIX
O_SYNC semantics, which require all metadata updates of a write
to be on disk
on returning to userspace, but only the O_DSYNC semantics,
which require only
actual file data and metadata necessary to retrieve it to be on
disk by the
time the system call returns.
So this basically implies that with the O_SYNC flag the entirety of the data you're attempting to write does not need to be flushed to disk before a syscall returns, but rather just enough information to be capable of retrieving it from disk ... depending on what you're writing, that could be quite a bit less than the entire buffer of data you were intending to write to disk, and therefore the actual writing of all the data will take place at a later time, after the syscall has been completed and the process has moved on to something else.

Resources