pread/pwrite, buffers and disk cache

pread/pwrite, buffers and disk cache - c

If my code does something like fd = open("/dev/sdXY", ...) and pwrite(fd, ...)/pread(fd, ...), do the I/O operations skip the buffers or disk cache? Suppose /dev/sdXY is a unmounted, formatted disk partition (ext4, ufs, etc.).
I ask that because there is a need to grant contiguous file storage in an application I'm working on and I read that the only way to achieve it is doing something like what I described. However, I may remove the need for contiguous storage if that would lead in lost of buffers, disk cache or some other useful feature.
I'm also confused about if I would need to re-implement low level stuff since the partition would already be formatted with a file system. I read that would be the case for RAW disks/partitions. I already know it will be needed to handle which blocks are free or in use, files and folders structures, etc., I'm already working on that.
Another question: I have only seen something about buffers when reading about fopen()/fread()/fwrite() and C++'s file streams. Is it right that only these streams and the f* family of functions have some kind of buffer, unlike open/write/read/pwrite/pread/etc? Is this buffer the same as disk cache or something different?
A last one: Is HDD cache handled by its own drive or by file system (ext4, ufs, etc.)?

The simple answer is 'it depends'. What's hard is characterizing what it depends on.
Simply using open() doesn't avoid the kernel disk buffer pool. To do that, you need special options (O_DIRECT) on Linux. However, using open() does avoid using hidden application buffers; you get to choose where the data is read from or written to without any intermediate copies. By contrast, the f* family of functions do have a 'hidden' application buffer; the data is frequently read into an I/O buffer associated with the FILE * file stream, and then copied into your application buffers.
If your /dev/sdXY device is already formatted with a file system but you want to ensure contiguous file storage for a file, you are going to have to replicate a significant portion of the file system driver to ensure you allocate the space correctly. It is unlikely to be a sensible use of your time or energy. Yes, you would need to reimplement all sorts of low-level disk space management — it would be entirely non-trivial. Further, the implementation for ext4 would be quite different from the implementation for ufs, etc — so you'd really have your work cut out for you.

Related

Where and why do read(2) and write(2) system calls copy to and from userspace?

I was reading about sendfile(2) recently, and the man page states:
sendfile() copies data between one file descriptor and another.
Because this copying is done within the kernel, sendfile() is more
efficient than the combination of read(2) and write(2), which would
require transferring data to and from user space.
It got me thinking, why exactly is the combination of read()/write() slower? The man page focuses on extra copying that has to happen to and from userspace, not the total number of calls required. I took a short look at the kernel code for read and write but didn't see the copy.
Why does the copy exist in the first place? Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?

why exactly is the combination of read()/write() slower?
The manual page is quite clear about this. Doing read() and then write() requires to copy the data two times.
Why does the copy exist in the first place?
It should be quite obvious: since you invoke read, you want the data to be copied to the memory of your process, in the specified destination buffer. Same goes for write: you want the data to be copied from the memory of your process. The kernel doesn't really know that you just want to do a read + write, and that copying back and forth two times could be avoided.
When executing read, the data is copied by the kernel from the file descriptor to the process memory. When executing write the data is copied by the kernel from the process memory to the file descriptor.
Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
The crucial point here is that when you read or write a file, the file has to be mapped from disk to memory by the kernel in order for it to be read or written. This is called memory-mapped file I/O, and it's a huge factor in the performance of modern operating systems.
The file content is already present in kernel memory, mapped as a memory page (or more). In case of a read, the data needs to be copied from that file kernel memory page to the process memory, while in case of a write, the data needs to be copied from the process memory to the file kernel memory page. The kernel will then ensure that the data in the kernel memory page(s) corresponding to the file is correctly written back to disk when needed (if needed at all).
This "intermediate" kernel mapping can be avoided, and the file mapped directly into userspace memory, but then the application would have to manage it manually, which is complicated and easy to mess up. This is why, for normal file operations, files are mapped into kernel memory. The kernel provides high level APIs for userspace programs to interact with them, and the hard work is left to the kernel itself.
The sendfile syscall is much faster because you do not need to perform the copy two times, but only once. Assuming that you want to do a sendfile of file A to file B, then all the kernel needs to do is to copy the data from A to B. However, in the case of read + write, the kernel needs to first copy from A to your process, and then from your process to B. This double copy is of course slower, and if you don't really need to read or manipulate the data, then it's a complete waste of time.
FYI, sendfile itself is basically an easy-to-use wrapper around splice (as can bee seen from the source code), which is a more generic syscall to perform zero-copy data transfer between file descriptors.
I took a short look at the kernel code for read and write but didn't see the copy.
In terms of kernel code, the whole process for reading a file is very complicated, but what the kernel ends up doing is a "special" version of memcpy(), called copy_to_user(), which copies the content of the file from the kernel memory to the userspace memory (doing the appropriate checks before performing the actual copy). More specifically, for files, the copyout() function is used, but the behavior is very similar, both end up calling raw_copy_to_user() (which is architecture-dependent).
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
The aio_{read,write} libc functions defined by POSIX are just asynchronous wrappers around read and write (i.e. they still use read and write under the hood). These still copy data to/from userspace.
io_uring can provide zero-copy operations, when using the O_DIRECT flag of open (see the manual page):
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT. See NOTES below for further
discussion.
This should be done carefully though, as it could very well degrade performance in case the userspace application does not do the appropriate caching on its own (if needed).
See also this related detailed answer on asynchronous I/O, and this LWN article on io_uring.

Reading a file of arbitrary length in C

What's the most idiomatic/efficient way to read a file of arbitrary length in C?
Get the filesize of the file in bytes and issue a single fread()
Keep fread()ing a constant size buffer until getting EOF
Anything else?

Avoid using any technique which requires knowing the size of the file in advance. That leaves exactly one technique: read the file a bit at a time, in blocks of a convenient size.
Here's why you don't want to try to find the filesize in advance:
If it is not a regular file, there may not be any way to tell. For example, you might be reading directly from a console, or taking piped input from a previous data generator. If your program requires the filesize to be knowable, these useful input mechanisms will not be available to your users, who will complain or choose a different tool.
Even if you can figure out the filesize, you have no way of preventing it from changing while you are reading the file. If you are not careful about how you read the file, you might open a vulnerability which could be exploited by adversarial programs.
For example, if you allocate a buffer of the "correct" size and then read until you get an end-of-file condition, you may end up overwriting random memory. (Multiple reads may be necessary if you use an interface like read() which might read less data than requested.) Or you might find that the file has been truncated; if you don't check the amount of data read, you might end up processing uninitialised memory leading to information leakage.

In practice, you usually don't need to keep the entire file content in memory. You'll often parse the file (notably if it is textual), or at least read the file in smaller pieces, and for that you don't need it entirely in memory. For a textual file, reading it line-by-line (perhaps with some state inside your parser) is often enough (using fgets or getline).
Files exist (notably on disks or SSDs) because usually they can be much "bigger" than your computer RAM. Actually, files have been invented (more than 50 years ago) to be able to deal with data larger than memory. Distributed file systems also can be very big (and accessed remotely even from a laptop, e.g. by NFS, CIFS, etc...)
Some file systems are capable of storing petabytes of data (on supercomputers), with individual files of many terabytes (much larger than available RAM).
You'll also likely to use some databases. These routinely have terabytes of data. See also this answer (about realistic size of sqlite databases).
If you really want to read a file entirely in memory using stdio (but you should avoid doing that, because you generally want your program to be able to handle a lot of data on files; so reading the entire file in memory is generally a design error), you indeed could loop on fread (or fscanf, or even fgetc) till end-of-file. Notice that feof is useful only after some input operation.
On current laptop or desktop computers, you could prefer (for efficiency) to use buffers of a few megabytes, and you certainly can deal with big files of several hundreds of gigabytes (much larger than your RAM).
On POSIX file systems, you might do memory mapped IO with e.g. mmap(2) - but that might not be faster than read(2) with large buffers (of a few megabytes). You could use readahead(2) (Linux specific) and posix_fadvise(2) (or madvise(2) if using mmap) to tune performance by hinting your OS kernel.
If you have to code for Microsoft Windows, you could study its WinAPI and find some way to do memory mapped IO.
In practice, file data (notably if it was accessed recently) often stays in the page cache, which is of paramount importance for performance. When that is not the case, your hardware (disk, controller, ...) becomes the bottleneck and your program becomes I/O bound (in that case, no software trick could improve significantly the performance).

What should use mmap, malloc or File I/O

Background Our kernel level program invokes a process in user space for making some decisions on the basis of values in a file. The user space program is a short lived process that compares value passed by the kernel with the file contents. At a time usually many instances of the user space program can be invoked. The file has less than one thousand lines.
Question What is the preferred way to read the a small file that is shared among short lived many processes? Currently We are using File I/O (fopen, fread)
Note The question When should I use mmap for file access? discusses very nicely but there is no discussion for the case of short lived many processes

What is the preferred way to read a small file that is shared among short lived many processes?
getline() or fread() using standard POSIX I/O from <stdio.h>, or low-level <unistd.h> open() and read() to a sufficiently large buffer (with sufficiently aggressive growth policy); depending on how the read data is parsed/interpreted.
You don't use memory mapping for reading a file once; it is just not as efficient as read()/fread(), due to the mapping overhead.
Note that if the file contains many numbers, the actual bottleneck is the string-to-integer and string-to-floating-point conversions (strtol(), strtod(), sscanf(), etc.), because if accessed often enough the file contents will stay in the page cache. The standard implementations of string conversion functions are designed for correctness, not for efficiency.
Our kernel level program invokes a process in user space for making some decisions on the basis of values in a file.
Seems very inefficient to me. Personally, I'd keep the "file" in-kernel, as a structure, and only expose an userspace interface, probably a character device, to modify its contents.
That way you only incur a context switch whenever the "file" is changed by an userspace process, and kernel-space stuff can simply examine the contents of the structure directly, in native format, with no overhead.
This is what e.g. netfilter (built-in firewall) and other existing stuff do.

Understanding low level file routines

I am going through Mark Burgess's "The GNU C Programming Tutorial". I have come across the following information:
Even though low-level fle routines do not use buﬀering, and once you call write, your data can be read from the file immediately, it may take up to a minute before your data is physically written to disk. (Page:142)
Firstly, is "it may take up to a minute(some time) before your data is written to disk" true?
Secondly, when low level file routines are not using buffering why will the delay take place?

There are two places where I/O buffering can occur (at least — it could be more than just two).
One is in the application; the standard I/O functions using FILE * use buffered I/O unless you use setvbuf() to prevent it.
The other is in the kernel. Disk I/O normally goes into the kernel buffer pool, and eventually gets written by the kernel to disk. There are ways around that (O_DIRECT on Linux; raw devices on classic Unix; etc). The key point is that the write() system call normally writes to he kernel buffer pool. The kernel takes responsibility for ensuring that the data is written to disk safely and correctly (journalling, …).
The kernel doesn't write everything to disk immediately because (a) you may add more changes to the data, (b) other people may need to read or write the data, (c) the disk drive may be busy writing something else at the other end of its 1 TiB of storage and it will take time to get the write head in position to take your data, and it would be better for the overall performance of the system if it scheduled other work before writing your changed buffer to disk. It will get written to disk. It is just not defined when, and it could be fractions of a second or multiple seconds or longer, though most often it will not take minutes for the data to be written to disk.
These days, there could also be buffering in the RAID controllers, and maybe in the individual disks inside the RAID setup, and maybe there's network buffering too if it is a remotely-mounted file system. Those add extra levels of buffering.
The read() and write() and related low-level I/O functions do not have any client-side (application) buffering — unlike the standard C I/O functions.

A file is said to be buffered, when its contents are not outputted or inputted directly. Instead, the file's bytes are written to a temporary buffer in memory.
For example, if you are reading from a file, you are reading from the buffer. Once you have read all the characters in the buffer, it is replenished with new bytes from the file. The reason for this indirectness, is that a memory read is much faster than a hard disk read.
The calls read and write are low-level, and do not perform buffering. The stdio.h calls like getc and putc, do use buffering. These higher-level APIs only call the low level ones, when the buffer must be replenished.

Writing to the hard drive is much slower than writing to RAM. When you write to a drive it writes to memory, but doesn't always write to the disk immediately. The data might not be written to disk until that part of memory needs to be overwritten to make room for something else. This is called a Write-Back cache.

mmap( ) vs read( )

I'm writing a bulk ID3 tag editor in C. ID3 tags are usually at the beginning of an mp3 encoded file, although older (version 1) tags are at the end. The app is designed to accept a directory and frame ID list from the command line, then recurse the directory structure updating all the ID3 tags it finds. The user may additionally choose to remove all older (version 1) tags. Another option is to simply display the current tags, without performing an update. The directory might contain 2 files or 2 million. If the user means to update the files, I was planning to load the entire file into memory, perform the updates, then save it (the file may be renamed as well). However, if the user only means to print the current ID3 tags, then loading the entire file seems excessive. After all the file could be 200mb.
I've read through this thread, which was insightful - mmap() vs. reading blocks
So my question is, what the most efficient way to go about this -- read(), mmap() or some combination? Design ideas welcome.
Edit: It's my understanding that mmap essentially delegates loading a file into memory, to the virtual memory subsystem. It seems to me, the VMM would be highly optimized on most systems as it's critical for system performance.

It really depends on what you're trying to do. If all you need to do is hop to a known offset and read out a small tag, read() may be faster (mmap() has to do some rather complex internal accounting). If you are planning on copying out all 200mb of the MP3, however, or scanning it for some tag that may appear at an unknown offset, then mmap() is likely a faster approach.
For example, if you need to shift the entire file down a few hundred bytes in order to insert an ID3 tag, one simple approach would be to expand the file with ftruncate(), mmap the file, then memmove() the contents down a bit. This, however, will destroy the file if your program crashes while it's running. You could also copy the contents of the file into a new file - this is another place where mmap() really shines; you can simply mmap() the old file, then copy all of its data into the new file with a single write().
In short, mmap() is great if you're doing a large amount of IO in terms of total bytes transferred; this is because it reduces the number of copies needed, and can significantly reduce the number of kernel entries needed for reading cached data. However mmap() requires a minimum of two trips into the kernel (three if you clean up the mapping when you're done!) and does some complex internal kernel accounting, and so the fixed overhead can be high.
read() on the other hand involves an extra memory-to-memory copy, and can thus be inefficient for large I/O operations, but is simple, and so the fixed overhead is relatively low. In short, use mmap() for large bulk I/O, and read() or pread() for one-off, small I/Os.

Don't bother with mmap unless your code is CPU bound, specifically due to lots small reads and writes. mmap may sound nice, but it isn't the awesome why isn't everyone using this alternative it looks like.
Given that you're recursing through potentially large directory structures, your bottleneck will be directory IO and concurrency. mmap is not going to help.
Update0
Reading the linked to question finds this answer that supports my experience:
mmap() vs. reading blocks

If you're not normally going to be streaming the file in and then processing it, but rather hopping around (like read the tags at the front and then jump to the end, etc.) then I would use mmap simply because your code will be cleaner and easier to maintain treating the file as a large buffer without having to actually manage the the buffering and paging yourself.
As has been mentioned, if you're processing a lot of data disk I/O is likely going to dominate your processing anyway. mmap may be faster than read, but for reasonable implementations, it's likely not THAT much faster, especially on todays hardware which has continually got faster and faster while disk drives have been stuck at 7200 and 10,000 RPM for years and years.
So, go with mmap and make your code easy and neat.

I don't know if standard POSIX functions reside inside what you are allowed or you will to use for the development but think about these two functions:
int ftruncate(int fildes, off_t length);
int truncate(const char *path, off_t length);
defined in unistd.h, which can be used to truncate a file up to a specified length. In this way you could easily
find where ID3 tags frame begins (don't know if you can compute it easily by just reading the header of the MP3 file but I guess yes)
save the offset
close the file
truncate the file with the provided function
open the file in append binary mode and write new tags
I'm not sure about the performance, you should test this method, but it should load much less things inside ram while providing a senseful way of doing it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight