read() on Linux and page-aligned buffers

read() on Linux and page-aligned buffers - c

I was implementing an efficient text file loader and found some good advice from the author of GNU grep in this post:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
One of things he suggests is to do read() calls of page aligned blocks of data into page aligned buffers. Apparently this allows the kernel to avoid some extra buffering.
I've been searching and I haven't heard anyone else back up this claim. Is it true that calling read() into a page aligned buffer (perhaps allocated with mmap/posix_memalign etc..) is actually more efficient? If its not true, is it something that used to be true? Does it heavily depend on the underlying file system or other factors like that?
Thanks!

Normally, read() will read into a kernel buffer, then copy it to user space. This extra copy is what is being discussed.
Linux supports "direct I/O" via the O_DIRECT flag to open(). This will skip kernel buffering and read directly into the userspace buffer. However, this direct I/O requires aligned accesses and buffers. So I don't think the author of that post meant that magic happens when you're aligned, but rather that if you align carefully, you can use "closer-to-the-metal" techniques to extract more performance.
mmap() is a much easier way to get the same effect. When the mapping is first set up, no I/O happens. When the user first accesses a page in the mapping, a page fault is triggered, which the kernel handles by allocating the user's page and performing the I/O to fill it. No copy. But again, the I/O happens in page-sized chunks, on page-aligned boundaries.
Whether this is a big deal or not depends on how fast memory copies happen relative to the I/O, and what proportion of CPU time is spent copying rather than doing real work. A web server, for instance, often doesn't even have to look at what it's reading: it just writes it out again out a socket (which incurs another copy). That's why a bunch of work has gone into "zerocopy" techniques like system calls sendfile() and splice(). These are specialized workloads. Normally, the buffering is too small an effect to worry about.

Related

why mmap is faster than traditional file io [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?

I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.

mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.

"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.

Where and why do read(2) and write(2) system calls copy to and from userspace?

I was reading about sendfile(2) recently, and the man page states:
sendfile() copies data between one file descriptor and another.
Because this copying is done within the kernel, sendfile() is more
efficient than the combination of read(2) and write(2), which would
require transferring data to and from user space.
It got me thinking, why exactly is the combination of read()/write() slower? The man page focuses on extra copying that has to happen to and from userspace, not the total number of calls required. I took a short look at the kernel code for read and write but didn't see the copy.
Why does the copy exist in the first place? Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?

why exactly is the combination of read()/write() slower?
The manual page is quite clear about this. Doing read() and then write() requires to copy the data two times.
Why does the copy exist in the first place?
It should be quite obvious: since you invoke read, you want the data to be copied to the memory of your process, in the specified destination buffer. Same goes for write: you want the data to be copied from the memory of your process. The kernel doesn't really know that you just want to do a read + write, and that copying back and forth two times could be avoided.
When executing read, the data is copied by the kernel from the file descriptor to the process memory. When executing write the data is copied by the kernel from the process memory to the file descriptor.
Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
The crucial point here is that when you read or write a file, the file has to be mapped from disk to memory by the kernel in order for it to be read or written. This is called memory-mapped file I/O, and it's a huge factor in the performance of modern operating systems.
The file content is already present in kernel memory, mapped as a memory page (or more). In case of a read, the data needs to be copied from that file kernel memory page to the process memory, while in case of a write, the data needs to be copied from the process memory to the file kernel memory page. The kernel will then ensure that the data in the kernel memory page(s) corresponding to the file is correctly written back to disk when needed (if needed at all).
This "intermediate" kernel mapping can be avoided, and the file mapped directly into userspace memory, but then the application would have to manage it manually, which is complicated and easy to mess up. This is why, for normal file operations, files are mapped into kernel memory. The kernel provides high level APIs for userspace programs to interact with them, and the hard work is left to the kernel itself.
The sendfile syscall is much faster because you do not need to perform the copy two times, but only once. Assuming that you want to do a sendfile of file A to file B, then all the kernel needs to do is to copy the data from A to B. However, in the case of read + write, the kernel needs to first copy from A to your process, and then from your process to B. This double copy is of course slower, and if you don't really need to read or manipulate the data, then it's a complete waste of time.
FYI, sendfile itself is basically an easy-to-use wrapper around splice (as can bee seen from the source code), which is a more generic syscall to perform zero-copy data transfer between file descriptors.
I took a short look at the kernel code for read and write but didn't see the copy.
In terms of kernel code, the whole process for reading a file is very complicated, but what the kernel ends up doing is a "special" version of memcpy(), called copy_to_user(), which copies the content of the file from the kernel memory to the userspace memory (doing the appropriate checks before performing the actual copy). More specifically, for files, the copyout() function is used, but the behavior is very similar, both end up calling raw_copy_to_user() (which is architecture-dependent).
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
The aio_{read,write} libc functions defined by POSIX are just asynchronous wrappers around read and write (i.e. they still use read and write under the hood). These still copy data to/from userspace.
io_uring can provide zero-copy operations, when using the O_DIRECT flag of open (see the manual page):
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT. See NOTES below for further
discussion.
This should be done carefully though, as it could very well degrade performance in case the userspace application does not do the appropriate caching on its own (if needed).
See also this related detailed answer on asynchronous I/O, and this LWN article on io_uring.

`mmap()` manual concurrent prefaulting / paging

I'm trying to fine tune mmap() to perform fast writes or reads (generally not both) of a potentially very large file. The writes and reads will be mostly sequential on one pass and then likely very sparse on future passes. No region of memory needs to be accessed more than once.
In other words, think of it as a file transfer with some lossiness that gets fixed asynchronously.
It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files. Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
I was hoping to offset the cost of these faults by concurrently prefaulting pages while performing the almost-sequential access and paging out pages that I won't need again. But I have three main questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the MAP_POPULATE flag, but this seems to attempt loading the entire file into memory, which is intolerable in many cases. Also, this seems to cause the mmap() call to block until prefaulting is complete, which is also intolerable. My idea for a manual alternative was to spawn a thread simply to try reading the next N pages in memory to force a prefetch. But it might be that madvise with MADV_SEQUENTIAL already does this, in effect.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might be useful if the program is frequently in an "Idle" state of disk IO and can afford to squeeze in some disk writebacks. Then again, the kernel might very well be handling this itself better than the ever application could.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by different threads or processes; if I am wrong about this, then manual prefaulting would not be useful at all. Similarly, if an msync() call blocks all disk IO, both to the filesystem cache and to the raw filesystem, then there also isn't as much of an incentive to use it over flushing the entire disk cache at the program's termination.

It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files.
That's not particularly surprising, I agree. But this is a cost that cannot be avoided, at least for the pages corresponding to regions of the mapped file that you actually access.
Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
That's plausible. Again, this is an unavoidable cost, at least for dirty pages, but you can exercise some influence over when those costs are incurred.
I was hoping to offset the cost of these faults by concurrently
prefaulting pages while performing the almost-sequential access and
paging out pages that I won't need again. But I have three main
questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the
MAP_POPULATE flag, but this seems to attempt loading the entire file
into memory,
Yes, that's consistent with its documentation.
which is intolerable in many cases. Also, this seems to
cause the mmap() call to block until prefaulting is complete,
That's also as documented.
which
is also intolerable. My idea for a manual alternative was to spawn a
thread simply to try reading the next N pages in memory to force a
prefetch.
Unless there's a delay between when you initially mmap() the file and when you want to start accessing the mapping, it's not clear to me why you would expect that to provide any improvement.
But it might be that madvise with MADV_SEQUENTIAL already
does this, in effect.
If you want POSIX compatibility, then you're looking for posix_madvise(). I would indeed recommend using this function instead of trying to roll your own userspace alternative. In particular, if you use posix_madvise() to assert POSIX_MADV_SEQUENTIAL on some or all of the mapped region, then it is reasonable to hope that the kernel will read ahead to load pages before they are needed. Additionally, if you advise with POSIX_MADV_DONTNEED then you might, at the kernel's discretion, get earlier sync to disk and overall less memory use. There is other advice you can pass by this mechanism, too, if it is useful.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might
be useful if the program is frequently in an "Idle" state of disk IO
and can afford to squeeze in some disk writebacks. Then again, the
kernel might very well be handling this itself better than the ever
application could.
This is something to test. Note that msync() supports asynchronous syncing, however, so you don't need I/O idleness. Thus, when you're sure you're done with a given page you could consider msync()ing it with flag MS_ASYNC to request that the kernel schedule a sync. This might reduce the delay incurred when you unmap the file. You'll have to experiment with combining it with posix_madvise(..., ..., POSIX_MADV_DONTNEED); they might or might not complement each other.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by
different threads or processes; if I am wrong about this, then manual
prefaulting would not be useful at all.
It should be possible for one thread to prefault pages (by accessing them), while another reads or writes others that have already been faulted in, but it's unclear to me why you expect such a prefaulting thread to be able to run ahead of the one(s) doing the reads and writes. If it has any effect at all (i.e. if the kernel does not prefault on its own) then I would expect prefaulting a page to be more expensive than reading or writing each byte in it once.
Similarly, if an msync()
call blocks all disk IO, both to the filesystem cache and to the raw
filesystem, then there also isn't as much of an incentive to use it
over flushing the entire disk cache at the program's termination.
There is a minimum number of disk reads and writes that will need to be performed on behalf of your program. For any given mmapped file, they will all be performed on the same I/O device, and therefore they will all be serialized with respect to one another. If you are I/O bound then to a first approximation, the order in which those I/O operations are performed does not matter for overall runtime.
Thus, if the runtime is what you're concerned with, then probably neither posix_madvise() nor msync() will be of much help unless your program spends a significant fraction of its runtime on tasks that are independent of accessing the mmapped file. If you do find yourself not wholly I/O bound then my suggestion would be to see first what posix_madvise() can do for you, and to try asynchronous msync() if you need more. I'm inclined to doubt that userspace prefaulting or synchronous msync() would provide a win, but in optimization, it's always better to test than to (only) predict.

When does actual write() takes place in C?

What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.

The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.

As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).

write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.

OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.

Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.

Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.

what's the proper buffer size for 'write' function?

I am using the low-level I/O function 'write' to write some data to disk in my code (C language on Linux). First, I accumulate the data in a memory buffer, and then I use 'write' to write the data to disk when the buffer is full. So what's the best buffer size for 'write'? According to my tests it isn't the bigger the faster, so I am here to look for the answer.

There is probably some advantage in doing writes which are multiples of the filesystem block size, especially if you are updating a file in place. If you write less than a partial block to a file, the OS has to read the old block, combine in the new contents and then write it out. This doesn't necessarily happen if you rapidly write small pieces in sequence because the updates will be done on buffers in memory which are flushed later. Still, once in a while you could be triggering some inefficiency if you are not filling a block (and a properly aligned one: multiple of block size at an offset which is a multiple of the block size) with each write operation.
This issue of transfer size does not necessarily go away with mmap. If you map a file, and then memcpy some data into the map, you are making a page dirty. That page has to be flushed at some later time: it is indeterminate when. If you make another memcpy which touches the same page, that page could be clean now and you're making it dirty again. So it gets written twice. Page-aligned copies of multiples-of a page size will be the way to go.

You'll want it to be a multiple of the CPU page size, in order to use memory as efficiently as possible.
But ideally you want to use mmap instead, so that you never have to deal with buffers yourself.

You could use BUFSIZ defined in <stdio.h>
Otherwise, use a small multiple of the page size sysconf(_SC_PAGESIZE) (e.g. twice that value). Most Linux systems have 4Kbytes pages (which is often the same as or a small multiple of the filesystem block size).
As other replied, using the mmap(2) system call could help. GNU systems (e.g. Linux) have an extension: the second mode string of fopen may contain the latter m and when that happens, the GNU libc try to mmap.
If you deal with data nearly as large as your RAM (or half of it), you might want to also use madvise(2) to fine-tune performance of mmap.
See also this answer to a question quite similar to yours. (You could use 64Kbytes as a reasonable buffer size).

The "best" size depends a great deal on the underlying file system.
The stat and fstat calls fill in a data structure, struct stat, that includes the following field:
blksize_t st_blksize; /* blocksize for file system I/O */
The OS is responsible for filling this field with a "good size" for write() blocks. However, it's also important to call write() with memory that is "well aligned" (e.g., the result of malloc calls). The easiest way to get this to happen is to use the provided <stdio.h> stream interface (with FILE * objects).
Using mmap, as in other answers here, can also be very fast for many cases. Note that it's not well suited to some kinds of streams (e.g., sockets and pipes) though.

It depends on the amount of RAM, VM, etc. as well as the amount of data being written. The more general answer is to benchmark what buffer works best for the load you're dealing with, and use what works the best.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight