Handling mmap and fread for big files between processes - c

I have two processes:
Process A is mapping large file (~170 GB - content constantly changes) into memory for writing with the flags MAP_NONBLOCK and MAP_SHARED:
MyDataType *myDataType; = (MyDataType*)mmap(NULL, sizeof(MyDataType), PROT_WRITE, MAP_NONBLOCK | MAP_SHARED , fileDescriptor, 0);
and every second I call msync:
msync((void *)myDataType, sizeof(MyDataType), MS_ASYNC);
This section works fine.
The problem occurs when process B is trying to read from the same file that process A is mapped to, process A does not respond for ~20 seconds.
Process B is trying to read from the file something like 1000 times, using fread() and fseek(), small blocks (~4 bytes every time).
Most of the content the process is reading are close to each other.
What is the cause for this problem? Is it related to pages allocation? How can I solve it?
BTW, same problem occur when I use mmap() in process B instead of simple fread().

msync() is likely the problem. It forces the system to write to disk, blocking the kernel in a write frenzy.
In general on Linux (it's the same on Solaris BTW), it is a bad idea to use msync() too often. There is no need to call msync() for the synchronization of data between the memory map and the read()/write() I/O operations, this is a misconception that comes from obsolete HOWTOs. In reality, mmap() makes only the file system cache "visible" for a process. This means that the memory blocks the process changes are still under kernel control. Even if your process crashed, the changes would land on the disk eventually. Other processes would also still be serviced by the same buffer.
Here another answer on the subject mmap, msync and linux process termination
The interesting part is the link to a discussion on realworldtech where Linus Torvalds himself explains how buffer cache and memory mapping work.
PS: fseek()/fread() pair is also probably better replaced by pread(). 1 system call is always better than 2. Also fseek()/fread() read always 4K and copies in a buffer, so if you have several small reads without fseek(), it will read from its local buffer and maybe miss updates in process A.

This sounds that you are suffering from IO-Starvation, which has nothing to do with the method (mmap or fread) you choose. You will have to improve your (pre-)caching-strategy and/or try another IO-scheduler (cfq being the default, maybe deadline delivers better overall-results for you)
You can change the scheduler by writing to /sys:
echo deadline > /sys/block/<device>/queue/scheduler

Maybe you should try profiling or even using strace to figure out for sure where the process is spending its time. 20 s seems like an awfully long time to be explained by io in msync().
When you say A doesn't respond, what exactly do you mean?

Related

Where and why do read(2) and write(2) system calls copy to and from userspace?

I was reading about sendfile(2) recently, and the man page states:
sendfile() copies data between one file descriptor and another.
Because this copying is done within the kernel, sendfile() is more
efficient than the combination of read(2) and write(2), which would
require transferring data to and from user space.
It got me thinking, why exactly is the combination of read()/write() slower? The man page focuses on extra copying that has to happen to and from userspace, not the total number of calls required. I took a short look at the kernel code for read and write but didn't see the copy.
Why does the copy exist in the first place? Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
why exactly is the combination of read()/write() slower?
The manual page is quite clear about this. Doing read() and then write() requires to copy the data two times.
Why does the copy exist in the first place?
It should be quite obvious: since you invoke read, you want the data to be copied to the memory of your process, in the specified destination buffer. Same goes for write: you want the data to be copied from the memory of your process. The kernel doesn't really know that you just want to do a read + write, and that copying back and forth two times could be avoided.
When executing read, the data is copied by the kernel from the file descriptor to the process memory. When executing write the data is copied by the kernel from the process memory to the file descriptor.
Couldn't the kernel just read from the passed buffer on a write() without first copying the whole thing into kernel space?
The crucial point here is that when you read or write a file, the file has to be mapped from disk to memory by the kernel in order for it to be read or written. This is called memory-mapped file I/O, and it's a huge factor in the performance of modern operating systems.
The file content is already present in kernel memory, mapped as a memory page (or more). In case of a read, the data needs to be copied from that file kernel memory page to the process memory, while in case of a write, the data needs to be copied from the process memory to the file kernel memory page. The kernel will then ensure that the data in the kernel memory page(s) corresponding to the file is correctly written back to disk when needed (if needed at all).
This "intermediate" kernel mapping can be avoided, and the file mapped directly into userspace memory, but then the application would have to manage it manually, which is complicated and easy to mess up. This is why, for normal file operations, files are mapped into kernel memory. The kernel provides high level APIs for userspace programs to interact with them, and the hard work is left to the kernel itself.
The sendfile syscall is much faster because you do not need to perform the copy two times, but only once. Assuming that you want to do a sendfile of file A to file B, then all the kernel needs to do is to copy the data from A to B. However, in the case of read + write, the kernel needs to first copy from A to your process, and then from your process to B. This double copy is of course slower, and if you don't really need to read or manipulate the data, then it's a complete waste of time.
FYI, sendfile itself is basically an easy-to-use wrapper around splice (as can bee seen from the source code), which is a more generic syscall to perform zero-copy data transfer between file descriptors.
I took a short look at the kernel code for read and write but didn't see the copy.
In terms of kernel code, the whole process for reading a file is very complicated, but what the kernel ends up doing is a "special" version of memcpy(), called copy_to_user(), which copies the content of the file from the kernel memory to the userspace memory (doing the appropriate checks before performing the actual copy). More specifically, for files, the copyout() function is used, but the behavior is very similar, both end up calling raw_copy_to_user() (which is architecture-dependent).
What about asynchronous IO interfaces like AIO and io_uring? Do they also copy?
The aio_{read,write} libc functions defined by POSIX are just asynchronous wrappers around read and write (i.e. they still use read and write under the hood). These still copy data to/from userspace.
io_uring can provide zero-copy operations, when using the O_DIRECT flag of open (see the manual page):
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT. See NOTES below for further
discussion.
This should be done carefully though, as it could very well degrade performance in case the userspace application does not do the appropriate caching on its own (if needed).
See also this related detailed answer on asynchronous I/O, and this LWN article on io_uring.

`mmap()` manual concurrent prefaulting / paging

I'm trying to fine tune mmap() to perform fast writes or reads (generally not both) of a potentially very large file. The writes and reads will be mostly sequential on one pass and then likely very sparse on future passes. No region of memory needs to be accessed more than once.
In other words, think of it as a file transfer with some lossiness that gets fixed asynchronously.
It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files. Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
I was hoping to offset the cost of these faults by concurrently prefaulting pages while performing the almost-sequential access and paging out pages that I won't need again. But I have three main questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the MAP_POPULATE flag, but this seems to attempt loading the entire file into memory, which is intolerable in many cases. Also, this seems to cause the mmap() call to block until prefaulting is complete, which is also intolerable. My idea for a manual alternative was to spawn a thread simply to try reading the next N pages in memory to force a prefetch. But it might be that madvise with MADV_SEQUENTIAL already does this, in effect.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might be useful if the program is frequently in an "Idle" state of disk IO and can afford to squeeze in some disk writebacks. Then again, the kernel might very well be handling this itself better than the ever application could.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by different threads or processes; if I am wrong about this, then manual prefaulting would not be useful at all. Similarly, if an msync() call blocks all disk IO, both to the filesystem cache and to the raw filesystem, then there also isn't as much of an incentive to use it over flushing the entire disk cache at the program's termination.
It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files.
That's not particularly surprising, I agree. But this is a cost that cannot be avoided, at least for the pages corresponding to regions of the mapped file that you actually access.
Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
That's plausible. Again, this is an unavoidable cost, at least for dirty pages, but you can exercise some influence over when those costs are incurred.
I was hoping to offset the cost of these faults by concurrently
prefaulting pages while performing the almost-sequential access and
paging out pages that I won't need again. But I have three main
questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the
MAP_POPULATE flag, but this seems to attempt loading the entire file
into memory,
Yes, that's consistent with its documentation.
which is intolerable in many cases. Also, this seems to
cause the mmap() call to block until prefaulting is complete,
That's also as documented.
which
is also intolerable. My idea for a manual alternative was to spawn a
thread simply to try reading the next N pages in memory to force a
prefetch.
Unless there's a delay between when you initially mmap() the file and when you want to start accessing the mapping, it's not clear to me why you would expect that to provide any improvement.
But it might be that madvise with MADV_SEQUENTIAL already
does this, in effect.
If you want POSIX compatibility, then you're looking for posix_madvise(). I would indeed recommend using this function instead of trying to roll your own userspace alternative. In particular, if you use posix_madvise() to assert POSIX_MADV_SEQUENTIAL on some or all of the mapped region, then it is reasonable to hope that the kernel will read ahead to load pages before they are needed. Additionally, if you advise with POSIX_MADV_DONTNEED then you might, at the kernel's discretion, get earlier sync to disk and overall less memory use. There is other advice you can pass by this mechanism, too, if it is useful.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might
be useful if the program is frequently in an "Idle" state of disk IO
and can afford to squeeze in some disk writebacks. Then again, the
kernel might very well be handling this itself better than the ever
application could.
This is something to test. Note that msync() supports asynchronous syncing, however, so you don't need I/O idleness. Thus, when you're sure you're done with a given page you could consider msync()ing it with flag MS_ASYNC to request that the kernel schedule a sync. This might reduce the delay incurred when you unmap the file. You'll have to experiment with combining it with posix_madvise(..., ..., POSIX_MADV_DONTNEED); they might or might not complement each other.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by
different threads or processes; if I am wrong about this, then manual
prefaulting would not be useful at all.
It should be possible for one thread to prefault pages (by accessing them), while another reads or writes others that have already been faulted in, but it's unclear to me why you expect such a prefaulting thread to be able to run ahead of the one(s) doing the reads and writes. If it has any effect at all (i.e. if the kernel does not prefault on its own) then I would expect prefaulting a page to be more expensive than reading or writing each byte in it once.
Similarly, if an msync()
call blocks all disk IO, both to the filesystem cache and to the raw
filesystem, then there also isn't as much of an incentive to use it
over flushing the entire disk cache at the program's termination.
There is a minimum number of disk reads and writes that will need to be performed on behalf of your program. For any given mmapped file, they will all be performed on the same I/O device, and therefore they will all be serialized with respect to one another. If you are I/O bound then to a first approximation, the order in which those I/O operations are performed does not matter for overall runtime.
Thus, if the runtime is what you're concerned with, then probably neither posix_madvise() nor msync() will be of much help unless your program spends a significant fraction of its runtime on tasks that are independent of accessing the mmapped file. If you do find yourself not wholly I/O bound then my suggestion would be to see first what posix_madvise() can do for you, and to try asynchronous msync() if you need more. I'm inclined to doubt that userspace prefaulting or synchronous msync() would provide a win, but in optimization, it's always better to test than to (only) predict.

Executing an external program when forking is not advisable

I have this a big server software that can hog 4-8GB of memory.
This makes fork-exec cumbersome, as the fork itself can take significant time, plus the default behavior seems to be that fork will fail unless there is enough memory for a copy of the entire resident memory.
Since this is starting to show as the hottest spot (60% of time spent in fork) when profiling I need to address it.
What would be the easiest way to avoid fork-exec routine?
You basically cannot avoid fork(2) (or the equivalent clone(2) syscall..., or the obsolete vfork which I don't recommend using) + execve(2) to start an external command (à la system(3), or à la posix_spawn) on Linux and (probably) MacOSX and most other Unix or POSIX systems.
What makes you think that it is becoming an issue? And 8GB process virtual address space is not a big deal today (at least on machines with 8Gbytes, or 16Gbytes RAM, like my desktop has). You don't practically need twice as much RAM (but you do need swap space) thanks to the lazy copy-on-write techniques used by all recent Unixes & Linux.
Perhaps you might believe that swap space could be an issue. On Linux, you could add swap space, perhaps by swapping on a file; just run as root:
dd if=/dev/zero of=/var/tmp/myswap bs=1M count=32768
mkswap /var/tmp/myswap
swapon /var/tmp/myswap
(of course, be sure that /var/tmp/ is not a tmpfs mounted filesystem, but sits on some disk, perhaps an SSD one....)
When you don't need any more a lot of swap space, run swapoff /var/tmp/myswap....
You could also start some external shell process near the beginning of your program (à la popen) and later you might send shell commands to it. Look at my execicar.c program for inspiration, or use it if it fits (I wrote it 10 years ago for similar purposes, but I forgot the details)
Alternatively fork at the beginning of your program some interpreter (Lua, Guile...) and send some commands to it.
Running more than a few dozens commands per second (starting any external program) is not reasonable, and should be considered as a design mistake, IMHO. Perhaps the commands that you are running could be replaced by in-process functions (e.g. /bin/ls can be done with stat, readdir, glob functions ...). Perhaps you might consider adding some plugin ability (with dlopen(3) & dlsym) to your code (and run functions from plugins instead of starting very often the same programs). Or perhaps embed an interpreter (Lua, Guile, ...) inside your code.
As an example, for web servers, look for old CGI vs FastCGI or HTTP forwarding (e.g. URL redirection) or embedded PHP or HOP or Ocsigen
This makes fork-exec cumbersome, as the fork itself can take
significant time
This is only half true. You didn't specify the OS, but fork(2) is pretty optimized in Linux (and I believe in other UNIX variants) by using copy-on-write. Copy-on-write means that the operating system will not copy the entire parent memory address space until the child (or the parent) writes to memory. So you can rest assured that if you have a parent process using 8 GB of memory and then you fork, you won't be using 16 GB of memory - especially if the child execs() something immediately.
fork will fail unless there is enough memory for a copy of the entire
resident memory.
No. The only overhead incurred by fork(2) is the copy and allocation of a task structure for the child, the allocation of a PID, and copying the parent's page tables. fork(2) will not fail if there isn't enough memory to copy the entire parent's address space, it will fail if there isn't enough memory to allocate a new task structure and the page tables. It may also fail if the maximum number of processes for the user has been reached. You can confirm this in man 2 fork (NOTE: See comments below).
If you still don't want to use fork(2), you can use vfork(2), which does no copying at all - it doesn't even copy the page tables - everything is shared with the parent. You can use that to create a new child process with a negligible overhead and then exec() something. Be aware that vfork(2) blocks the calling thread until the child either exits or calls one of the seven exec() functions. You also shouldn't modify the memory inside the child process before calling any of the exec() functions.
You mentioned that you can fork+exec 10k times per second. That sounds very excessive. Have you considered making the things you're execing into a daemon? Or maybe implement those external programs inside your application? It sounds very dodgy to have to fork that much.
fork most likely starts failing for you despite having the memory to back it because you're on a flavor of linux that has disabled (or put a limit on) memory overcommit. Check the file /proc/sys/vm/overcommit_memory. If it's 1 then my guess is wrong and there's something else weird going on. If it's 0 then you're not allowed to overcommit at all. If it's 2 then you need to read the documentation for how exactly this gets configured.
One solution mentioned above is just adding swap (that will never get used).
Another solution is to implement a small daemon that will take commands and execute those forks and execs for you piping back whatever output you need.
N.B. fork of a large process can in theory be as fast as a small process. The performance of fork is determined by how many memory mappings you have rather than how much memory they cover. Setting up copy-on-write is done per mapping. Except that on certain operating systems setting up COW of anonymous mappings is linear to amount of memory in those mappings, but I don't know what Linux does here, last time I studied the VM system in Linux was over 15 years ago.

Creating unflushed file output buffers

I am trying to clear up an issue that occurs with unflushed file I/O buffers in a couple of programs, in different languages, running on Linux. The solution of flushing buffers is easy enough, but this issue of unflushed buffers happens quite randomly. Rather than seek help on what may cause it, I am interested in how to create (reproduce) and diagnose this kind of situation.
This leads to a two-part question:
Is it feasible to artificially and easily construct instances where, for a given period of time, one can have output buffers that are known to be unflushed? My searches are turning up empty. A trivial baseline is to hammer the hard drive (e.g. swapping) in one process while trying to write a large amount of data from another process. While this "works", it makes the system practically unusable: I can't poke around and see what's going on.
Are there commands from within Linux that can identify that a given process has unflushed file output buffers? Is this something that can be run at the command line, or is it necessary to query the kernel directly? I have been looking at fsync, sync, ioctl, flush, bdflush, and others. However, lacking a method for creating unflushed buffers, it's not clear what these may reveal.
In order to reproduce for others, an example for #1 in C would be excellent, but the question is truly language agnostic - just knowing an approach to create this situation would help in the other languages I'm working in.
Update 1: My apologies for any confusion. As several people have pointed out, buffers can be in the kernel space or the user space. This helped pinpoint the problems: we're creating big dirty kernel buffers. This distinction and the answers completely resolve #1: it now seems clear how to re-create unflushed buffers in either user space or kernel space. Identifying which process ID has dirty kernel buffers is not yet clear, though.
If you are interested in the kernel-buffered data, then you can tune the VM writeback through the sysctls in /proc/sys/vm/dirty_*. In particular, dirty_expire_centisecs is the age, in hundredths of a second, at which dirty data becomes eligible for writeback. Increasing this value will give you a larger window of time in which to do your investigation. You can also increase dirty_ratio and dirty_background_ratio (which are percentages of system memory, defining the point at which synchronous and asynchronous writeback start respectively).
Actually creating dirty pages is easy - just write(2) to a file and exit without syncing, or dirty some pages in a MAP_SHARED mapping of a file.
A simple program that would have an unflushed buffer would be:
main()
{
printf("moo");
pause();
}
Stdio, by default only flushes stdout on newlines, when connected to a terminal.
It is very easy to cause unflushed buffers by controlling the receiving side. The beauty of *nix systems is that everything looks like a file, so you can use special files to do what you want. The easiest option is a pipe. If you just want to control stdout, this is the simples option: unflushed_program | slow_consumer. Otherwise, you can use named pipes:
mkfifo pipe_file
unflushed_program --output pipe_file
slow_consumer --input pipe_file
slow_consumer is most likely a program you design to read data slowly, or just read X bytes and stop.

What posix_fadvise() args for sequential file write?

I am working on an application which does sequentially write a large file (and does not read at all), and I would like to use posix_fadvise() to optimize the filesystem behavior.
The function description in the manpage suggests that the most appropriate strategy would be POSIX_FADV_SEQUENTIAL. However, the Linux implementation description doubts that:
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely.
As I'm only writing data (overwriting files possibly too), I don't expect any readahead. Should I then stick with my POSIX_FADV_SEQUENTIAL or rather use POSIX_FADV_RANDOM to disable it?
How about other options, such as POSIX_FADV_NOREUSE? Or maybe do not use posix_fadvise() for writing at all?
Most of the posix_fadvise() flags (eg POSIX_FADV_SEQUENTIAL and POSIX_FADV_RANDOM) are hints about readahead rather than writing.
There's some advice from Linus here and here about getting good sequential write performance. The idea is to break the file into large-ish (8MB) windows, then loop around doing:
Write out window N with write();
Request asynchronous write-out of window N with sync_file_range(..., SYNC_FILE_RANGE_WRITE)
Wait for the write-out of window N-1 to complete with sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER)
Drop window N-1 from the pagecache with posix_fadvise(..., POSIX_FADV_DONTNEED)
This way you never have more than two windows worth of data in the page cache, but you still get the kernel writing out part of the pagecache to disk while you fill the next part.
It all depends on the temporal locality of your data. If your application won't need the data soon after it was written, then you can go with POSIX_FADV_NOREUSE to avoid writing to the buffer cache (in a similar way as the O_DIRECT flag from open()).
As far as writes go I think that you can just rely on the OSes disk IO scheduler to do the right thing.
You should keep in mind that while posix_fadvise is there specifically to give the kernel hints about future file usage patterns the kernel also has other data to help it out.
If you don't open the file for reading then it would only need to read blocks in when they were partially written. If you were to truncate the file to 0 then it doesn't even have to do that (you said that you were overwriting).

Resources