Does MADV_REMOVE result in a TLB shootdown? - c

I have a tmpfs with a number of large file-backed shared memory mappings. I'd like to be able to punch out any unused pages in the process that writes the files without causing major detriment to other concurrent processes. I understand that MADV_FREE and MADV_DONTNEED can/will result in a TLB shootdown, but I cannot find anything describing any potential ill effects of MADV_REMOVE.

Related

why mmap is faster than traditional file io [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.
mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.
"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.

Window control for mmapped large file(linux, mmap)

How can we control the window in RSS when mapping a large file? Now let me explain what i mean.
For example, we have a large file that exceeds RAM by several times, we do shared memory mmaping for several processes, if we access some object whose virtual address is located in this mapped memory and catch a page fault, then reading from disk, the sub-question is, will the opposite happen if we no longer use the given object? If this happens like an LRU, then what is the size of the LRU and how to control it? How is page cache involved in this case?
RSS graph
This is the RSS graph on testing instance(2 thread, 8 GB RAM) for 80 GB tar file. Where does this value of 3800 MB come from and stay stable when I run through the file after it has been mapped? How can I control it (or advise the kernel to control it)?
As long as you're not taking explicit action to lock the pages in memory, they should eventually be swapped back out automatically. The kernel basically uses a memory pressure heuristic to decide how much of physical memory to devote to swapped-in pages, and frequently rebalances as needed.
If you want to take a more active role in controlling this process, have a look at the madvise() system call.
This allows you to tweak the paging algorithm for your mmap, with actions like:
MADV_FREE (since Linux 4.5)
The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. ...
MADV_COLD (since Linux 5.4)
Deactivate a given range of pages. This will make the pages a more probable reclaim target should there be a memory pressure.
MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
MADV_WILLNEED
Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) ...
Issuing an madvise(MADV_SEQUENTIAL) after creating the mmap might be sufficient to get acceptable behavior. If not, you could also intersperse some MADV_WILLNEED/MADV_DONTNEED access hints (and/or MADV_FREE/MADV_COLD) during the traversal as you pass groups of pages.

`mmap()` manual concurrent prefaulting / paging

I'm trying to fine tune mmap() to perform fast writes or reads (generally not both) of a potentially very large file. The writes and reads will be mostly sequential on one pass and then likely very sparse on future passes. No region of memory needs to be accessed more than once.
In other words, think of it as a file transfer with some lossiness that gets fixed asynchronously.
It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files. Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
I was hoping to offset the cost of these faults by concurrently prefaulting pages while performing the almost-sequential access and paging out pages that I won't need again. But I have three main questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the MAP_POPULATE flag, but this seems to attempt loading the entire file into memory, which is intolerable in many cases. Also, this seems to cause the mmap() call to block until prefaulting is complete, which is also intolerable. My idea for a manual alternative was to spawn a thread simply to try reading the next N pages in memory to force a prefetch. But it might be that madvise with MADV_SEQUENTIAL already does this, in effect.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might be useful if the program is frequently in an "Idle" state of disk IO and can afford to squeeze in some disk writebacks. Then again, the kernel might very well be handling this itself better than the ever application could.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by different threads or processes; if I am wrong about this, then manual prefaulting would not be useful at all. Similarly, if an msync() call blocks all disk IO, both to the filesystem cache and to the raw filesystem, then there also isn't as much of an incentive to use it over flushing the entire disk cache at the program's termination.
It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files.
That's not particularly surprising, I agree. But this is a cost that cannot be avoided, at least for the pages corresponding to regions of the mapped file that you actually access.
Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
That's plausible. Again, this is an unavoidable cost, at least for dirty pages, but you can exercise some influence over when those costs are incurred.
I was hoping to offset the cost of these faults by concurrently
prefaulting pages while performing the almost-sequential access and
paging out pages that I won't need again. But I have three main
questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the
MAP_POPULATE flag, but this seems to attempt loading the entire file
into memory,
Yes, that's consistent with its documentation.
which is intolerable in many cases. Also, this seems to
cause the mmap() call to block until prefaulting is complete,
That's also as documented.
which
is also intolerable. My idea for a manual alternative was to spawn a
thread simply to try reading the next N pages in memory to force a
prefetch.
Unless there's a delay between when you initially mmap() the file and when you want to start accessing the mapping, it's not clear to me why you would expect that to provide any improvement.
But it might be that madvise with MADV_SEQUENTIAL already
does this, in effect.
If you want POSIX compatibility, then you're looking for posix_madvise(). I would indeed recommend using this function instead of trying to roll your own userspace alternative. In particular, if you use posix_madvise() to assert POSIX_MADV_SEQUENTIAL on some or all of the mapped region, then it is reasonable to hope that the kernel will read ahead to load pages before they are needed. Additionally, if you advise with POSIX_MADV_DONTNEED then you might, at the kernel's discretion, get earlier sync to disk and overall less memory use. There is other advice you can pass by this mechanism, too, if it is useful.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might
be useful if the program is frequently in an "Idle" state of disk IO
and can afford to squeeze in some disk writebacks. Then again, the
kernel might very well be handling this itself better than the ever
application could.
This is something to test. Note that msync() supports asynchronous syncing, however, so you don't need I/O idleness. Thus, when you're sure you're done with a given page you could consider msync()ing it with flag MS_ASYNC to request that the kernel schedule a sync. This might reduce the delay incurred when you unmap the file. You'll have to experiment with combining it with posix_madvise(..., ..., POSIX_MADV_DONTNEED); they might or might not complement each other.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by
different threads or processes; if I am wrong about this, then manual
prefaulting would not be useful at all.
It should be possible for one thread to prefault pages (by accessing them), while another reads or writes others that have already been faulted in, but it's unclear to me why you expect such a prefaulting thread to be able to run ahead of the one(s) doing the reads and writes. If it has any effect at all (i.e. if the kernel does not prefault on its own) then I would expect prefaulting a page to be more expensive than reading or writing each byte in it once.
Similarly, if an msync()
call blocks all disk IO, both to the filesystem cache and to the raw
filesystem, then there also isn't as much of an incentive to use it
over flushing the entire disk cache at the program's termination.
There is a minimum number of disk reads and writes that will need to be performed on behalf of your program. For any given mmapped file, they will all be performed on the same I/O device, and therefore they will all be serialized with respect to one another. If you are I/O bound then to a first approximation, the order in which those I/O operations are performed does not matter for overall runtime.
Thus, if the runtime is what you're concerned with, then probably neither posix_madvise() nor msync() will be of much help unless your program spends a significant fraction of its runtime on tasks that are independent of accessing the mmapped file. If you do find yourself not wholly I/O bound then my suggestion would be to see first what posix_madvise() can do for you, and to try asynchronous msync() if you need more. I'm inclined to doubt that userspace prefaulting or synchronous msync() would provide a win, but in optimization, it's always better to test than to (only) predict.

madvise managing memory mapped files

I have a program that processes a large dataset consisting of a large number (300+) of sizable memory (40MB+) mapped files. All the files are needed together though they are accessed in a sequential way. at the moment I am memory mapping the files and then using madvise with MADV_SEQUENTIAL since I don't want the thing to be any more of a memory hog than it needs to be (without any madvise the consumption becomes a problem). The problem it that the program runs much slower (like 50x slower) than the diskio of the system would indicate it should, and becomes worse faster than linearly. as the number of files are involved are increased. Processing 100 files is more than 10x faster than 300 files despite being only 3x the data. I suspect that the memory mapped files are generating a page fault every time a 4kb page is crossed, net result disk seek time is greater than disk transfer time.
Can anyone think of a better way than using madvise with MADV_WILLNEED and MADV_DONTNEED every so often, and if this is the best way, any ideas as to how far to look ahead?

real-time writes to disk

I have a thread that needs to write data from an in-memory buffer to a disk thousands of times. I have some requirements of how long each write takes because the buffer needs to be cleared for a separate thread to write to it again.
I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.
In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long. My block of code looks like (this is in C by the way):
last = get_timestamp();
write();
now = get_timestamp();
if (longest_write < now - last)
longest_write = now - last;
And at the end I print out the longest write. I found that for a 32K buffer, I am seeing a longest write speed of about 47ms. This is way too long to meet the requirements of my application. I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds? Thanks
Edit:
I am in fact using multiple buffers of the type I declare above and striping between them to multiple disks. One solution to my problem would be to just increase the number of buffers to amortize the cost of long writes. However I would like to keep the amount of memory being used for buffering as small as possible to avoid dirtying the cache of the thread that is producing the data written into the buffer. My question should be constrained to dealing with variance in the latency of writing a small block to disk and how to reduce it.
I'm assuming that you are using an ATA or SATA drive connected to the built-in disk controller in a standard computer. Is this a valid assumption, or are you using anything out of the ordinary (hardware RAID controller, SCSI drives, external drive, etc)?
As an engineer who does a lot of disk I/O performance testing at work, I would say that this sounds a lot like your writes are being cached somewhere. Your "high latency" I/O is a result of that cache finally being flushed. Even without a filesystem, I/O operations can be cached in the I/O controller or in the disk itself.
To get a better view of what is going on, record not just your max latency, but your average latency as well. Consider recording your max 10-15 latency samples so you can get a better picture of how (in-)frequent these high-latency samples are. Also, throw out the data recorded in the first two or three seconds of your test and start your data logging after that. There can be high-latency I/O operations seen at the start of a disk test that aren't indicative of the disk's true performance (can be caused by things like the disk having to rev up to full speed, the head having to do a large initial seek, disk write cache being flushed, etc).
If you are wanting to benchmark disk I/O performance, I would recommend using a tool like IOMeter instead of using dd or rolling your own. IOMeter makes it easy to see what kind of a difference it makes to change the I/O size, alignment, etc, plus it keeps track of a number of useful statistics.
Requiring an I/O operation to happen within a certain amount of time is a risky thing to do. For one, other applications on the system can compete with you for disk access or CPU time and it is nearly impossible to predict their exact effect on your I/O speeds. Your disk might encounter a bad block, in which case it has to do some extra work to remap the affected sectors before processing your I/O. This introduces an unpredictable delay. You also can't control what the OS, driver, and disk controller are doing. Your I/O request may get backed up in one of those layers for any number of unforseeable reasons.
If the only reason you have a hard limit on I/O time is because your buffer is being re-used, consider changing your algorithm instead. Try using a circular buffer so that you can flush data out of it while writing into it. If you see that you are filling it faster than flushing it, you can throttle back your buffer usage. Alternatively, you can also create multiple buffers and cycle through them. When one buffer fills up, write that buffer to disk and switch to the next one. You can be writing to the new buffer even if the first is still being written.
Response to comment:
You can't really "get the kernel out of the way", it's the lowest level in the system and you have to go through it to one degree or another. You might be able to build a custom version of the driver for your disk controller (provided it's open source) and build in a "high-priority" I/O path for your application to use. You are still at the mercy of the disk controller's firmware and the firmware/hardware of the drive itself, which you can't necessarily predict or do anything about.
Hard drives traditionally perform best when doing large, sequential I/O operations. Drivers, device firmware, and OS I/O subsystems take this into account and try to group smaller I/O requests together so that they only have to generate a single, large I/O request to the drive. If you are only flushing 32K at a time, then your writes are probably being cached at some level, coalesced, and sent to the drive all at once. By defeating this coalescing, you should reduce the number of I/O latency "spikes" and see more uniform disk access times. However, these access times will be much closer to the large times seen in your "spikes" than the moderate times that you are normally seeing. The latency spike corresponds to an I/O request that didn't get coalesced with any others and thus had to absorb the entire overhead of a disk seek. Request coalescing is done for a reason; by bundling requests you are amortizing the overhead of a drive seek operation over multiple commands. Defeating coalescing leads to doing more seek operations than you would normally, giving you overall slower I/O speeds. It's a trade-off: you reduce your average I/O latency at the expense of occasionally having an abnormal, high-latency operation. It is a beneficial trade-off, however, because the increase in average latency associated with disabling coalescing is nearly always more of a disadvantage than having a more consistent access time is an advantage.
I'm also assuming that you have already tried adjusting thread priorities, and that this isn't a case of your high-bandwidth producer thread starving out the buffer-flushing thread for CPU time. Have you confirmed this?
You say that you do not want to disturb the high-bandwidth thread that is also running on the system. Have you actually tested various output buffer sizes/quantities and measured their impact on the other thread? If so, please share some of the results you measured so that we have more information to use when brainstorming.
Given the amount of memory that most machines have, moving from a 32K buffer to a system that rotates through 4 32K buffers is a rather inconsequential jump in memory usage. On a system with 1GB of memory, the increase in buffer size represents only 0.0092% of the system's memory. Try moving to a system of alternating/rotating buffers (to keep it simple, start with 2) and measure the impact on your high-bandwidth thread. I'm betting that the extra 32K of memory isn't going to have any sort of noticeable impact on the other thread. This shouldn't be "dirtying the cache" of the producer thread. If you are constantly using these memory regions, they should always be marked as "in use" and should never get swapped out of physical memory. The buffer being flushed must stay in physical memory for DMA to work, and the second buffer will be in memory because your producer thread is currently writing to it. It is true that using an additional buffer will reduce the total amount of physical memory available to the producer thread (albeit only very slightly), but if you are running an application that requires high bandwidth and low latency then you would have designed your system such that it has quite a lot more than 32K of memory to spare.
Instead of solving the problem by trying to force the hardware and low-level software to perform to specific performance measurements, the easier solution is to adjust your software to fit the hardware. If you measure your max write latency to be 1 second (for the sake of nice round numbers), write your program such that a buffer that is flushed to disk will not need to be re-used for at least 2.5-3 seconds. That way you cover your worst-case scenario, plus provide a safety margin in case something really unexpected happens. If you use a system where you rotate through 3-4 output buffers, you shouldn't have to worry about re-using a buffer before it gets flushed. You aren't going to be able to control the hardware too closely, and if you are already writing to a raw volume (no filesystem) then there's not much between you and the hardware that you can manipulate or eliminate. If your program design is inflexible and you are seeing unacceptable latency spikes, you can always try a faster drive. Solid-state drives don't have to "seek" to do I/O operations, so you should see a fairly uniform hardware I/O latency.
As long as you are using O_DIRECT | O_SYNC, you can use ioprio_set() to set the IO scheduling priority of your process/thread (although the man page says "process", I believe you can pass a TID as given by gettid()).
If you set a real-time IO class, then your IO will always be given first access to the disk - it sounds like this is what you want.
I have a thread that needs to write data from an in-memory buffer to a disk thousands of times.
I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.
The dd's block size is aligned with file system block size. I guess your log file isn't.
Plus probably your application writes not only the log file, but also does some other file operations. Or your application isn't alone using the disk.
Generally, disk I/O isn't optimized for latencies, it is optimized for the throughput. High latencies are normal - and networked file systems have them even higher.
In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long.
Some writes take longer time because after some point of time you saturate the write queue and OS finally decides to actually flush the data to disk. The I/O queues by default configured pretty short: to avoid excessive buffering and information loss due to a crash.
N.B. If you want to see the real speed, try setting the O_DSYNC flag when opening the file.
If your blocks are really aligned you might try using the O_DIRECT flag, since that would remove contentions (with other applications) on the Linux disk cache level. The writes would work at the real speed of the disk.
100MB/s with dd - without any syncing - is a highly synthetic benchmark, as you never know that data have really hit the disk. Try adding conv=dsync to the dd's command line.
Also trying using larger block size. 32K is still small. IIRC 128K size was the optimal when I was testing sequential vs. random I/O few years ago.
I am seeing a longest write speed of about 47ms.
"Real time" != "fast". If I define max response time of 50ms, and your app consistently responds within the 50ms (47 < 50) then your app would classify as real-time.
I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds?
I do not think you can avoid the write() delays. Latencies are the inherit property of the disk I/O. You can't avoid them - you have to expect and handle them.
I can think only of the following option: use two buffers. First would be used by write(), second - used for storing new incoming data from threads. When write() finishes, switch the buffers and if there is something to write, start writing it. That way there is always a buffer for threads to put the information into. Overflow might still happen if threads generate information faster than the write() can write. Dynamically adding more buffers (up to some limit) might help in the case.
Otherwise, you can achieve some sort of real-time-ness for (rotational) disk I/O only if your application is the sole user of the disk. (Old rule of real time applications applies: there can be only one.) O_DIRECT helps somehow to remove the influence of the OS itself from the equation. (Though you would still have the overhead of file system in form of occasional delays due to block allocation for the file extension. Under Linux that works pretty fast, but still can be avoided by preallocating the whole file in advance, e.g. by writing zeros.) If the timing is really important, consider buying dedicated disk for the job. SSDs have excellent throughput and do not suffer from the seeking.
Are you writing to a new file or overwriting the same file?
The big difference with dd is likely to be seek time, dd is streaming to a contigous (mostly) list of blocks, if you are writing lots of small files the head may be seeking all over the drive to allocate them.
The best way of solving the problem is likely to be removing the requirement for the log to be written in a specific time. Can you use a set of buffers so that one is being written (or at least sent to the drives's buffer) while new log data is arriving into another one?
linux does not write anything directly to the disk it will use the virtual memory and then, a kernel thread call pdflush will write these datas to the disk , the behavior of pdflush could be controlled through sysctl -w ""

Resources