POSIX environments provide at least two ways of accessing files. There's the standard system calls open(), read(), write(), and friends, but there's also the option of using mmap() to map the file into virtual memory.
When is it preferable to use one over the other? What're their individual advantages that merit including two interfaces?
mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.
mmap also allows the operating system to optimize paging operations. For example, consider two programs; program A which reads in a 1MB file into a buffer creating with malloc, and program B which mmaps the 1MB file into memory. If the operating system has to swap part of A's memory out, it must write the contents of the buffer to swap before it can reuse the memory. In B's case any unmodified mmap'd pages can be reused immediately because the OS knows how to restore them from the existing file they were mmap'd from. (The OS can detect which pages are unmodified by initially marking writable mmap'd pages as read only and catching seg faults, similar to Copy on Write strategy).
mmap is also useful for inter process communication. You can mmap a file as read / write in the processes that need to communicate and then use synchronization primitives in the mmap'd region (this is what the MAP_HASSEMAPHORE flag is for).
One place mmap can be awkward is if you need to work with very large files on a 32 bit machine. This is because mmap has to find a contiguous block of addresses in your process's address space that is large enough to fit the entire range of the file being mapped. This can become a problem if your address space becomes fragmented, where you might have 2 GB of address space free, but no individual range of it can fit a 1 GB file mapping. In this case you may have to map the file in smaller chunks than you would like to make it fit.
Another potential awkwardness with mmap as a replacement for read / write is that you have to start your mapping on offsets of the page size. If you just want to get some data at offset X you will need to fixup that offset so it's compatible with mmap.
And finally, read / write are the only way you can work with some types of files. mmap can't be used on things like pipes and ttys.
One area where I found mmap() to not be an advantage was when reading small files (under 16K). The overhead of page faulting to read the whole file was very high compared with just doing a single read() system call. This is because the kernel can sometimes satisify a read entirely in your time slice, meaning your code doesn't switch away. With a page fault, it seemed more likely that another program would be scheduled, making the file operation have a higher latency.
mmap has the advantage when you have random access on big files. Another advantage is that you access it with memory operations (memcpy, pointer arithmetic), without bothering with the buffering. Normal I/O can sometimes be quite difficult when using buffers when you have structures bigger than your buffer. The code to handle that is often difficult to get right, mmap is generally easier. This said, there are certain traps when working with mmap.
As people have already mentioned, mmap is quite costly to set up, so it is worth using only for a given size (varying from machine to machine).
For pure sequential accesses to the file, it is also not always the better solution, though an appropriate call to madvise can mitigate the problem.
You have to be careful with alignment restrictions of your architecture(SPARC, itanium), with read/write IO the buffers are often properly aligned and do not trap when dereferencing a casted pointer.
You also have to be careful that you do not access outside of the map. It can easily happen if you use string functions on your map, and your file does not contain a \0 at the end. It will work most of the time when your file size is not a multiple of the page size as the last page is filled with 0 (the mapped area is always in the size of a multiple of your page size).
In addition to other nice answers, a quote from Linux system programming written by Google's expert Robert Love:
Advantages of mmap( )
Manipulating files via mmap( ) has a handful of advantages over the
standard read( ) and write( ) system calls. Among them are:
Reading from and writing to a memory-mapped file avoids the
extraneous copy that occurs when using the read( ) or write( ) system
calls, where the data must be copied to and from a user-space buffer.
Aside from any potential page faults, reading from and writing to a memory-mapped file does not incur any system call or context switch
overhead. It is as simple as accessing memory.
When multiple processes map the same object into memory, the data is shared among all the processes. Read-only and shared writable
mappings are shared in their entirety; private writable mappings have
their not-yet-COW (copy-on-write) pages shared.
Seeking around the mapping involves trivial pointer manipulations. There is no need for the lseek( ) system call.
For these reasons, mmap( ) is a smart choice for many applications.
Disadvantages of mmap( )
There are a few points to keep in mind when using mmap( ):
Memory mappings are always an integer number of pages in size. Thus, the difference between the size of the backing file and an
integer number of pages is "wasted" as slack space. For small files, a
significant percentage of the mapping may be wasted. For example, with
4 KB pages, a 7 byte mapping wastes 4,089 bytes.
The memory mappings must fit into the process' address space. With a 32-bit address space, a very large number of various-sized mappings
can result in fragmentation of the address space, making it hard to
find large free contiguous regions. This problem, of course, is much
less apparent with a 64-bit address space.
There is overhead in creating and maintaining the memory mappings and associated data structures inside the kernel. This overhead is
generally obviated by the elimination of the double copy mentioned in
the previous section, particularly for larger and frequently accessed
files.
For these reasons, the benefits of mmap( ) are most greatly realized
when the mapped file is large (and thus any wasted space is a small
percentage of the total mapping), or when the total size of the mapped
file is evenly divisible by the page size (and thus there is no wasted
space).
Memory mapping has a potential for a huge speed advantage compared to traditional IO. It lets the operating system read the data from the source file as the pages in the memory mapped file are touched. This works by creating faulting pages, which the OS detects and then the OS loads the corresponding data from the file automatically.
This works the same way as the paging mechanism and is usually optimized for high speed I/O by reading data on system page boundaries and sizes (usually 4K) - a size for which most file system caches are optimized to.
An advantage that isn't listed yet is the ability of mmap() to keep a read-only mapping as clean pages. If one allocates a buffer in the process's address space, then uses read() to fill the buffer from a file, the memory pages corresponding to that buffer are now dirty since they have been written to.
Dirty pages can not be dropped from RAM by the kernel. If there is swap space, then they can be paged out to swap. But this is costly and on some systems, such as small embedded devices with only flash memory, there is no swap at all. In that case, the buffer will be stuck in RAM until the process exits, or perhaps gives it back withmadvise().
Non written to mmap() pages are clean. If the kernel needs RAM, it can simply drop them and use the RAM the pages were in. If the process that had the mapping accesses it again, it cause a page fault the kernel re-loads the pages from the file they came from originally. The same way they were populated in the first place.
This doesn't require more than one process using the mapped file to be an advantage.
Related
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.
mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.
"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.
I want to work on a file which is composed of 4Kb blocks.
As things happen, I will write more data and map new parts, unmap parts that I do not need anymore.
Is a map() of just 4Kb too small when the total amount of file data to map is around 4Gb total? (i.e. some 1,048,576 individually mapped blocks).
I'm worried that making so many small mmap() calls is not going to be efficient in the end, even if they are very well directed to the exact blocks I want to use. At the same time, it may still be better than reading and writing these blocks with read()/write() each time I change one byte.
As far as I understand it, even a single mmap() that covers several contiguous 4kb pages will require the kernel (and the TLB, MMU...) to deal with as many virtual/physical associations as the number of these pages (this is the purpose of memory pages; contiguous virtual pages can be mapped to non-contiguous physical pages).
So, considering the usage of these mapped pages, once set up by a unique or by many mmap() calls, there should not be any difference in performances.
But each single call to mmap() probably requires some overhead in order to choose the part of virtual address space to use; a single mmap() call will just have to choose once a big enough virtual location (it should not be too difficult on a 64-bit system, as stated in other answers) but repeated calls will imply this overhead many times.
So, if I had to deal with this situation on a 64-bit system, I would mmap() the entire file at once, using huge-pages in order to reduce the pressure on TLB.
Note that mapping the entire file at once does not imply using the same amount of physical memory right at this moment; virtual/physical memory association will only occur for each single page when it is accessed for the first time.
There is no shortage of address space on 64-bit architectures. Unless your code has to work in 32-bit architectures too (rare these days), map the whole file once and avoid the overhead of multiple mmap calls and thousands of extra kernel objects. With reading and writing changes, it depends on your desired semantics. See this answer.
On 64-bit systems you should pretty much map the entire file or at least the entire range in one go and let the operating system handle the paging in and out for you. The mmap calls do have some overhead themselves. In practice the user address space on x86-64 is something like 128 TiB so you should be able to map say 1 TiB files/ranges without any problems.
I am trying to understand how mmap works while looking at man mmap.
As I understand it, it adds a mapping to the page table that maps between the file and the virtual address (which is the address that is given void *addr)
So, what happens when 2 programs map the same file?
Are there 2 entries in the page table, one for each program?
So, what happens when 2 programs map the same file? Are there 2 entries in the page table, one for each program?
In modern operating systems, each process has its own page table for its memory, that may point to pages of physical memory shared with other user and kernel processes.
With MAP_SHARED, this mapping is shared: updates to the mapping are visible to other processes that map this file, and are carried through to the underlying file. The file may not actually be updated until msync(2) or munmap() is called.
This seems very interesting, but there are numerous caveats:
The actual pages mmapped by both processes for the same file may reside at the same address or at a different address in each process, storing pointers into this shared memory may not allow the other process to use them as they might point to inconsistent addresses.
The implementation may use the same physical memory pages for both mappings or not: for subtile reasons (cache strategies, out of sync reading...), even if it is the same physical memory, modifications done by one process to its memory may not be immediately reflected in the memory of the other process.
So the modification may or may not be visible to the other processes mmapping the file nor reading it via read or the FILE* stream API.
If one of the processes calls msync(), the modifications should be visible in all maps and for all yet unread portions of the file, bearing in mind that the FILE* streaming APIs may have buffered some data in internal unshared buffers: modifications in this area will not be reflected.
Conclusion: it is risky and unreliable to use these mechanisms to implement inter process communication. The behavior may depend on system specific characteristics such as the OS strategies, the CPU and cache architectures, the type of RAM in use, the clock speed, and who knows what else. It is safer to rely on proven APIs that may indeed be implemented using mmapped memory, but only if it is know to provide the correct semantics.
The actual system implementation is different. At the risk of over simplification (and omitting paging here):
A mmap will map physical page frames to a file.
So, what happens when 2 programs map the same file? Are there 2 entries in the page table, one for each program?
If two processes (P and Q) map to the same file, then P and Q will each have there own page table; each page table will have entry mapping to the same physical page frame (which could be mapped to different addresses within P and Q).
I realize that most CPUs are better at reading data at an aligned memory address, that is at memory address that is a multiple of CPU word. However, in many places I read about allocating a page-aligned memory. Why might someone want to get a page-aligned memory address? Is it only for even bigger performance?
The "traditional" way to allocate memory is to have it in a contiguous address space (the "heap", growing upwards by calls to sbrk()). Each time you hit a page boundary, there will be a page fault and you get mapped a new page. There are two consequences of this strategy:
pages can only be freed when all allocations inside that page are freed AND when all other allocations are mapped to lower addresses. (the typical effect of heap fragmentation).
larger allocations might occupy one page more than strictly needed (if they start somewhere in the middle of a page).
So this strategy is only suitable for smaller blocks of memory where you don't want to "waste" a whole page for each allocation.
For bigger chunks, it's better to use mmap() which maps you new pages somewhere directly, so you get "page aligned memory". Using this, your allocation doesn't share pages with other allocations. As soon as you don't need the memory any more, you can give it back to the OS. Note that many malloc()implementations choose automatically whether to allocate using sbrk() or mmap(), depending on the size of the desired allocation.
Alignment restrictions are usually associated with direct IO - which bypasses the page cache, copying data to/from disk directly into or from the address space of a process. This can provide significant performance improvements in cases where the page cache is not needed - such as streaming multiple gigabytes of data, especially when doing IO to/from extremely fast disk systems.
Note that only some file systems support direct IO.
On Linux, RedHat's documentation is, in part:
Direct I/O best practices
Users must always take care to use properly aligned and sized IO.
This is especially important for Direct I/O
access. Direct I/O should be aligned on a 'logical_block_size'
boundary and in multiples of the 'logical_block_size'. With native 4K
devices (logical_block_size is 4K) it is now critical that
applications perform Direct I/O that is a multiple of the device's
'logical_block_size'. This means that applications that do not
perform 4K aligned I/O, but 512-byte aligned I/O, will break with
native 4K devices. Applications may consult a device's "I/O Limits"
to ensure they are using properly aligned and sized I/O. The "I/O
Limits" are exposed through both sysfs and block device ioctl
interfaces (also see: libblkid).
sysfs interface
/sys/block//alignment_offset
/sys/block///alignment_offset
/sys/block//queue/physical_block_size
/sys/block//queue/logical_block_size
/sys/block//queue/minimum_io_size
/sys/block//queue/optimal_io_size
Note that the use of direct IO can be limited by actual hardware, as well as software. As noted in the RedHat documentation, physical device limitations matter.
To use direct IO, on Linux the file needs to be opened with the O_DIRECT flag:
int fd = open( filename, O_RDONLY | O_DIRECT );
In my experience, direct IO can result in 20-30% gains in IO performance under certain circumstances. Those circumstances usually involve streaming large amounts of data to/from a file on a very fast file system with the application performing no or very few seek() calls.
Alignment is something that always makes some performance issues. when you write(2) or read(2) a file, it's best if you can adjust the limits of your reading to block aligments, because you make kernel do two block reads instead of one. The worst case being just reading two bytes on a block boundary. Suppose you have a block size of 1024bytes, this code:
char var[2];
int fd;
fd = open("/etc/passwd", O_RDONLY);
lseek(fd, 1023UL, SEEK_SET);
read(fd, &var, sizeof var);
Will make the kernel to force two block reads (at most, as the blocks could be already cached before) for only a two bytes read(2) call.
In the case of memory, all this stuff is normally managed by malloc(3), and, as you don't fail on page faults, you don't get any performance penalties (that's the reason you don't have any standar library function to get aligned memory, even in demand paged virtual systems) as far as you consume memory, the kernel allocs it in pages for you. The processor virtual memory system makes page alignment almost transparent. Only in case you have an unaligned memory access (suppose you access a 32bit integer access misaligned ---unprobable--- to two pages, and those two pages have been swapped out by the kernel, you'll have to wait for the kernel to swap in two pages of memory instead of one ---but that's far improbable thing to occur, the compiler normally forces inner loops to not fail between page boundaries to minimize the probability of this to happen, and you have also the instruction cache to cope with these things)
Said this, there are some places where you do get performance improvements if you somewhat align memory. I'll try to show you a scenario of this:
Suppose you need to dynamically manage a lot of small structures (let's say 16bytes long) and you plan to manage them with malloc(). malloc(3) manages memory including a header in each chunk of memory allocated (let's say this header is 8 bytes long) making the overhead of memory 50% percent more than the ideal. If you arrange to get memory in chunks of (let's say) 64 structures you'll get just one of those headers (8bytes) for each 64*16 = 1024 bytes (amounting for just roughly an 8%)
To manage this, you have to consider knowing to which chunk all of this structures belong (so you can free(3) the chunk when not in use), and you can do this in two ways: 1.- Using a pointer (adding 4 bytes to each structure size --this is pointless as you'll add 4 bytes to each structure, lossing a 25% of memory again) to point to the chunck, or 2.- *forcing the chunck to be aligned, so the chunk address can be easily calculated from the struct address (you only need to substract the rest of the division mod chunksize to the struct address) to get the chunk address. This last method doesn't impose any overhead to locate the chunck, but imposes the practice of all chunks to be chunk aligned (not page aligned).
In this way, you improve performance too much, as you reduce considerably the amount of malloc(3) calls and the waste of memory imposed by allocating small amounts of memory.
By the way, malloc doesn't ask the operating system for the memory you ask it at each call. It allocates memory in chunks, in a manner similar as has been explained here, and normal implementations don't even manage to return the allocated memory to the system again (reusing the freed memory before allocating new one) It controls the calls to sbrk(2) system call, what means that you are going to interfere with malloc in case you use this system call.
Linux/unix will give you page aligned memory when you use shmat(2) system call. Try reading this and related documents.
I'm implementing persistent large constant arrays via mmap. Is there any tips and tricks or gotchas one should be aware when using mmap?
All pointers that are stored inside the mmap'd region should be done as offsets from the base of the mmap'd region, not as real pointers! You won't necessarily be getting the same base address when you mmap the region on the next run of the program. (I have had to clean up code that made incorrect assumptions about mmap region base address constancy).
This is the most straight forward use case for mmap() so there shouldn't be much to trip you up.
You are effectively just loading a large constant array. Being constants you shouldn't need to worry about synchronization. It would be advisable to make sure the prot parameter is set to PROT_READ only since you won't be writing.
If one or more programs using the constants are going to be continually run, it might be worthwhile to have a separate program that loads the data and keeps it resident. Runs of the other programs then essentially are just doing an shared memory attach rather than continually reading the file into memory.
Make sure you check for restrictions on open file size or memory usage. On Linux there is a built in shell command ulimit. Run as ulimit -a to see the current settings.
Flush writes to the in-memory array to the file with the msync(2) syscall or else they may stay in memory until munmap(2) and there may be a power outage or something before then!
If multiple processes are mmap'ing the same memory region shared with read and write privileges, make sure that only one is writing to it at a time to avoid corrupting your data. Or use file locking or some other means of synchronization.