How does mmap work when 2 programs map the same file - c

I am trying to understand how mmap works while looking at man mmap.
As I understand it, it adds a mapping to the page table that maps between the file and the virtual address (which is the address that is given void *addr)
So, what happens when 2 programs map the same file?
Are there 2 entries in the page table, one for each program?

So, what happens when 2 programs map the same file? Are there 2 entries in the page table, one for each program?
In modern operating systems, each process has its own page table for its memory, that may point to pages of physical memory shared with other user and kernel processes.
With MAP_SHARED, this mapping is shared: updates to the mapping are visible to other processes that map this file, and are carried through to the underlying file. The file may not actually be updated until msync(2) or munmap() is called.
This seems very interesting, but there are numerous caveats:
The actual pages mmapped by both processes for the same file may reside at the same address or at a different address in each process, storing pointers into this shared memory may not allow the other process to use them as they might point to inconsistent addresses.
The implementation may use the same physical memory pages for both mappings or not: for subtile reasons (cache strategies, out of sync reading...), even if it is the same physical memory, modifications done by one process to its memory may not be immediately reflected in the memory of the other process.
So the modification may or may not be visible to the other processes mmapping the file nor reading it via read or the FILE* stream API.
If one of the processes calls msync(), the modifications should be visible in all maps and for all yet unread portions of the file, bearing in mind that the FILE* streaming APIs may have buffered some data in internal unshared buffers: modifications in this area will not be reflected.
Conclusion: it is risky and unreliable to use these mechanisms to implement inter process communication. The behavior may depend on system specific characteristics such as the OS strategies, the CPU and cache architectures, the type of RAM in use, the clock speed, and who knows what else. It is safer to rely on proven APIs that may indeed be implemented using mmapped memory, but only if it is know to provide the correct semantics.

The actual system implementation is different. At the risk of over simplification (and omitting paging here):
A mmap will map physical page frames to a file.
So, what happens when 2 programs map the same file? Are there 2 entries in the page table, one for each program?
If two processes (P and Q) map to the same file, then P and Q will each have there own page table; each page table will have entry mapping to the same physical page frame (which could be mapped to different addresses within P and Q).

Related

What type of memory objects does `shm_open` use?

Usually, shared memory is implemented using portions of On-Disk files mapped to processes address spaces. Whenever a memory access occurs on the shared region, the filesystem is involved to write changes on the disk which is a great overhead. Typically, a call to fopen() returns a file descriptor which is passed to mmap() to create the file's memory map. shm_open, apparently, works in the same way. It returns a file descriptor which can even be used with regular file operations (e.g ftruncate, ftell, fseek ...etc). We do specify a string as a parameter to shm_open but unlike fopen(), it is not a name of a real file on the visible filesystem (mounted HDD, Flash drives, SSD ... etc). The same string name can be used by totally unrelated processes to map the same region into their address spaces.
So, what is the string parameter passed to shm_open & what does shm_open creates/opens ? Is it a file on some temporary filesystem (/tmp) which is eventually used by many processes to create the shared region (Well, i think it has to be some kind of files since it returns a file descriptor) ? Or is it some kind of a mysterious and hidden filesystem backed by the kernel ?
People say shm_open is faster then fopen because no disk operations are involved so the theory i suggest is that the kernel uses an invisible RAM-based filesystem to implement shared memory with shm_open !
Usually, shared memory is implemented using portions of On-Disk files mapped to processes address spaces.
This is generally false, at least on a desktop or laptop running a recent Linux distribution, with some reasonable amount of RAM (e.g. 8Gbytes at least).
So, the disk is not relevant. You could use shm_open without any swap. See shm_overview(7), and notice that /dev/shm/ is generally a tmpfs mounted file system so don't use any disk. See tmpfs(5). And tmpfs don't use the disk (unless you reach thrashing conditions, which is unlikely) since it works in virtual memory.
the filesystem is involved to write changes on the disk which is a great overhead.
This is usually wrong. On most systems, recently written files are in the page cache, which does not reach the disk quickly (BTW, that is why the shutdown procedure needs to call sync(2) which is rarely used otherwise...).
BTW, on most desktops and laptops, it is easy to observe. The hard disk has some LED, and you won't see it blinking when using shm_open and related calls. BTW, you could also use proc(5) (notably /proc/diskstats etc....) to query the kernel about its disk activity.
Usually, shared memory is implemented using portions of On-Disk files mapped to processes address spaces. Whenever a memory access occurs on the shared region, the filesystem is involved to write changes on the disk which is a great overhead.
That seems rather presumptuous, and not entirely correct. Substantially all machines that implement shared memory regions (in the IPC sense) have virtual memory units by which they support the feature. There may or may not be any persistent storage backing any particular shared memory segment, or any part of it. Only the part, if any, that is paged out needs to be backed by such storage.
shm_open, apparently, works in the same way. It returns a file descriptor which can even be used with regular file operations (e.g ftruncate, ftell, fseek ...etc).
That shm_open() has an interface modeled on that of open(), and that it returns a file descriptor that can meaningfully be used with certain general-purpose I/O function, do not imply that shm_open() "works in the same way" in any broader sense. Pretty much all system resources are represented to processes as files. This affords a simpler overall system interface, but it does not imply any commonality of the underlying resources beyond the fact that they can be manipulated via the same functions -- to the extent that indeed they can be.
So, what is the string parameter passed to shm_open & what does shm_open creates/opens ?
The parameter is a string identifying the shared memory segment. You already knew that, but you seem to think there's more to it than that. There isn't, at least not at the level (POSIX) at which the shm_open interface is specified. The identifier is meaningful primarily to the kernel. Different implementations handle the details differently.
Is it a file on some temporary filesystem (/tmp) which is eventually used by many processes to create the shared region
Could be, but probably isn't. Any filesystem interface provided for it is likely (but not certain) to be a virtual filesystem, not actual, accessible files on disk. Persistent storage, if used, is likely to be provided out of the system's swap space.
(Well, i think it has to be some kind of files since it returns a file descriptor) ?
Such a conclusion is unwarranted. Sockets and pipes are represented via file descriptors, too, but they don't have corresponding accessible files.
Or is it some kind of a mysterious and hidden filesystem backed by the kernel ?
That's probably a better conception, though again, there might not be any persistent storage at all. To the extent that there is any, however, it is likely to be part of the system's swap space, which is not all that mysterious.

Is it possible to control page-out and page-in by user programming? If yes then how?

My questions are as follows:
I mmap(memory mapping) a file into the virtual memory space.
When I access the first byte of the file using a pointer at the first time, the OS will try to access the data in memory, but it will fails and raises the page fault, because the data doesn't present in memory now. So the OS will swap the data from disk into memory. Finally my access will success.
(question is coming)
When I modify the data(in-memory) and write back into disk file, how could I just free the physical memory for other using, but remain virtual memory for fetching the data back into memory as needed?
It sounds like the page-out and page-in behaviors where the OS know the memory is exhaust, it will swap the LRU(or something like that) memory page into disk(swap files) and free the physical memory for other process, and fetch the evicted data back into memory as needed. But this mechanism is controlled by OS.
For some reasons, I need to control the page-out and page-in behaviors by myself. So how should I do? Hack the kernel?
You can use the madvise system call. Its behaviour is affected by the advice argument; there are many choices for advice and the optimal one should be picked based on the specifics of your application.
The flag MADV_DONTNEED means that the given range of physical backing frames should be unconditionally freed (i.e. paged out). Also:
After a successful MADV_DONTNEED operation, the semantics of
memory access in the specified region are changed: subsequent
accesses of pages in the range will succeed, but will result
in either repopulating the memory contents from the up-to-date
contents of the underlying mapped file (for shared file
mappings, shared anonymous mappings, and shmem-based
techniques such as System V shared memory segments) or zero-
fill-on-demand pages for anonymous private mappings.
This could be useful if you're absolutely certain that it will be very long until you access the same position again.
However it might not be necessary to force the kernel to actually page out; instead another possibility, if you're accessing the mapping sequentially is to use madvise with MADV_SEQUENTIAL to tell kernel that you'd access your memory mapping mostly sequentially:
Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
or MADV_RANDOM
Expect page references in random order. (Hence, read ahead may be less useful than normally.)
These are not as aggressive as explicitly calling MADV_DONTNEED to page out. (Of course you can combine these with MADV_DONTNEED as well)
In recent kernel versions there is also the MADV_FREE flag which will lazily free the page frames; they will stay mapped in if enough memory is available, but are reclaimed by the kernel if the memory pressure grows.
You can checout mlock+munlock to lock/unlock the pages. This will give you control over pages being swapped out.
You need to have CAP_IPC_LOCK capability to perform this operation though.

what is a named memory block

I know that, in general, a named memory block is shared memory which you can assign and access by a name.
What I want to know is what are the advantages of using a named block of memory and when should it be used in terms of memory management ?
What you are describing has different names depending upon the operating system.
It is a range of pages that can be mapped to the address space of multiple processes. It really has two components:
1) The storage in the page file
2) The physical memory--with paging, there might not be physical memory associated with it all the time.
The name serves as the way of identifying the shared memory so that it can be mapped to the process address space.
It is used for sharing data between processes. They were very commonly used with database systems. They are the fastest method of interprocess communication but require some kind of locking mechanism that the application has to implement. Often they are used with on writer and multiple readers.
If processes A&B map to the shared memory block, and process A writes to the block, B immediately sees the change.

Sharing memory across multiple computers?

I'd like to share certain memory areas around multiple computers, that is, for a C/C++ project. When something on computer B accesses some memory area which is currently on computer A, that has to be locked on A and sent to B. I'm fine when its only linux compitable.
Thanks in advance :D
You cannot do this for a simple C/C++ project.
Common computer hardware does not have the physical properties that support this directly: Memory on one system cannot be read by another system.
In order to make it appear to C/C++ programs on different machines that they are sharing memory, you have to write software that provides this function. Typically, you would need to do something like this:
Allocate some pages in the virtual memory address space (of each process).
Mark those pages read-only.
Set a handler to receive the exception that occurs when the process attempts to write to the read-only memory. (This handler might be in the operating system, as some sort of kernel extension, or it might be a signal handler in your process.)
When the exception is received, determine what the process was attempting to write to memory. Write that to the page (perhaps by writing it through a separate mapping in virtual memory to the same physical memory, with this extra mapping marked writeable).
Send a message by network communications to the other machine telling it that memory has changed.
Resume execution in the process after the instruction that wrote to memory.
Additionally, you need to determine what to do about memory coherence: If two processes write to the same address in memory at nearly the same time, what happens? If process A writes to location X and then reads location Y while, at nearly the same time, process B writes to location Y and reads X, what do they see? Is it okay if the two processes see data that cannot possibly be the result of a single time sequence of writes to memory?
On top of all that, this is hugely expensive in time: Stores to memory that require exception handling and network operations take many thousands, likely hundreds of thousands, times as long as normal stores to memory. Your processes will execute excruciatingly slowly whenever they write to this shared memory.
There are software solutions, as noted in the comments. These use the paging hardware in the processors on a node to detect access, and use your local network fabric to disseminate the changes to the memory. One hardware alternative is reflective memory - you can read more about it here:
https://en.wikipedia.org/wiki/Reflective_memory
http://www.ecrin.com/embedded/downloads/reflectiveMemory.pdf
Old page was broken
http://www.dolphinics.com/solutions/embedded-system-reflective-memory.html
Reflective memory provides low latency (about one microsecond per hop) in either a ring or tree configuration.

When should I use mmap for file access?

POSIX environments provide at least two ways of accessing files. There's the standard system calls open(), read(), write(), and friends, but there's also the option of using mmap() to map the file into virtual memory.
When is it preferable to use one over the other? What're their individual advantages that merit including two interfaces?
mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.
mmap also allows the operating system to optimize paging operations. For example, consider two programs; program A which reads in a 1MB file into a buffer creating with malloc, and program B which mmaps the 1MB file into memory. If the operating system has to swap part of A's memory out, it must write the contents of the buffer to swap before it can reuse the memory. In B's case any unmodified mmap'd pages can be reused immediately because the OS knows how to restore them from the existing file they were mmap'd from. (The OS can detect which pages are unmodified by initially marking writable mmap'd pages as read only and catching seg faults, similar to Copy on Write strategy).
mmap is also useful for inter process communication. You can mmap a file as read / write in the processes that need to communicate and then use synchronization primitives in the mmap'd region (this is what the MAP_HASSEMAPHORE flag is for).
One place mmap can be awkward is if you need to work with very large files on a 32 bit machine. This is because mmap has to find a contiguous block of addresses in your process's address space that is large enough to fit the entire range of the file being mapped. This can become a problem if your address space becomes fragmented, where you might have 2 GB of address space free, but no individual range of it can fit a 1 GB file mapping. In this case you may have to map the file in smaller chunks than you would like to make it fit.
Another potential awkwardness with mmap as a replacement for read / write is that you have to start your mapping on offsets of the page size. If you just want to get some data at offset X you will need to fixup that offset so it's compatible with mmap.
And finally, read / write are the only way you can work with some types of files. mmap can't be used on things like pipes and ttys.
One area where I found mmap() to not be an advantage was when reading small files (under 16K). The overhead of page faulting to read the whole file was very high compared with just doing a single read() system call. This is because the kernel can sometimes satisify a read entirely in your time slice, meaning your code doesn't switch away. With a page fault, it seemed more likely that another program would be scheduled, making the file operation have a higher latency.
mmap has the advantage when you have random access on big files. Another advantage is that you access it with memory operations (memcpy, pointer arithmetic), without bothering with the buffering. Normal I/O can sometimes be quite difficult when using buffers when you have structures bigger than your buffer. The code to handle that is often difficult to get right, mmap is generally easier. This said, there are certain traps when working with mmap.
As people have already mentioned, mmap is quite costly to set up, so it is worth using only for a given size (varying from machine to machine).
For pure sequential accesses to the file, it is also not always the better solution, though an appropriate call to madvise can mitigate the problem.
You have to be careful with alignment restrictions of your architecture(SPARC, itanium), with read/write IO the buffers are often properly aligned and do not trap when dereferencing a casted pointer.
You also have to be careful that you do not access outside of the map. It can easily happen if you use string functions on your map, and your file does not contain a \0 at the end. It will work most of the time when your file size is not a multiple of the page size as the last page is filled with 0 (the mapped area is always in the size of a multiple of your page size).
In addition to other nice answers, a quote from Linux system programming written by Google's expert Robert Love:
Advantages of mmap( )
Manipulating files via mmap( ) has a handful of advantages over the
standard read( ) and write( ) system calls. Among them are:
Reading from and writing to a memory-mapped file avoids the
extraneous copy that occurs when using the read( ) or write( ) system
calls, where the data must be copied to and from a user-space buffer.
Aside from any potential page faults, reading from and writing to a memory-mapped file does not incur any system call or context switch
overhead. It is as simple as accessing memory.
When multiple processes map the same object into memory, the data is shared among all the processes. Read-only and shared writable
mappings are shared in their entirety; private writable mappings have
their not-yet-COW (copy-on-write) pages shared.
Seeking around the mapping involves trivial pointer manipulations. There is no need for the lseek( ) system call.
For these reasons, mmap( ) is a smart choice for many applications.
Disadvantages of mmap( )
There are a few points to keep in mind when using mmap( ):
Memory mappings are always an integer number of pages in size. Thus, the difference between the size of the backing file and an
integer number of pages is "wasted" as slack space. For small files, a
significant percentage of the mapping may be wasted. For example, with
4 KB pages, a 7 byte mapping wastes 4,089 bytes.
The memory mappings must fit into the process' address space. With a 32-bit address space, a very large number of various-sized mappings
can result in fragmentation of the address space, making it hard to
find large free contiguous regions. This problem, of course, is much
less apparent with a 64-bit address space.
There is overhead in creating and maintaining the memory mappings and associated data structures inside the kernel. This overhead is
generally obviated by the elimination of the double copy mentioned in
the previous section, particularly for larger and frequently accessed
files.
For these reasons, the benefits of mmap( ) are most greatly realized
when the mapped file is large (and thus any wasted space is a small
percentage of the total mapping), or when the total size of the mapped
file is evenly divisible by the page size (and thus there is no wasted
space).
Memory mapping has a potential for a huge speed advantage compared to traditional IO. It lets the operating system read the data from the source file as the pages in the memory mapped file are touched. This works by creating faulting pages, which the OS detects and then the OS loads the corresponding data from the file automatically.
This works the same way as the paging mechanism and is usually optimized for high speed I/O by reading data on system page boundaries and sizes (usually 4K) - a size for which most file system caches are optimized to.
An advantage that isn't listed yet is the ability of mmap() to keep a read-only mapping as clean pages. If one allocates a buffer in the process's address space, then uses read() to fill the buffer from a file, the memory pages corresponding to that buffer are now dirty since they have been written to.
Dirty pages can not be dropped from RAM by the kernel. If there is swap space, then they can be paged out to swap. But this is costly and on some systems, such as small embedded devices with only flash memory, there is no swap at all. In that case, the buffer will be stuck in RAM until the process exits, or perhaps gives it back withmadvise().
Non written to mmap() pages are clean. If the kernel needs RAM, it can simply drop them and use the RAM the pages were in. If the process that had the mapping accesses it again, it cause a page fault the kernel re-loads the pages from the file they came from originally. The same way they were populated in the first place.
This doesn't require more than one process using the mapped file to be an advantage.

Resources