simulating memfd_create on Linux 2.6 - c

I'm backporting a piece of code that uses a virtual memory trick involving a file descriptor that gets passed to mmap, but doesn't have a mount point. Having the physical file would be an unnecessary overhead in this application. The original code uses memfd_create which is great.
Since Linux 2.6 doesn't have memfd_create or the O_TMPFILE flag for open, I'm currently creating a file with mkstemp and then unlinking it without closing it first. This works, but it doesn't please me at all.
Is there a better way to get a file descriptor for mmap purposes without ever touching the file system in 2.6?
Before somebody says "XY problem," what I really need is two different virtual memory addresses to the same data in memory. This is implemented by mmap'ing the same anonymous file to two different addresses. Any other "Y" to my "X" also welcome.
Thanks

I considered two approaches:
Creating my temporary under /dev/shm/ rather than /tmp/
Using shm_open to get a file descriptor.
Although irrelevant to the specific problem at hand, /dev/shm/ is not guaranteed to exist on all distributions, so #2 felt more correct to me.
In order to not have to worry about unique names of the shared memory objects, I just generate UUIDs.
I think I'm happy with this.
Shout out to #NominalAnimal.

Related

windows - plain shared memory between 2 processes (no file mapping, no pipe, no other extra)

How to have an isolated part of memory, that is NOT at all backed to any file or extra management layers such as piping and that can be shared between two dedicated processes on the same Windows machine?
Majority of articles point me into the direction of CreateFileMapping. Let's start from there:
How does CreateFileMapping with hFile=INVALID_HANDLE_VALUE actually work?
According to
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366537(v=vs.85).aspx
it
"...creates a file mapping object of a specified size that is backed by the system paging file instead of by a file in the file system..."
Assume I write something into the memory which is mapped by CreateFileMapping with hFile=INVALID_HANDLE_VALUE. Under which conditions will this content be written to the page file on disk?
Also my understanding of what motivates the use of shared memory is to keep performance up and optimized. Why is the article "Creating Named Shared Memory"
(https://msdn.microsoft.com/de-de/library/windows/desktop/aa366551(v=vs.85).aspx) refering to
CreateFileMapping, if there is not a single attribute combination, that would prevent writing to files, e.g. the page file?
Going back to the original question: I am afraid, that CreateFileMapping is not good enough... So what would work?
You misunderstand what it means for memory to be "backed" by the system paging file. (Don't feel bad; Raymond Chen has described the text you quoted from MSDN as "one of the most misunderstood sentences in the Win32 documentation.") Almost all of the computer's memory is "backed" by something on disk; only the "non-paged pool", used exclusively by the kernel and as little as possible, isn't. If a page isn't backed by an ordinary named file, then it's backed by the system paging file. The operating system won't write pages out to the system paging file unless it needs to, but it can if it does need to.
This architecture is intended to ensure that processes can be completely "paged out" of RAM when they have nothing to do. This used to be much more important than it is nowadays, but it's still valuable; a typical Windows desktop will have dozens of processes "idle" waiting for events (e.g. needing to spool a print job) that may never happen. Those processes can get paged out and the memory can be put to more constructive use.
CreateFileMapping with hfile=INVALID_HANDLE_VALUE is, in fact, what you want. As long as the processes sharing the memory are actively doing stuff with it, it will remain resident in RAM and there will be no performance problem. If they go idle, yeah, it may get paged out, but that's fine because they're not doing anything with it.
You can direct the system not to page out a chunk of memory; that's what VirtualLock is for. But it's meant to be used for small chunks of memory containing secret information, where writing it to the page file could conceivably leak the secret. The MSDN page warns you that "Each version of Windows has a limit on the maximum number of pages a process can lock. This limit is intentionally small to avoid severe performance degradation."

What type of memory objects does `shm_open` use?

Usually, shared memory is implemented using portions of On-Disk files mapped to processes address spaces. Whenever a memory access occurs on the shared region, the filesystem is involved to write changes on the disk which is a great overhead. Typically, a call to fopen() returns a file descriptor which is passed to mmap() to create the file's memory map. shm_open, apparently, works in the same way. It returns a file descriptor which can even be used with regular file operations (e.g ftruncate, ftell, fseek ...etc). We do specify a string as a parameter to shm_open but unlike fopen(), it is not a name of a real file on the visible filesystem (mounted HDD, Flash drives, SSD ... etc). The same string name can be used by totally unrelated processes to map the same region into their address spaces.
So, what is the string parameter passed to shm_open & what does shm_open creates/opens ? Is it a file on some temporary filesystem (/tmp) which is eventually used by many processes to create the shared region (Well, i think it has to be some kind of files since it returns a file descriptor) ? Or is it some kind of a mysterious and hidden filesystem backed by the kernel ?
People say shm_open is faster then fopen because no disk operations are involved so the theory i suggest is that the kernel uses an invisible RAM-based filesystem to implement shared memory with shm_open !
Usually, shared memory is implemented using portions of On-Disk files mapped to processes address spaces.
This is generally false, at least on a desktop or laptop running a recent Linux distribution, with some reasonable amount of RAM (e.g. 8Gbytes at least).
So, the disk is not relevant. You could use shm_open without any swap. See shm_overview(7), and notice that /dev/shm/ is generally a tmpfs mounted file system so don't use any disk. See tmpfs(5). And tmpfs don't use the disk (unless you reach thrashing conditions, which is unlikely) since it works in virtual memory.
the filesystem is involved to write changes on the disk which is a great overhead.
This is usually wrong. On most systems, recently written files are in the page cache, which does not reach the disk quickly (BTW, that is why the shutdown procedure needs to call sync(2) which is rarely used otherwise...).
BTW, on most desktops and laptops, it is easy to observe. The hard disk has some LED, and you won't see it blinking when using shm_open and related calls. BTW, you could also use proc(5) (notably /proc/diskstats etc....) to query the kernel about its disk activity.
Usually, shared memory is implemented using portions of On-Disk files mapped to processes address spaces. Whenever a memory access occurs on the shared region, the filesystem is involved to write changes on the disk which is a great overhead.
That seems rather presumptuous, and not entirely correct. Substantially all machines that implement shared memory regions (in the IPC sense) have virtual memory units by which they support the feature. There may or may not be any persistent storage backing any particular shared memory segment, or any part of it. Only the part, if any, that is paged out needs to be backed by such storage.
shm_open, apparently, works in the same way. It returns a file descriptor which can even be used with regular file operations (e.g ftruncate, ftell, fseek ...etc).
That shm_open() has an interface modeled on that of open(), and that it returns a file descriptor that can meaningfully be used with certain general-purpose I/O function, do not imply that shm_open() "works in the same way" in any broader sense. Pretty much all system resources are represented to processes as files. This affords a simpler overall system interface, but it does not imply any commonality of the underlying resources beyond the fact that they can be manipulated via the same functions -- to the extent that indeed they can be.
So, what is the string parameter passed to shm_open & what does shm_open creates/opens ?
The parameter is a string identifying the shared memory segment. You already knew that, but you seem to think there's more to it than that. There isn't, at least not at the level (POSIX) at which the shm_open interface is specified. The identifier is meaningful primarily to the kernel. Different implementations handle the details differently.
Is it a file on some temporary filesystem (/tmp) which is eventually used by many processes to create the shared region
Could be, but probably isn't. Any filesystem interface provided for it is likely (but not certain) to be a virtual filesystem, not actual, accessible files on disk. Persistent storage, if used, is likely to be provided out of the system's swap space.
(Well, i think it has to be some kind of files since it returns a file descriptor) ?
Such a conclusion is unwarranted. Sockets and pipes are represented via file descriptors, too, but they don't have corresponding accessible files.
Or is it some kind of a mysterious and hidden filesystem backed by the kernel ?
That's probably a better conception, though again, there might not be any persistent storage at all. To the extent that there is any, however, it is likely to be part of the system's swap space, which is not all that mysterious.

What resources operating system associates with file descriptor

I know that I should close opened files. I know that if I don't do that the file descriptors will leak. I also know that file descriptor is just an integer. With that integer os associates some resources. And here is the question. What are those resources? What makes it difficult to create infinite (a lot) file descriptors? Why can't os detect those leakages? And why os doesn't give the same file descriptors for the same file opening?
What are those resources?
This link posted by codeforester contains some material about.
Anyway, those file descriptors are simply handles to complex data the kernel holds for a program. They could be opaque pointers, but using simple numbers has its advantages (stdin, stdout, stderr have a well-known number, for example). What kind and amount of data is a kernel thing, and a program should not, and doesn't need, to know. So, nor you and me. But, just to speak, for example some buffer is needed. Then, the kernel must know in any moment which files are opened, otherwise, for example, you could unmount a filesystem with open files and leave programs dangling.
What makes it difficult to create infinite (a lot) file descriptors?
Because file descriptors cost ram (and CPU also), which is a finite resource, and nobody wants a kernel crash because some (stupid) programmer wastes file descriptors... :-). So, the kernel reserves a finite amount of resources for file descriptors (which are not always simple files). Kernels are not all equal, each can have its policy and often some way for users to manage relevant settings.
Why can't os detect those leakages?
Because it can not. The kernel can not tell the difference between a poor written program, which leaks resources, and a program which legitimately allocates many resources. Moreover, it is not the duty of a kernel to try to distinguish good programs from bad ones. A kernel must supply services, fast and efficiently -- all the rest is responsibility of programmers.
And why os doesn't give the same file descriptors for the same file opening?
Because it is legitimate to open the same file twice or more. Two programs can open the same file, two threads can, or even a single thread. And the kernel must always respect the "contract" its API claims, always in the same manner: again, it is the programmer who must know what he is doing.

How to close a memory mapped file in C that I did not explicitly open?

How can I close a memory mapped file that I did not explicitly open using mmap in C? I'm able to see the name of the memory-mapped file using lsof and I'd like to either be able to somehow get the address and size of the file so that I may call munmap. I know that I should be able to access some memory info via /proc/self/mem/ but I'm not entirely sure how to progress from there and if this would be the safe/correct way of doing so. Thanks.
EDIT
I'm using a library that writes to a device file. What I'm trying to do is simulate this device becoming temporarily physically unavailable and having the program recover from this. A part of the device becoming available again is having it's driver reinitialized, this cannot happen with my process still maintaining a reference since the module count won't be zero and the kernel will therefore not allow unloading. I was able to identify all the file descriptors and close these, but lsof points to a shared memory location that still references the device. Since I did not explicitly open this, and the code that did so is not accessible by me, I was hoping for a way to still be able to close it.
I would suggest the most likely safe solution is to use the exec() system call to return to a known state.
What you are asking for is how to yank the memory mapping out from some library function; which sets you up for undefined behavior in the not too distant future, probably involving the heap manager. You don't want to debug that.
OP = Heisenbug. Well that's singularly fitting.
As for the question that now appears after edit; we have here the XY problem. What would happen on device failure is not freeing the memory mapping, but most likely changing it so that all access to the region yields SIGBUS, which is a lot harder to simulate.

How to use Readlink

How do I use Readlink for fetching the values.
The answer is:
Don't do it
At least not in the way you're proposing.
You specified a solution here without specifying what you really want to do [and why?]. That is, what are your needs/requirements? Assuming you get it, what do you want to do with the filename? You posted a bare fragment of your userspace application but didn't post any of your kernel code.
As a long time kernel programmer, I can tell you that this won't work, can't work, and is a terrible hack. There is a vast difference in methods to use inside the kernel vs. userspace.
/proc is strictly for userspace applications to snoop on kernel data. The /proc filesystem drivers assume userspace, so they always do copy_to_user. Data will be written to user address space, and not kernel address space, so this will never work from within the kernel.
Even if you could use /proc from within the kernel, it is a genuinely awful way to do it.
You can get the equivalent data, but it's a bit more complicated than that. If you're intercepting the read syscall inside the kernel, you [already] have access to the current task struct and the fd number used in the call. From this, you can locate the struct for the given open file, and get whatever you want, directly, without involving /proc at all. Use this as a starting point.
Note that doing this will necessitate that you read kernel documentation, sources for filesystem drivers, syscalls, etc. How to lock data structures and lists with the various locking methods (e.g. RCU, rw locks, spinlocks). Also, per-cpu variables. kernel thread preemptions. How to properly traverse the necessary filesystem related lists and structs to get the information you want. All this, without causing lockups, panics, segfaults, deadlocks, UB based on stale or inconsistent/dynamically changing data.
You'll need to study all this to become familiar with the way the kernel does things internally, and understand it, before you try doing something like this. If you had, you would have read the source code for the /proc drivers and already known why things were failing.
As a suggestion, forget anything that you've learned about how a userspace application does things. It won't apply here. Internally, the kernel is organized in a completely different way than what you've been used to.
You have no need to use readlink inside the kernel in this instance. That's the way a userspace application would have to do it, but in the kernel it's like driving 100 miles out of your way to get data you already have nearby, and, as I mentioned previously, won't even work.

Resources