What is the difference between vm_insert_page() and remap_pfn_range()? - c

I want to map device memory (NIC) to the kernel space memory region by using ioremap_wc(). And then I want to remap memory region from kernel space to user space, and I can use 2 functions for this: vm_insert_page() and remap_pfn_range()
POSIX mmap(3) usually use the second: remap_pfn_range()
What is the difference between vm_insert_page() and remap_pfn_range(), and when do I need to use vm_insert_page() instead of remap_pfn_range()?

As their name suggest vm_insert_page() map a single page, while remap_pfn_range() maps a consecutive block of kernel memory. Check the prototypes and comments vm_insert_page, remap_pfn_range
For example, you can use vm_insert_page to map vmalloc aree
do {
page = vmalloc_to_page(vaddr);
vm_insert_page(vma, uaddr, page);
vaddr += PAGE_SIZE;
} while(/* there is something to map */);
it is not possible using remap_pfn_range because it maps only a consecutive block of kernel memory.
Another difference is that with remap_pfn_range you can map not only RAM buffers, but other ranges. With vm_inser_page you can map only RAM buffers
An explanation from Linus

vm_insert_page() allows drivers to insert individual pages they've allocated into a user vma. The page has to be allocate within the kernel independently. It requires the page be an order-zero allocation obtained for this purpose. It does not put out warnings, and does not require that PG_reserved be set.
Traditionally, this was done with remap_pfn_range() which took an arbitrary page protection parameter. vm_insert_page() doesn't allow that. Your vma protection will have to be set up correctly, which means that if you want a shared writeable mapping, you'd better ask for a shared writeable mapping!
remap_pfn_range() is used for mapping or remapping a group of pages into the memory.
Refer

In short,
use remap_pfn_range if you don't need struct page for the physical page frames.
use vm_insert_page for memory you need struct page for the physical page frame.
(note that there is also vm_insert_pages which can insert multiple pages into vma)
If you hope to do direct I/O from this region, then you need to use vm_insert_page/vm_insert_pages because direct I/O will invoke get_user_page to get struct page and the following ext4/scatterlist codes also need struct page.

Related

Is it possible to control page-out and page-in by user programming? If yes then how?

My questions are as follows:
I mmap(memory mapping) a file into the virtual memory space.
When I access the first byte of the file using a pointer at the first time, the OS will try to access the data in memory, but it will fails and raises the page fault, because the data doesn't present in memory now. So the OS will swap the data from disk into memory. Finally my access will success.
(question is coming)
When I modify the data(in-memory) and write back into disk file, how could I just free the physical memory for other using, but remain virtual memory for fetching the data back into memory as needed?
It sounds like the page-out and page-in behaviors where the OS know the memory is exhaust, it will swap the LRU(or something like that) memory page into disk(swap files) and free the physical memory for other process, and fetch the evicted data back into memory as needed. But this mechanism is controlled by OS.
For some reasons, I need to control the page-out and page-in behaviors by myself. So how should I do? Hack the kernel?
You can use the madvise system call. Its behaviour is affected by the advice argument; there are many choices for advice and the optimal one should be picked based on the specifics of your application.
The flag MADV_DONTNEED means that the given range of physical backing frames should be unconditionally freed (i.e. paged out). Also:
After a successful MADV_DONTNEED operation, the semantics of
memory access in the specified region are changed: subsequent
accesses of pages in the range will succeed, but will result
in either repopulating the memory contents from the up-to-date
contents of the underlying mapped file (for shared file
mappings, shared anonymous mappings, and shmem-based
techniques such as System V shared memory segments) or zero-
fill-on-demand pages for anonymous private mappings.
This could be useful if you're absolutely certain that it will be very long until you access the same position again.
However it might not be necessary to force the kernel to actually page out; instead another possibility, if you're accessing the mapping sequentially is to use madvise with MADV_SEQUENTIAL to tell kernel that you'd access your memory mapping mostly sequentially:
Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
or MADV_RANDOM
Expect page references in random order. (Hence, read ahead may be less useful than normally.)
These are not as aggressive as explicitly calling MADV_DONTNEED to page out. (Of course you can combine these with MADV_DONTNEED as well)
In recent kernel versions there is also the MADV_FREE flag which will lazily free the page frames; they will stay mapped in if enough memory is available, but are reclaimed by the kernel if the memory pressure grows.
You can checout mlock+munlock to lock/unlock the pages. This will give you control over pages being swapped out.
You need to have CAP_IPC_LOCK capability to perform this operation though.

How is the anatomy of memory mapped Kernel space

I try to understand the mechanism in Linux of mapping kernel mode space into user mode space using mmap.
First I have a loadable kernel module (LKM) which provides a character device with mmap-functionality. Then a user space application open the device and calls mmap the LKM allocate memory space on the heap of the LKM inside the kernel mode space (virtual high address). On user space side the data pointer points to a virtual low address.
The following picture shows how I imagine the anatomy of memory is. Is this right?
Please let me know if question is not clear, I will try to add more details.
Edit: The picture was edited regarding to Gil Hamilton. The black arrow now points to a physical address.
The drawing is missing out a few important underlying assumptions.
The kernel does not need to mmap() to access user space memory. If a user process has the memory, it's already mapped in the address space by definition. In that sense, the memory is already shared between user and kernel.
mmap() creates a new region in user's virtual address space, so that the address region can be populated by physical memory if later accessed. The actual allocation of memory and modifying the page table entry is done by the kernel.
mmap() only makes sense for managing user-half of the virtual address space. Kernel-half of the address space is managed completely differently.
Also, the kernel-half is shared by all processes in the system. Each process has its dedicated virtual address space, but the page tables are programmed in such a way that the page table entries for the kernel-half are set exactly the same for all processes.
Again, the kernel does not mmap() in order to access user space memory. mmap() is rather a service provided by kernel to user to modify the current mapping in user's virtual address space.
BTW, the kernel actually has a few ways to access user memory if it wants to.
First of all, the kernel has a dedicated region of kernel address space (as part of its kernel space) which maps the entirety of the physical memory present in consecutive fashion. (This is true in all 64-bit system. In 32-bit system the kernel has to 'remap' on-the-fly to achieve this.)
Second, if the kernel is entered via a system call or exception, not by hardware interrupt, you have valid process context, so the kernel can directly "dereference" user space pointer to get the correct value.
Third, if kernel wants to deference a user space pointer of a process while executing in a borrowed context such as in an interrupt handler, kernel can trace process's virtual address by traversing the vm_area_struct tree for permission and walking the page table to find out actual physical page frame.
You can check the memory regions by iterating through vma's "struct vm_area_struct" through current.
If you walk pagetables and derive mapped physical addresses for virtual addresses which is not related to user space then memory layout will be more clear.
Apart from this minor correction in this figure,
BSS is not a segment but section which is embed to Data segment, refer ELF specification for more details, linker script

mmap() in linux kernel to access unmapped memory

I am trying to use the mamp() functionality provided in linux-kernel.
As we call mmap() in user-space we try to map virtual memory area of user-space process to the memory in the kernel-space.
the definition of mamp() inside kernel is done in my kernel module which try to allocate some memory in pages & maps it during mmap system call. The memory content of this kernel-space memory could be filled by this module.
The question i want to ask is that after memory mapping the user-space process could access the mapped memory directly with-out any extra kernel overload so there will be no system-call like read() but if the memory(allocated inside kernel-space & mapped in the kernel-space) is containing the pointer to other memory(not mapped) allocated inside the kernel-space then could the user-space process be able to access this unmapped memory with the help of mapped memory's content which are pointer to this unmapped memory.
No, userspace can't chase pointers in mapped memory that point to unmapped kernel memory.
No user-space process can not be able to access the unmapped memory. Kernel wont allow you to access that memory.
You are able to access only that portion of memory which is mapped via mmap.
I think use can use remap_pfn_range function explicitly to remapping the region.
From Linux mmap man page
The effect of changing
the size of the underlying file of a mapping on the pages that correspond to
added or removed regions of the file is unspecified.
No,you can't.
However,If your purpose is to change your mmaped area on the fly,Here are some options:
A. In user space, you can use mremap which expands (or shrinks) an existing memory mapping.
B. In kernel space,in your driver, you need to implement nopage() method or remap_pfn_range,but remap_pfn_range has its limitation which Linux only gives the reserved pages and you even cant remap normal address,such as the one allocated by get_free_page()

Manage virtual memory from userspace

What I actually want to do is to redirect writes in a certain memory area to a separate memory area which is shared between two processes. Can this be done at user level? For example, for some page X. What I want to do is to change its (virtual to physical) mapping to some shared mapping when it's written. Is this achievable? I need to do it transparently too, that is the program still uses the variables in page X by their names or pointers, but behind the scenes, we are using a different page.
Yes, it is possible to replace memory mappings in Linux, though it is not advisable to do it since it is highly non-portable.
First, you should find out in what page the X variable is located by taking its address and masking out the last several bits - query the system page size with sysconf(_SC_PAGE_SIZE) in order to know how many bits to mask out. Then you can create a shared memory mapping that overlaps this page using the MAP_FIXED | MAP_SHARED flag to mmap(2) or mmap2(2). You should copy the initial content of the page and restore it after the new mapping. Since other variables may reside in the same page, you should be very careful about memory layout and better use a dedicated shared memory object.
What you're trying to do isn't entirely possible, because, at least on x86, memory cannot be remapped on that fine-grained of a scale. The smallest quantum that you can remap memory on is a 4k page, and the page containing any given variable (e.g, X) is likely to contain other variables or program data.
That being said, you can share memory between processes using the mmap() system call.

How does mprotect() work?

I was stracing some of the common commands in the linux kernel, and saw mprotect() was used a lot many times. I'm just wondering, what is the deciding factor that mprotect() uses to find out that the memory address it is setting a protection value for, is in its own address space?
On architectures with an MMU1, the address that mprotect() takes as an argument is a virtual address. Each process has its own independent virtual address space, so there's only two possibilities:
The requested address is within the process's own address range; or
The requested address is within the kernel's address range (which is mapped into every process).
mprotect() works internally by altering the flags attached to a VMA2. The first thing it must do is look up the VMA corresponding to the address that was passed - if the passed address was within the kernel's address range, then there is no VMA, and so this search will fail. This is exactly the same thing happens if you try to change the protections on an area of the address space that is not mapped.
You can see a representation of the VMAs in a process's address space by examining /proc/<pid>/smaps or /proc/<pid>/maps.
1. Memory Management Unit
2. Virtual Memory Area, a kernel data structure describing a contiguous section of a process's memory.
This is about virtual memory. And about dynamic linker/loader. Most mprotect(2) syscalls you see in the trace are probably related to bringing in library dependencies, though malloc(3) implementation might call it too.
Edit:
To answer your question in comments - the MMU and the code inside the kernel protect one process from the other. Each process has an illusion of a full 32-bit or 64-bit address space. The addresses you operate on are virtual and belong to a given process. Kernel, with the help of the hardware, maps those to physical memory pages. These pages could be shared between processes implicitly as code, or explicitly for interprocess communications.
The kernel looks up the address you pass mprotect in the current process's page table. If it is not in there then it fails. If it is in there the kernel may attempt to mark the page with new access rights. I'm not sure, but it may still be possible that the kernel would return an error here if there were some special reason that the access could not be granted (such as trying to change the permissions of a memory mapped shared file area to writable when the file was actually read only).
Keep in mind that the page table that the processor uses to determine if an area of memory is accessible is not the one that the kernel used to look up that address. The processor's table may have holes in it for things like pages that are swapped out to disk. The tables are related, but not the same.

Resources