copy_from_user and segmentation - c

I was reading a paragraph from the "The Linux Kernel Module Programming Guide" and I have a couple of doubts related to the following paragraph.
The reason for copy_from_user or get_user is that Linux memory (on
Intel architecture, it may be different under some other processors)
is segmented. This means that a pointer, by itself, does not reference
a unique location in memory, only a location in a memory segment, and
you need to know which memory segment it is to be able to use it.
There is one memory segment for the kernel, and one for each of the
processes.
However it is my understanding that Linux uses paging instead of segmentation and that virtual addresses at and above 0xc0000000 have the kernel mapping in.
Do we use copy_from_user in order to accommodate older kernels?
Do the current linux kernels use segmentation in any way at all? If so how?
If (1) is not true, are there any other advantages to using copy_from_user?

Yeah. I don't like that explanation either. The details are essentially correct in a technical sense (see also Why does Linux on x86 use different segments for user processes and the kernel?) but as you say, linux typically maps the memory so that kernel code could access it directly, so I don't think it's a good explanation for why copy_from_user, etc. actually exist.
IMO, the primary reason for using copy_from_user / copy_to_user (and friends) is simply that there are a number of things to be checked (dangers to be guarded against), and it makes sense to put all of those checks in one place. You wouldn't want every place that needs to copy data in and out from user-space to have to re-implement all those checks. Especially when the details may vary from one architecture to the next.
For example, it's possible that a user-space page is actually not present when you need to copy to or from that memory and hence it's important that the call be made from a context that can accommodate a page fault (and hence being put to sleep).
Also, user-space data pointers need to be checked carefully to ensure that they actually point to user-space and that they point to data regions, and that the copy length doesn't wrap beyond the end of the valid regions, and so forth.
Finally, it's possible that user-space actually doesn't share the same page mappings with the kernel. There used to be a linux patch for 32-bit x86 that made the complete 4G of virtual address space available to user-space processes. In that case, kernel code could not make the assumption that a user-space pointer was directly accessible, and those functions might need to map individual user-space pages one at a time in order to access them. (See 4GB/4GB Kernel VM Split)

Related

How to create vm_area mapping if using __get_free_pages() with order greater than 1?

I am re-implementing mmap in a device driver for DMA.
I saw this question: Linux Driver: mmap() kernel buffer to userspace without using nopage that has an answer using vm_insert_page() to map one page at a time; hence, for multiple pages, needed to execute in a loop. Is there another API that handles this?
Previously I used dma_alloc_coherent to allocate a chunk of memory for DMA and used remap_pfn_range to build a page table that associates process's virtual memory to physical memory.
Now I would like to allocate a much larger chunk of memory using __get_free_pages with order greater than 1. I am not sure how to build page table in that case. The reason is as follows:
I checked the book Linux Device Drivers and noticed the following:
Background:
When a user-space process calls mmap to map device memory into its address space, the system responds by creating a new VMA to represent that mapping. A driver that supports mmap (and, thus, that implements the mmap method) needs to help that process by completing the initialization of that VMA.
Problem with remap_pfn_range:
remap_pfn_range won’t allow you to remap conventional addresses, which include the ones you obtain by calling get_free_page. Instead, it maps in the zero page. Everything appears to work, with the exception that the process sees private, zero-filled pages rather than the remapped RAM that it was hoping for.
The corresponding implementation using get_free_pages with order 0, i.e. only 1 page in scullp device driver:
The mmap method is disabled for a scullp device if the allocation order is greater than zero, because nopage deals with single pages rather than clusters of pages. scullp simply does not know how to properly manage reference counts for pages that are part of higher-order allocations.
May I know if there is a way to create VMA for pages obtained using __get_free_pages with order greater than 1?
I checked Linux source code and noticed there are some drivers re-implementing struct dma_map_ops->alloc() and struct dma_map_ops->map_page(). May I know if this is the correct way to do it?
I think I got the answer to my question. Feel free to correct me if I am wrong.
I happened to see this patch: mm: Introduce new vm_map_pages() and vm_map_pages_zero() API while I was googling for vm_insert_page.
Previouly drivers have their own way of mapping range of kernel pages/memory into user vma and this was done by invoking vm_insert_page() within a loop.
As this pattern is common across different drivers, it can be generalized by creating new functions and use it across the drivers.
vm_map_pages() is the API which could be used to mapped kernel memory/pages in drivers which has considered vm_pgoff.
After reading it, I knew I found what I want.
That function also could be found in Linux Kernel Core API Documentation.
As for the difference between remap_pfn_range() and vm_insert_page() which requires a loop for a list of contiguous pages, I found this answer to this question extremely helpful, in which it includes a link to explanation by Linus.
As a side note, this patch mm: Introduce new vm_insert_range and vm_insert_range_buggy API indicates that the earlier version of vm_map_pages() was vm_insert_range(), but we should stick to vm_map_pages(), since under the hood vm_map_pages() calls vm_insert_range().

Reading/writing in Linux kernel space

I want to add functions in the Linux kernel to write and read data. But I don't know how/where to store it so other programs can read/overwrite/delete it.
Program A calls uf_obj_add(param, param, param) it stores information in memory.
Program B does the same.
Program C calls uf_obj_get(param) the kernel checks if operation is allowed and if it is, it returns data.
Do I just need to malloc() memory or is it more difficult ?
And how uf_obj_get() can access memory where uf_obj_add() writes ?
Where to store memory location information so both functions can access the same data ?
As pointed out by commentators to your question, achieving this in userspace would probably be much safer. However, if you insist on achieving this by modifying kernel code, one way you can go is implementing a new device driver, which has functions such as read and write that you may implement according to your needs, in order to have your processes access some memory space. Your processes can then work, as you described, by reading from and writing onto the same space more or less as if they are reading from/writing to a regular file.
I would recommend reading quite a bit of materials before diving into kernel code, though. A good resource on device drivers is Linux Device Drivers. Even though a significant portion of its information may not be up-to-date, you may find here a version of the source code used in the book ported to linux 3.x. You may find what you are looking for under the directory scull.
Again, as pointed out by commentators to your question, I do not think you should jump right into updating the execution of the kernel space. However, for educational purposes scull may serve as a good starting point to read kernel code and see how to achieve results similar to what you described.

C write/read detection on memory block

i like to ask if someone have any idea how to detect a write on alloc memory address.
At first i used mprotect along with sigaction to force a segmentation fault when was made a write/read operation.
Two negative factor with this approach among several:
is the difficult to pass through a segmentation fault
the memory address pass in mprotect must be aligned to a page boundary i.e its not possible to handle this memory address with a simple malloc.
To clarify the problematic:
I construct a app in C for cluster environment. In some point I allocate memory that i call buffer in a local host and assign some data. This buffer will be sent to a remote node and have the same procedure. At same point this buffer will be write/read in remote node but i don't know when(it will be used DMA to write/read buffer), the local host must be notified about buffer modification. Like i said above i already used some mechanisms but none of them
its capable to handle it with some ability. For now i just want some idea.
Every different idea here its welcome.
Thanks
You could use hardware breakpoints. The downsides are that this is hardware specific and only a limited number of breakpoints can be set. Also most of the times such facilities are not task specific, so if you run multiple instances of the program they'll share the number of available 'slots'.
The x86 architecture has debug registers which can be used to set hardware memory breakpoints (see: http://en.wikipedia.org/wiki/X86_debug_register).
If you want to test this you could use GDB to set hardware breakpoints. You can use the 'watch' command of GDB to place a hardware memory breakpoint on a variable.
Note that using debug registers and mprotect() are just methods to get the job done you're asking for, I don't think they are sound engineering practices for doing memory management (what you probably try to do here). Maybe you can explain a bit more about what you trying to do at a higher level: http://catb.org/esr/faqs/smart-questions.html#goal

When a binary file runs, does it copy its entire binary data into memory at once? Could I change that?

Does it copy the entire binary to the memory before it executes? I am interested in this question and want to change it into some other way. I mean, if the binary is 100M big (seems impossible), I could run it while I am copying it into the memory. Could that be possible?
Or could you tell me how to see the way it runs? Which tools do I need?
The theoretical model for an application-level programmer makes it appear that this is so. In point of fact, the normal startup process (at least in Linux 1.x, I believe 2.x and 3.x are optimized but similar) is:
The kernel creates a process context (more-or-less, virtual machine)
Into that process context, it defines a virtual memory mapping that maps
from RAM addresses to the start of your executable file
Assuming that you're dynamically linked (the default/usual), the ld.so program
(e.g. /lib/ld-linux.so.2) defined in your program's headers sets up memory mapping for shared libraries
The kernel does a jmp into the startup routine of your program (for a C program, that's
something like crtprec80, which calls main). Since it has only set up the mapping, and not actually loaded any pages(*), this causes a Page Fault from the CPU's Memory Management Unit, which is an interrupt (exception, signal) to the kernel.
The kernel's Page Fault handler loads some section of your program, including the part
that caused the page fault, into RAM.
As your program runs, if it accesses a virtual address that doesn't have RAM backing
it up right now, Page Faults will occur and cause the kernel to suspend the program
briefly, load the page from disc, and then return control to the program. This all
happens "between instructions" and is normally undetectable.
As you use malloc/new, the kernel creates read-write pages of RAM (without disc backing files) and adds them to your virtual address space.
If you throw a Page Fault by trying to access a memory location that isn't set up in the virtual memory mappings, you get a Segmentation Violation Signal (SIGSEGV), which is normally fatal.
As the system runs out of physical RAM, pages of RAM get removed; if they are read-only copies of something already on disc (like an executable, or a shared object file), they just get de-allocated and are reloaded from their source; if they're read-write (like memory you "created" using malloc), they get written out to the ( page file = swap file = swap partition = on-disc virtual memory ). Accessing these "freed" pages causes another Page Fault, and they're re-loaded.
Generally, though, until your process is bigger than available RAM — and data is almost always significantly larger than the executable — you can safely pretend that you're alone in the world and none of this demand paging stuff is happening.
So: effectively, the kernel already is running your program while it's being loaded (and might never even load some pages, if you never jump into that code / refer to that data).
If your startup is particularly sluggish, you could look at the prelink system to optimize shared library loads. This reduces the amount of work that ld.so has to do at startup (between the exec of your program and main getting called, as well as when you first call library routines).
Sometimes, linking statically can improve performance of a program, but at a major expense of RAM — since your libraries aren't shared, you're duplicating "your libc" in addition to the shared libc that every other program is using, for example. That's generally only useful in embedded systems where your program is running more-or-less alone on the machine.
(*) In point of fact, the kernel is a bit smarter, and will generally preload some pages
to reduce the number of page faults, but the theory is the same, regardless of the
optimizations
No, it only loads the necessary pages into memory. This is demand paging.
I don't know of a tool which can really show that in real time, but you can have a look at /proc/xxx/maps, where xxx is the PID of your process.
While you ask a valid question, I don't think it's something you need to worry about. First off, a binary of 100M is not impossible. Second, the system loader will load the pages it needs from the ELF (Executable and Linkable Format) into memory, and perform various relocations, etc. that will make it work, if necessary. It will also load all of its requisite shared library dependencies in the same way. However, this is not an incredibly time-consuming process, and one that doesn't really need to be optimized. Arguably, any "optimization" would have a significant overhead to make sure it's not trying to use something that hasn't been loaded in its due course, and would possibly be less efficient.
If you're curious what gets mapped, as fge says, you can check /proc/pid/maps. If you'd like to see how a program loads, you can try running a program with strace, like:
strace ls
It's quite verbose, but it should give you some idea of the mmap() calls, etc.

How does mprotect() work?

I was stracing some of the common commands in the linux kernel, and saw mprotect() was used a lot many times. I'm just wondering, what is the deciding factor that mprotect() uses to find out that the memory address it is setting a protection value for, is in its own address space?
On architectures with an MMU1, the address that mprotect() takes as an argument is a virtual address. Each process has its own independent virtual address space, so there's only two possibilities:
The requested address is within the process's own address range; or
The requested address is within the kernel's address range (which is mapped into every process).
mprotect() works internally by altering the flags attached to a VMA2. The first thing it must do is look up the VMA corresponding to the address that was passed - if the passed address was within the kernel's address range, then there is no VMA, and so this search will fail. This is exactly the same thing happens if you try to change the protections on an area of the address space that is not mapped.
You can see a representation of the VMAs in a process's address space by examining /proc/<pid>/smaps or /proc/<pid>/maps.
1. Memory Management Unit
2. Virtual Memory Area, a kernel data structure describing a contiguous section of a process's memory.
This is about virtual memory. And about dynamic linker/loader. Most mprotect(2) syscalls you see in the trace are probably related to bringing in library dependencies, though malloc(3) implementation might call it too.
Edit:
To answer your question in comments - the MMU and the code inside the kernel protect one process from the other. Each process has an illusion of a full 32-bit or 64-bit address space. The addresses you operate on are virtual and belong to a given process. Kernel, with the help of the hardware, maps those to physical memory pages. These pages could be shared between processes implicitly as code, or explicitly for interprocess communications.
The kernel looks up the address you pass mprotect in the current process's page table. If it is not in there then it fails. If it is in there the kernel may attempt to mark the page with new access rights. I'm not sure, but it may still be possible that the kernel would return an error here if there were some special reason that the access could not be granted (such as trying to change the permissions of a memory mapped shared file area to writable when the file was actually read only).
Keep in mind that the page table that the processor uses to determine if an area of memory is accessible is not the one that the kernel used to look up that address. The processor's table may have holes in it for things like pages that are swapped out to disk. The tables are related, but not the same.

Resources