Behavior of mprotect with multiple threads - c

For the purpose of concurrent/parallel GC,
I'm interested in what memory order guarantee is provided by the mprotect syscall (i.e. the behavior of mprotect with multiple threads or the memory model of mprotect). My questions are (assuming no compilier reordering or with sufficient compiler barrier)
If thread 1 triggers a segfault on an address due to a mprotect on
thread 2, can I be sure that everything happens on thread 2 before the
syscall can be observed in thread 1 in the signal handler of the
segfault? What if a full memory barrier is placed in the signal
handler before performing load on thread1?
If thread 1 does an volatile load on an address that is set to
PROT_NONE by thread 2 and didn't trigger a segfault, is this enough of
a happens before relation between the two. Or in another word, if the
two threads do (*ga starts as 0, p is a page aligned address started readonly)
// thread 1
*ga = 1;
*(volatile int*)p; // no segfault happens
// thread 2
mprotect(p, 4096, PROT_NONE); // Or replace 4096 by the real userspace-visible page size
a = *ga;
is there a guarantee that a on thread 2 will be 1? (assuming no
segfault observed on thread 1 and no other code modifies *ga)
I'm mostly interested in Linux behavior and particularly on x86(_64), arm/aarch64 and ppc though information about other archs/OS are welcome to (for windows, replace mprotect by VirtualProtect or whatever it is called....). So far my tests on x64 and aarch64 Linux suggests no violations of these though I'm not sure if my test is conclusive or if the behavior can be relied on in the long term.
Some searching suggests that mprotect may issue a TLB shootdown on all threads with the address mapped when permission is removed which might provide the guarantee stated here (or in another word, providing this guarantee seems to be the goal of such operation) though it's unclear to me if future optimization of the kernel code could break this guarentee.
Ref LKML post where I asked about this a week ago with no reply yet...
Edit: clearification about the question. I was aware that a tlb shootdown should provide the guarantee I'm looking for but I'd like to know if such a behavior can be relied on. In another word, what's the reason such requests are issued by the kernel since it shouldn't be needed if not for providing some kind of ordering guarantee.

So I asked this on the mechanical-sympathy group a day after posting here and got an answer from Gil Tene. With his permission here's my summary of his answers. The full thread is available here in case there's anything I didn't include that isn't clear.
For the overall behavior one can expect from the OS.
(as in "would be surprising for an OS to not meet):
A call to mprotect() is fully ordered with respect to loads and stores that happen before and after the call. This tends to be trivially achieved at the CPU and OS level because mprotect is a system call, which involves a trap, which in turn involves full ordering. [In strange no-ring-transition-implementations (e.g. in-kernel execution, etc.) the protect call would be presumably responsible for emulating this ordering assumption].
A call to mprotect will not return before the protection request semantically takes hold everywhere within the process. If the mprotect() call sets a protection that would cause a fault, any operation on any thread that happens after this mprotect() call is required to fault. Similarly, if the mprotect() call sets a protection that would prevent a fault, any operation on any thread that happens after this mprotect() call is required to NOT fault.
This essentially means that the memory operation on the affected pages on other threads are synchronized with the thread calling mprotect. More specifically, one can expect both of the two cases mentioned in the original question are guaranteed. I.e.
If it is observed that a load on one thread in the affected page faults due to the mprotect call, this fault happens after mprotect() call and therefore after and is able to observer all memory operations that happens before mprotect.
If it is observed that a load on one thread in the affected page doesn't fault disbite the mprotect call, the load happens before mprotect call and the mprotect call and any code after it are after the load and will be able to observe any memory operations that happens before the load.
It was also pointed out that transitivity may not work, i.e. a fault load on one thread may not be after a non-fault load on another thread. This can (effectively) be caused by the non-atomicity of the tlb flush causing different threads/cpus to observer the change in access permission at different times.

Related

What happens at CPU-Level if you dereference a null pointer?

Suppose I have following program:
#include <signal.h>
#include <stddef.h>
#include <stdlib.h>
static void myHandler(int sig){
abort();
}
int main(void){
signal(SIGSEGV,myHandler);
char* ptr=NULL;
*ptr='a';
return 0;
}
As you can see, I register a signalhandler and some lines further, I dereference a null pointer ==> SIGSEGV is triggered.
But how is it triggered?
If I run it using strace (Output stripped):
//Set signal handler (In glibc signal simply wraps a call to sigaction)
rt_sigaction(SIGSEGV, {sa_handler=0x563b125e1060, sa_mask=[SEGV], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7ffbe4fe0d30}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
//SIGSEGV is raised
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [SEGV], 8) = 0
But something is missing, how does a signal go from the CPU to the program?
My understanding:
[Dereferences null pointer] -> [CPU raises an exception] -> [??? (How does it go from the CPU to the kernel?) ] -> [The kernel is notified, and sends the signal to the process] -> [??? (How does the process know, that a signal is raised?)] -> [The matching signal handler is called].
What happens at these two places marked with ????
A NULL pointer in most (but not all) C implementations is address 0. Normally this address is not in a valid (mapped) page.
Any access to a virtual page that's not mapped by the HW page tables results in a page-fault exception. e.g. on x86, #PF.
This invokes the OS's page-fault exception handler to resolve the situation. On x86-64 for example, the CPU pushes exception-return info on the kernel stack and loads a CS:RIP from the IDT (Interrupt Descriptor Table) entry that corresponds to that exception number. Just like any other exception triggered by user-space, e.g. integer divide by zero (#DE), or a General Protection fault #GP (trying to run a privileged instruction in user-space, or a misaligned SIMD instruction that required alignment, or many other possible things).
The page-fault handler can find out what address user-space tried to access. e.g. on x86, there's a control register (CR2) that holds the linear (virtual) address that caused the fault. The OS can get a copy of that into a general-purpose register with mov rax, cr2.
Other ISAs have other mechanisms for the OS to tell the CPU where its page-fault handler is, and for that handler to find out what address user-space was trying to access. But it's pretty universal for systems with virtual memory to have essentially equivalent mechanisms.
The access is not yet known to be invalid. There are several reasons why an OS might not have bothered to "wire" a process's allocated memory into the hardware page tables. This is what paging is all about: letting the OS correct the situation, like copy-on-write, lazy allocation, or bringing a page back in from swap space.
Page faults come in three categories: (copied from my answer on another question). Wikipedia's page-fault article says similar things.
valid (the process logically has the memory mapped, but the OS was lazy or playing tricks like copy-on-write):
hard: the page needs to be paged in from disk, either from swap space or from a disk file (e.g. a memory mapped file, like a page of an executable or shared library). Usually the OS will schedule another task while waiting for I/O: this is the key difference between hard (major) and soft (minor).
soft: No disk access required, just for example allocating + zeroing a new physical page to back a virtual page that user-space just tried to write. Or copy-on-write of a writeable page that multiple processes had mapped, but where changes by one shouldn't be visible to the other (like mmap(MAP_PRIVATE)). This turns a shared page into a private dirty page.
invalid: There wasn't even a logical mapping for that page. A POSIX OS like Linux will deliver SIGSEGV signal to the offending process/thread.
So only after the OS consults its own data structures to see which virtual addresses a process is supposed to own can it be sure that the memory access was invalid.
Deciding whether a page fault is invalid or not is completely up to software. As I wrote on Why page faults are usually handled by the OS, not hardware? - if the HW could figure everything out, it wouldn't need to trap to the OS.
Fun fact: on Linux it's possible to configure the system so virtual address 0 is (or can be) valid. Setting mmap_min_addr = 0 allows processes to mmap there. e.g. WINE needs this for emulating a 16-bit Windows memory layout.
Since that wouldn't change the internal object-representation of a NULL pointer to be other than 0, doing that would mean that NULL dereference would no longer fault. That makes debugging harder, which is why the default for mmap_min_addr is 64k.
On a simpler system without virtual memory, the OS might still be able to configure an MMU to trap on memory access to certain regions of address space. The OS's trap handler doesn't have to check anything, it knows any access that triggered it was invalid. (Unless it's also emulating something for some regions of address space...)
Delivering a signal to user-space
This part is pure software. Delivering SIGSEGV is no different than delivering SIGALRM or SIGTERM sent by another process.
Of course, a user-space process that just returns from a SIGSEGV handler without fixing the problem will make the main thread re-run the same faulting instruction again. (The OS would return to the instruction that raised the page-fault exception.)
This is why the default action for SIGSEGV is to terminate, and why it doesn't make sense to set the behaviour to "ignore".
Typically what happens is that when the CPU’s Memory Management Unit finds that the virtual address the program is trying to access is not in any of the mappings to physical memory, it raises an interrupt. The OS will have set up an Interrupt Service Routine just in case this happens. That routine will do whatever is necessary inside the OS to signal the process with SEGV. In return from the ISR the offending instruction has not been completed.
What happens then depends on whether there’s a handler installed or not for SEGV. The language’s runtime may have installed one that raises it as an exception. Almost always the process is terminated, as it is beyond recovery. Something like valgrind would do something useful with the signal, eg telling you exactly where in the code the program had got to.
Where it gets interesting is when you look at the memory allocation strategies used by C runtime libraries like glibc. A NULL pointer dereference is a bit of an obvious one, but what about accessing beyond the end of an array? Often, calls to malloc() or new will result in the library asking for more memory than has been asked for. The bet is that it can use that memory to satisfy further requests for memory without troubling the OS - which is nice and fast. However, the CPU’s MMU has no idea that that’s happened. So if you do access beyond the end of the array, you’re still accessing memory that the MMU can see is mapped to your process, but in reality you’re beginning to trample where one shouldn’t. Some very defensive OSes don’t do this, specifically so that the MMU does catch out of bounds accesses.
This leads to interesting results. I’ve come across software that builds and runs just fine on Linux which, compiled for FreeBSD, starts throwing SEGVs. GNURadio is one such piece of software (it was a complex flow graph). Which is interesting because it makes heavy use of boost / c++11 smart pointers specifically to help avoid memory misuse. I’ve not yet been able to identify where the fault is to submit a bug report for that one...

Can a process somehow continue without crashing after receiving SIGSEGV or SIGBUS?

I am working on a project that deals with multiple processes and threads effecting the same data. I have a line of code which can result into a segmentation fault because data can be updated from anywhere.
For that particular line, if it causes segmentation fault, I somehow want to handle it instead of letting the program crash.
Like I can simply update the memory location if the previous one was causing a segmentation fault.
Is there any possible way to do that?
UPDATE(A short summary of my case):
I want extremely speedy access to a file.
For that purpose, I am calling mmap(2) to map that file into all processes accessing it. The data I am writing to the file is in form of a particular data structure and it consumes lots of memory. So if a point comes that the size I mapped is not enough, I need to increase file size and mmap(2) that file again with the new size. For increasing the size I call ftruncate(2). ftruncate(2) may get called from any process so it may end up shrinking the file instead. So I need to check if the memory I am accessing doesn’t lead to seg faults.
I am working on macOS.
This can be made to work, but by bringing signal handlers into the picture you make your inter-process and inter-thread locking problems much more complicated. I would like to suggest an alternative approach: Reserve a field in the first page of the mmapped file to indicate the expected size of the data structure. Use fcntl file locks to mediate access to this field.
When any process wants to update the size, it takes a write lock, reads the current value, increases it, msyncs the page (using MS_ASYNC|MS_INVALIDATE should be enough), then uses ftruncate to enlarge the file, then enlarges its mapping of the file, and only then releases the write lock. If, after taking the write lock, you find that the file is already larger than the size you wanted, just enlarge your mapping and drop the lock, don't call ftruncate or change the field.
This ensures cooperating processes will never make the file smaller, and the region of memory each process has mapped is always backed by allocated storage, so you shouldn't ever get any SIGBUSes. Note that the size of the file on disk will only increase when you actually write to newly allocated space, thanks to the magic of sparse files.
Yes, you can make this work with a signal handler that catches the SIGSEGV or SIGBUS, adjusts the mmap and returns. When a signal handler returns it will resume where the signal occurred, which means for a synchronous signal like SIGSEGV or SIGBUS, it will rerun the faulting instruction.
You can see this at work in my shared memory malloc implementation -- search for shm_segv in malloc.c to see the signal handler; it's pretty simple. I've never tried this code on MacOS, but I would think it would work on OSX, as it works on all the other BSD-derived UNIXes I've tried it on. There's a an issue that, according to the POSIX spec, mmap is not async safe, so cannot be called from a signal handler, but on all systems that actually support real memory mapping (rather than emulating it with malloc+read) it should be fine.

Modify read-only memory at low overhead

Assume that I have a page of memory that is read-only (e.g., set through mmap/mprotect). How do I modify one word (8 bytes) on this page at the lowest possible overhead?
Some context: I assume x86-64, Linux as my runtime environment. The modifications happen rarely but frequently enough so that I have to worry about overhead. The page is read only to protect some important data that must be read by the program frequently against rogue/illegal modifications. There are only few places that are allowed to modify the data on the page and I know all the locations of these places and the address of the page statically. The problem I'm trying to solve is protecting some data against memory safety bugs in the program with a few authorized places where I need to make modifications to the data. The modifications are not frequent but frequent enough so that several kernel-roundtrips (through system calls) are too costly.
So far, I thought of the following solutions:
mprotect
ptrace
shared memory
new system call
mprotect
mprotect(addr, 4096, PROT_WRITE | PROT_READ);
addr[12] = 0xc0fec0fe;
mprotect(addr, 4096, PROT_READ);
The mprotect solution is clean, simple, and straight-forward. Unfortunately, it involves two round trips into the kernel and will result in some overhead. In addition, the whole page will be writable during that time frame, allowing for some other thread to modify that memory area concurrently.
ptrace
Unfortunately, ptraceing yourself is no longer possible (as a ptraced-process needs to be stopped. So the solution is to fork, ptrace the child process, then use PTRACE_POKETEXT to write to the child processes memory.
This option has the drawback of spawning a parent process and will result in problems if the tracee uses multiple processes. The overhead per write is at least one system call for PTRACE plus the required synchronization between the processes.
shared memory
Shared memory is similar to the ptrace solution except that it reduces the system call. Both processes set up shared memory with different permissions (RW in the child, R in the parent). The two processes still need to synchronize on each write that is then carried out by the parent. Shared memory has similar drawbacks in complexity as the ptrace solution and incompatibilities with multiple communicating processes.
new system call
Adding a new system call to the kernel would solve the problem and would only require a single system call to modify one word in the process without having to change the page tables or the requirement to set up multiple communicating processes.
Is there anything that is faster than the 4 discussed/sketched solutions? Could I rely on any debug features? Are there any other neat low-level systems tricks?

get_user_pages_fast() from kernel thread

I need to call get_user_pages_fast() from a kernel thread. But get_user_pages_fast() uses current->mm internally, which is set to NULL for kernel thread. Is there any way to get around this? The kernel thread in question is working on behalf of another process, say x, would it be be fine to just set x->mm to current->mm and invoke get_user_pages_fast()?
[EDIT 1]: I verified this and it seems to be working. I am still concerned if it could break in some cases. Any insight is welcome. Thanks.
Your "hack" will indeed work, but let's take a step back and understand what the idea of it is:
When you are in a kernel thread, (And I am talking about a pure kernel thread (child of kthreadd), not a user thread executing in kernel mode, as would be the case of servicing a syscall), there is no user memory to speak of. This is why current->mm is null: There is no "current" user space memory.
When you assign current->mm to x->mm you are "cheating" by annexing the process memory space of the innocent x to be your own. As a consequence, any allocation you perform will be charged to x, and will be visible by x (it is, after all, part of its memory space). Also, there might be internal kernel checks on current->mm which might be tricked, leading to your kernel mode thread to be treated by the kernel as if it were a user mode thread (though arguably other checks rely on KERNEL_DS/USER_DS, which you're not modifying). Still, a concern. This will break if x ever dies (hey - nobody's immortal), and will likely cause an oops, if not a panic altogether.
You haven't said WHY you need to get user pages - if the case is that you know x is alive and you are doing this as part of, say, IPC/shmem, I can see a reason for that. If that is the case, you might want to provide some API for the process in question to "register" with the kernel thread. Otherwise, your solution works, but is.. well, not as neat as it could be.
I'm not convinced this is totally safe. The _fast part of get_user_pages_fast means that acquiring mm->mmap_sem is not required, and part of the reason that works is because it is assumed that we are running within the process itself (so eg the current->mm can't go away completely). Since you're running in another thread, you're susceptible to races if the real process ever does something that changes its mapping.
I guess the question is why can't you just use get_user_pages instead?

Too many calls to mprotect

I am working on a parallel app (C, pthread). I traced the system calls because at some point I have bad parallel performances. My traces shown that my program calls mprotect() many many times ... enough to significantly slow down my program.
I do allocate a lot of memory (with malloc()) but there is only a reasonable number of calls to brk() to increase the heap size. So why so many calls to mprotect() ?!
Are you creating and destroying lots of threads?
Most pthread implementations will add a "guard page" when allocating a threads stack. It's an access protected memory page used to detect stack overflows. I'd expect at least one call to mprotect each time a thread is created or terminated to (un)protect the guard page. If this is the case, there are several obvious strategies:
Set the guard page size to zero using pthread_attr_setguardsize() before creating threads.
Use a thread-pool (of as many threads as processors say). Once a thread is done with a task, return it to the pool to get a new task rather than terminate and create a new thread.
Another explanation might be that you're on a platform where a thread's stack will be grown if overflow is detected. I don't think this is implemented on Linux with GCC/Glibc as yet, but there have been some proposals along these lines recently. If you use a lot of stack space whilst processing, you might explicitely increase the initial/minimum stack size using pthread_attr_setstacksize.
Or it might be something else entirely!
If you can, run your program under a debug libc and break on mprotect(). Look at the call stack, see what your code is doing that's leading to the mprotect() calls.
glibc library that has ptmalloc2 for its malloc uses mprotect() internally for micromanagement of heap for threads other than main thread (for main thread, sbrk() is used instead.) malloc() firstly allocates large chunk of memory with mmap() for the thread if a heap area seems to have contention, and then it changes the protection bits of unnecessary portion to make it accessible with mprotect(). Later, when it needs to grow the heap, it changes the protection to read/writable with mprotect() again. Those mprotect() calls are for heap growth and shrink in multithreaded applications.
http://www.blackhat.com/presentations/bh-usa-07/Ferguson/Whitepaper/bh-usa-07-ferguson-WP.pdf
explains this in a bit more detailed way.
The 'valgrind' suite has a tool called 'callgrind' that will tell you what is calling what. If you run the application under 'callgrind', you can then view the resulting profile data with 'kcachegrind' (it can analyze profiles made by 'cachegrind' or 'callgrind'). Then just double-click on 'mprotect' in the left pane and it will show you what code is calling it and how many times.

Resources