The man page for mlockall on my kernel 3.0 says
mlockall() locks all pages mapped into the address space of the
calling process. This includes the pages of the code, data and stack
segment, as well as shared libraries, user space kernel data, shared
memory, and memory-mapped files. All mapped pages are guaranteed
to be resident in RAM when the call returns successfully; the pages
are guaranteed to stay in RAM until later unlocked.
and later says
Real-time processes that are using mlockall() to prevent delays on
page faults should reserve enough locked stack pages before entering
the time-critical section, so that no page fault can be caused by
function
calls. This can be achieved by calling a function that allocates a sufficiently large automatic variable (an array) and
writes to the memory occupied by this array in order to touch these
stack pages. This way,
enough pages will be mapped for the stack and can be locked into RAM. The dummy writes ensure that not even copy-on-write page
faults can occur in the critical section.
I understand that this system call can't guess the maximum stack size that will be reached and thus is unable to lock pages for the stack. But why the first part of the man displayed above says that it's also done for the stack ? Is there an error in this man page, or does it just mean that the locking is done for the initial stack size ?
Yes, locking is done for the current stack pages, but not for all possible future stack pages.
It's explained by that first sentence:
mlockall() locks all pages mapped into the address space of the calling process.
So if a page is mapped, it will be locked. If not, it won't.
It just mentions the stack in the original sentence because the stack memory is mapped separately from the heap memory. There's no special treatment for the stack, if it's mapped it'll be locked, otherwise it won't. So as the second section you quote says, it's important to grow the stack to the maximum size it will reach whilst your code is running before you call mlockall.
Actually, from a quick reading of the mm/mlock.c source code, I'd say it simply locks everything: all currently mapped pages.
static int do_mlockall(int flags)
{
struct vm_area_struct * vma, * prev = NULL;
unsigned int def_flags = 0;
if (flags & MCL_FUTURE)
def_flags = VM_LOCKED;
current->mm->def_flags = def_flags;
if (flags == MCL_FUTURE)
goto out;
for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
vm_flags_t newflags;
newflags = vma->vm_flags | VM_LOCKED;
if (!(flags & MCL_CURRENT))
newflags &= ~VM_LOCKED;
/* Ignore errors */
mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
}
out:
return 0;
}
Despite what larsmans said, I do think it also applies to all future pages if MCL_FUTURE is also specified.
In that case 'current->mm->def_flags is updated to include VM_LOCKED.
Related
In Linux, I could use mmap with the MAP_GROWSDOWN flag to allocate memory for an automatically-growing stack. To quote the manpage,
MAP_GROWSDOWN
This flag is used for stacks. It indicates to the kernel
virtual memory system that the mapping should extend
downward in memory. The return address is one page lower
than the memory area that is actually created in the
process's virtual address space. Touching an address in
the "guard" page below the mapping will cause the mapping
to grow by a page. This growth can be repeated until the
mapping grows to within a page of the high end of the next
lower mapping, at which point touching the "guard" page
will result in a SIGSEGV signal.
Is there some equivalent technique in Windows? Even something ugly like asking the OS notify you about page faults so you can allocate a new page underneath (and make it look contiguous by asking the OS to fiddle around with page tables)?
With VirtualAlloc you can reserve a block of memory, commit the top two pages, and set the lower of the two pages with PAGE_GUARD. When the stack grows down and accesses the guard page, a structured exception is thrown that you can handle to commit the next page down and set PAGE_GUARD on it.
The above is similar to how the stack is handled in Windows processes. Below is a description of the stack using Sysinternals VMMAP.EXE. You can see it is a 256KB stack with 32K committed, 12K of guard pages, and 212K of reserved stack remaining.
I want to lock the memory to physical RAM in C with mlock and munlock, but I'm unsure about the correct usage.
Allow me to explain in a step by step scenario:
Let's assume that I dynamically allocate a pointer using calloc:
char * data = (char *)calloc(12, sizeof(char*));
Should I do mlock right after that?
Let's also assume that I later attempt to resize the memory block with realloc:
(char *)realloc(data, 100 * sizeof(char*));
Note the above increase amount ( 100 ) is random and sometimes i will decrease the memory block.
Should I first do munlock and then mlock again to address the changes made?
Also when I want to free the pointer data later, should I munlock first?
I hope someone can please explain the correct steps to me so I can understand better.
From the POSIX specification of mlock() and munlock():
The mlock() function shall cause those whole pages containing any part
of the address space of the process starting at address addr and
continuing for len bytes to be memory-resident until unlocked or until
the process exits or execs another process image. The implementation
may require that addr be a multiple of {PAGESIZE}.
The munlock() function shall unlock those whole pages containing any
part of the address space of the process starting at address addr and
continuing for len bytes, regardless of how many times mlock() has
been called by the process for any of the pages in the specified
range. The implementation may require that addr be a multiple of
{PAGESIZE}.
Note that:
Both functions work on whole pages
Both functions might require addr to be a multiple of page size
munlock doesn't use any reference counting to track lock lifetime
This make it almost impossible to use them with pointers returned by malloc/calloc/realloc as they can:
Accidently lock/unlock nearby pages (you might unlock pages that must be memory-resident by accident)
Might return pointers that are not suitable for those functions
You should use mmap instead or any other OS-specific mechanism. For example Linux has mremap which allows you to "reallocate" memory. Whatever you use, make sure mlock behavior is well-defined for it. From Linux man pages:
If the memory segment specified by old_address and old_size is locked
(using mlock(2) or similar), then this lock is maintained when the
segment is resized and/or relocated. As a consequence, the amount of
memory locked by the process may change.
Note Nate Eldredge's comment below:
Another problem with using realloc with locked memory is that the data
will be copied to the new location before you have a chance to find
out where it is and lock it. If your purpose in using mlock is to
ensure that sensitive data never gets written out to swap, this
creates a window of time where that might happen.
TL;DR
Memory locking doesn't mix with general-purpose memory allocation using the C language runtime.
Memory locking does mix with page-oriented virtual memory mapping OS-level APIs.
The above hold unless special circumstances arise (that's my way out of this :)
WARNING: This is long but I hope it can be useful for people like me in the future.
I think I know what program counter is, how lazy memory allocation works, what MMU does, how virtual memory address is mapped to physical address and the purpose of L1, L2 caches. What I really have trouble with is is how they all fit together in a high level when we run a C code.
Suppose I have this C code:
#include <stdio.h>
#include <stdlib.h>
int main()
{
int* ptr;
int n = 1000000, i = 0;
// Dynamically allocate memory using malloc()
ptr = (int*)malloc(n * sizeof(int));
ptr[0] = 99;
i += 100;
printf("%d\n", ptr[0]);
free(ptr);
return 0;
}
So here is my attempt to put everything together:
After execve() is called, part of the executable is loaded into the memory, e.g. text and data segment, but most of the code are not - they are loaded on demand (demand paging).
The address of the first instruction is in the process table's program counter (PC) field as well as physically in the PC register, ready to be used.
As the CPU executes instructions, PC is updated (usually +1, but jump can go to a different address).
Enter the main function: ptr, n, and i are in the stack.
Next, when we call malloc, the C library will ask the OS (I think via sbrk() sys call, or is it mmap()?) to allocate some memory on the heap.
malloc succeeds in this case, returning a virtual memory address (VMA), but the physically memory may not have been allocated yet. The page table doesn't contain the VMA, so when CPU tries to access such VMA, a page fault will be generated.
In our case, when we do ptr[0] = 99, CPU raises a page fault. I am not sure if the entire array is allocated or just the first page (4k size) though.
But now I don't know how to put cache access into the picture. How does i put into L1 cache? How does it relate to VMA?
Sorry if this is confusing. I just hope someone could help walk me through the entire process...
Before the program runs, the operating system and the C runtime setup the necessary values in the CPU registers.
As you've already noted, the intended PC value is set by the operating system (e.g. by the loader) and then the CPU's PC (aka IP) register is set, probably with a "return from interrupt" instruction that both switches to user mode (activating the virtual memory map for that process) along with loading the CPU with the proper PC value (a virtual address).
In addition, the SP register is set somehow: in some systems this will be done similar to the PC during the "return from interrupt", but in other (older) systems the user code sets the SP to a prearranged location. In either case the SP also holds a virtual memory address.
Usually the first instruction in that runs in the user process is in a routine traditionally called _start in a library called crt0 (C RunTime 0 (aka startup)). _start is usually written in assembly and handles the transition from the operating system to user mode. As needed _start will establish anything else necessary for C code to be called, and then, call main. If main returns to _start, it will do an exit syscall.
The CPU caches (and probably TLBs) will be cold when _start's first instruction gets control. All addresses in user mode are virtual memory addresses that designate memory within the (virtual) address space of the process. The processor is running in user mode. Probably the operating system has preloaded the page holding _start (or a least the start of _start). So when the processor performs an instruction fetch from _start, it will probably TLB miss, but not page fault, and then cache miss.
The TLB is a set of registers forming a cache in the CPU that support virtual to physical address translations/mappings. The TLB, when it misses, will be loaded from a structure in the virtual memory mapping for the process, such as the page tables. Since that first page is preloaded, the attempt to map will succeed, and the TLB will then be filled with the proper mappings from the virtual PC page to the physical page. However, the L1/L2, etc.. caches are also cold, so the access next causes a cache miss. The memory system will satisfy the cache miss by filling a cache line at each level. Finally an instruction word or group of words is provided to the processor, and it begins executing instructions.
If a virtual address for code (by way of the PC) or data (by some dereference) is not present in the TLB, then the processor will consult the page tables, and a miss there can cause a recoverable or non-recoverable page fault. Recoverable page faults are virtual to physical mappings that are not present in the page tables, because the data is on disc and operating system intervention is required; whereas non-recoverable faults are accesses to virtual memory that are in error, i.e. not allowed as they refer to virtual memory that has not been allocated/authorized by the operating system.
Variable i is known to main as a stack-relative location. So, when main wants to write to i it will write to memory and an offset from SP, e.g. SP+8 (i could also be a register variable, but I digress). Since the SP is a pointer holding a virtual memory address, i then has a virtual address. That virtual address goes thru the above described steps: TLB mapping from virtual page to physical page, possible page faulting, and then possible cache miss. Subsequent access will yield TLB hits, and cache hits, so as to run at full speed. (The operating system will probably also preload some but not all stack pages before running the process.)
A malloc operation will use some system calls that ultimately cause additional virtual memory to be added to the process. (Though as you also note, malloc gets more than enough for the current request so the system calls are not done every malloc.) malloc will return a virtual memory address, i.e. a pointer in the user mode virtual address space. For memory just obtained by a system call, the TLB and caches are also probably code, and it is possible that the page is not even loaded yet as well. In the latter case, a recoverable page fault will happen and the OS will allocate a physical page to use. If the OS is smart it will know that this is a new data page, and so can fill it with zeros instead of loading it from the paging file. Then it will set up the page table entries for the proper mapping, and resume the user process, which will probably then TLB miss, fill a TLB entry from the page tables, and then cache miss, and fill cache lines from the physical page.
On Mac my use of munmap results in seeing higher page reclaims.
The return value of my munmap is 0, which indicates that the requested pages where successfully unmapped.
Why do I see higher page reclaims when I test programs using memory I have mapped and unmapped in this way?
Is there a way to debug munmap and see if my calls to that function aren't doing anything to the mapped memory that is passed to it.
I used "/usr/bin/time -l" to see the amount of page reclaims I get from running my program. Whenever I use munmap my page reclaims get higher then when I don't.
int main(void)
{
int i = 0; char *addr;
while (i < 1024)
{
addr = mmap(0, getpagesize(), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
addr[0] = 23;
if (!munmap(addr, getpagesize()))
print("Success\n");
i++;
}
return (NULL);
}
on allocation
when I call munmap:
I pass it the same pointer it gave me.
I check the return value and check if it is 0 <-- this is what I get most of the time.
I made a test program where I call mmap 1024 times and munmap that number of times too.
When I don't call munmap the reclaimed pages are within the region of 1478 and the value is the same when I call munmap.
How can I check if my use of that memory is correct?
The important thing to remember about mmap is that the MAP_ANONYMOUS memory must be zeroed. So what happens usually is that a kernel will map a page frame with only zeroes in there - and only when a write hits the page, a read-write mapped zero page is mapped in place.
However, this is the reason why the kernel cannot reuse the originally mapped page right away - it does not know that only the first byte of the page is dirty - instead, it must zero all 4 kiB bytes on that page before it can be given back to the process in a new anonymous mapping. Hence in both examples there are at least 1024 page faults occurring.
If the memory would not need to be zeroed, Linux for example has an extra flag called MAP_UNINITIALIZED that tells kernel that the pages need not be zeroed, but it is only available in embedded devices:
MAP_UNINITIALIZED (since Linux 2.6.33)
Don't clear anonymous pages. This flag is intended to improve
performance on embedded devices. This flag is honored only if
the kernel was configured with the
CONFIG_MMAP_ALLOW_UNINITIALIZED
option. Because of the security implications, that option
is normally enabled only on embedded devices (i.e., devices
where one has complete control of the contents of user memory).
I guess the reason for its non-availability in generic Linux kernels is because the kernel does not keep track of the process that previously had mapped the page frame, hence the page could leak information from a sensitive process.
bzeroing the page yourself would not affect performance - the kernel would not know that it was zeroed because there is no architecture that would support it in hardware - and then it is cheaper to write zeroes over the page than to check if the page is full of all zeroes and then in 99.9999999 % cases to write zeroes over it anyway.
Having a static array in a C program:
#define MAXN (1<<13)
void f() {
static int X[MAXN];
//...
}
Can the Linux kernel choose to not map the addresses to physical memory until the each page is actually used? How can X be full of 0s then, is the memory zeroed when each page is accessed? How does that not impact the performance of the program?
Can the Linux kernel choose to not map the addresses to physical memory until the each page is actually used?
Yes, it does this for all memory (except special memory used by drivers and the kernel itself).
How can X be full of 0s then, is the memory zeroed when each page is accessed?
You're supposed to ignore this detail. As long as the memory is full of zeroes when you access it, we say it's full of zeroes.
How does that not impact the performance of the program?
It does.
Can the Linux kernel choose to not map the addresses to physical memory until the each page is actually used?
Yes, with userspace memory it is always done.
How can X be full of 0s then, is the memory zeroed when each page is accessed?
The kernel maintains a page full of 0s, when the user asks for a new page of the static array (static thus full of 0s before first use), the kernel provides the zeroed page, without permissions for the program to write. Writing to the array causes the copy-on-write mechanism to trigger: a page fault occurs, the kernel then allocates a writable page, maps it and resumes the program from the last instruction (the one that couldn't complete because of the page fault). Note that prezeroing optimizations change the implementation details here, but the theory's the same.
How does that not impact the performance of the program?
The program doesn't have to zero a (potentially) lot of pages on start, and the kernel doesn't actually have to have the memory (one can ask for more memory than the system's got, as long as you don't use it). Page faults will be generated during the execution of the program, but they can be minimized, see mmap() and madvise() with MADV_SEQUENTIAL. Remember that the Translation Lookaside Buffer is not infinite, there are so many entries it can maintain.
Sources: A linux memory FAQ, Introduction to Memory Management in Linux by Alan Ott