I'm trying to understand how stack works in Linux. I read AMD64 ABI sections about stack and process initialization and it is not clear how the stack should be mapped. Here is the relevant quote (3.4.1):
Stack State
This section describes the machine state that exec (BA_OS) creates for
new processes.
and
It is unspecified whether the data and stack segments are initially
mapped with execute permissions or not. Applications which need to
execute code on the stack or data segments should take proper
precautions, e.g., by calling mprotect().
So I can deduce from the quotes above that the stack is mapped (it is unspecified if PROT_EXEC is used to create the mapping). Also the mapping is created by exec.
The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?
Looking at pmap -x <pid> the stack is marked with [stack] as
00007ffc04c78000 132 12 12 rw--- [ stack ]
Creating a mapping as
mmap(NULL, 4096,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_STACK,
-1, 0);
simply creates anonymous mapping as that is shown in pmap -x <pid> as
00007fb6e42fa000 4 0 0 rw--- [ anon ]
I can deduce from the quotes above that the stack is mapped
That literally just means that memory is allocated. i.e. that there is a logical mapping from those virtual addresses to physical pages. We know this because you can use a push or call instruction in _start without making a system call from user-space to allocate a stack.
In fact the x86-64 System V ABI specifies that argc, argv, and envp are on the stack at process startup.
The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?
The ELF binary loader sets the _GROWSDOWN flag for the main thread's stack, but not the MAP_STACK flag. This is code inside the kernel, and it does not go through the regular mmap system call interface.
(Nothing in user-space uses mmap(MAP_GROWSDOWN) so normally the main thread stack is the only mapping that have the VM_GROWSDOWN flag inside the kernel.)
The internal name of the flag that is used for the virtual memory aree (VMA) of the stack is called VM_GROWSDOWN. In case you're interested, here are all the flags that are used for the main thread's stack: VM_GROWSDOWN, VM_READ, VM_WRITE, VM_MAYREAD, VM_MAYWRITE, and VM_MAYEXEC. In addition, if the ELF binary is specified to have an executable stack (e.g., by compiling with gcc -z execstack), the VM_EXEC flag is also used. Note that on architectures that support stacks that grow upwards, VM_GROWSUP is used instead of VM_GROWSDOWN if the kernel was compiled with CONFIG_STACK_GROWSUP defined. The line of code where these flags are specified in the Linux kernel can be found here.
/proc/.../maps and pmap don't use the VM_GROWSDOWN - they rely on address comparison instead. Therefore they may not be able to determine exactly the exact range of the virtual address space that the main thread's stack occupies (see an example). On the other hand, /proc/.../smaps looks for the VM_GROWSDOWN flag and marks each memory region that has this flag as gd. (Although it seems to ignore VM_GROWSUP.)
All of these tools/files ignore the MAP_STACK flag. In fact, the whole Linux kernel ignores this flag (which is probably why the program loader doesn't set it.) User-space only passes it for future-proofing in case the kernel does want to start treating thread-stack allocations specially.
sbrk makes no sense here; the stack isn't contiguous with the "break", and the brk heap grows upward toward the stack anyway. Linux puts the stack very near the top of virtual address space. So of course the primary stack couldn't be allocated with (the in-kernel equivalent of) sbrk.
And no, nothing uses MAP_GROWSDOWN, not even secondary thread stacks, because it can't in general be used safely.
The mmap(2) man page which says MAP_GROWSDOWN is "used for stacks" is laughably out of date and misleading. See How to mmap the stack for the clone() system call on linux?. As Ulrich Drepper explained in 2008, code using MAP_GROWSDOWN is typically broken, and proposed removing the flag from Linux mmap and from glibc headers. (This obviously didn't happen, but pthreads hasn't used it since well before then, if ever.)
MAP_GROWSDOWN sets the VM_GROWSDOWN flag for the mapping inside the kernel. The main thread also uses that flag to enable the growth mechanism, so a thread stack may be able to grow the same way the main stack does: arbitrarily far (up to ulimit -s?) if the stack pointer is below the page fault location. (Linux does not require "stack probes" to touch every page of a large multi-page stack array or alloca.)
(Thread stacks are fully allocated up front; only normal lazy allocation of physical pages to back that virtual allocation avoids wasting space for thread stacks.)
MAP_GROWSDOWN mapping can also grow the way the mmap man page describes: access to the "guard page" below the lowest mapped page will also trigger growth, even if that's below the bottom of the red zone.
But the main thread's stack has magic you don't get with mmap(MAP_GROWSDOWN). It reserves the growth space up to ulimit -s to prevent random choice of mmap address from creating a roadblock to stack growth. That magic is only available to the in-kernel program-loader which maps the main thread's stack during execve(), making it safe from an mmap(NULL, ...) randomly blocking future stack growth.
mmap(MAP_FIXED) could still create a roadblock for the main stack, but if you use MAP_FIXED you're 100% responsible for not breaking anything. (Unlimited stack cannot grow beyond the initial 132KiB if MAP_FIXED involved?). MAP_FIXED will replace existing mappings and reservations, but anything else will treat the main thread's stack-growth space as reserved;. (I think that's true; worth trying with MAP_FIXED_NOREPLACE or just a non-NULL hint address)
See
How is Stack memory allocated when using 'push' or 'sub' x86 instructions?
Why does this code crash with address randomization on?
pthread_create doesn't use MAP_GROWSDOWN for thread stacks, and neither should anyone else. Generally do not use. Linux pthreads by default allocates the full size for a thread stack. This costs virtual address space but (until it's actually touched) not physical pages.
The inconsistent results in comments on Why is MAP_GROWSDOWN mapping does not grow? (some people finding it works, some finding it still segfaults when touching the return value and the page below) sound like https://bugs.centos.org/view.php?id=4767 - MAP_GROWSDOWN may even be buggy outside of the way the standard main-stack VM_GROWSDOWN mapping is used.
Related
I've read the linux manual about sbrk() thoroughly:
sbrk() changes the location of the program break, which defines the end
of the process's data segment (i.e., the program break is the first
location after the end of the uninitialized data segment).
And I do know that user space memory's organization is like the following:
The problem is:
When I call sbrk(1), why does it say I am increasing the size of heap? As the manual says, I am changing the end position of "data segment & bss". So, what increases should be the size of data segment & bss, right?
The data and bss segments are a fixed size. The space allocated to the process after the end of those segments is therefore not a part of those segments; it is merely contiguous with them. And that space is called the heap space and is used for dynamic memory allocation.
If you want to regard it as 'extending the data/bss segment', that's fine too. It won't make any difference to the behaviour of the program, or the space that's allocated, or anything.
The manual page on Mac OS X indicates you really shouldn't be using them very much:
The brk and sbrk functions are historical curiosities left over from earlier days before the advent of virtual memory management. The brk() function sets the break or lowest address of a process's data segment (uninitialized data) to addr (immediately above bss). Data addressing is restricted between addr and the lowest stack pointer to the stack segment. Memory is allocated by brk in page size pieces; if addr is not evenly divisible by the system page size, it is increased to the next page boundary.
The current value of the program break is reliably returned by sbrk(0) (see also end(3)). The getrlimit(2) system call may be used to determine the maximum permissible size of the data segment; it will not be possible to set the break beyond the rlim_max value returned from a call to getrlimit, e.g. etext + rlp->rlim_max (see end(3) for the definition of etext).
It is mildly exasperating that I can't find a manual page for end(3), despite the pointers to look at it. Even this (slightly old) manual page for sbrk() does not have a link for it.
Notice that today sbrk(2) is rarely used. Most malloc implementations are using mmap(2) -at least for large allocations- to acquire a memory segment (and munmap to release it). Quite often, free simply marks a memory zone to be reusable by some future malloc (and does not release any memory to the Linux kernel).
(so practically, the heap of a modern linux process is made of several segments, so is more subtle than your picture; and multi-threaded processes have one stack per thread)
Use proc(5), notably /proc/self/maps and /proc/$pid/maps, to understand the virtual address space of some process. Try first to understand the output of cat /proc/self/maps (showing the address space of that cat command) and of cat /proc/$$/maps (showing the address space of your shell). Try also to look at the maps pseudo-file for your web browser (e.g. cat /proc/$(pidof firefox)/maps or cat /proc/$(pidof iceweasel)/maps etc...); I have more than a thousand lines (so process segments) in it.
Use strace(1) to understand the system calls done by a given command or process.
Take advantage that on Linux most (and probably all) C standard library implementations are free software, so you can study their source code. The source code of musl-libc is quite easy to read.
Read also about ELF, ASLR, dynamic linking & ld-linux(8), and the Advanced Linux Programming book then syscalls(2)
For a mono threaded program, I want to check whether or not a given virtual address is in the process's stack. I want to do that inside the process which is written in C.
I am thinking of reading /proc/self/maps to find the line labelled [stack] to get start and end address for my process's stack. Thinking about this solution led me to the following questions:
/proc/self/maps shows a stack of 132k for my particular process and the maximum size for the stack (ulimit -s) is 8 mega on my system. How does Linux know that a given page fault occurring because we are above the stack limit belongs to the stack (and that the stack must be made larger) rather than that we are reaching another memory area of the process ?
Does Linux shrink back the stack ? In other words, when returning from deep function calls for example, does the OS reduce the virtual memory area corresponding to the stack ?
How much virtual space is initially allocated for the stack by the OS ?
Is my solution correct and is there any other cleaner way to do that ?
Lots of the stack setup details depend on which architecture you're running on, executable format, and various kernel configuration options (stack pointer randomization, 4GB address space for i386, etc).
At the time the process is exec'd, the kernel picks a default stack top (for example, on the traditional i386 arch it's 0xc0000000, i.e. the end of the user-mode area of the virtual address space).
The type of executable format (ELF vs a.out, etc) can in theory change the initial stack top. Any additional stack randomization and any other fixups are then done (for example, the vdso [system call springboard] area generally is put here, when used). Now you have an actual initial top of stack.
The kernel now allocates whatever space is needed to construct argument and environment vectors and so forth for the process, initializes the stack pointer, creates initial register values, and initiates the process. I believe this provides the answer for (3): i.e. the kernel allocates only enough space to contain the argument and environment vectors, other pages are allocated on demand.
Other answers, as best as I can tell:
(1) When a process attempts to store data in the area below the current bottom of the stack region, a page fault is generated. The kernel fault handler determines where the next populated virtual memory region within the process' virtual address space begins. It then looks at what type of area that is. If it's a "grows down" area (at least on x86, all stack regions should be marked grows-down), and if the process' stack pointer (ESP/RSP) value at the time of the fault is less than the bottom of that region and if the process hasn't exceeded the ulimit -s setting, and the new size of the region wouldn't collide with another region, then it's assumed to be a valid attempt to grow the stack and additional pages are allocated to satisfy the process.
(2) Not 100% sure, but I don't think there's any attempt to shrink stack areas. Presumably normal LRU page sweeping would be performed making now-unused areas candidates for paging out to the swap area if they're really not being re-used.
(4) Your plan seems reasonable to me: the /proc/NN/maps should get start and end addresses for the stack region as a whole. This would be the largest your stack has ever been, I think. The current actual working stack area OTOH should reside between your current stack pointer and the end of the region (ordinarily nothing should be using the area of the stack below the stack pointer).
My answer is for linux on x64 with kernel 3.12.23 only. It might or might not apply to aother versions or architectures.
(1)+(2) I'm not sure here, but I believe it is as Gil Hamilton said before.
(3) You can see the amount in /proc/pid/maps (or /proc/self/maps if you target the calling process). However not all of that it actually useable as stack for your application. Argument- (argv[]) and environment vectors (__environ[]) usually consume quite a bit of space at the bottom (highest address) of that area.
To actually find the area the kernel designated as "stack" for your application, you can have a look at /proc/self/stat. Its values are documented here. As you can see, there is a field for "startstack". Together with the size of the mapped area, you can compute the current amount of stack reserved. Along with "kstkesp", you could determine the amount of free stack space or actually used stack space (keep in mind that any operation done by your thread most likely will change those values).
Also note, that this works only for the processes main thread! Other threads won't get a labled "[stack]" mapping, but either use anonymous mappings or might even end up on the heap. (Use pthreads API to find those values, or remember the stack-start in the threads main function).
(4) As explained in (3), you solution is mostly OK, but not entirely accurate.
int brk(void *end_data_segment);
void *sbrk(intptr_t increment);
Calling sbrk() with an increment of
0
can be used to find the current location of the program break.
What is program break? Where does it start from,0x00?
Oversimplifying:
A process has several segments of memory:
Code (text) segment, which contains the code to be executed.
Data segment, which contains data the compiler knows about (globals and statics).
Stack segment, which contains (drumroll...) the stack.
(Of course, nowadays it's much more complex. There is a rodata segment, a uninitialized data segment, mappings allocated via mmap, a vdso, ...)
One traditional way a program can request more memory in a Unix-like OS is to increment the size of the data segment, and use a memory allocator (i.e. malloc() implementation) to manage the resulting space. This is done via the brk() system call, which changes the point where the data segment "breaks"/ends.
A program break is end of the process's data segment. AKA...
the program break is the first
location after the end of the
uninitialized data segment
As to where it starts from, it's system dependent but probably not 0x00.
These days, sbrk(2) (and brk) are nearly obsolete system calls (and you can nearly forget about them and ignore the old notion of break; focus on understanding mmap(2)). Notice that the sbrk(2) man page says in its NOTES :
Avoid using brk() and sbrk(): the malloc(3) memory allocation package
is the portable and comfortable way of allocating memory.
(emphasis mine)
Most implementations of malloc(3) (notably the one in musl-libc) are rather using mmap(2) to require memory - and increase their virtual address space - from the kernel (look at that virtual address space wikipage, it has a nice picture). Some malloc-s use sbrk for small allocations, mmap for large ones.
Use strace(1) to find out the system calls (listed in syscalls(2)) done by some given process or command. BTW you'll then find that bash and ls (and probably many other programs) don't make a single call to sbrk.
Explore the virtual address space of some process by using proc(5). Try cat /proc/$$/maps and cat /proc/self/maps and even cat /proc/$$/smaps and read a bit to understand the output.
Be aware of ASLR & vdso(7).
And sbrk is not very thread friendly.
(my answer focuses on Linux)
You are saying that sbrk() is an obsolute system call and that we should use malloc(), but malloc(), according to her documentation, when allocating less memory than 128 KiB (32 pages) uses it. So we shouldn´t use sbrk() directly, but malloc() use it, if allocation is bigger than 128 KiB then malloc() uses mmap() that allocates private pages to the userspace.
Finally its a good idea to understand sbrk(), at least for understanding the "Program Break" concept.
Based on the following widely used diagram:
program break, which is also known as brk in many articles, points to the address of heap segment's end.
When you call malloc, it changes the address of program break.
I want to know the full detail of the address space layout of a multithreaded Linux Process for both 64 bit and 32 bit. Link to any article that describes it will be appreciated. And note that I need to know full details, not just an overview, because I will be directly dealing with it. So I need to know for example, where are the thread stacks located, the heap, thread private data etc...
Thread stacks are allocated with mmap at thread start (or even before - you can set the stack space in pthread_attrs). TLS data is stored in the beginning of thread's stack. Size of thread's stacks is fixed, typically it is from 2 to 8 MB. Stack size of each thread can't be changed while the thread is live. (First thread - running main - is still uses main stack at the end of address space and this stack may grow and shrink.) Heap and code is shared between all threads. Mutexes can be anywhere in data section - it is just a struct.
The mmap of thread's stack is not fixed at any address:
Glibc sources
mem = mmap (NULL, size, prot,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
PS modern GCC allows threads stack to be unlimited with SplitStacks feature
I know this is an OS function. But is there a way of increasing your overall stack or heap size through C? sbrk() is used to change the size of data segment right? Does this imply both stack and heap?
You mentioned sbrk(), so I'm going to assume you mean Unix/Linux. sbrk() will change the size of the data segment, but usually the stack is in a different memory space than the data segment, to keep people from overwriting the stack and doing evil things. Typically you'll set your stack size before you start the program running by using ulimit from your shell.
Note: sbrk() is deprecated in favor of malloc().
The Open Unix specification defines (and Linux implements) the getrlimit() and setrlimit() functions, which also allow you to play with system limits.
Virtual memory OSes (when using a CPU with MMU) automatically grow the data/stack segment when needed, up to a maximum. On POSIX systems the maximums can be configured using setrlimit(), as W. Craig Trader said. POSIX defines RLIMIT_DATA, RLIMIT_STACK and RLIMIT_AS for the limits.
malloc() internally uses brk() to grow/shrink the data segment, or mmap()/munmap(), to request/release memory mappings. The stack is grown when the CPU tries to access memory below the allocated stack.
On systems with no MMU (e.g. uClinux), the executable file format usually has a field for the stack size (take a look at the BFLT file format, for instance).