I want to know the full detail of the address space layout of a multithreaded Linux Process for both 64 bit and 32 bit. Link to any article that describes it will be appreciated. And note that I need to know full details, not just an overview, because I will be directly dealing with it. So I need to know for example, where are the thread stacks located, the heap, thread private data etc...
Thread stacks are allocated with mmap at thread start (or even before - you can set the stack space in pthread_attrs). TLS data is stored in the beginning of thread's stack. Size of thread's stacks is fixed, typically it is from 2 to 8 MB. Stack size of each thread can't be changed while the thread is live. (First thread - running main - is still uses main stack at the end of address space and this stack may grow and shrink.) Heap and code is shared between all threads. Mutexes can be anywhere in data section - it is just a struct.
The mmap of thread's stack is not fixed at any address:
Glibc sources
mem = mmap (NULL, size, prot,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
PS modern GCC allows threads stack to be unlimited with SplitStacks feature
Related
When I create a new thread and inside thread function allocate some memory on heap using malloc then process memory is increased by 64 mb.Before creating the thread I tried to set the stack size to 64 kb using pthread_attr_setstacksize but there is no impact on process memory. If I create 2 threads then process memory is increased by 64*2=128 mb
Example code https://www.ibm.com/docs/en/zos/2.2.0?topic=functions-pthread-attr-setstacksize-set-stacksize-attribute-object
Is there an solution to avoid extra 64 mb of memory for each thread?
Before creating the thread I tried to set the stack size to 64 kb using pthread_attr_setstacksize
That is the correct way to set stack size.
but there is no impact on process memory.
It doesn't sound like your memory consumption is coming from the thread stack.
It sounds like your malloc implementation reserves memory in 64MiB chunks (probably to manage thread-local allocation arenas).
Accurately measuring actual memory usage on modern systems is surprisingly non-trivial. If you are looking at VM in ps output, you are doing it wrong. If you are looking at RSS, that's closer but still only an approximation.
Is there an solution to avoid extra 64 mb of memory for each thread?
There might be environment variables or functions you can call to tune your malloc implementation.
But you haven't told us which malloc implementation (or OS) you are using, and without that we can't help you.
Also note that "avoiding extra 64MiB" is likely not a goal you should pursue (see http://xyproblem.info).
I'm trying to understand how stack works in Linux. I read AMD64 ABI sections about stack and process initialization and it is not clear how the stack should be mapped. Here is the relevant quote (3.4.1):
Stack State
This section describes the machine state that exec (BA_OS) creates for
new processes.
and
It is unspecified whether the data and stack segments are initially
mapped with execute permissions or not. Applications which need to
execute code on the stack or data segments should take proper
precautions, e.g., by calling mprotect().
So I can deduce from the quotes above that the stack is mapped (it is unspecified if PROT_EXEC is used to create the mapping). Also the mapping is created by exec.
The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?
Looking at pmap -x <pid> the stack is marked with [stack] as
00007ffc04c78000 132 12 12 rw--- [ stack ]
Creating a mapping as
mmap(NULL, 4096,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_STACK,
-1, 0);
simply creates anonymous mapping as that is shown in pmap -x <pid> as
00007fb6e42fa000 4 0 0 rw--- [ anon ]
I can deduce from the quotes above that the stack is mapped
That literally just means that memory is allocated. i.e. that there is a logical mapping from those virtual addresses to physical pages. We know this because you can use a push or call instruction in _start without making a system call from user-space to allocate a stack.
In fact the x86-64 System V ABI specifies that argc, argv, and envp are on the stack at process startup.
The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?
The ELF binary loader sets the _GROWSDOWN flag for the main thread's stack, but not the MAP_STACK flag. This is code inside the kernel, and it does not go through the regular mmap system call interface.
(Nothing in user-space uses mmap(MAP_GROWSDOWN) so normally the main thread stack is the only mapping that have the VM_GROWSDOWN flag inside the kernel.)
The internal name of the flag that is used for the virtual memory aree (VMA) of the stack is called VM_GROWSDOWN. In case you're interested, here are all the flags that are used for the main thread's stack: VM_GROWSDOWN, VM_READ, VM_WRITE, VM_MAYREAD, VM_MAYWRITE, and VM_MAYEXEC. In addition, if the ELF binary is specified to have an executable stack (e.g., by compiling with gcc -z execstack), the VM_EXEC flag is also used. Note that on architectures that support stacks that grow upwards, VM_GROWSUP is used instead of VM_GROWSDOWN if the kernel was compiled with CONFIG_STACK_GROWSUP defined. The line of code where these flags are specified in the Linux kernel can be found here.
/proc/.../maps and pmap don't use the VM_GROWSDOWN - they rely on address comparison instead. Therefore they may not be able to determine exactly the exact range of the virtual address space that the main thread's stack occupies (see an example). On the other hand, /proc/.../smaps looks for the VM_GROWSDOWN flag and marks each memory region that has this flag as gd. (Although it seems to ignore VM_GROWSUP.)
All of these tools/files ignore the MAP_STACK flag. In fact, the whole Linux kernel ignores this flag (which is probably why the program loader doesn't set it.) User-space only passes it for future-proofing in case the kernel does want to start treating thread-stack allocations specially.
sbrk makes no sense here; the stack isn't contiguous with the "break", and the brk heap grows upward toward the stack anyway. Linux puts the stack very near the top of virtual address space. So of course the primary stack couldn't be allocated with (the in-kernel equivalent of) sbrk.
And no, nothing uses MAP_GROWSDOWN, not even secondary thread stacks, because it can't in general be used safely.
The mmap(2) man page which says MAP_GROWSDOWN is "used for stacks" is laughably out of date and misleading. See How to mmap the stack for the clone() system call on linux?. As Ulrich Drepper explained in 2008, code using MAP_GROWSDOWN is typically broken, and proposed removing the flag from Linux mmap and from glibc headers. (This obviously didn't happen, but pthreads hasn't used it since well before then, if ever.)
MAP_GROWSDOWN sets the VM_GROWSDOWN flag for the mapping inside the kernel. The main thread also uses that flag to enable the growth mechanism, so a thread stack may be able to grow the same way the main stack does: arbitrarily far (up to ulimit -s?) if the stack pointer is below the page fault location. (Linux does not require "stack probes" to touch every page of a large multi-page stack array or alloca.)
(Thread stacks are fully allocated up front; only normal lazy allocation of physical pages to back that virtual allocation avoids wasting space for thread stacks.)
MAP_GROWSDOWN mapping can also grow the way the mmap man page describes: access to the "guard page" below the lowest mapped page will also trigger growth, even if that's below the bottom of the red zone.
But the main thread's stack has magic you don't get with mmap(MAP_GROWSDOWN). It reserves the growth space up to ulimit -s to prevent random choice of mmap address from creating a roadblock to stack growth. That magic is only available to the in-kernel program-loader which maps the main thread's stack during execve(), making it safe from an mmap(NULL, ...) randomly blocking future stack growth.
mmap(MAP_FIXED) could still create a roadblock for the main stack, but if you use MAP_FIXED you're 100% responsible for not breaking anything. (Unlimited stack cannot grow beyond the initial 132KiB if MAP_FIXED involved?). MAP_FIXED will replace existing mappings and reservations, but anything else will treat the main thread's stack-growth space as reserved;. (I think that's true; worth trying with MAP_FIXED_NOREPLACE or just a non-NULL hint address)
See
How is Stack memory allocated when using 'push' or 'sub' x86 instructions?
Why does this code crash with address randomization on?
pthread_create doesn't use MAP_GROWSDOWN for thread stacks, and neither should anyone else. Generally do not use. Linux pthreads by default allocates the full size for a thread stack. This costs virtual address space but (until it's actually touched) not physical pages.
The inconsistent results in comments on Why is MAP_GROWSDOWN mapping does not grow? (some people finding it works, some finding it still segfaults when touching the return value and the page below) sound like https://bugs.centos.org/view.php?id=4767 - MAP_GROWSDOWN may even be buggy outside of the way the standard main-stack VM_GROWSDOWN mapping is used.
I know that threads share code/global data but have different stacks. Each thread has its own stack. I believe there is one virtual address space for each process. It means each thread uses this single virtual address space.
I want to know how stack/heap grows in case of multiple threads in the virtual address space? How does OS manages if stack space is full for one thread?
In linux the stack size is determined by guardsize when the guardsize if exceeded the stackoverflow occurs.
It is programmer's responsibility to take care of stackoverflow. The default guardsize value is equal to page size defined in the system.
Indeed, the memory manager of your operating system creates a virtual memory space for each process (processes have different memory spaces; threads share the same memory space within a process).
Within the memory space of a thread, each thread has its own stack. However, they share the same heap and clever memory management techniques are used to optimize the shared usage of the stack (see Memory Allocation/Deallocation Bottleneck? as a starting point).
How does OS manages if stack space is full for one thread?
The OS does not manage the stack. The stack is a static data structure created by the compiler. The memory allocations and memory releases from the stack are managed by the compiler, and it knows at any time the size of the stack. Thus, it can split the static memory region of the memory space (i.e. the whole "stack") into thread "sub-stacks".
I've read the linux manual about sbrk() thoroughly:
sbrk() changes the location of the program break, which defines the end
of the process's data segment (i.e., the program break is the first
location after the end of the uninitialized data segment).
And I do know that user space memory's organization is like the following:
The problem is:
When I call sbrk(1), why does it say I am increasing the size of heap? As the manual says, I am changing the end position of "data segment & bss". So, what increases should be the size of data segment & bss, right?
The data and bss segments are a fixed size. The space allocated to the process after the end of those segments is therefore not a part of those segments; it is merely contiguous with them. And that space is called the heap space and is used for dynamic memory allocation.
If you want to regard it as 'extending the data/bss segment', that's fine too. It won't make any difference to the behaviour of the program, or the space that's allocated, or anything.
The manual page on Mac OS X indicates you really shouldn't be using them very much:
The brk and sbrk functions are historical curiosities left over from earlier days before the advent of virtual memory management. The brk() function sets the break or lowest address of a process's data segment (uninitialized data) to addr (immediately above bss). Data addressing is restricted between addr and the lowest stack pointer to the stack segment. Memory is allocated by brk in page size pieces; if addr is not evenly divisible by the system page size, it is increased to the next page boundary.
The current value of the program break is reliably returned by sbrk(0) (see also end(3)). The getrlimit(2) system call may be used to determine the maximum permissible size of the data segment; it will not be possible to set the break beyond the rlim_max value returned from a call to getrlimit, e.g. etext + rlp->rlim_max (see end(3) for the definition of etext).
It is mildly exasperating that I can't find a manual page for end(3), despite the pointers to look at it. Even this (slightly old) manual page for sbrk() does not have a link for it.
Notice that today sbrk(2) is rarely used. Most malloc implementations are using mmap(2) -at least for large allocations- to acquire a memory segment (and munmap to release it). Quite often, free simply marks a memory zone to be reusable by some future malloc (and does not release any memory to the Linux kernel).
(so practically, the heap of a modern linux process is made of several segments, so is more subtle than your picture; and multi-threaded processes have one stack per thread)
Use proc(5), notably /proc/self/maps and /proc/$pid/maps, to understand the virtual address space of some process. Try first to understand the output of cat /proc/self/maps (showing the address space of that cat command) and of cat /proc/$$/maps (showing the address space of your shell). Try also to look at the maps pseudo-file for your web browser (e.g. cat /proc/$(pidof firefox)/maps or cat /proc/$(pidof iceweasel)/maps etc...); I have more than a thousand lines (so process segments) in it.
Use strace(1) to understand the system calls done by a given command or process.
Take advantage that on Linux most (and probably all) C standard library implementations are free software, so you can study their source code. The source code of musl-libc is quite easy to read.
Read also about ELF, ASLR, dynamic linking & ld-linux(8), and the Advanced Linux Programming book then syscalls(2)
I have a SH4 board, here are the specs...
uname -a
Linux LINUX7109 2.6.23.17_stm23_A18B-HMP_7109-STSDK #1 PREEMPT Fri Aug 6 16:08:19 ART 2010
sh4 unknown
and suppose I have eaten pretty much all the memory, and have only 9 MB left.
free
total used free shared buffers cached
Mem: 48072 42276 5796 0 172 3264
-/+ buffers/cache: 38840 9232
Swap: 0 0 0
Now, when I try to launch a single thread with default stack size (8 MB)
the pthread_create fails with ENOMEM. If I strace my test code, I can see that the function that is failing is mmap:
old_mmap(NULL, 8388608, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
However, when I set the stack size to a lower value using ulimit -s:
ulimit -s 7500
I can now launch 10 threads. Each thread does not allocate anything, so it
is only consuming the minimum overhead (aprox. 8 kb per thread, right?).
So, my question is:
Knowing that mmap doesnt actually consume the memory,
Why is pthread_create() (or mmap) failing when memory available is below
the thread stack size ?
The VM setting /proc/sys/vm/overcommit_memory (aka. sysctl vm.overcommit_memory) controls whether Linux is willing to hand out more address space than the combined RAM+swap of the machine. (Of course, if you actually try to access that much memory, something will crash. Try a search on "linux oom-killer"...)
The default for this setting is 0. I am going to speculate that someone set it to something else on your system.
Under glibc, the default stack size for threads is 2-10 megabytes (often 8). You should use pthread_attr_setstacksize and call pthread_create with the resulting attributes object to request a thread with a smaller stack.
mmap consume address space.
Pointers have to uniquely identify a piece of "memory" (including mmap file) in memory.
32-bit pointer can only address 2/3GB memory (32bit = 2^32 = 4GB. But some address space is reserved by kernel). This address space is limited.
All threads in the process share the same address space, but different process have separate address spaces.
This is the operating system's only chance to fail the operation gracefully. If the implementation allows this operation to succeed, it could run out of memory during an operation that it cannot return a failure code for, such as the stack growing. The operating system prefers to let an operation fail gracefully than risk having to kill a completely innocent process.