For a mono threaded program, I want to check whether or not a given virtual address is in the process's stack. I want to do that inside the process which is written in C.
I am thinking of reading /proc/self/maps to find the line labelled [stack] to get start and end address for my process's stack. Thinking about this solution led me to the following questions:
/proc/self/maps shows a stack of 132k for my particular process and the maximum size for the stack (ulimit -s) is 8 mega on my system. How does Linux know that a given page fault occurring because we are above the stack limit belongs to the stack (and that the stack must be made larger) rather than that we are reaching another memory area of the process ?
Does Linux shrink back the stack ? In other words, when returning from deep function calls for example, does the OS reduce the virtual memory area corresponding to the stack ?
How much virtual space is initially allocated for the stack by the OS ?
Is my solution correct and is there any other cleaner way to do that ?
Lots of the stack setup details depend on which architecture you're running on, executable format, and various kernel configuration options (stack pointer randomization, 4GB address space for i386, etc).
At the time the process is exec'd, the kernel picks a default stack top (for example, on the traditional i386 arch it's 0xc0000000, i.e. the end of the user-mode area of the virtual address space).
The type of executable format (ELF vs a.out, etc) can in theory change the initial stack top. Any additional stack randomization and any other fixups are then done (for example, the vdso [system call springboard] area generally is put here, when used). Now you have an actual initial top of stack.
The kernel now allocates whatever space is needed to construct argument and environment vectors and so forth for the process, initializes the stack pointer, creates initial register values, and initiates the process. I believe this provides the answer for (3): i.e. the kernel allocates only enough space to contain the argument and environment vectors, other pages are allocated on demand.
Other answers, as best as I can tell:
(1) When a process attempts to store data in the area below the current bottom of the stack region, a page fault is generated. The kernel fault handler determines where the next populated virtual memory region within the process' virtual address space begins. It then looks at what type of area that is. If it's a "grows down" area (at least on x86, all stack regions should be marked grows-down), and if the process' stack pointer (ESP/RSP) value at the time of the fault is less than the bottom of that region and if the process hasn't exceeded the ulimit -s setting, and the new size of the region wouldn't collide with another region, then it's assumed to be a valid attempt to grow the stack and additional pages are allocated to satisfy the process.
(2) Not 100% sure, but I don't think there's any attempt to shrink stack areas. Presumably normal LRU page sweeping would be performed making now-unused areas candidates for paging out to the swap area if they're really not being re-used.
(4) Your plan seems reasonable to me: the /proc/NN/maps should get start and end addresses for the stack region as a whole. This would be the largest your stack has ever been, I think. The current actual working stack area OTOH should reside between your current stack pointer and the end of the region (ordinarily nothing should be using the area of the stack below the stack pointer).
My answer is for linux on x64 with kernel 3.12.23 only. It might or might not apply to aother versions or architectures.
(1)+(2) I'm not sure here, but I believe it is as Gil Hamilton said before.
(3) You can see the amount in /proc/pid/maps (or /proc/self/maps if you target the calling process). However not all of that it actually useable as stack for your application. Argument- (argv[]) and environment vectors (__environ[]) usually consume quite a bit of space at the bottom (highest address) of that area.
To actually find the area the kernel designated as "stack" for your application, you can have a look at /proc/self/stat. Its values are documented here. As you can see, there is a field for "startstack". Together with the size of the mapped area, you can compute the current amount of stack reserved. Along with "kstkesp", you could determine the amount of free stack space or actually used stack space (keep in mind that any operation done by your thread most likely will change those values).
Also note, that this works only for the processes main thread! Other threads won't get a labled "[stack]" mapping, but either use anonymous mappings or might even end up on the heap. (Use pthreads API to find those values, or remember the stack-start in the threads main function).
(4) As explained in (3), you solution is mostly OK, but not entirely accurate.
Related
hay
I have a question about location of threads in memory,
Where is threads stack located? And is there a way to display it (using gdb, readelf or something similar)
is there a way to display it...using gdb...?
Sure, GDB can show you the stack of any thread. I don't remember the commands, but they're right there in the manual. ISTR, there's one command that will list all of the threads, there's another that you use to tell it which thread you want to look at. Everything else works just like how it works for a single-threaded program.
I think there's also a way you can tell GDB to iterate over the threads (i.e., perform a single command, such as dump the stack, once for each thread in the program.)
Where is threads stack located?
Um, It's located in memory.
Seriously. Why do you want to know? In most programming environments that I have ever heard of, the entire stack for a thread gets allocated all at once, and it can not grow. There's usually some way for the program to say how big it needs the stack of a new thread to be if the default size is not big enough.
In Linux, the program typically would obtain space for a new thread's stack by calling 'mmap(...)` with arguments that allow the OS to choose the virtual address. But, there's no reason why it has to work that way. The program could allocate the stack from the heap, if that made any sense.
In other operating systems, there's probably some mechanism similar to mmap that lets the OS choose the address.
if you want an exact answer it is in the memory between heap and stack
If you go back thirty or more years, A process in a Unix-like OS would be given one contiguous block of virtual memory, typically starting at page one (page zero would be unallocated because that's how you get a segfault if the program follows a NULL pointer.) The lowest addresses would contain the program's "text" segment (e.g., its code and immutable strings), then the "data" segment (initialized static variables), then the "bss" segment (uninitialized static variables.)
Everything from the top of the BSS to the top of the given VM region was "wilderness" (i.e., untouched). The program's heap would grow up into the wilderness from the bottom, and its one and only call stack would grow down into the wilderness from the top. If the heap and the stack ever met, then you'd get a "stack overflow" or a malloc() error.
Things are more complicated these days, when a program can have dozens or even hundreds of call stacks. Instead of that "wilderness" Linux programs today can use 'mmap(...)` to create additional VM regions--either for a new thread's stack, or to add to the heap, or to map a file into memory for random access.
I'm trying to understand how stack works in Linux. I read AMD64 ABI sections about stack and process initialization and it is not clear how the stack should be mapped. Here is the relevant quote (3.4.1):
Stack State
This section describes the machine state that exec (BA_OS) creates for
new processes.
and
It is unspecified whether the data and stack segments are initially
mapped with execute permissions or not. Applications which need to
execute code on the stack or data segments should take proper
precautions, e.g., by calling mprotect().
So I can deduce from the quotes above that the stack is mapped (it is unspecified if PROT_EXEC is used to create the mapping). Also the mapping is created by exec.
The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?
Looking at pmap -x <pid> the stack is marked with [stack] as
00007ffc04c78000 132 12 12 rw--- [ stack ]
Creating a mapping as
mmap(NULL, 4096,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_STACK,
-1, 0);
simply creates anonymous mapping as that is shown in pmap -x <pid> as
00007fb6e42fa000 4 0 0 rw--- [ anon ]
I can deduce from the quotes above that the stack is mapped
That literally just means that memory is allocated. i.e. that there is a logical mapping from those virtual addresses to physical pages. We know this because you can use a push or call instruction in _start without making a system call from user-space to allocate a stack.
In fact the x86-64 System V ABI specifies that argc, argv, and envp are on the stack at process startup.
The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?
The ELF binary loader sets the _GROWSDOWN flag for the main thread's stack, but not the MAP_STACK flag. This is code inside the kernel, and it does not go through the regular mmap system call interface.
(Nothing in user-space uses mmap(MAP_GROWSDOWN) so normally the main thread stack is the only mapping that have the VM_GROWSDOWN flag inside the kernel.)
The internal name of the flag that is used for the virtual memory aree (VMA) of the stack is called VM_GROWSDOWN. In case you're interested, here are all the flags that are used for the main thread's stack: VM_GROWSDOWN, VM_READ, VM_WRITE, VM_MAYREAD, VM_MAYWRITE, and VM_MAYEXEC. In addition, if the ELF binary is specified to have an executable stack (e.g., by compiling with gcc -z execstack), the VM_EXEC flag is also used. Note that on architectures that support stacks that grow upwards, VM_GROWSUP is used instead of VM_GROWSDOWN if the kernel was compiled with CONFIG_STACK_GROWSUP defined. The line of code where these flags are specified in the Linux kernel can be found here.
/proc/.../maps and pmap don't use the VM_GROWSDOWN - they rely on address comparison instead. Therefore they may not be able to determine exactly the exact range of the virtual address space that the main thread's stack occupies (see an example). On the other hand, /proc/.../smaps looks for the VM_GROWSDOWN flag and marks each memory region that has this flag as gd. (Although it seems to ignore VM_GROWSUP.)
All of these tools/files ignore the MAP_STACK flag. In fact, the whole Linux kernel ignores this flag (which is probably why the program loader doesn't set it.) User-space only passes it for future-proofing in case the kernel does want to start treating thread-stack allocations specially.
sbrk makes no sense here; the stack isn't contiguous with the "break", and the brk heap grows upward toward the stack anyway. Linux puts the stack very near the top of virtual address space. So of course the primary stack couldn't be allocated with (the in-kernel equivalent of) sbrk.
And no, nothing uses MAP_GROWSDOWN, not even secondary thread stacks, because it can't in general be used safely.
The mmap(2) man page which says MAP_GROWSDOWN is "used for stacks" is laughably out of date and misleading. See How to mmap the stack for the clone() system call on linux?. As Ulrich Drepper explained in 2008, code using MAP_GROWSDOWN is typically broken, and proposed removing the flag from Linux mmap and from glibc headers. (This obviously didn't happen, but pthreads hasn't used it since well before then, if ever.)
MAP_GROWSDOWN sets the VM_GROWSDOWN flag for the mapping inside the kernel. The main thread also uses that flag to enable the growth mechanism, so a thread stack may be able to grow the same way the main stack does: arbitrarily far (up to ulimit -s?) if the stack pointer is below the page fault location. (Linux does not require "stack probes" to touch every page of a large multi-page stack array or alloca.)
(Thread stacks are fully allocated up front; only normal lazy allocation of physical pages to back that virtual allocation avoids wasting space for thread stacks.)
MAP_GROWSDOWN mapping can also grow the way the mmap man page describes: access to the "guard page" below the lowest mapped page will also trigger growth, even if that's below the bottom of the red zone.
But the main thread's stack has magic you don't get with mmap(MAP_GROWSDOWN). It reserves the growth space up to ulimit -s to prevent random choice of mmap address from creating a roadblock to stack growth. That magic is only available to the in-kernel program-loader which maps the main thread's stack during execve(), making it safe from an mmap(NULL, ...) randomly blocking future stack growth.
mmap(MAP_FIXED) could still create a roadblock for the main stack, but if you use MAP_FIXED you're 100% responsible for not breaking anything. (Unlimited stack cannot grow beyond the initial 132KiB if MAP_FIXED involved?). MAP_FIXED will replace existing mappings and reservations, but anything else will treat the main thread's stack-growth space as reserved;. (I think that's true; worth trying with MAP_FIXED_NOREPLACE or just a non-NULL hint address)
See
How is Stack memory allocated when using 'push' or 'sub' x86 instructions?
Why does this code crash with address randomization on?
pthread_create doesn't use MAP_GROWSDOWN for thread stacks, and neither should anyone else. Generally do not use. Linux pthreads by default allocates the full size for a thread stack. This costs virtual address space but (until it's actually touched) not physical pages.
The inconsistent results in comments on Why is MAP_GROWSDOWN mapping does not grow? (some people finding it works, some finding it still segfaults when touching the return value and the page below) sound like https://bugs.centos.org/view.php?id=4767 - MAP_GROWSDOWN may even be buggy outside of the way the standard main-stack VM_GROWSDOWN mapping is used.
I've read the linux manual about sbrk() thoroughly:
sbrk() changes the location of the program break, which defines the end
of the process's data segment (i.e., the program break is the first
location after the end of the uninitialized data segment).
And I do know that user space memory's organization is like the following:
The problem is:
When I call sbrk(1), why does it say I am increasing the size of heap? As the manual says, I am changing the end position of "data segment & bss". So, what increases should be the size of data segment & bss, right?
The data and bss segments are a fixed size. The space allocated to the process after the end of those segments is therefore not a part of those segments; it is merely contiguous with them. And that space is called the heap space and is used for dynamic memory allocation.
If you want to regard it as 'extending the data/bss segment', that's fine too. It won't make any difference to the behaviour of the program, or the space that's allocated, or anything.
The manual page on Mac OS X indicates you really shouldn't be using them very much:
The brk and sbrk functions are historical curiosities left over from earlier days before the advent of virtual memory management. The brk() function sets the break or lowest address of a process's data segment (uninitialized data) to addr (immediately above bss). Data addressing is restricted between addr and the lowest stack pointer to the stack segment. Memory is allocated by brk in page size pieces; if addr is not evenly divisible by the system page size, it is increased to the next page boundary.
The current value of the program break is reliably returned by sbrk(0) (see also end(3)). The getrlimit(2) system call may be used to determine the maximum permissible size of the data segment; it will not be possible to set the break beyond the rlim_max value returned from a call to getrlimit, e.g. etext + rlp->rlim_max (see end(3) for the definition of etext).
It is mildly exasperating that I can't find a manual page for end(3), despite the pointers to look at it. Even this (slightly old) manual page for sbrk() does not have a link for it.
Notice that today sbrk(2) is rarely used. Most malloc implementations are using mmap(2) -at least for large allocations- to acquire a memory segment (and munmap to release it). Quite often, free simply marks a memory zone to be reusable by some future malloc (and does not release any memory to the Linux kernel).
(so practically, the heap of a modern linux process is made of several segments, so is more subtle than your picture; and multi-threaded processes have one stack per thread)
Use proc(5), notably /proc/self/maps and /proc/$pid/maps, to understand the virtual address space of some process. Try first to understand the output of cat /proc/self/maps (showing the address space of that cat command) and of cat /proc/$$/maps (showing the address space of your shell). Try also to look at the maps pseudo-file for your web browser (e.g. cat /proc/$(pidof firefox)/maps or cat /proc/$(pidof iceweasel)/maps etc...); I have more than a thousand lines (so process segments) in it.
Use strace(1) to understand the system calls done by a given command or process.
Take advantage that on Linux most (and probably all) C standard library implementations are free software, so you can study their source code. The source code of musl-libc is quite easy to read.
Read also about ELF, ASLR, dynamic linking & ld-linux(8), and the Advanced Linux Programming book then syscalls(2)
I'm using a library which creates a pthread using the default stack size of 8MB. Is it possible to programatically reduce the stack size of the thread the library creates? I tried using setrlimit(RLIMIT_STACK...) inside my main() function, but that doesn't seem to have any effect. ulimit -s seems to do the job, but I don't want to set the stack size before my program is executed.
Any ideas what I can do? Thanks
Update 1:
Seems like I'm going to give up on being able to set the stack size using setrlimit(RLIMIT_STACK,...). I checked the resident memory and found it's a lot less than the virtual memory. That's a good enough reason for me to give up on trying to limit the stack size.
I think you are out of luck. If the library you are using does not provide a way to set the stack limit, then you can't change it after the thread has been created. setrlimit and shell limits effects the main thread's stack.
Threads are created within the processes memory space so their stacks are allocated when the threads are created. On Unix I believe the stack will be mapped to RAM on demand, so you may not actually use 8 Megs of RAM if you don't need it (virtual vs resident memory).
There are a couple aspects to answering this question.
First, as stated in the comments, pthread_attr_setstacksize is The Right Way to do this. If the library calling pthread_create doesn't have a way to let you do this, fixing the library would be the ideal solution. If the thread is purely internal to the library (not calling code from the calling application) it really should set its own preference for the stack size based on something like PTHREAD_STACK_MIN + ITS_OWN_NEEDS. If it's calling back to your code, it should let you request how much stack space you need.
Second, as an implementation detail, glibc uses the stack limit from setrlimit/ulimit to derive the stack size for threads created by pthread_create. You can perhaps influence the size this way, but it's not portable, and as you've found, not reliable even there (it's not working when you call setrlimit from within the process itself). It's possible that glibc only probes the limit once when the relevant code is first initialized, so I would try moving the setrlimit call as early as possible in main to see if this helps.
Finally, the stack size for threads may not even be relevant to your application. Even if the stack size is 8MB, only the pages which have actually been modified (probably 4k or at most 8k unless you have big arrays on the stack) are actually using physical memory. The rest is just tying up virtual address space (of which you always have at least 2-3 GB) and possibly commit charge. By default, Linux enables overcommit, so commit charge will not be strictly enforced, and therefore the fact that glibc is requesting too much may not even matter. You could make the overcommit checking even less strict by writing a 1 to /proc/sys/vm/overcommit_memory, but this will cause you to loose information about when you're "running out of memory" and make your program crash instead. On such a constrained system you may prefer even stricter overcommit accounting, but then you have to fix the thread stack size problem...
i need to have reliable measurement of allocated memory in a linux process. I've been looking into mallinfo but i've read that it is deprecated. What is the state of the art alternative for this sort of statistics?
basically i'm interested in at least two numbers:
number (and size) of allocated memory blocks/pages from the kernel by any malloc or whatever implementation uses the C library of choice
(optional but still important) number of allocated memory by userspace code (via malloc, new, etc.) minus the deallocated memory by it (via free, delete, etc.)
one possibility i have is to override malloc calls with LD_PRELOAD, but it might introduce an unwanted overhead at runtime, also it might not interact properly with other libraries i'm using that also rely on LD_PRELOAD aop-ness.
another possibility i've read is with rusage.
To be clear, this is NOT for debugging purposes, the memory usage is intrinsic feature of the application (similar to Mathematica or Matlab that display the amount of memory used, only that more precise at the block-level)
For this purpose - a "memory usage" introspection feature within an application - the most appropriate interface is malloc_hook(3). These are GNU extensions that allow you to hook every malloc(), realloc() and free() call, maintaining your statistics.
To see how much memory is mapped by your application from the kernel's point of view, you can read and collate the information in the /proc/self/smaps pseudofile. This also lets you see how much of each allocation is resident, swapped, shared/private, clean/dirty etc.
/proc/PID/status contains a few useful pieces of information (try running cat /proc/$$/status for example).
VmPeak is the largest your process's virtual memory space ever became during its execution. This includes all pages mapped into your process, including executable pages, mmap'ed files, stack, and heap.
VmSize is the current size of your process's virtual memory space.
VmRSS is the Resident Set Size of your process; i.e., how much of it is taking up physical RAM right now. (A typical process will have lots of stuff mapped that it never uses, like most of the C library. If no processes need a page, eventually it will be evicted and become non-resident. RSS measures the pages that remain resident and are mapped into your process.)
VmHWM is the High Water Mark of VmRSS; i.e. the highest that number has been during the lifetime of the process.
VmData is the size of your process's "data" segment; i.e., roughly its heap usage. Note that small blocks on which you have done malloc and then free will still be in use from the kernel's point of view; large blocks will actually be returned to the kernel when freed. (If memory serves, "large" means greater than 128k for current glibc.) This is probably the closest to what you are looking for.
These measurements are probably better than trying to track malloc and free, since they indicate what is "really going on" from a system-wide point of view. Just because you have called free() on some memory, that does not mean it has been returned to the system for other processes to use.