How does stack size of pthread affects memory usage? - c

I have a question about setting the stack size of pthread using pthread_attr_setstacksize():
From my understanding, the stack of pthread lies on the anonymous mmapped region of its creating process. When I set the thread's stack size to 5M & 8M separately, I see that it does affect the size of the mmapped region but both of them use (almost) the same amount of physical memory:
Partial result of the pmap command [stack with size 5M]:
00007f97f8b52000 7172K rw--- [ anon ]
Partial result of the pmap command [stack with size 8M]:
00007f8784606000 10244K rw--- [ anon ]
Partial result of the top command [stack with size 5M]:
VIRT RES SWAP USED
25160 7236 0 7236
Partial result of the top command [stack with size 8M]:
VIRT RES SWAP USED
22088 7196 0 7196
In my program, I want to use a larger stack size to prevent a stack overflow; what I want to confirm here is that by using a large stack size, I will not consume more physical memory but just larger virtual address. Is this correct?

If you need a larger stack size to prevent overflow, that implies at some point you'll actually be using the larger size (ie, your stack will be deeper than the default would allow).
In that case, there's some point where your program would have crashed with the default stack size, where it instead has another page allocated to its address space. So, in some sense, it could use more physical memory.
How many of the pages allocated to your process actually reside in memory at one time, however, depends your OS, memory pressure, other processes etc. etc.

Related

Analyzing memory mapping of a process with pmap. [stack]

I'm trying to understand how stack works in Linux. I read AMD64 ABI sections about stack and process initialization and it is not clear how the stack should be mapped. Here is the relevant quote (3.4.1):
Stack State
This section describes the machine state that exec (BA_OS) creates for
new processes.
and
It is unspecified whether the data and stack segments are initially
mapped with execute permissions or not. Applications which need to
execute code on the stack or data segments should take proper
precautions, e.g., by calling mprotect().
So I can deduce from the quotes above that the stack is mapped (it is unspecified if PROT_EXEC is used to create the mapping). Also the mapping is created by exec.
The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?
Looking at pmap -x <pid> the stack is marked with [stack] as
00007ffc04c78000 132 12 12 rw--- [ stack ]
Creating a mapping as
mmap(NULL, 4096,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_STACK,
-1, 0);
simply creates anonymous mapping as that is shown in pmap -x <pid> as
00007fb6e42fa000 4 0 0 rw--- [ anon ]
I can deduce from the quotes above that the stack is mapped
That literally just means that memory is allocated. i.e. that there is a logical mapping from those virtual addresses to physical pages. We know this because you can use a push or call instruction in _start without making a system call from user-space to allocate a stack.
In fact the x86-64 System V ABI specifies that argc, argv, and envp are on the stack at process startup.
The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?
The ELF binary loader sets the _GROWSDOWN flag for the main thread's stack, but not the MAP_STACK flag. This is code inside the kernel, and it does not go through the regular mmap system call interface.
(Nothing in user-space uses mmap(MAP_GROWSDOWN) so normally the main thread stack is the only mapping that have the VM_GROWSDOWN flag inside the kernel.)
The internal name of the flag that is used for the virtual memory aree (VMA) of the stack is called VM_GROWSDOWN. In case you're interested, here are all the flags that are used for the main thread's stack: VM_GROWSDOWN, VM_READ, VM_WRITE, VM_MAYREAD, VM_MAYWRITE, and VM_MAYEXEC. In addition, if the ELF binary is specified to have an executable stack (e.g., by compiling with gcc -z execstack), the VM_EXEC flag is also used. Note that on architectures that support stacks that grow upwards, VM_GROWSUP is used instead of VM_GROWSDOWN if the kernel was compiled with CONFIG_STACK_GROWSUP defined. The line of code where these flags are specified in the Linux kernel can be found here.
/proc/.../maps and pmap don't use the VM_GROWSDOWN - they rely on address comparison instead. Therefore they may not be able to determine exactly the exact range of the virtual address space that the main thread's stack occupies (see an example). On the other hand, /proc/.../smaps looks for the VM_GROWSDOWN flag and marks each memory region that has this flag as gd. (Although it seems to ignore VM_GROWSUP.)
All of these tools/files ignore the MAP_STACK flag. In fact, the whole Linux kernel ignores this flag (which is probably why the program loader doesn't set it.) User-space only passes it for future-proofing in case the kernel does want to start treating thread-stack allocations specially.
sbrk makes no sense here; the stack isn't contiguous with the "break", and the brk heap grows upward toward the stack anyway. Linux puts the stack very near the top of virtual address space. So of course the primary stack couldn't be allocated with (the in-kernel equivalent of) sbrk.
And no, nothing uses MAP_GROWSDOWN, not even secondary thread stacks, because it can't in general be used safely.
The mmap(2) man page which says MAP_GROWSDOWN is "used for stacks" is laughably out of date and misleading. See How to mmap the stack for the clone() system call on linux?. As Ulrich Drepper explained in 2008, code using MAP_GROWSDOWN is typically broken, and proposed removing the flag from Linux mmap and from glibc headers. (This obviously didn't happen, but pthreads hasn't used it since well before then, if ever.)
MAP_GROWSDOWN sets the VM_GROWSDOWN flag for the mapping inside the kernel. The main thread also uses that flag to enable the growth mechanism, so a thread stack may be able to grow the same way the main stack does: arbitrarily far (up to ulimit -s?) if the stack pointer is below the page fault location. (Linux does not require "stack probes" to touch every page of a large multi-page stack array or alloca.)
(Thread stacks are fully allocated up front; only normal lazy allocation of physical pages to back that virtual allocation avoids wasting space for thread stacks.)
MAP_GROWSDOWN mapping can also grow the way the mmap man page describes: access to the "guard page" below the lowest mapped page will also trigger growth, even if that's below the bottom of the red zone.
But the main thread's stack has magic you don't get with mmap(MAP_GROWSDOWN). It reserves the growth space up to ulimit -s to prevent random choice of mmap address from creating a roadblock to stack growth. That magic is only available to the in-kernel program-loader which maps the main thread's stack during execve(), making it safe from an mmap(NULL, ...) randomly blocking future stack growth.
mmap(MAP_FIXED) could still create a roadblock for the main stack, but if you use MAP_FIXED you're 100% responsible for not breaking anything. (Unlimited stack cannot grow beyond the initial 132KiB if MAP_FIXED involved?). MAP_FIXED will replace existing mappings and reservations, but anything else will treat the main thread's stack-growth space as reserved;. (I think that's true; worth trying with MAP_FIXED_NOREPLACE or just a non-NULL hint address)
See
How is Stack memory allocated when using 'push' or 'sub' x86 instructions?
Why does this code crash with address randomization on?
pthread_create doesn't use MAP_GROWSDOWN for thread stacks, and neither should anyone else. Generally do not use. Linux pthreads by default allocates the full size for a thread stack. This costs virtual address space but (until it's actually touched) not physical pages.
The inconsistent results in comments on Why is MAP_GROWSDOWN mapping does not grow? (some people finding it works, some finding it still segfaults when touching the return value and the page below) sound like https://bugs.centos.org/view.php?id=4767 - MAP_GROWSDOWN may even be buggy outside of the way the standard main-stack VM_GROWSDOWN mapping is used.

Malloc is using 10x the amount of memory necessary

I have a network application which allocates predicable 65k chunks as part of the IO subsystem. The memory usage is tracked atomically within the system so I know how much memory I'm actually using. This number can also be checked against malloc_stats()
Result of malloc_stats()
Arena 0:
system bytes = 1617920
in use bytes = 1007840
Arena 1:
system bytes = 2391826432
in use bytes = 247265696
Arena 2:
system bytes = 2696175616
in use bytes = 279997648
Arena 3:
system bytes = 6180864
in use bytes = 6113920
Arena 4:
system bytes = 16199680
in use bytes = 699552
Arena 5:
system bytes = 22151168
in use bytes = 899440
Arena 6:
system bytes = 8765440
in use bytes = 910736
Arena 7:
system bytes = 16445440
in use bytes = 11785872
Total (incl. mmap):
system bytes = 935473152
in use bytes = 619758592
max mmap regions = 32
max mmap bytes = 72957952
Items to note:
The total in use bytes is completely correct number according to my internal counter. However, the application has a RES (from top/htop) of 5.2GB. The allocations are almost always 65k; I don't understand the huge amount of fragmentation/waste I am seeing even more so when mmap comes into play.
total system bytes does not equal to the sum of system bytes in each Arena.
I'm on Ubuntu 16.04 using glibc 2.23-0ubuntu3
Arena 1 and 2 account for the large RES value the kernel is reporting.
Arena 1 and 2 are holding on to 10x the amount of memory that is used.
The mass majority of allocations are ALWAYS 65k (explicit multiple of the page size)
How do I keep malloc for allocating an absurd amount of memory?
I think this version of malloc has a huge bug. Eventually (after an hour) a little more than half of the memory will be released. This isn't a fatal bug but it is definitely a problem.
UPDATE - I added mallinfo and re-ran the test - the app is no longer processing anything at the time this was captured. No network connections are attached. It is idle.
Arena 2:
system bytes = 2548473856
in use bytes = 3088112
Arena 3:
system bytes = 3288600576
in use bytes = 6706544
Arena 4:
system bytes = 16183296
in use bytes = 914672
Arena 5:
system bytes = 24027136
in use bytes = 911760
Arena 6:
system bytes = 15110144
in use bytes = 643168
Arena 7:
system bytes = 16621568
in use bytes = 11968016
Total (incl. mmap):
system bytes = 1688858624
in use bytes = 98154448
max mmap regions = 32
max mmap bytes = 73338880
arena (total amount of memory allocated other than mmap) = 1617780736
ordblks (number of ordinary non-fastbin free blocks) = 1854
smblks (number of fastbin free blocks) = 21
hblks (number of blocks currently allocated using mmap) = 31
hblkhd (number of bytes in blocks currently allocated using mmap) = 71077888
usmblks (highwater mark for allocated space) = 0
fsmblks (total number of bytes in fastbin free blocks) = 1280
uordblks (total number of bytes used by in-use allocations) = 27076560
fordblks (total number of bytes in free blocks) = 1590704176
keepcost (total amount of releaseable free space at the top of the heap) = 439216
My hypothesis is as follows: The difference between the total system bytes reported by malloc is much less than the amount reported in each arena. (1.6Gb vs 6.1GB) This could mean that (A) malloc is actually releasing blocks but the arena doesn't or (B) that malloc is not compacting memory allocations at all and it is creating huge amount of fragmentation.
UPDATE Ubuntu released a kernel update which basically fixed everything as described in this post. That said, there is a lot of good information in here on how malloc works with the kernel.
The full details can be a bit complex, so I'll try to simplify things as much as I can. Also, this is a rough outline and may be slightly inaccurate in places.
Requesting memory from the kernel
malloc uses either sbrk or anonymous mmap to request a contiguous memory area from the kernel. Each area will be a multiple of the machine's page size, typically 4096 bytes. Such a memory area is called an arena in malloc terminology. More on that below.
Any pages so mapped become part of the process's virtual address space. However, even though they have been mapped in, they may not be backed up by a physical RAM page [yet]. They are mapped [many-to-one] to the single "zero" page in R/O mode.
When the process tries to write to such a page, it incurs a protection fault, the kernel breaks the mapping to the zero page, allocates a real physical page, remaps to it, and the process is restarted at the fault point. This time the write succeeds. This is similar to demand paging to/from the paging disk.
In other words, page mapping in a process's virtual address space is different than page residency in a physical RAM page/slot. More on this later.
RSS (resident set size)
RSS is not really a measure of how much memory a process allocates or frees, but how many pages in its virtual address space have a physical page in RAM at the present time.
If the system has a paging disk of 128GB, but only had (e.g.) 4GB of RAM, a process RSS could never exceed 4GB. The process's RSS goes up/down based upon paging in or paging out pages in its virtual address space.
So, because of the zero page mapping at start, a process RSS may be much lower than the amount of virtual memory it has requested from the system. Also, if another process B "steals" a page slot from a given process A, the RSS for A goes down and goes up for B.
The process "working set" is the minimum number of pages the kernel must keep resident for the process to prevent the process from excessively page faulting to get a physical memory page, based on some measure of "excessively". Each OS has its own ideas about this and it's usually a tunable parameter on a system-wide or per-process basis.
If a process allocates a 3GB array, but only accesses the first 10MB of it, it will have a lower working set than if it randomly/scattershot accessed all parts of the array.
That is, if the RSS is higher [or can be higher] than the working set, the process will run well. If the RSS is below the working set, the process will page fault excessively. This can be either because it has poor "locality of reference" or because other events in the system conspire to "steal" the process's page slots.
malloc and arenas
To cut down on fragmentation, malloc uses multiple arenas. Each arena has a "preferred" allocation size (aka "chunk" size). That is, smaller requests like malloc(32) come from (e.g.) arena A, but larger requests like malloc(1024 * 1024) come from a different arena (e.g.) arena B.
This prevents a small allocation from "burning" the first 32 bytes of the last available chunk in arena B, making it too short to satisfy the next malloc(1M)
Of course, we can't have a separate arena for each requested size, so the "preferred" chunk sizes are typically some power of 2.
When creating a new arena for a given chunk size, malloc doesn't just request an area of the chunk size, but some multiple of it. It does this so it can quickly satisfy subsequent requests of the same size without having to do an mmap for each one. Since the minimum size is 4096, arena A will have 4096/32 chunks or 128 chunks available.
free and munmap
When an application does a free(ptr) [ptr represents a chunk], the chunk is marked as available. free could choose to combine contiguous chunks that are free/available at that time or not.
If the chunk is small enough, it does nothing more (i.e.) the chunk is available for reallocation, but, free does not try to release the chunk back to the kernel. For larger allocations, free will [try to] do munmap immediately.
munmap can unmap a single page [or even a small number of bytes], even if comes in the middle of an area that was multiple pages long. If so, the application now has a "hole" in the mapping.
malloc_trim and madvise
If free is called, it probably calls munmap. If an entire page has been unmapped, the RSS of the process (e.g. A) goes down.
But, consider chunks that are still allocated, or chunks that were marked as free/available but were not unmapped.
They are still part of the process A's RSS. If another process (e.g. B) starts doing lots of allocations, the system may have to page out some of process A's slots to the paging disk [reducing A's RSS] to make room for B [whose RSS goes up].
But, if there is no process B to steal A's page slots, process A's RSS can remain high. Say process A allocated 100MB, used it a while back, but is only actively using 1MB now, the RSS will remain at 100MB.
That's because without the "interference" from process B, the kernel had no reason to steal any page slots from A, so they "remain on the books" in the RSS.
To tell the kernel that a memory area is not likely to be used soon, we need the madvise syscall with MADV_WONTNEED. This tells the kernel that the memory area is low priority and it should [more] aggressively page it out to the paging disk, thereby reducing the process's RSS.
The pages remain mapped in the process's virtual address space, but get farmed out to the paging disk. Remember, page mapping is different than page residency.
If the process accesses the page again, it incurs a page fault and the kernel will pull in the data from paging disk to a physical RAM slot and remap. The RSS goes back up. Classical demand paging.
madvise is what malloc_trim uses to reduce the RSS of the process.
free does not promise to return the freed memory to the OS.
What you observe is the freed memory is kept in the process for possible reuse. More than that, free releasing memory to the OS can pose a performance problem when allocation and deallocation of large chunks happen frequently. This is why there is an option to return the memory to the OS explicitly with malloc_trim.
Try malloc_trim(0) and see if that reduces the RSS. This function is non-standard, so its behaviour is implementation specific, it might not do anything at all. You mentioned in the comments that calling it did reduce RSS.
You may like to make sure that there are no memory leaks and memory corruption before you start digging deeper.
With regards to keepcost member, see man mallinfo:
BUGS
Information is returned for only the main memory allocation area.
Allocations in other arenas are excluded. See malloc_stats(3) and malloc_info(3) for alternatives that include information about other arenas.

Redhat 7.1 kernel process stack size from 8K to 16KB

Reading the release note of Redhat 7.1, I read this:
Process Stack Size Increased from 8KB to 16KB
Since Red Hat Enterprise Linux 7.1, the kernel process stack size has been increased from 8KB to 16KB to help large processes that use stack space.
I know the kernel process stack are resident memory and the allocation is made when processes are created and that memory needs to be contiguous, In x86_64 with page size of 4096 bytes, the kernel will need to find 4 pages intend of 2 pages for the process stack.
This feature can be a problem when the kernel memory is fragmented?
With one process kernel stack size, will be more easier to have a problem with process creation when the memory will be fragmented?
The kernel often needs to allocate a set of one or more pages which are physically contiguous. This may be necessary while allocating buffers (required by drivers for data transfers such as DMA) or when creating a process stack.
Typically, to meet such requirements, kernel attempts to avoid fragmentation by allocating pages which are physically contiguous and additionally freed pages are merged/grouped into larger physically contiguous group of pages (if available). This is handled by the memory management sub-system and the buddy allocator. Now when your stack (8k or 16k in RHEL7) is created when the program has starts executing.
If kernel is unable to obtain or allocate a the requested set of physically contiguous pages (either 2 for 8k stack or 4 for 16k stack assuming 4k page size), then this could potentially lead to page allocation failures, order:2. (i.e 2^2=4 pages * 4K). The order depends on the size of your physically contiguous pages requested. We can observe the /proc/buddyinfo file during the time when page allocation failures occur, it could show signs of physically memory being fragmented.
Yes.
When memory is fragmented finding stack space can be a problem.

Maximum limit of Linkedlist

I am new in CS world.While reading some books i checked that dynamic allocation of memory, allocates memory dynamically while program runs and that memory is called heap.So that means whenever i want to create a new node in Linkedlist it get stored on heap? or else it get stored on memory and accessed runtime?
And i also checked that whenever any programs ran, the OS creates PCB for the same including minimum 4 following parts:
Heap Segment
Stack segment
Data segment
Code segment
And heap segment and stack segment grows dynamically depends on code(upwards or downwards depends on system).
So my basic question is
The maximum element we can add in Linkedlist until system memory exhausts or heap memory exhausts?
I read that its until system memory exhausts.But i wondered how?
Best explanation I've read yet:
Because the heap grows up and the stack grows down, they basically limit each other. Also, because both type of segments are writeable, it wasn't always a violation for one of them to cross the boundary, so you could have buffer or stack overflow. Now there are mechanism to stop them from happening.
There is a set limit for heap (stack) for each process to start with. This limit can be changed at runtime (using brk()/sbrk()). Basically what happens is when the process needs more heap space and it has run out of allocated space, the standard library will issue the call to the OS. The OS will allocate a page, which usually will be manage by user library for the program to use. I.e. if the program wants 1 KiB, the OS will give additional 4 KiB and the library will give 1 KiB to the program and have 3 KiB left for use when the program ask for more next time.
Excerpt from here.
If you dynamically allocate memory you get it from the heap.
From the statement above you can directly conclude that if you allocate a list's nodes dynamically, the maximum numbers of nodes is limited by the heap's size, that is by the amount of heap-memory.

How programmatically get Linux process's stack start and end address?

For a mono threaded program, I want to check whether or not a given virtual address is in the process's stack. I want to do that inside the process which is written in C.
I am thinking of reading /proc/self/maps to find the line labelled [stack] to get start and end address for my process's stack. Thinking about this solution led me to the following questions:
/proc/self/maps shows a stack of 132k for my particular process and the maximum size for the stack (ulimit -s) is 8 mega on my system. How does Linux know that a given page fault occurring because we are above the stack limit belongs to the stack (and that the stack must be made larger) rather than that we are reaching another memory area of the process ?
Does Linux shrink back the stack ? In other words, when returning from deep function calls for example, does the OS reduce the virtual memory area corresponding to the stack ?
How much virtual space is initially allocated for the stack by the OS ?
Is my solution correct and is there any other cleaner way to do that ?
Lots of the stack setup details depend on which architecture you're running on, executable format, and various kernel configuration options (stack pointer randomization, 4GB address space for i386, etc).
At the time the process is exec'd, the kernel picks a default stack top (for example, on the traditional i386 arch it's 0xc0000000, i.e. the end of the user-mode area of the virtual address space).
The type of executable format (ELF vs a.out, etc) can in theory change the initial stack top. Any additional stack randomization and any other fixups are then done (for example, the vdso [system call springboard] area generally is put here, when used). Now you have an actual initial top of stack.
The kernel now allocates whatever space is needed to construct argument and environment vectors and so forth for the process, initializes the stack pointer, creates initial register values, and initiates the process. I believe this provides the answer for (3): i.e. the kernel allocates only enough space to contain the argument and environment vectors, other pages are allocated on demand.
Other answers, as best as I can tell:
(1) When a process attempts to store data in the area below the current bottom of the stack region, a page fault is generated. The kernel fault handler determines where the next populated virtual memory region within the process' virtual address space begins. It then looks at what type of area that is. If it's a "grows down" area (at least on x86, all stack regions should be marked grows-down), and if the process' stack pointer (ESP/RSP) value at the time of the fault is less than the bottom of that region and if the process hasn't exceeded the ulimit -s setting, and the new size of the region wouldn't collide with another region, then it's assumed to be a valid attempt to grow the stack and additional pages are allocated to satisfy the process.
(2) Not 100% sure, but I don't think there's any attempt to shrink stack areas. Presumably normal LRU page sweeping would be performed making now-unused areas candidates for paging out to the swap area if they're really not being re-used.
(4) Your plan seems reasonable to me: the /proc/NN/maps should get start and end addresses for the stack region as a whole. This would be the largest your stack has ever been, I think. The current actual working stack area OTOH should reside between your current stack pointer and the end of the region (ordinarily nothing should be using the area of the stack below the stack pointer).
My answer is for linux on x64 with kernel 3.12.23 only. It might or might not apply to aother versions or architectures.
(1)+(2) I'm not sure here, but I believe it is as Gil Hamilton said before.
(3) You can see the amount in /proc/pid/maps (or /proc/self/maps if you target the calling process). However not all of that it actually useable as stack for your application. Argument- (argv[]) and environment vectors (__environ[]) usually consume quite a bit of space at the bottom (highest address) of that area.
To actually find the area the kernel designated as "stack" for your application, you can have a look at /proc/self/stat. Its values are documented here. As you can see, there is a field for "startstack". Together with the size of the mapped area, you can compute the current amount of stack reserved. Along with "kstkesp", you could determine the amount of free stack space or actually used stack space (keep in mind that any operation done by your thread most likely will change those values).
Also note, that this works only for the processes main thread! Other threads won't get a labled "[stack]" mapping, but either use anonymous mappings or might even end up on the heap. (Use pthreads API to find those values, or remember the stack-start in the threads main function).
(4) As explained in (3), you solution is mostly OK, but not entirely accurate.

Resources