Do threads have a distinct heap? - c

As far as I know each thread gets a distinct stack when the thread is created by the operating system. I wonder if each thread has a heap distinct to itself also?

No. All threads share a common heap.
Each thread has a private stack, which it can quickly add and remove items from. This makes stack based memory fast, but if you use too much stack memory, as occurs in infinite recursion, you will get a stack overflow.
Since all threads share the same heap, access to the allocator/deallocator must be synchronized. There are various methods and libraries for avoiding allocator contention.
Some languages allow you to create private pools of memory, or individual heaps, which you can assign to a single thread.

By default, C has only a single heap.
That said, some allocators that are thread aware will partition the heap so that each thread has it's own area to allocate from. The idea is that this should make the heap scale better.
One example of such a heap is Hoard.

Depends on the OS. The standard c runtime on windows and unices uses a shared heap across threads. This means locking every malloc/free.
On Symbian, for example, each thread comes with its own heap, although threads can share pointers to data allocated in any heap. Symbian's design is better in my opinion since it not only eliminates the need for locking during alloc/free, but also encourages clean specification of data ownership among threads. Also in that case when a thread dies, it takes all the objects it allocated along with it - i.e. it cannot leak objects that it has allocated, which is an important property to have in mobile devices with constrained memory.
Erlang also follows a similar design where a "process" acts as a unit of garbage collection. All data is communicated between processes by copying, except for binary blobs which are reference counted (I think).

Each thread has its own stack and call stack.
Each thread shares the same heap.

It depends on what exactly you mean when saying "heap".
All threads share the address space, so heap-allocated objects are accessible from all threads. Technically, stacks are shared as well in this sense, i.e. nothing prevents you from accessing other thread's stack (though it would almost never make any sense to do so).
On the other hand, there are heap structures used to allocate memory. That is where all the bookkeeping for heap memory allocation is done. These structures are sophisticatedly organized to minimize contention between the threads - so some threads might share a heap structure (an arena), and some might use distinct arenas.
See the following thread for an excellent explanation of the details: How does malloc work in a multithreaded environment?

Typically, threads share the heap and other resources, however there are thread-like constructions that don't. Among these thread-like constructions are Erlang's lightweight processes, and UNIX's full-on processes (created with a call to fork()). You might also be working on multi-machine concurrency, in which case your inter-thread communication options are considerably more limited.

Generally speaking, all threads use the same address space and therefore usually have just one heap.
However, it can be a bit more complicated. You might be looking for Thread Local Storage (TLS), but it stores single values only.
Windows-Specific:
TLS-space can be allocated using TlsAlloc and freed using TlsFree (Overview here). Again, it's not a heap, just DWORDs.
Strangely, Windows support multiple Heaps per process. One can store the Heap's handle in TLS. Then you would have something like a "Thread-Local Heap". However, just the handle is not known to the other threads, they still can access its memory using pointers as it's still the same address space.
EDIT: Some memory allocators (specifically jemalloc on FreeBSD) use TLS to assign "arenas" to threads. This is done to optimize allocation for multiple cores by reducing synchronization overhead.

On FreeRTOS Operating system, tasks(threads) share the same heap but each one of them has its own stack. This comes in very handy when dealing with low power low RAM architectures,because the same pool of memory can be accessed/shared by several threads, but this comes with a small catch , the developer needs to keep in mind that a mechanism for synchronizing malloc and free is needed, that is why it is necessary to use some type of process synchronization/lock when allocating or freeing memory on the heap, for example a semaphore or a mutex.

Related

How is memory layout shared with other processes/threads?

I'm currently learning memory layout in C. For now I know there exist several sections in C program memory: text, data, bss, heap and stack. They also say heap is shared with other things beyond the program.
My questions are these.
What exactly is the heap shared with? One source states that Heap must always be freed in order to make it available for other processes whereas another says The heap area is shared by all threads, shared libraries, and dynamically loaded modules in a process. If it is not shared with other processes, do I really have to free it while my program is running (not at the end of it)?
Some sources also single out high addresses (the sixth section) for command line arguments and environment variables. Shall this be considered as another layer and a part of a program memory?
Are the other sections shared with anything else beyond a program?
The heap is a per-process memory: each process has its own heap, which is shared only within the same process space (like between the process threads, as you said). Why should you free it? Not properly to give space to other processes (at least in modern OS where the process memory is reclaimed by the OS when the process dies), but to prevent heap exhaustion within your process memory: in C, if you don't deallocate the heap memory regions you used, they will be always considered as busy even when they are not used anymore. Thus, to prevent undesired errors, it's a good practice to free the memory in the heap as soon as you don't need it anymore.
In a C program the command line variables are stored in the stack as function variables of the main. What happens is that usually the stack is allocated in the highest portion of a process memory, which is mapped to the high addresses (this is probably the reason why some sources point out what you wrote). But, generally speaking, there isn't any sixth memory area.
As said by the others, the text area can be shared by processes. This area usually contains the binary code, which would be the same for different processes which share the same binary. For performance reasons, the OS can allow to share such memory area, (think for example when you fork a child process).
Heap is shared with other processes in a sense that all processes use RAM. The more of it you use, the less is available to other programs. Heap sharing with other threads in your own program means that all your threads actually see and access the same heap (same virtual address space, same actual RAM, with some luck also same cache).
No.
text can be shared with other processes. These days it is marked as read-only, so having several processes share text makes sense. In practice this means that if you are already running top and you run another instance it makes no sense to load text part again. This would waste time and physical RAM. If the OS is smart enough it can map those RAM pages into virtual address space of both top instances, saving time and space.
On the official aspect:
The terms thread, process, text section, data section, bss, heap and stack are not even defined by the C language standard, and every platform is free to implement these components however "it may like".
Threads and processes are typically implemented at the operating-system layer, while all the different memory sections are typically implemented at the compiler layer.
On the practical aspect:
For every given process, all these memory sections (text section, data section, bss, heap and stack) are shared by all the threads of that process.
Hence, it is under the responsibility of the programmer to ensure mutual-exclusion when accessing these memory sections from different threads.
Typically, this is achieved via synchronization utilities such as semaphores, mutexes and message queues.
In between processes, it is under the responsibility of the operating system to ensure mutual-exclusion.
Typically, this is achieved via virtual-memory abstraction, where each process runs inside its own logical address space, and each logical address space is mapped to a different physical address space.
Disclaimer: some would claim that each thread has its own stack, but technically speaking, those stacks are usually allocated consecutively on the stack of the process, and there's usually no one to prevent a thread from accessing the stacks of other threads, whether intentionally or by mistake (aka stack overflow).

heap overflow affecting other programs

I was trying to create the condition for malloc to return a NULL pointer. In the below program, though I can see malloc returning NULL, once the program is forcebly terminated, I see that all other programs are becoming slow and finally I had to reboot the system. So my question is whether the memory for heap is shared with other programs? If not, other programs should not have affected. Is OS is not allocating certain amount of memory at the time of execution? I am using windows 10, Mingw.
#include <stdio.h>
#include <malloc.h>
void mallocInFunction(void)
{
int *ptr=malloc(500);
if(ptr==NULL)
{
printf("Memory Could not be allocated\n");
}
else
{
printf("Allocated memory successfully\n");
}
}
int main (void)
{
while(1)
{
mallocInFunction();
}
return(0);
}
So my question is whether the memory for heap is shared with other programs?
Physical memory (RAM) is a resource that is shared by all processes. The operating system makes decisions about how much RAM to allocate to each process and adjusts that over time.
If not, other programs should not have affected. Is OS is not allocating certain amount of memory at the time of execution?
At the time the program starts executing, the operating system has no idea how much memory the program will want or need. Instead, it deals with allocations as they happen. Unless configured otherwise, it will typically do everything it possibly can to allow the program's allocation to succeed because presumably there's a reason the program is doing what it's doing and the operating system won't try to second guess it.
... whether the memory for heap is shared with other programs?
Well, the C standard doesn't exactly require a heap, but in the context of a task-switching, multi-user and multi-threaded OS, of course memory is shared between processes! The C standard doesn't require any of this, but this is all pretty common stuff:
CPU cache memory tends to be preferred for code that's executed often, though this might get swapped around quite a bit; that may or may not be swapped to a heap.
Task switching causes registers to be swapped to other forms of memory; that may or may not be swapped to a heap.
Entire pages are swapped to and from disk, so that other programs can make use of them when your OS switches execution away from your program and to the other programs, and when it's your programs turn to execute again among other reasons. This may or may not involve manipulating the heap.
FWIW, you're referring to memory that has allocated storage duration. It's best to avoid using terms like heap and stack, as they're virtually meaningless. The memory you're referring to is on a silicon chip, regardless of whether it uses a heap or a stack.
... Is OS is not allocating certain amount of memory at the time of execution?
Speaking of silicon chips and execution, your OS likely only has control of one processor (a silicon chip which contains some logic circuits and memory, among other things I'm sure) with which to execute many programs! To summarise this post, yes, your program is most likely sharing those silicon chips with other programs!
On a tangential note, I don't think heap overflow means what you think it means.
Your question cannot be answered in the context of C, the language. For C, there's no such thing as a heap, a process, ...
But it can be answered in the context of operating systems. Even a bit generically because many modern multitasking OSes do similar things.
Given a modern multitasking OS, it will use virtual address spaces for each process. The OS manages a fixed size of physical RAM and divides this into pages, when a process needs memory, such pages are mapped into the process' virtual address space (typically using a different virtual address than the physical one). So when all memory pages are claimed by the OS itself and by the processes running, the OS will typically save some of these pages that are not in active use to disk, in a swap area, in order to serve this page as a fresh page to the next process requesting one. But when the original page is touched (and this is typically the case with free(), see below), it must first be loaded from disk again, but to have a free page for this, another page must be saved to swap space.
This is, like all disk I/O, slow, and it's probably what you see happening here.
Now to fully understand this: what does malloc() do? It typically requests from the operating system to have the memory of the own process increased (and if necessary, the OS does this by mapping another page), and it uses this new memory by writing some information there about the block of memory requested (so free() can work correctly later) and ultimately returns a pointer to a block that's free to use for the program. free() uses the information written by malloc(), modifies it to indicate this block is free again, and it typically can't give any memory back to the OS because there are other malloc()d blocks in the same page. It will give memory back when possible, but that's the exception in a typical scenario where dynamic allocations are heavily used.
So, the answer to your question is: Yes, the RAM is shared because there is only one set of physical RAM. The OS does the best it can to hide that fact and virtualize RAM, but if a process consumes all that is there, this will have visible effects.
malloc() is not system call but libc library function. So when a program ask for allocating memory via malloc(), system call brk()/sbrk() OR mmap() to allocated page(s), more details here.
Please keep in mind that the memory you get is all virtual in nature, that means if you have 3GB of physical RAM you can actually allocate almost infinite memory. So how does this happens? This happens via concept called 'paging', where system stores and retrieves data from secondary memory storage(HDD/SDD) to main memory(RAM), more details here.
So with this theory, out of memory usually quite rare but program like above which is checking system limits, this can happen. This is nicely explained here.
Now, why other programs are sort of hanged OR slow? Because they all share the same operating system and system is starving for resource. In fact at a point the system will crash and reboot again.
Hope this helps?

jemalloc, mmap and shared memory?

Can jemalloc be modified to allocate from shared memory? The FreeBSD function dallocx() implies you can provide a pointer to use for allocation, but I don't see an obvious way to tell jemalloc to restrict all allocations from that memory (nor set a size, etc).
The dallocx() function causes the memory referenced by ptr to be made available for future allocations.
If not, what is the level of effort for such a feature? I'm struggling to find an off-the-shelf allocation scheme that can allocate from a shared memory section that I provided.
Similarly, can jemalloc be configured to allocate from a locked region of memory to prevent swapping?
Feel free to point me to relevant code sections that require modification and provide any ideas or suggestions.
The idea I am exploring is — since you can create arenas/heaps for allocating in a threaded environment, as jemalloc does to minimize contention, the concept seems scalable to allocating regions of shared memory in a multiprocessing environment, i.e. I create N regions of shared memory using mmap(), and I want to leverage the power of jemalloc (or any allocation scheme) to allocate as efficiently as possible, with minimum thread contention, from those one of those shared regions, i.e. if threads/processes are not accessing the same shared regions and arenas, the chance for contention is minimal and speed of the malloc operation is increased.
This is different than a global pool alloc with malloc() API since usually these require a global lock effectively serializing the user-space. I'd like to avoid this.
edit 2:
Ideally an api like this:
// init the alloc context to two shmem pools
ctx1 = alloc_init(shm_region1_ptr);
ctx2 = alloc_init(shm_region2_ptr);
(... bunch of code determines pool 2 should be used, based on some method
of pool selection which can minimize possibility of lock contention
with other processes allocating shmem buffers)
// allocate from pool2
ptr = malloc(ctx2, size)
Yes. But this was not true when you asked the question.
Jemalloc 4 (released in August of 2015) has a couple of mallctl namespaces that would be useful for this purpose; they allow you to specify per-arena, application-specific chunk allocation hooks. In particular, the arena.<i>.chunk_hooks namespace and the arenas.extend mallctl options are of use. An integration test exists that demonstrates how to consume this API.
Regarding the rationale, I would expect that the effective "messaging" overhead required to understand where contention on any particular memory segment lies would be similar to the overhead of just contending, since you're going to degrade into contending on a cache line to accurately update the "contention" value of a particular arena.
Since jemalloc already employs a number of techniques to reduce contention, you could get a similar behavior in a highly threaded environment by creating additional arenas with opt.narenas. This would reduce contention as fewer threads would be mapped to an arena, but since threads are effectively round-robined, it's possible you get to hot-spots anyway.
To get around this, you could do your contention counting and hotspot detection, and simply use the thread.arena mallctl interface to switch a thread onto an arena with less contention.

process allocated memory blocks from kernel

i need to have reliable measurement of allocated memory in a linux process. I've been looking into mallinfo but i've read that it is deprecated. What is the state of the art alternative for this sort of statistics?
basically i'm interested in at least two numbers:
number (and size) of allocated memory blocks/pages from the kernel by any malloc or whatever implementation uses the C library of choice
(optional but still important) number of allocated memory by userspace code (via malloc, new, etc.) minus the deallocated memory by it (via free, delete, etc.)
one possibility i have is to override malloc calls with LD_PRELOAD, but it might introduce an unwanted overhead at runtime, also it might not interact properly with other libraries i'm using that also rely on LD_PRELOAD aop-ness.
another possibility i've read is with rusage.
To be clear, this is NOT for debugging purposes, the memory usage is intrinsic feature of the application (similar to Mathematica or Matlab that display the amount of memory used, only that more precise at the block-level)
For this purpose - a "memory usage" introspection feature within an application - the most appropriate interface is malloc_hook(3). These are GNU extensions that allow you to hook every malloc(), realloc() and free() call, maintaining your statistics.
To see how much memory is mapped by your application from the kernel's point of view, you can read and collate the information in the /proc/self/smaps pseudofile. This also lets you see how much of each allocation is resident, swapped, shared/private, clean/dirty etc.
/proc/PID/status contains a few useful pieces of information (try running cat /proc/$$/status for example).
VmPeak is the largest your process's virtual memory space ever became during its execution. This includes all pages mapped into your process, including executable pages, mmap'ed files, stack, and heap.
VmSize is the current size of your process's virtual memory space.
VmRSS is the Resident Set Size of your process; i.e., how much of it is taking up physical RAM right now. (A typical process will have lots of stuff mapped that it never uses, like most of the C library. If no processes need a page, eventually it will be evicted and become non-resident. RSS measures the pages that remain resident and are mapped into your process.)
VmHWM is the High Water Mark of VmRSS; i.e. the highest that number has been during the lifetime of the process.
VmData is the size of your process's "data" segment; i.e., roughly its heap usage. Note that small blocks on which you have done malloc and then free will still be in use from the kernel's point of view; large blocks will actually be returned to the kernel when freed. (If memory serves, "large" means greater than 128k for current glibc.) This is probably the closest to what you are looking for.
These measurements are probably better than trying to track malloc and free, since they indicate what is "really going on" from a system-wide point of view. Just because you have called free() on some memory, that does not mean it has been returned to the system for other processes to use.

Too many calls to mprotect

I am working on a parallel app (C, pthread). I traced the system calls because at some point I have bad parallel performances. My traces shown that my program calls mprotect() many many times ... enough to significantly slow down my program.
I do allocate a lot of memory (with malloc()) but there is only a reasonable number of calls to brk() to increase the heap size. So why so many calls to mprotect() ?!
Are you creating and destroying lots of threads?
Most pthread implementations will add a "guard page" when allocating a threads stack. It's an access protected memory page used to detect stack overflows. I'd expect at least one call to mprotect each time a thread is created or terminated to (un)protect the guard page. If this is the case, there are several obvious strategies:
Set the guard page size to zero using pthread_attr_setguardsize() before creating threads.
Use a thread-pool (of as many threads as processors say). Once a thread is done with a task, return it to the pool to get a new task rather than terminate and create a new thread.
Another explanation might be that you're on a platform where a thread's stack will be grown if overflow is detected. I don't think this is implemented on Linux with GCC/Glibc as yet, but there have been some proposals along these lines recently. If you use a lot of stack space whilst processing, you might explicitely increase the initial/minimum stack size using pthread_attr_setstacksize.
Or it might be something else entirely!
If you can, run your program under a debug libc and break on mprotect(). Look at the call stack, see what your code is doing that's leading to the mprotect() calls.
glibc library that has ptmalloc2 for its malloc uses mprotect() internally for micromanagement of heap for threads other than main thread (for main thread, sbrk() is used instead.) malloc() firstly allocates large chunk of memory with mmap() for the thread if a heap area seems to have contention, and then it changes the protection bits of unnecessary portion to make it accessible with mprotect(). Later, when it needs to grow the heap, it changes the protection to read/writable with mprotect() again. Those mprotect() calls are for heap growth and shrink in multithreaded applications.
http://www.blackhat.com/presentations/bh-usa-07/Ferguson/Whitepaper/bh-usa-07-ferguson-WP.pdf
explains this in a bit more detailed way.
The 'valgrind' suite has a tool called 'callgrind' that will tell you what is calling what. If you run the application under 'callgrind', you can then view the resulting profile data with 'kcachegrind' (it can analyze profiles made by 'cachegrind' or 'callgrind'). Then just double-click on 'mprotect' in the left pane and it will show you what code is calling it and how many times.

Resources