How does malloc work in a multithreaded environment? - c

Does the typical malloc (for x86-64 platform and Linux OS) naively lock a mutex at the beginning and release it when done, or does it lock a mutex in a more clever way at a finer level, so that lock contention is reduced for concurrent calls? If it indeed does it the second way, how does it do it?

glibc 2.15 operates multiple allocation arenas. Each arena has its own lock. When a thread needs to allocate memory, malloc() picks an arena, locks it, and allocates memory from it.
The mechanism for choosing an arena is somewhat elaborate and is aimed at reducing lock contention:
/* arena_get() acquires an arena and locks the corresponding mutex.
First, try the one last locked successfully by this thread. (This
is the common case and handled with a macro for speed.) Then, loop
once over the circularly linked list of arenas. If no arena is
readily available, create a new one. In this latter case, `size'
is just a hint as to how much memory will be required immediately
in the new arena. */
With this in mind, malloc() basically looks like this (edited for brevity):
mstate ar_ptr;
void *victim;
arena_lookup(ar_ptr);
arena_lock(ar_ptr, bytes);
if(!ar_ptr)
return 0;
victim = _int_malloc(ar_ptr, bytes);
if(!victim) {
/* Maybe the failure is due to running out of mmapped areas. */
if(ar_ptr != &main_arena) {
(void)mutex_unlock(&ar_ptr->mutex);
ar_ptr = &main_arena;
(void)mutex_lock(&ar_ptr->mutex);
victim = _int_malloc(ar_ptr, bytes);
(void)mutex_unlock(&ar_ptr->mutex);
} else {
/* ... or sbrk() has failed and there is still a chance to mmap() */
ar_ptr = arena_get2(ar_ptr->next ? ar_ptr : 0, bytes);
(void)mutex_unlock(&main_arena.mutex);
if(ar_ptr) {
victim = _int_malloc(ar_ptr, bytes);
(void)mutex_unlock(&ar_ptr->mutex);
}
}
} else
(void)mutex_unlock(&ar_ptr->mutex);
return victim;
This allocator is called ptmalloc. It is based on earlier work by Doug Lea, and is maintained by Wolfram Gloger.

Doug Lea's malloc used coarse locking (or no locking, depending on the configuration settings), where every call to malloc/realloc/free is protected by a global mutex. This is safe but can be inefficient in highly multithreaded environments.
ptmalloc3, which is the default malloc implementation in the GNU C library (libc) used on most Linux systems these days, has a more fine-grained strategy, as described in aix's answer, which allows multiple threads to concurrently allocate memory safely.
nedmalloc is another independent implementation which claims even better multithreaded performance than ptmalloc3 and various other allocators. I don't know how it works, and there doesn't seem to be any obvious documentation, so you'll have to check the source code to see how it works.

In addition to ptmalloc mentioned by #NPE, there is also tcmalloc which is offered by Google and, under certain scenarios, it has been tested to run slightly faster than ptmalloc (e.g.: when doing malloc() for 10^6 times and frees them).
It uses global heap and per-thread heaps so that the global heap can be used by every thread. There is only mutex on the global heap and not the per-thread heap. For each individual thread, they could pull storage from the global heap and free that storage so this storage goes to the per-thread heap.
And each per-thread heap is owned by individual threads. Unless things are imbalanced, then threads would move the storage back to the global heap.
here for code implementation: https://github.com/google/tcmalloc

Related

Does terminating a program reclaim memory in the same way as free()?

I saw this answer to a stack overflow question that says that freeing memory at the very end of a c program is actually harmful because it moves variables that wouldn't be used again into system memory.
I'm confused why the free() method in C would do anything different than the operating system reclaiming the heap at the end of the program.
Does anyone know if there is a real difference between free() and termination in terms of memory management and if so how the operating system may treat these two differently?
e.g.
would anything different happen between these two short programs?
void main() {
int* mem = malloc(1);
return 0;
}
void main() {
int* mem = malloc(1);
free(mem);
return 0;
}
No, terminating a program, as with exit or abort, does not reclaim memory in the same way as free. Using free causes some activity that ultimately has no effect when the operating system discards the data maintained by malloc and free.
exit has some complications, as it does not immediately terminate the program. For now, let’s just consider the effect of immediately terminating the program and consider the complications later.
In a general-purpose multi-user operating system, when a process is terminated, the operating system releases the memory it was using to make it available for other purposes.1 In large part, this simply means the operating system does some accounting operations.
In contrast, when you call free, software inside the program runs, and it has to look up the size of the memory you are freeing and then insert information about that memory into the pool of memory it is maintaining. There could be thousands or tens of thousands (or more) of such allocations. A program that frees all its data may have to execute many thousands of calls to free. Yet, in the end, when the program exits, all of the changes produced by free will vanish, as the operating system will discard all the data about that pool of memory—all of the data is in memory pages the operating system does not preserve.
So, in this regard, the answer you link to is correct, calling free is a waste. And, as it points out, the necessity of going through all the data structures in the program to fetch the pointers in them so the memory they point to can be freed causes all those data structures to be read into memory if they had been swapped out to disk. For large programs, it can take a considerable amount of time and other resources.
On the other hand, it is not clear it is easy to avoid many calls to free. This is because releasing memory is not the only thing a terminating program has to clean up. A program may want to write final data to files or send final messages to network connections. Furthermore, a program may not have established all of this context directly. Most large programs rely on layers of software, and each software package may have set up its own context, and often no way is provided to tell other software “I want to exit now. Finish the valuable context, but skip all the freeing of memory.” So all the desired clean-up tasks may be interwined with the free-memory tasks, and there may be no good way to untangle them.
Software should generally be written so that nothing terrible happens if a program is suddenly aborted (since this can happen from a loss of power, not just deliberate user action). But even though a program might be able to tolerate an abort, there can still be value in a graceful exit.
Getting back to exit, calling the C exit routine does not exit the program immediately. Exit handlers (registered with atexit) are called, stream buffers are flushed, and streams are closed. Any software libraries you called may have set up their own exit handlers so that they can finish up when the program is exiting. So, if you want to be sure libraries you have used in your program are not calling free when you end the program, you have to call abort, not exit. But it is generally preferred to end a program gracefully, not by aborting. Calling abort will not call exit handlers, flush streams, close streams, or perform other wind-down code that exit does—data can be lost when a program calls abort.
Footnote
1 Releasing memory does not mean it is immediately available for other purposes. The specific result of this depends on each page of memory. For example:
If the memory is shared with other processes, it is still needed for them, so releasing it from use by this process only decrements the number of processes using the memory. It is not immediately available for any other use.
If the memory is not in use by any other processes but contains data mapped from a file on disk, the operating system might mark it as available when needed but leave it alone for the moment. This is because you might run the same program again, and it would be nice if the data were still in memory, so why not just leave it in place just in case? The data might even be used by a different program that uses the same file. (For example, many programs might use the same shared library.)
If the memory is not in use by any other processes and was just used by the program as a work area, not mapped from a file, then system may mark it as immediately available and not containing anything useful.
would anything different happen between these two short programs?
The simple answer is: it makes no difference, the memory is released to the system in both cases. Calling free() is not strictly necessary and does incur an infinitesimal overhead but may prove useful when trying to track memory leaks in more complex programs.
Does terminating a program reclaim memory in the same way as free?
Not exactly:
Terminating a program releases the memory used by the program, be it for the program code, data, stack or heap. It also releases some other resources such as file handles, device handles, network sockets... All this is done efficiently, no matter how many blocks of memory have been allocated with malloc().
Conversely, free() makes the block of memory available for further use by the program for later calls to malloc() or realloc(). Depending on its size and the implementation of the heap, this freed block may or may not be returned to the OS for use by other programs. Also worth noting it the fragmentation problem, where small blocks of freed memory may not be usable for a larger allocation because they are surrounded by allocated blocks. The C heap does not perform packing or de-fragmentation, it merely coalesces adjacent free blocks. Freeing all allocated blocks before leaving the program may be useful for debugging purposes, but may be complicated and time consuming, while not necessary for the memory to be reused by the system after the program terminates.
free() is a user level memory management function and depends on malloc implementation you are currently using. The user-level allocator might maintain a linked-list of memory chunk and malloc/free will take the chunk of appropropriate size/put it back.
exit() Destroys an address space and all regions.
This is related to malloced heap as well as some other regions and in-kernel data structures used for managing address space of the process:
Each address space consists of a number of page-aligned regions
of memory that are in use. They never overlap and represent a set
of addresses which contain pages that are related to each other in
terms of protection and purpose. These regions are represented by
a struct vm_area_struct and are roughly analogous to the
vm_map_entry struct in BSD. For clarity, a region may represent the
process heap for use with malloc(), a memory mapped file such as
a shared library or a block of anonymous memory allocated with
mmap(). The pages for this region may still have to be allocated, be
active and resident or have been paged out
Reference: https://www.kernel.org/doc/gorman/html/understand/understand007.html
The reason well-designed programs free memory at exit is to check for memory leaks. If your application-level memory allocation does not go to zero after your last deallocation, you know that you have a memory memory that is not being managed properly and probably have a memory leak in your code.
would anything different happen between these two short programs?
YES
I'm confused why the free() method in C would do anything different than the operating system reclaiming the heap at the end of the program.
The operating system allocates memory in pages. Heap managers (such as malloc/free implementations) allocate pages from the operating system and subdivide the pages into smaller allocations. Calls to free() normally return memory to the heap. They do not return the pages to the operating system.

How can I get a guarantee that when a memory is freed, the OS will reclaim that memory for it's use?

I noticed that this program:
#include <stdio.h>
int main() {
const size_t alloc_size = 1*1024*1024;
for (size_t i = 0; i < 3; i++) {
printf("1\n");
usleep(1000*1000);
void *p[3];
for (size_t j = 3; j--; )
memset(p[j] = malloc(alloc_size),0,alloc_size); // memset for de-virtualize the memory
usleep(1000*1000);
printf("2\n");
free(p[i]);
p[i] = NULL;
usleep(1000*1000*4);
printf("3\n");
for (size_t j = 3; j--; )
free(p[j]);
}
}
which allocates 3 memories, 3 times and each time frees different memory, frees the memory according to watch free -m, which means that the OS reclaimed the memory for every free regardless of the memory's position inside the program's address space. Can I somehow get a guarantee for this effect? Or is there already anything like that (like a rule of >64KB allocations)?
The short answer is: In general, you cannot guarantee that the OS will reclaim the freed memory, but there may be an OS specific way to do it or a better way to ensure such behavior.
The long answer:
Your code has undefined behavior: there is an extra free(p[i]); after the printf("2\n"); which accesses beyond the end of the p array.
You allocate large blocks (1 MB) for which your library makes individual system calls (for example mmap in linux systems), and free releases these blocks to the OS, hence the observed behavior.
Various OSes are likely to implement such behavior for a system specific threshold (typically 128KB), but the C standard gives guarantee about this, so relying on such behavior is system specific.
Read the manual page for malloc() on your system to see if this behavior can be controlled. For example, the C library on Linux uses an environment variable MMAP_THRESHOLD to override the default setting for this threshold.
If you program to a Posix target, you might want to use mmap() directly instead of malloc to guarantee that the memory is returned to the system once deallocated with munmap(). Note that the block returned by mmap() will have been initialized to all bits zero before the first access, so you may avoid such explicit initialization to take advantage of on demand paging, on perform explicit initialization to ensure the memory is mapped to try and minimize latency in later operations.
On the OSes I know, and especially on linux:
no, you cannot guarantee reuse. Why would you want that? Reuse only happens when someone needs more memory pages, and Linux will then have to pick pages that aren't currently mapped to a process; if these run out, you'll get into swapping. And: you can't make your OS do something that is none of your processes' business. How it internally manages memory allocations is none of the freeing process' business. In fact, that's security-wise a good thing.
What you can do is not only freeing the memory (which might leave it allocated to your process, handled by your libc, for later mallocs), but actually giving it back (man sbrk, man munmap, have fun). That's not something you'd usually do.
Also: this is yet another instantiation of "help, linux ate my RAM"... you misinterpret what free tells you.
For glibc malloc(), read the man 3 malloc man page.
In short, smaller allocations use memory provided by sbrk() to extend the data segment; this is not returned to the OS. Larger allocations (typically 132 KiB or more; you can use MMAP_THRESHOLD on glibc to change the limit) use mmap() to allocate anonymous memory pages (but also include memory allocation bookkeeping on those pages), and when freed, these are usually immediately returned to the OS.
The only case when you should worry about the process returning memory to the OS in a timely manner, is if you have a long-running process, that temporarily does a very large allocation, running on an embedded or otherwise memory-constrained device. Why? Because this stuff has been done in C successfully for decades, and the C library and the OS kernel do handle these cases just fine. It just isn't a practical problem in normal circumstances. You only need to worry about it, if you know it is a practical problem; and it won't be a practical problem except on very specific circumstances.
I personally do routinely use mmap(2) in Linux to map pages for huge data sets. Here, "huge" means "too large to fit in RAM and swap".
Most common case is when I have a truly huge binary data set. Then, I create a (sparse) backing file of suitable size, and memory-map that file. Years ago, in another forum, I showed an example of how to do this with a terabyte data set -- yes, 1,099,511,627,776 bytes -- of which only 250 megabytes or so was actually manipulated in that example, to keep the data file small. The key here in this approach is to use MAP_SHARED | MAP_NORESERVE to ensure the kernel does not use swap memory for this dataset (because it would be insufficient, and fail), but use the file backing directly. We can use madvise() to inform the kernel of our probable access patterns as an optimization, but in most cases it does not have that big of an effect (as the kernel heuristics do a pretty good job of it anyway). We can also use msync() to ensure certain parts are written to storage. (There are certain effects that has wrt. other processes that read the file backing the mapping, especially depending on whether they read it normally, or use options like O_DIRECT; and if shared over NFS or similar, wrt. processes reading the file remotely. It all goes quite complicated very quickly.)
If you do decide to use mmap() to acquire anonymous memory pages, do note that you need to keep track of both the pointer and the length (length being a multiple of page size, sysconf(_SC_PAGESIZE)), so that you can release the mapping later using munmap(). Obviously, this is then completely separate from normal memory allocation (malloc(), calloc(), free()); but unless you try to use specific addresses, the two will not interfere with each other.
If you want memory to be reclaimed by the operating system you need to use operating system services to allocate the memory (which be allocated in pages). Deallocate the memory, you need to call the operating system services that remove pages from your process.
Unless you write your own malloc/free that does this, you are never going to be able to accomplish your goal with off-the-shelf library functions.

Malloc fails to allocate memory on atmega2561 and freeRTOS

I am trying to use the malloc() function to create nodes for a Linked List. The function in my case returns NULL on the following dummy segment of code.
I am running FREERTOS on an atmega 2561.
if (!malloc(sizeof(struct Event))
{
//The code gets inside here
} else {
//
}
The struct of the nodes is the following:
struct Event
{
uint8_t shouldCarBrake;
uint16_t tachoPoint;
struct Event *next;
};
If the project is set up to use four out of the five example heap memory management files that come with FreeRTOS then it might be that the heap size supplied by the C libraries is zero. Out of the 5 only heap_3 uses malloc. See http://www.freertos.org/a00111.html for more information.
The standard library malloc() is not normally thread-safe, so should probably not be used in this instance. Some libraries provide stubs for mutex locks for the standard library, but if that is not the case or you choose not implement them, then malloc() itself should be wrapped in a function with mutex locks, or an alternative thread-safe heap management implementation used. FreeRTOS provides both of these options as described here.
The standard malloc() implementation is also non-deterministic - another reason not to use it in real-time critical code; you cannot determine how-long the mutex will be locked for, so not only will the thread allocating be non-deterministic, but so will any threads waiting to allocate.
Even without FreeRTOS's support for various allocators, it is easy to implement a deterministic fixed-block memory allocator in a manner that is easily portable to any RTOS. You simply pre-allocate a pool (or pools) of memory blocks (either statically or from the standard heap before real-time scheduling starts), and stuff an RTOS queue with pointers to the blocks. Allocation is then simply a case taking a pointer from the queue, and deallocation by returning the pointer back to the queue.

jemalloc, mmap and shared memory?

Can jemalloc be modified to allocate from shared memory? The FreeBSD function dallocx() implies you can provide a pointer to use for allocation, but I don't see an obvious way to tell jemalloc to restrict all allocations from that memory (nor set a size, etc).
The dallocx() function causes the memory referenced by ptr to be made available for future allocations.
If not, what is the level of effort for such a feature? I'm struggling to find an off-the-shelf allocation scheme that can allocate from a shared memory section that I provided.
Similarly, can jemalloc be configured to allocate from a locked region of memory to prevent swapping?
Feel free to point me to relevant code sections that require modification and provide any ideas or suggestions.
The idea I am exploring is — since you can create arenas/heaps for allocating in a threaded environment, as jemalloc does to minimize contention, the concept seems scalable to allocating regions of shared memory in a multiprocessing environment, i.e. I create N regions of shared memory using mmap(), and I want to leverage the power of jemalloc (or any allocation scheme) to allocate as efficiently as possible, with minimum thread contention, from those one of those shared regions, i.e. if threads/processes are not accessing the same shared regions and arenas, the chance for contention is minimal and speed of the malloc operation is increased.
This is different than a global pool alloc with malloc() API since usually these require a global lock effectively serializing the user-space. I'd like to avoid this.
edit 2:
Ideally an api like this:
// init the alloc context to two shmem pools
ctx1 = alloc_init(shm_region1_ptr);
ctx2 = alloc_init(shm_region2_ptr);
(... bunch of code determines pool 2 should be used, based on some method
of pool selection which can minimize possibility of lock contention
with other processes allocating shmem buffers)
// allocate from pool2
ptr = malloc(ctx2, size)
Yes. But this was not true when you asked the question.
Jemalloc 4 (released in August of 2015) has a couple of mallctl namespaces that would be useful for this purpose; they allow you to specify per-arena, application-specific chunk allocation hooks. In particular, the arena.<i>.chunk_hooks namespace and the arenas.extend mallctl options are of use. An integration test exists that demonstrates how to consume this API.
Regarding the rationale, I would expect that the effective "messaging" overhead required to understand where contention on any particular memory segment lies would be similar to the overhead of just contending, since you're going to degrade into contending on a cache line to accurately update the "contention" value of a particular arena.
Since jemalloc already employs a number of techniques to reduce contention, you could get a similar behavior in a highly threaded environment by creating additional arenas with opt.narenas. This would reduce contention as fewer threads would be mapped to an arena, but since threads are effectively round-robined, it's possible you get to hot-spots anyway.
To get around this, you could do your contention counting and hotspot detection, and simply use the thread.arena mallctl interface to switch a thread onto an arena with less contention.

Do threads have a distinct heap?

As far as I know each thread gets a distinct stack when the thread is created by the operating system. I wonder if each thread has a heap distinct to itself also?
No. All threads share a common heap.
Each thread has a private stack, which it can quickly add and remove items from. This makes stack based memory fast, but if you use too much stack memory, as occurs in infinite recursion, you will get a stack overflow.
Since all threads share the same heap, access to the allocator/deallocator must be synchronized. There are various methods and libraries for avoiding allocator contention.
Some languages allow you to create private pools of memory, or individual heaps, which you can assign to a single thread.
By default, C has only a single heap.
That said, some allocators that are thread aware will partition the heap so that each thread has it's own area to allocate from. The idea is that this should make the heap scale better.
One example of such a heap is Hoard.
Depends on the OS. The standard c runtime on windows and unices uses a shared heap across threads. This means locking every malloc/free.
On Symbian, for example, each thread comes with its own heap, although threads can share pointers to data allocated in any heap. Symbian's design is better in my opinion since it not only eliminates the need for locking during alloc/free, but also encourages clean specification of data ownership among threads. Also in that case when a thread dies, it takes all the objects it allocated along with it - i.e. it cannot leak objects that it has allocated, which is an important property to have in mobile devices with constrained memory.
Erlang also follows a similar design where a "process" acts as a unit of garbage collection. All data is communicated between processes by copying, except for binary blobs which are reference counted (I think).
Each thread has its own stack and call stack.
Each thread shares the same heap.
It depends on what exactly you mean when saying "heap".
All threads share the address space, so heap-allocated objects are accessible from all threads. Technically, stacks are shared as well in this sense, i.e. nothing prevents you from accessing other thread's stack (though it would almost never make any sense to do so).
On the other hand, there are heap structures used to allocate memory. That is where all the bookkeeping for heap memory allocation is done. These structures are sophisticatedly organized to minimize contention between the threads - so some threads might share a heap structure (an arena), and some might use distinct arenas.
See the following thread for an excellent explanation of the details: How does malloc work in a multithreaded environment?
Typically, threads share the heap and other resources, however there are thread-like constructions that don't. Among these thread-like constructions are Erlang's lightweight processes, and UNIX's full-on processes (created with a call to fork()). You might also be working on multi-machine concurrency, in which case your inter-thread communication options are considerably more limited.
Generally speaking, all threads use the same address space and therefore usually have just one heap.
However, it can be a bit more complicated. You might be looking for Thread Local Storage (TLS), but it stores single values only.
Windows-Specific:
TLS-space can be allocated using TlsAlloc and freed using TlsFree (Overview here). Again, it's not a heap, just DWORDs.
Strangely, Windows support multiple Heaps per process. One can store the Heap's handle in TLS. Then you would have something like a "Thread-Local Heap". However, just the handle is not known to the other threads, they still can access its memory using pointers as it's still the same address space.
EDIT: Some memory allocators (specifically jemalloc on FreeBSD) use TLS to assign "arenas" to threads. This is done to optimize allocation for multiple cores by reducing synchronization overhead.
On FreeRTOS Operating system, tasks(threads) share the same heap but each one of them has its own stack. This comes in very handy when dealing with low power low RAM architectures,because the same pool of memory can be accessed/shared by several threads, but this comes with a small catch , the developer needs to keep in mind that a mechanism for synchronizing malloc and free is needed, that is why it is necessary to use some type of process synchronization/lock when allocating or freeing memory on the heap, for example a semaphore or a mutex.

Resources