jemalloc, mmap and shared memory? - c

Can jemalloc be modified to allocate from shared memory? The FreeBSD function dallocx() implies you can provide a pointer to use for allocation, but I don't see an obvious way to tell jemalloc to restrict all allocations from that memory (nor set a size, etc).
The dallocx() function causes the memory referenced by ptr to be made available for future allocations.
If not, what is the level of effort for such a feature? I'm struggling to find an off-the-shelf allocation scheme that can allocate from a shared memory section that I provided.
Similarly, can jemalloc be configured to allocate from a locked region of memory to prevent swapping?
Feel free to point me to relevant code sections that require modification and provide any ideas or suggestions.
The idea I am exploring is — since you can create arenas/heaps for allocating in a threaded environment, as jemalloc does to minimize contention, the concept seems scalable to allocating regions of shared memory in a multiprocessing environment, i.e. I create N regions of shared memory using mmap(), and I want to leverage the power of jemalloc (or any allocation scheme) to allocate as efficiently as possible, with minimum thread contention, from those one of those shared regions, i.e. if threads/processes are not accessing the same shared regions and arenas, the chance for contention is minimal and speed of the malloc operation is increased.
This is different than a global pool alloc with malloc() API since usually these require a global lock effectively serializing the user-space. I'd like to avoid this.
edit 2:
Ideally an api like this:
// init the alloc context to two shmem pools
ctx1 = alloc_init(shm_region1_ptr);
ctx2 = alloc_init(shm_region2_ptr);
(... bunch of code determines pool 2 should be used, based on some method
of pool selection which can minimize possibility of lock contention
with other processes allocating shmem buffers)
// allocate from pool2
ptr = malloc(ctx2, size)

Yes. But this was not true when you asked the question.
Jemalloc 4 (released in August of 2015) has a couple of mallctl namespaces that would be useful for this purpose; they allow you to specify per-arena, application-specific chunk allocation hooks. In particular, the arena.<i>.chunk_hooks namespace and the arenas.extend mallctl options are of use. An integration test exists that demonstrates how to consume this API.
Regarding the rationale, I would expect that the effective "messaging" overhead required to understand where contention on any particular memory segment lies would be similar to the overhead of just contending, since you're going to degrade into contending on a cache line to accurately update the "contention" value of a particular arena.
Since jemalloc already employs a number of techniques to reduce contention, you could get a similar behavior in a highly threaded environment by creating additional arenas with opt.narenas. This would reduce contention as fewer threads would be mapped to an arena, but since threads are effectively round-robined, it's possible you get to hot-spots anyway.
To get around this, you could do your contention counting and hotspot detection, and simply use the thread.arena mallctl interface to switch a thread onto an arena with less contention.

Related

Looking for a custom memory allocator which allocates from within a large pre-allocated block of memory

I have a memory-heavy application which is supposed to run with low latency and with constant speed, but in practice it has poor performance during the first few seconds of startup. This appears to be because the initial memory accesses triggers page faults which have significant performance implications.
I would like to try preallocating a single large block of memory, paging it all in (via mlock() or just by touching each byte), and then using a custom malloc()/free() implementation to ensure that all further allocations are done from within this block.
I am aware of numerous custom memory allocators (TCMalloc, Hoard, jemalloc, etc) but it is not clear to me whether they can be backed by user-provided memory, or whether they always perform their internal allocations from the OS. Does anyone have any insight or recommendations here?
To be clear, I am not looking for a memory pooling system (which would be for reusing small objects). The custom implementation of malloc()/free() should be able to perform any size allocation while limiting fragmentation of its backing store and following other best practices.
Edit based on comments: I do not expect to make the system faster - I just want to move the slow part (allocation, initial page faults) to the start of the process, and then do the real computation work once the system is 'primed'.
Thanks!
A bit late to the party.
dlmalloc is one choice that can be backed by pre-allocated memory. You can find it here. You may just need to add some extra definitions in the beginning to force it to use your pre-allocated memory rather than call the system mmap, you can refer to the nice documentation at the beginning of the file.

How to constrain malloc to a specific region of memory

Is there anything that:
Allocates/deallocates/reallocates many variable sized objects without fragmentation (basically what malloc does) and
keeps track of all memory pages used in for these allocations so that I can
later mprotect all of these pages to make them read-only
preferably without any locking - all access will be single-threaded
that works on Linux and OS X, preferably with something equivalent on Windows?
I can't think of a way of doing this with standard memory allocation functions. The only strategy that comes to mind is using a custom memory pool instead of malloc. So my question is: is there a way to do this without a custom malloc or (if there isn't) suggestions on what to use?
I could wrap malloc and keep track of all pages it has used pretty easily how do I guarantee that once I have called mprotect on these pages malloc doesn't try to use memory that is "caught" either before the start or after the end of an allocated block within one of the affected pages?
The open source Memory Pool System will allocate memory in operating system page-sized chunks which the MPS does not itself touch. You can mprotect these pages if you like and be certain that they won't be touched by the allocator itself (which keeps all its data structures elsewhere) or by any other memory pool. If you use the MVT pool class you can also take advantage on inline lockless allocation as well. Linux, Mac OS X, and Windows are supported.
Disclaimer: I'm the architect of the MPS.

How does malloc work in a multithreaded environment?

Does the typical malloc (for x86-64 platform and Linux OS) naively lock a mutex at the beginning and release it when done, or does it lock a mutex in a more clever way at a finer level, so that lock contention is reduced for concurrent calls? If it indeed does it the second way, how does it do it?
glibc 2.15 operates multiple allocation arenas. Each arena has its own lock. When a thread needs to allocate memory, malloc() picks an arena, locks it, and allocates memory from it.
The mechanism for choosing an arena is somewhat elaborate and is aimed at reducing lock contention:
/* arena_get() acquires an arena and locks the corresponding mutex.
First, try the one last locked successfully by this thread. (This
is the common case and handled with a macro for speed.) Then, loop
once over the circularly linked list of arenas. If no arena is
readily available, create a new one. In this latter case, `size'
is just a hint as to how much memory will be required immediately
in the new arena. */
With this in mind, malloc() basically looks like this (edited for brevity):
mstate ar_ptr;
void *victim;
arena_lookup(ar_ptr);
arena_lock(ar_ptr, bytes);
if(!ar_ptr)
return 0;
victim = _int_malloc(ar_ptr, bytes);
if(!victim) {
/* Maybe the failure is due to running out of mmapped areas. */
if(ar_ptr != &main_arena) {
(void)mutex_unlock(&ar_ptr->mutex);
ar_ptr = &main_arena;
(void)mutex_lock(&ar_ptr->mutex);
victim = _int_malloc(ar_ptr, bytes);
(void)mutex_unlock(&ar_ptr->mutex);
} else {
/* ... or sbrk() has failed and there is still a chance to mmap() */
ar_ptr = arena_get2(ar_ptr->next ? ar_ptr : 0, bytes);
(void)mutex_unlock(&main_arena.mutex);
if(ar_ptr) {
victim = _int_malloc(ar_ptr, bytes);
(void)mutex_unlock(&ar_ptr->mutex);
}
}
} else
(void)mutex_unlock(&ar_ptr->mutex);
return victim;
This allocator is called ptmalloc. It is based on earlier work by Doug Lea, and is maintained by Wolfram Gloger.
Doug Lea's malloc used coarse locking (or no locking, depending on the configuration settings), where every call to malloc/realloc/free is protected by a global mutex. This is safe but can be inefficient in highly multithreaded environments.
ptmalloc3, which is the default malloc implementation in the GNU C library (libc) used on most Linux systems these days, has a more fine-grained strategy, as described in aix's answer, which allows multiple threads to concurrently allocate memory safely.
nedmalloc is another independent implementation which claims even better multithreaded performance than ptmalloc3 and various other allocators. I don't know how it works, and there doesn't seem to be any obvious documentation, so you'll have to check the source code to see how it works.
In addition to ptmalloc mentioned by #NPE, there is also tcmalloc which is offered by Google and, under certain scenarios, it has been tested to run slightly faster than ptmalloc (e.g.: when doing malloc() for 10^6 times and frees them).
It uses global heap and per-thread heaps so that the global heap can be used by every thread. There is only mutex on the global heap and not the per-thread heap. For each individual thread, they could pull storage from the global heap and free that storage so this storage goes to the per-thread heap.
And each per-thread heap is owned by individual threads. Unless things are imbalanced, then threads would move the storage back to the global heap.
here for code implementation: https://github.com/google/tcmalloc

available memory in kernel

Is there a kernel function which returns amount of kernel memory available(Not vmalloc related).
First, let me say that if you're going to make any policy decisions (should I proceed with this operation?) based on this information, STOP. As WGW pointed out, there are unavoidable races here; memory can be used up between when you check and when you use it. Just test for errors on your memory allocations and have an appropriate failure path. Moreover, if you request memory when there isn't enough free memory, often the kernel can obtain more free memory by cleaning up various cache memory, swapping to disk, freeing slabs, etc. And kernel memory fragmentation can fail large (multiple page) allocations when not made through vmalloc even with plenty of memory free.
That said, there are APIs for querying kernel memory availability. You should note that the kernel has multiple memory pools, so even if one of these API says you have no free RAM, it could be that it's available in the memory pool you are interested in.
First, we have si_meminfo. This is the call that provides availability data for /proc/meminfo, among other things, and reports on the current state of the buddy page allocator. Note that cached and buffer ram can be converted to free ram very quickly.
global_page_state(NR_SLAB_RECLAIMABLE) can also be used to get counts of how much slab memory can be quickly reclaimed. If you request an allocation, this memory can and will be freed on demand.
The SLUB allocator (used for kalloc() and the like, among others) also provides statistics for its internal memory pools that can also reflect free memory within each memory pool. This may not be available with the same API depending on which allocator is selected in your configuration - please do not use this data except for debugging. The relevant code (implementing /proc/slabinfo) can be found in mm/slub.c
What kind of use is the available memory for you? Worst case you run in a race condition with checking available memory:
You get the available memory. It`s enough.
Multitasking, a.k.a. the scheduler of the kernel, stops your process and continues with another one which allocates a bunch of the available memory.
The scheduler continues with your process.
Your allocations fails though step 1 showed enough available memory.

Do threads have a distinct heap?

As far as I know each thread gets a distinct stack when the thread is created by the operating system. I wonder if each thread has a heap distinct to itself also?
No. All threads share a common heap.
Each thread has a private stack, which it can quickly add and remove items from. This makes stack based memory fast, but if you use too much stack memory, as occurs in infinite recursion, you will get a stack overflow.
Since all threads share the same heap, access to the allocator/deallocator must be synchronized. There are various methods and libraries for avoiding allocator contention.
Some languages allow you to create private pools of memory, or individual heaps, which you can assign to a single thread.
By default, C has only a single heap.
That said, some allocators that are thread aware will partition the heap so that each thread has it's own area to allocate from. The idea is that this should make the heap scale better.
One example of such a heap is Hoard.
Depends on the OS. The standard c runtime on windows and unices uses a shared heap across threads. This means locking every malloc/free.
On Symbian, for example, each thread comes with its own heap, although threads can share pointers to data allocated in any heap. Symbian's design is better in my opinion since it not only eliminates the need for locking during alloc/free, but also encourages clean specification of data ownership among threads. Also in that case when a thread dies, it takes all the objects it allocated along with it - i.e. it cannot leak objects that it has allocated, which is an important property to have in mobile devices with constrained memory.
Erlang also follows a similar design where a "process" acts as a unit of garbage collection. All data is communicated between processes by copying, except for binary blobs which are reference counted (I think).
Each thread has its own stack and call stack.
Each thread shares the same heap.
It depends on what exactly you mean when saying "heap".
All threads share the address space, so heap-allocated objects are accessible from all threads. Technically, stacks are shared as well in this sense, i.e. nothing prevents you from accessing other thread's stack (though it would almost never make any sense to do so).
On the other hand, there are heap structures used to allocate memory. That is where all the bookkeeping for heap memory allocation is done. These structures are sophisticatedly organized to minimize contention between the threads - so some threads might share a heap structure (an arena), and some might use distinct arenas.
See the following thread for an excellent explanation of the details: How does malloc work in a multithreaded environment?
Typically, threads share the heap and other resources, however there are thread-like constructions that don't. Among these thread-like constructions are Erlang's lightweight processes, and UNIX's full-on processes (created with a call to fork()). You might also be working on multi-machine concurrency, in which case your inter-thread communication options are considerably more limited.
Generally speaking, all threads use the same address space and therefore usually have just one heap.
However, it can be a bit more complicated. You might be looking for Thread Local Storage (TLS), but it stores single values only.
Windows-Specific:
TLS-space can be allocated using TlsAlloc and freed using TlsFree (Overview here). Again, it's not a heap, just DWORDs.
Strangely, Windows support multiple Heaps per process. One can store the Heap's handle in TLS. Then you would have something like a "Thread-Local Heap". However, just the handle is not known to the other threads, they still can access its memory using pointers as it's still the same address space.
EDIT: Some memory allocators (specifically jemalloc on FreeBSD) use TLS to assign "arenas" to threads. This is done to optimize allocation for multiple cores by reducing synchronization overhead.
On FreeRTOS Operating system, tasks(threads) share the same heap but each one of them has its own stack. This comes in very handy when dealing with low power low RAM architectures,because the same pool of memory can be accessed/shared by several threads, but this comes with a small catch , the developer needs to keep in mind that a mechanism for synchronizing malloc and free is needed, that is why it is necessary to use some type of process synchronization/lock when allocating or freeing memory on the heap, for example a semaphore or a mutex.

Resources