Handle memory properly with a pool of structs

Handle memory properly with a pool of structs - c

I have a program with three pools of structs. For each of them I use a list a of used structs and another one for the unused structs. During the execution the program consumes structs, and returns them back to the pool on demand. Also, there is a garbage collector to clean the "zombie" structs and return them to the pool.
At the beginning of the execution, the virtual memory, as expected, shows around 10GB* of memory allocated, and as the program uses the pool, the RSS memory increases.
Although the used nodes are back in the pool, marked as unused nodes, the RSS memory do not decreases. I expect this, because the OS doesn't know about what I'm doing with the memory, is not able to notice if I'm doing a real use of them or managing a pool.
What I would like to do is to force the unused memory to go back to virtual memory whenever I want, for example, when the RSS memory increases above X GB.
Is there any way to mark, given the memory pointer, a memory area to put it in virtual memory? I know this is the Operating System responsability but maybe there is a way to force it.
Maybe I shouldn't care about this, what do you think?
Thanks in advance.
Note 1: This program is used in High Performance Computing, that's why it's using this amount of memory.
I provide a picture of the pool usage vs the memory usage, for a few files. As you can see, the sudden drops in the pool usage are due to the garbage collector, what I would like to see, is this drop reflected in the memory usage.

You can do this as long as you are allocating your memory via mmap and not via malloc. You want to use the madvise function with the POSIX_MADV_DONTNEED argument.
Just remember to run madvise with POSIX_MADV_WILLNEED before using them again to ensure there is actually memory behind them.
This does not actually guarantee the pages will be swapped out but gives the kernel a strong hint to do so when it has time.

Git 2.19 (Q3 2018) offers an example of memory pool of struct, using mmap, not malloc.
For a large tree, the index needs to hold many cache entries allocated on heap.
These cache entries are now allocated out of a dedicated memory pool to amortize malloc(3) overhead.
See commit 8616a2d, commit 8e72d67, commit 0e58301, commit 158dfef, commit 8fb8e3f, commit a849735, commit 825ed4d, commit 768d796 (02 Jul 2018) by Jameson Miller (jamill).
(Merged by Junio C Hamano -- gitster -- in commit ae533c4, 02 Aug 2018)
block alloc: allocate cache entries from mem_pool
When reading large indexes from disk, a portion of the time is dominated in malloc() calls.
This can be mitigated by allocating a large block of memory and manage it ourselves via memory pools.
This change moves the cache entry allocation to be on top of memory pools.
Design:
The index_state struct will gain a notion of an associated memory_pool from which cache_entries will be allocated from.
When reading in the index from disk, we have information on the number of entries and their size, which can guide us in deciding how large our initial memory allocation should be.
When an index is discarded, the associated memory_pool will be discarded as well - so the lifetime of a cache_entry is tied to the lifetime of the index_state that it was allocated for.
In the case of a Split Index, the following rules are followed.
1st, some terminology is defined:
Terminology:
'the_index': represents the logical view of the index
'split_index': represents the "base" cache entries.
Read from the split index file.
'the_index' can reference a single split_index, as well as cache_entries from the split_index. the_index will be discarded before the split_index is.
This means that when we are allocating cache_entries in the presence of a split index, we need to allocate the entries from the split_index's memory pool.
This allows us to follow the pattern that the_index can reference cache_entries from the split_index, and that the cache_entries will not be freed while they are still being referenced.
Managing transient cache_entry structs:
Cache entries are usually allocated for an index, but this is not always the case. Cache entries are sometimes allocated because this is the type that the existing checkout_entry function works with.
Because of this, the existing code needs to handle cache entries associated with an index / memory pool, and those that only exist transiently.
Several strategies were contemplated around how to handle this.
Chosen approach:
An extra field was added to the cache_entry type to track whether the cache_entry was allocated from a memory pool or not.
This is currently an int field, as there are no more available bits in the existing
ce_flags bit field.
If / when more bits are needed, this new field can be turned into a proper bit field.
We decided tracking and iterating over known memory pool regions was
less desirable than adding an extra field to track this state.

Related

Large memory usage of aio [duplicate]

Here's my question: Does calling free or delete ever release memory back to the "system". By system I mean, does it ever reduce the data segment of the process?
Let's consider the memory allocator on Linux, i.e ptmalloc.
From what I know (please correct me if I am wrong), ptmalloc maintains a free list of memory blocks and when a request for memory allocation comes, it tries to allocate a memory block from this free list (I know, the allocator is much more complex than that but I am just putting it in simple words). If, however, it fails, it gets the memory from the system using say sbrk or brk system calls. When a memory is free'd, that block is placed in the free list.
Now consider this scenario, on peak load, a lot of objects have been allocated on heap. Now when the load decreases, the objects are free'd. So my question is: Once the object is free'd will the allocator do some calculations to find whether it should just keep this object in the free list or depending upon the current size of the free list it may decide to give that memory back to the system i.e decrease the data segment of the process using sbrk or brk?
Documentation of glibc tells me that if the allocation request is much larger than page size, it will be allocated using mmap and will be directly released back to the system once free'd. Cool. But let's say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what?
From what I know (correct me please), a memory allocated with malloc will never be released back to the system ever until the process ends i.e. the allocator will simply keep it in the free list if I free it. But the question that is troubling me is then, if I use a tool to see the memory usage of my process (I am using pmap on Linux, what do you guys use?), it should always show the memory used at peak load (as the memory is never given back to the system, except when allocated using mmap)? That is memory used by the process should never ever decrease(except the stack memory)? Is it?
I know I am missing something, so please shed some light on all this.
Experts, please clear my concepts regarding this. I will be grateful. I hope I was able to explain my question.

There isn't much overhead for malloc, so you are unlikely to achieve any run-time savings. There is, however, a good reason to implement an allocator on top of malloc, and that is to be able to trace memory leaks. For example, you can free all memory allocated by the program when it exits, and then check to see if your memory allocator calls balance (i.e. same number of calls to allocate/deallocate).
For your specific implementation, there is no reason to free() since the malloc won't release to system memory and so it will only release memory back to your own allocator.
Another reason for using a custom allocator is that you may be allocating many objects of the same size (i.e you have some data structure that you are allocating a lot). You may want to maintain a separate free list for this type of object, and free/allocate only from this special list. The advantage of this is that it will avoid memory fragmentation.

No.
It's actually a bad strategy for a number of reasons, so it doesn't happen --except-- as you note, there can be an exception for large allocations that can be directly made in pages.
It increases internal fragmentation and therefore can actually waste memory. (You can only return aligned pages to the OS, so pulling aligned pages out of a block will usually create two guaranteed-to-be-small blocks --smaller than a page, anyway-- to either side of the block. If this happens a lot you end up with the same total amount of usefully-allocated memory plus lots of useless small blocks.)
A kernel call is required, and kernel calls are slow, so it would slow down the program. It's much faster to just throw the block back into the heap.
Almost every program will either converge on a steady-state memory footprint or it will have an increasing footprint until exit. (Or, until near-exit.) Therefore, all the extra processing needed by a page-return mechanism would be completely wasted.

It is entirely implementation dependent. On Windows VC++ programs can return memory back to the system if the corresponding memory pages contain only free'd blocks.

I think that you have all the information you need to answer your own question. pmap shows the memory that is currenly being used by the process. So, if you call pmap before the process achieves peak memory, then no it will not show peak memory. if you call pmap just before the process exits, then it will show peak memory for a process that does not use mmap. If the process uses mmap, then if you call pmap at the point where maximum memory is being used, it will show peak memory usage, but this point may not be at the end of the process (it could occur anywhere).
This applies only to your current system (i.e. based on the documentation you have provided for free and mmap and malloc) but as the previous poster has stated, behavior of these is implmentation dependent.

This varies a bit from implementation to implementation.
Think of your memory as a massive long block, when you allocate to it you take a bit out of your memory (labeled '1' below):
111
If I allocate more more memory with malloc it gets some from the system:
1112222
If I now free '1':
___2222
It won't be returned to the system, because two is in front of it (and memory is given as a continous block). However if the end of the memory is freed, then that memory is returned to the system. If I freed '2' instead of '1'. I would get:
111
the bit where '2' was would be returned to the system.
The main benefit of freeing memory is that that bit can then be reallocated, as opposed to getting more memory from the system. e.g:
33_2222

I believe that the memory allocator in glibc can return memory back to the system, but whether it will or not depends on your memory allocation patterns.
Let's say you do something like this:
void *pointers[10000];
for(i = 0; i < 10000; i++)
pointers[i] = malloc(1024);
for(i = 0; i < 9999; i++)
free(pointers[i]);
The only part of the heap that can be safely returned to the system is the "wilderness chunk", which is at the end of the heap. This can be returned to the system using another sbrk system call, and the glibc memory allocator will do that when the size of this last chunk exceeds some threshold.
The above program would make 10000 small allocations, but only free the first 9999 of them. The last one should (assuming nothing else has called malloc, which is unlikely) be sitting right at the end of the heap. This would prevent the allocator from returning any memory to the system at all.
If you were to free the remaining allocation, glibc's malloc implementation should be able to return most of the pages allocated back to the system.
If you're allocating and freeing small chunks of memory, a few of which are long-lived, you could end up in a situation where you have a large chunk of memory allocated from the system, but you're only using a tiny fraction of it.

Here are some "advantages" to never releasing memory back to the system:
Having already used a lot of memory makes it very likely you will do so again, and
when you release memory the OS has to do quite a bit of paperwork
when you need it again, your memory allocator has to re-initialise all its data structures in the region it just received
Freed memory that isn't needed gets paged out to disk where it doesn't actually make that much difference
Often, even if you free 90% of your memory, fragmentation means that very few pages can actually be released, so the effort required to look for empty pages isn't terribly well spent

Many memory managers can perform TRIM operations where they return entirely unused blocks of memory to the OS. However, as several posts here have mentioned, it's entirely implementation dependent.
But lets say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what ?
This depends on your allocation pattern. Do you free ALL of the small allocations? If so and if the memory manager has handling for a small block allocations, then this may be possible. However, if you allocate many small items and then only free all but a few scattered items, you may fragment memory and make it impossible to TRIM blocks since each block will have only a few straggling allocations. In this case, you may want to use a different allocation scheme for the temporary allocations and the persistant ones so you can return the temporary allocations back to the OS.

Growing (and shrinking) memory pool

Let's say, for the purpose of the question, we have a memory pool that has n blocks allocated initially. However, when the capacity is reached, the pool wants to grow and become twice the size it was (2n).
Now this resize operation can be done with realloc in C, however the function itself may return a pointer to a different memory (with the old data copied in).
This means the pointers returned by the memory pool allocator may no longer be valid (since the memory may have been moved).
What would be a nice way to overcome this problem? Or is it even possible at all?

Allocate out of multiple non-contiguous pools of memory. When one pool is full, allocate a second pool, allowing it to be someplace else in your virtual address space.
Then the problem is one of keeping track of where your pools are. Typically you'd use some of the space in each pool for bookkeeping. For example, you might reserve one pointer's worth of space to keep a simple linear linked list of all the pools. More sophisticated allocators tend to require more bookkeeping overhead.

Instead of using realloc, malloc a new/additional chunk of blocks (assuming there's no reason why the blocks, which are managed and returned and returned by your pool allocator, need to be in a single contiguous chunk of memory).

process allocated memory blocks from kernel

i need to have reliable measurement of allocated memory in a linux process. I've been looking into mallinfo but i've read that it is deprecated. What is the state of the art alternative for this sort of statistics?
basically i'm interested in at least two numbers:
number (and size) of allocated memory blocks/pages from the kernel by any malloc or whatever implementation uses the C library of choice
(optional but still important) number of allocated memory by userspace code (via malloc, new, etc.) minus the deallocated memory by it (via free, delete, etc.)
one possibility i have is to override malloc calls with LD_PRELOAD, but it might introduce an unwanted overhead at runtime, also it might not interact properly with other libraries i'm using that also rely on LD_PRELOAD aop-ness.
another possibility i've read is with rusage.
To be clear, this is NOT for debugging purposes, the memory usage is intrinsic feature of the application (similar to Mathematica or Matlab that display the amount of memory used, only that more precise at the block-level)

For this purpose - a "memory usage" introspection feature within an application - the most appropriate interface is malloc_hook(3). These are GNU extensions that allow you to hook every malloc(), realloc() and free() call, maintaining your statistics.
To see how much memory is mapped by your application from the kernel's point of view, you can read and collate the information in the /proc/self/smaps pseudofile. This also lets you see how much of each allocation is resident, swapped, shared/private, clean/dirty etc.

/proc/PID/status contains a few useful pieces of information (try running cat /proc/$$/status for example).
VmPeak is the largest your process's virtual memory space ever became during its execution. This includes all pages mapped into your process, including executable pages, mmap'ed files, stack, and heap.
VmSize is the current size of your process's virtual memory space.
VmRSS is the Resident Set Size of your process; i.e., how much of it is taking up physical RAM right now. (A typical process will have lots of stuff mapped that it never uses, like most of the C library. If no processes need a page, eventually it will be evicted and become non-resident. RSS measures the pages that remain resident and are mapped into your process.)
VmHWM is the High Water Mark of VmRSS; i.e. the highest that number has been during the lifetime of the process.
VmData is the size of your process's "data" segment; i.e., roughly its heap usage. Note that small blocks on which you have done malloc and then free will still be in use from the kernel's point of view; large blocks will actually be returned to the kernel when freed. (If memory serves, "large" means greater than 128k for current glibc.) This is probably the closest to what you are looking for.
These measurements are probably better than trying to track malloc and free, since they indicate what is "really going on" from a system-wide point of view. Just because you have called free() on some memory, that does not mean it has been returned to the system for other processes to use.

available memory in kernel

Is there a kernel function which returns amount of kernel memory available(Not vmalloc related).

First, let me say that if you're going to make any policy decisions (should I proceed with this operation?) based on this information, STOP. As WGW pointed out, there are unavoidable races here; memory can be used up between when you check and when you use it. Just test for errors on your memory allocations and have an appropriate failure path. Moreover, if you request memory when there isn't enough free memory, often the kernel can obtain more free memory by cleaning up various cache memory, swapping to disk, freeing slabs, etc. And kernel memory fragmentation can fail large (multiple page) allocations when not made through vmalloc even with plenty of memory free.
That said, there are APIs for querying kernel memory availability. You should note that the kernel has multiple memory pools, so even if one of these API says you have no free RAM, it could be that it's available in the memory pool you are interested in.
First, we have si_meminfo. This is the call that provides availability data for /proc/meminfo, among other things, and reports on the current state of the buddy page allocator. Note that cached and buffer ram can be converted to free ram very quickly.
global_page_state(NR_SLAB_RECLAIMABLE) can also be used to get counts of how much slab memory can be quickly reclaimed. If you request an allocation, this memory can and will be freed on demand.
The SLUB allocator (used for kalloc() and the like, among others) also provides statistics for its internal memory pools that can also reflect free memory within each memory pool. This may not be available with the same API depending on which allocator is selected in your configuration - please do not use this data except for debugging. The relevant code (implementing /proc/slabinfo) can be found in mm/slub.c

What kind of use is the available memory for you? Worst case you run in a race condition with checking available memory:
You get the available memory. It`s enough.
Multitasking, a.k.a. the scheduler of the kernel, stops your process and continues with another one which allocates a bunch of the available memory.
The scheduler continues with your process.
Your allocations fails though step 1 showed enough available memory.

overhead for an empty heap arena

My tools are Linux, gcc and pthreads. When my program calls new/delete from several threads, and when there is contention for the heap, 'arena's are created (see the following link for reference http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html). My program runs 24x7, and arenas are still occasionally being created after 2 weeks. I think there may eventually be as many arenas as threads. ps(1) shows alarming memory consumption, but I suspect that only a small portion of it is actually mapped.
What is the 'overhead' for an empty arena? (How much more memory per arena is used than if all allocation was confined to the traditional heap? )
Is there any way to force the creation in advance of n arenas? Is there any way to force the destruction of empty arenas?

struct malloc_state (aka mstate, aka arena descriptor) have size
glibc-2.2
(256+18)*4 bytes =~ 1 KB for 32 bit mode and ~2 KB for 64 bit mode.
glibc-2.3
(256+256/32+11+NFASTBINS)*4 =~ 1.1-1.2 KB in 32bit and 2.4-2.5 KB for 64bit
See glibc-x.x.x/malloc/malloc.c file, struct malloc_state

Destruction of arenas... I don't know yet, but there is such text (briefly - it says NO to the possibility of destruction/trimming memory ) from analysis http://www.citi.umich.edu/techreports/reports/citi-tr-00-5.pdf from 2000 (*a bit outdated). Please name your glibc version.
Ptmalloc maintains a linked list of subheaps. To re-
duce lock contention, ptmalloc searchs for the first
unlocked subheap and grabs memory from it to fulfill
a malloc() request. If ptmalloc doesn’t find an
unlocked heap, it creates a new one. This is a simple
way to grow the number of subheaps as appropriate
without adding complicated schemes for hashing on
thread or processor ID, or maintaining workload sta-
tistics. However, there is no facility to shrink the sub-
heap list and nothing stops the heap list from growing
without bound.

from malloc.c (glibc 2.3.5) line 1546
/*
-------------------- Internal data structures --------------------
All internal state is held in an instance of malloc_state defined
below.
...
Beware of lots of tricks that minimize the total bookkeeping space
requirements. **The result is a little over 1K bytes** (for 4byte
pointers and size_t.)
*/
The same result as I got for 32-bit mode. The result is a little over 1K bytes

Consider using of TCmalloc form google-perftools. It just better suited for threaded and long-living applications. And it is very FAST.
Take a look on http://goog-perftools.sourceforge.net/doc/tcmalloc.html especially on graphics (higher is better). Tcmalloc is twice better than ptmalloc.

In our application the main cost of multiple arenas has been "dark" memory. Memory allocated by the OS, which we don't have any references to.
The pattern you can see is
Thread X goes goes to alloc, hits a collision, creates a new arena.
Thread X makes some large allocations.
Thread X makes some small allocation(s).
Thread X stops allocating.
Large allocations are freed. But the whole arena at the high water mark of the last currently active allocation is still using up VMEM, and other threads won't use this arena unless they hit contention in the main arena.
Basically it's a contributor to "memory fragmentation", since there are multiple places memory can be available, but needing to grow an arena is not a reason to look in other arenas. At least I think that's the cause, the point is your application can end up with a bigger VM footprint than you think it should have. This mostly hits you if you have limited swap, since as you say most of this ends up paged out.
Our (memory hungry) application can have 10s of percent of memory "wasted" in this way, and it can really bite in some situations.
I'm not sure why you would want to create empty arenas. If allocations and frees are in the same thread as each other, then I think over time you will tend to all of them being in the same thread-specific arena with no contention. You may have some small blips while you get there, so maybe that's a reason.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight