Linux Heap Fragmentation - c

I have a question that keeps bothering me for the last week.
In Windows debugger there is the !heap -s command that outputs the virtual memory's heap status and calculates the external fragmentation using the formula :
External fragmentation = 1 - (larget free block / total free size)
Is there a similar method in linux, that outputs those statistics needed to calculate the effect?
Long story now:
I have a C application that keeps allocating and deallocating space of different sizes, using malloc and free, each allocation has different life span.
The platform I am using is Lubuntu, so ptmalloc2 algorithm is the default.
I am aware that those allocations are served in the virtual user space heap(except those that are larger than 128Kb, where the allocator uses mmap), and are mapped to physical pages when actually accessed .
The majority of the allocations is of size < 80 bytes, so they are served from FastBins.
Using Valgrind and Massif I can get the internal fragmentation, since it reports the extra bytes used for each allocation.
However, my main concern is how to figure out the external fragmentation.
I am aware of the /proc/[pid]/smaps heap size and the pmap-d[pid] anon statistics, but I find it difficult to interpret them in terms of external fragmentation.
I am also aware of LD_PRELOAD , and I can dynamically connect the /lib/i386-linux-gnu/libmemusage.so. This library outputs the heap total, peak and the distribution of the requested allocation sizes.
I know that __malloc__hook is deprecated now, and I do not really want to rely on implementation specific statistics like malloc_stats() and mallinfo() . However, If you have any suggestions using those two please let me know.
I can tell that the external fragmentation problem, is when a request can not be satisfied, because there is no contiguous space in the heap, but the requested total size is scattered all around that area.
I still have not figured out, how to get the statistics needed so I can calculate this effect. For example different formulas stating that I have to capture the live_memory or get the total_free_pages, or get the size of the largest_free_block.
How can I have a function to "traverse" through the heap and gather those statistics?
Thank you everyone in advance.

I believe that this will depend on the allocator you are using. That is, you would probably need a different strategy for whichever malloc (et al) and free implementation you are using. If the implementation doesn't offer the information you seek as an extension, you will probably have to read its source-code and type your own logic to examine the state of allocations.
I believe that the mapping of pages to swap-space and physical RAM is at a lower level and so won't particularly help you in your goal. The malloc (et al) and free implementation might or might not care about those lower-level details.
If you are certain that you are using ptmalloc2, are you able to find its source-code?

Related

How to get heap memory usage, FreeRTOS [duplicate]

I'm creating a list of elements inside a task in the following way:
l = (dllist*)pvPortMalloc(sizeof(dllist));
dllist is 32 byte big.
My embedded system has 60kB SRAM so I expected my 200 element list can be handled easily by the system. I found out that after allocating space for 8 elements the system is crashing on the 9th malloc function call (256byte+).
If possible, where can I change the heap size inside freeRTOS?
Can I somehow request the current status of heap size?
I couldn't find this information in the documentation so I hope somebody can provide some insight in this matter.
Thanks in advance!
(Yes - FreeRTOS pvPortMalloc() returns void*.)
If you have 60K of SRAM, and configTOTAL_HEAP_SIZE is large, then it is unlikely you are going to run out of heap after allocating 256 bytes unless you had hardly any heap remaining before hand. Many FreeRTOS demos will just keep creating objects until all the heap is used, so if your application is based on one of those, then you would be low on heap before your code executed. You may have also done something like use up loads of heap space by creating tasks with huge stacks.
heap_4 and heap_5 will combine adjacent blocks, which will minimise fragmentation as far as practical, but I don't think that will be your problem - especially as you don't mention freeing anything anywhere.
Unless you are using heap_3.c (which just makes the standard C library malloc and free thread safe) you can call xPortGetFreeHeapSize() to see how much free heap you have. You may also have xPortGetMinimumEverFreeHeapSize() available to query how close you have ever come to running out of heap. More information: http://www.freertos.org/a00111.html
You could also define a malloc() failed hook (http://www.freertos.org/a00016.html) to get instant notification of pvPortMalloc() returning NULL.
For the standard allocators you will find a config option in FreeRTOSConfig.h .
However:
It is very well possible you run out of memory already, depending on the allocator used. IIRC there is one that does not free() any blocks (free() is just a dummy). So any block returned will be lost. This is still useful if you only allocate memory e.g. at startup, but then work with what you've got.
Other allocators might just not merge adjacent blocks once returned, increasing fragmentation much faster than a full-grown allocator.
Also, you might loose memory to fragmentation. Depending on your alloc/free pattern, you quickly might end up with a heap looking like swiss cheese: Many holes between allocated blocks. So while there is still enough free memory, no single block is big enough for the size required.
If you only allocate blocks that size there, you might be better of using your own allocator or a pool (blocks of fixed size). Thaqt would be statically allocated (e.g. array) and chained as a linked list during startup. Alloc/free would then just be push/pop on a stack (or put/get on a queue). That would also be very fast and have complexity O(1) (interrupt-safe if properly written).
Note that normal malloc()/free() are not interrupt-safe.
Finally: Do not cast void *. (Well, that's actually what standard malloc() returns and I expect that FreeRTOS-variant does the same).

Pb in using nedflush for memory leaks

I try to use the debug functionalities of nedmalloc to find potential memory leaks in my code. So I activate the flags ENABLE_LOGGING and NEDMALLOC_TESTLOGENTRY.
in my program, I only use the system memory pool. At the very end of my program, I call the function neddestroysyspool in order to flush all memory events.
First of all, I don't manage to activate the stack trace functionality. When I change this depth, the program crashes after a few allocations. In order to compile Under VS2010, I had to define DeinitSym myself with a call to CloseHandle; I hope I'm doing right ... but it does not work properly. So I don't use it.
So I just parse the file nedmalloc.csv: I sort it thanks to addresses, sum allocated sizes and substract freed ones wrt the address. For an unknown reason, for several big chunks (size>400kb), the size given at allocation is right but the size given at free is different, above the allocated size. For example, I allocated a block of 840352 bytes, but when freed, the recorded size was 851932 bytes. Is is normal?
Does anyone has some answer(s) or hint(s) for this problem?
Firstly, I really wouldn't use nedmalloc's logging facilities for memory leak detection. valgrind is a vastly superior way of finding resource leaks. I even have a hack there in nedmalloc to have it use the system allocator instead of dlmalloc precisely in order to avail of valgrind.
Secondly, yes the stack backtracing code can be a little brittle. If I remember rightly I did something naughty to get the performance considerably higher at the expense of it not quite working well. I'll be blunt in saying almost no one but me ever used that code path, so I never had much reason to debug it properly. It worked well enough for my purposes.
Thirdly for larger blocks, dlmalloc will round up allocations including their bookkeeping to a 64Kb multiple, so it is expected that a 12 chunk allocation would be turned into a 13 chunk allocation.
Lastly, I as the author of nedmalloc would recommend you not use nedmalloc on Windows 7 or better, or any Linux with a 3.x kernel, or any Mac OS X produced in the past three years. Why? The system allocator is probably good enough and as I personally haven't needed nedmalloc for some years, I'll freely admit the code is bit rotting.
Hope that helps.
Niall

Fragmentation-resistant Microcontroller Heap Algorithm

I am looking to implement a heap allocation algorithm in C for a memory-constrained microcontroller. I have narrowed my search down to 2 options I'm aware of, however I am very open to suggestions, and I am looking for advice or comments from anyone with experience in this.
My Requirements:
-Speed definitely counts, but is a secondary concern.
-Timing determinism is not important - any part of the code requiring deterministic worst-case timing has its own allocation method.
-The MAIN requirement is fragmentation immunity. The device is running a lua script engine, which will require a range of allocation sizes (heavy on the 32 byte blocks). The main requirement is for this device to run for a long time without churning its heap into an unusable state.
Also Note:
-For reference, we are talking about a Cortex-M and PIC32 parts, with memory ranging from 128K and 16MB or memory (with a focus on the lower end).
-I don't want to use the compiler's heap because 1) I want consistent performance across all compilers and 2) their implementations are generally very simple and are the same or worse for fragmentation.
-double indirect options are out because of the huge Lua code base that I don't want to fundamtnetally change and revalidate.
My Favored Approaches Thus Far:
1) Have a binary buddy allocator, and sacrifice memory usage efficiency (rounding up to a power of 2 size).
-this would (as I understand) require a binary tree for each order/bin to store free nodes sorted by memory address for fast buddy-block lookup for rechaining.
2) Have two binary trees for free blocks, one sorted by size and one sorted by memory address. (all binary tree links are stored in the block itself)
-allocation would be best-fit using a lookup on the table by size, and then remove that block from the other tree by address
-deallocation would lookup adjacent blocks by address for rechaining
-Both algorithms would also require storing an allocation size before the start of the allocated block, and have blocks go out as a power of 2 minus 4 (or 8 depending on alignment). (Unless they store a binary tree elsewhere to track allocations sorted by memory address, which I don't consider a good option)
-Both algorithms require height-balanced binary tree code.
-Algorithm 2 does not have the requirement of wasting memory by rounding up to a power of two.
-In either case, I will probably have a fixed bank of 32-byte blocks allocated by nested bit fields to off-load blocks this size or smaller, which would be immune to external fragmentation.
My Questions:
-Is there any reason why approach 1 would be more immune to fragmentation than approach 2?
-Are there any alternatives that I am missing that might fit the requirements?
If block sizes are not rounded up to powers of two or some equivalent(*), certain sequences of allocation and deallocation will generate an essentially-unbounded amount of fragmentation even if the number of non-permanent small objects that exist at any given time is limited. A binary-buddy allocator will, of course, avoid that particular issue. Otherwise, if one is using a limited number of nicely-related object sizes but not using a "binary buddy" system, one may still have to use some judgment in deciding where to allocate new blocks.
Another approach to consider is having different allocation methods for things that are expected to be permanent, temporary, or semi-persistent. Fragmentation often causes the most trouble when temporary and permanent things get interleaved on the heap. Avoiding such interleaving may minimize fragmentation.
Finally, I know you don't really want to use double-indirect pointers, but allowing object relocation can greatly reduce fragmentation-related issues. Many Microsoft-derived microcomputer BASICs used a garbage-collected string heap; Microsoft's garbage collector was really horrible, but its string-heap approach can be used with a good one.
You can pick up a (never used for real) Buddy system allocator at http://www.mcdowella.demon.co.uk/buddy.html, with my blessing for any purpose you like. But I don't think you have a problem that is easily solved just by plugging in a memory allocator. The long-running high integrity systems I am familiar with have predictable resource usage, described in 30+ page documents for each resource (mostly cpu and I/O bus bandwidth - memory is easy because they tend to allocate the same amount at startup every time and then never again).
In your case none of the usual tricks - static allocation, free lists, allocation on the stack, can be shown to work because - at least as described to us - you have a Lua interpreted hovering in the background ready to do who knows what at run time - what if it just gets into a loop allocating memory until it runs out?
Could you separate the memory use into two sections - traditional code allocating almost all of what it needs on startup, and never again, and expendable code (e.g. Lua) allowed to allocate whatever it needs when it needs it, from whatever is left over after static allocation? Could you then trigger a restart or some sort of cleanup of the expendable code if it manages to use all of its area of memory, or fragments it, without bothering the traditional code?

Can I write a C application without using the heap?

I'm experiencing what appears to be a stack/heap collision in an embedded environment (see this question for some background).
I'd like to try rewriting the code so that it doesn't allocate memory on the heap.
Can I write an application without using the heap in C? For example, how would I use the stack only if I have a need for dynamic memory allocation?
I did it once in an embedded environment where we were writing "super safe" code for biomedical machines.
Malloc()s were explicitly forbidden, partly for the resources limits and for the unexpected behavior you can get from dynamic memory (look for malloc(), VxWorks/Tornado and fragmentation and you'll have a good example).
Anyway, the solution was to plan in advance the needed resources and statically allocate the "dynamic" ones in a vector contained in a separate module, having some kind of special purpose allocator give and take back pointers. This approach avoided fragmentation issues altogether and helped getting finer grained error info, if a resource was exhausted.
This may sound silly on big iron, but on embedded systems, and particularly on safety critical ones, it's better to have a very good understanding of which -time and space- resources are needed beforehand, if only for the purpose of sizing the hardware.
Funnily enough, I once saw a database application which completly relied on static allocated memory. This application had a strong restriction on field and record lengths. Even the embedded text editor (I still shiver calling it that) was unable to create texts with more than 250 lines of text. That solved some question I had at this time: why are only 40 records allowed per client?
In serious applications you can not calculate in advance the memory requirements of your running system. Therefore it is a good idea to allocate memory dynamically as you need it. Nevertheless it is common case in embedded systems to preallocate memory you really need to prevent unexpected failures due to memory shortage.
You might allocate dynamic memory on the stack using the alloca() library calls. But this memory is tight to the execution context of the application and it is a bad idea to return memory of this type the caller, because it will be overwritten by later subroutine calls.
So I might answer your question with a crisp and clear "it depends"...
You can use alloca() function that allocates memory on the stack - this memory will be freed automatically when you exit the function. alloca() is GNU-specific, you use GCC so it must be available.
See man alloca.
Another option is to use variable-length arrays, but you need to use C99 mode.
It's possible to allocate a large amount of memory from the stack in main() and have your code sub-allocate it later on. It's a silly thing to do since it means your program is taking up memory that it doesn't actually need.
I can think of no reason (save some kind of silly programming challenge or learning exercise) for wanting to avoid the heap. If you've "heard" that heap allocation is slow and stack allocation is fast, it's simply because the heap involves dynamic allocation. If you were to dynamically allocate memory from a reserved block within the stack, it would be just as slow.
Stack allocation is easy and fast because you may only deallocate the "youngest" item on the stack. It works for local variables. It doesn't work for dynamic data structures.
Edit: Having seen the motivation for the question...
Firstly, the heap and the stack have to compete for the same amount of available space. Generally, they grow towards each other. This means that if you move all your heap usage into the stack somehow, then rather than stack colliding with heap, the stack size will just exceed the amount of RAM you have available.
I think you just need to watch your heap and stack usage (you can grab pointers to local variables to get an idea of where the stack is at the moment) and if it's too high, reduce it. If you have lots of small dynamically-allocated objects, remember that each allocation has some memory overhead, so sub-allocating them from a pool can help cut down on memory requirements. If you use recursion anywhere think about replacing it with an array-based solution.
You can't do dynamic memory allocation in C without using heap memory. It would be pretty hard to write a real world application without using Heap. At least, I can't think of a way to do this.
BTW, Why do you want to avoid heap? What's so wrong with it?
1: Yes you can - if you don't need dynamic memory allocation, but it could have a horrible performance, depending on your app. (i.e. not using the heap won't give you better apps)
2: No I don't think you can allocate memory dynamically on the stack, since that part is managed by the compiler.
Yes, it's doable. Shift your dynamic needs out of memory and onto disk (or whatever mass storage you have available) -- and suffer the consequent performance penalty.
E.g., You need to build and reference a binary tree of unknown size. Specify a record layout describing a node of the tree, where pointers to other nodes are actually record numbers in your tree file. Write routines that let you add to the tree by writing an additional record to file, and walk the tree by reading a record, finding its child as another record number, reading that record, etc.
This technique allocates space dynamically, but it's disk space, not RAM space. All the routines involved can be written using statically allocated space -- on the stack.
Embedded applications need to be careful with memory allocations but I don't think using the stack or your own pre-allocated heap is the answer. If possible, allocate all required memory (usually buffers and large data structures) at initialization time from a heap. This requires a different style of program than most of us are used to now but it's the best way to get close to deterministic behavior.
A large heap that is sub-allocated later would still be subject to running out of memory and the only thing to do then is have a watchdog kick in (or similar action). Using the stack sounds appealing but if you're going to allocate large buffers/data structures on the stack you have to be sure that the stack is large enough to handle all possible code paths that your program could execute. This is not easy and in the end is similar to a sub-allocated heap.
My foremost concern is, does abolishing the heap really helps?
Since your wish of not using heap stems from stack/heap collision, assuming the start of stack and start of heap are set properly (e.g. in the same setting, small sample programs have no such collision problem), then the collision means the hardware has not enough memory for your program.
Not using heap, one may indeed save some waste space from heap fragmentation; but if your program does not use the heap for a bunch of irregular large size allocation, the waste there are probably not much. I will see your collision problem more of an out of memory problem, something not fixable by merely avoiding heap.
My advices on tackling this case:
Calculate the total potential memory usage of your program. If it is too close to but not yet exceeding the amount of memory you prepared for the hardware, then you may
Try using less memory (improve the algorithms) or using the memory more efficiently (e.g. smaller and more-regular-sized malloc() to reduce heap fragmentation); or
Simply buy more memory for the hardware
Of course you may try pushing everything into pre-defined static memory space, but it is very probable that it will be stack overwriting into static memory this time. So improve the algorithm to be less memory-consuming first and buy more memory the second.
I'd attack this problem in a different way - if you think the the stack and heap are colliding, then test this by guarding against it.
For example (assuming a *ix system) try mprotect()ing the last stack page (assuming a fixed size stack) so it is not accessible. Or - if your stack grows - then mmap a page in the middle of the stack and heap. If you get a segv on your guard page you know you've run off the end of the stack or heap; and by looking at the address of the seg fault you can see which of the stack & heap collided.
It is often possible to write your embedded application without using dynamic memory allocation. In many embedded applications the use of dynamic allocation is deprecated because of the problems that can arise due to heap fragmentation. Over time it becomes highly likely that there will not be a suitably sized region of free heap space to allow the memory to be allocated and unless there is a scheme in place to handle this error the application will crash. There are various schemes to get around this, one being to always allocate fixed size objects on the heap so that a new allocation will always fit into a freed memory area. Another to detect the allocation failure and to perform a defragmentation process on all of the objects on the heap (left as an exercise for the reader!)
You do not say what processor or toolset you are using but in many the static, heap and stack are allocated to separate defined segments in the linker. If this is the case then it must be that your stack is growing outside the memory space that you have defined for it. The solution that you require is to reduce the heap and/or static variable size (assuming that these two are contiguous) so that there is more available for the stack. It may be possible to reduce the heap unilaterally although this can increase the probability of fragmentation problems. Ensuring that there are no unnecessary static variables will free some space at the cost of possibly increasing the stack usage if the variable is made auto.

What's a good C memory allocator for embedded systems? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have an single threaded, embedded application that allocates and deallocates lots and lots of small blocks (32-64b). The perfect scenario for a cache based allocator. And although I could TRY to write one it'll likely be a waste of time, and not as well tested and tuned as some solution that's already been on the front lines.
So what would be the best allocator I could use for this scenario?
Note: I'm using a Lua Virtual Machine in the system (which is the culprit of 80+% of the allocations), so I can't trivially refactor my code to use stack allocations to increase allocation performance.
I'm a bit late to the party, but I just want to share very efficient memory allocator for embedded systems I've recently found and tested: https://github.com/dimonomid/umm_malloc
This is a memory management library specifically designed to work with the ARM7, personally I use it on PIC32 device, but it should work on any 16- and 8-bit device (I have plans to test in on 16-bit PIC24, but I haven't tested it yet)
I was seriously beaten by fragmentation with default allocator: my project often allocates blocks of various size, from several bytes to several hundreds of bytes, and sometimes I faced 'out of memory' error. My PIC32 device has total 32K of RAM, and 8192 bytes is used for heap. At the particular moment there is more than 5K of free memory, but default allocator has maximum non-fragmented memory block just of about 700 bytes, because of fragmentation. This is too bad, so I decided to look for more efficient solution.
I already was aware of some allocators, but all of them has some limitations (such as block size should be a power or 2, and starting not from 2 but from, say, 128 bytes), or was just buggy. Every time before, I had to switch back to the default allocator.
But this time, I'm lucky: I've found this one: http://hempeldesigngroup.com/embedded/stories/memorymanager/
When I tried this memory allocator, in exactly the same situation with 5K of free memory, it has more than 3800 bytes block! It was so unbelievable to me (comparing to 700 bytes), and I performed hard test: device worked heavily more than 30 hours. No memory leaks, everything works as it should work.
I also found this allocator in the FreeRTOS repository: http://svnmios.midibox.org/listing.php?repname=svn.mios32&path=%2Ftrunk%2FFreeRTOS%2FSource%2Fportable%2FMemMang%2F&rev=1041&peg=1041# , and this fact is an additional evidence of stability of umm_malloc.
So I completely switched to umm_malloc, and I'm quite happy with it.
I just had to change it a bit: configuration was a bit buggy when macro UMM_TEST_MAIN is not defined, so, I've created the github repository (the link is at the top of this post). Now, user dependent configuration is stored in separate file umm_malloc_cfg.h
I haven't got deeply yet in the algorithms applied in this allocator, but it has very detailed explanation of algorithms, so anyone who is interested can look at the top of the file umm_malloc.c . At least, "binning" approach should give huge benefit in less-fragmentation: http://g.oswego.edu/dl/html/malloc.html
I believe that anyone who needs for efficient memory allocator for microcontrollers, should at least try this one.
In a past project in C I worked on, we went down the road of implementing our own memory management routines for a library ran on a wide range of platforms including embedded systems. The library also allocated and freed a large number of small buffers. It ran relatively well and didn't take a large amount of code to implement. I can give you a bit of background on that implementation in case you want to develop something yourself.
The basic implementation included a set of routines that managed buffers of a set size. The routines were used as wrappers around malloc() and free(). We used these routines to manage allocation of structures that we frequently used and also to manage generic buffers of set sizes. A structure was used to describe each type of buffer being managed. When a buffer of a specific type was allocated, we'd malloc() the memory in blocks (if a list of free buffers was empty). IE, if we were managing 10 byte buffers, we might make a single malloc() that contained space for 100 of these buffers to reduce fragmentation and the number of underlying mallocs needed.
At the front of each buffer would be a pointer that would be used to chain the buffers in a free list. When the 100 buffers were allocated, each buffer would be chained together in the free list. When the buffer was in use, the pointer would be set to null. We also maintained a list of the "blocks" of buffers, so that we could do a simple cleanup by calling free() on each of the actual malloc'd buffers.
For management of dynamic buffer sizes, we also added a size_t variable at the beginning of each buffer telling the size of the buffer. This was then used to identify which buffer block to put the buffer back into when it was freed. We had replacement routines for malloc() and free() that did pointer arithmetic to get the buffer size and then to put the buffer into the free list. We also had a limit on how large of buffers we managed. Buffers larger than this limit were simply malloc'd and passed to the user. For structures that we managed, we created wrapper routines for allocation and freeing of the specific structures.
Eventually we also evolved the system to include garbage collection when requested by the user to clean up unused memory. Since we had control over the whole system, there were various optimizations we were able to make over time to increase performance of the system. As I mentioned, it did work quite well.
I did some research on this very topic recently, as we had an issue with memory fragmentation. In the end we decided to stay with GNU libc's implementation, and add some application-level memory pools where necessary. There were other allocators which had better fragmentation behavior, but we weren't comfortable enough with them replace malloc globally. GNU's has the benefit of a long history behind it.
In your case it seems justified; assuming you can't fix the VM, those tiny allocations are very wasteful. I don't know what your whole environment is, but you might consider wrapping the calls to malloc/realloc/free on just the VM so that you can pass it off to a handler designed for small pools.
Although its been some time since I asked this, my final solution was to use LoKi's SmallObjectAllocator it work great. Got rid off all the OS calls and improved the performance of my Lua engine for embedded devices. Very nice and simple, and just about 5 minutes worth of work!
Since version 5.1, Lua has allowed a custom allocator to be set when creating new states.
I'd just also like to add to this even though it's an old thread. In an embedded application if you can analyze your memory usage for your application and come up with a max number of memory allocation of the varying sizes usually the fastest type of allocator is one using memory pools. In our embedded apps we can determine all allocation sizes that will ever be needed during run time. If you can do this you can completely eliminate heap fragmentation and have very fast allocations. Most these implementations have an overflow pool which will do a regular malloc for the special cases which will hopefully be far and few between if you did your analysis right.
I have used the 'binary buddy' system to good effect under vxworks. Basically, you portion out your heap by cutting blocks in half to get the smallest power of two sized block to hold your request, and when blocks are freed, you can make a pass up the tree to merge blocks back together to mitigate fragmentation. A google search should turn up all the info you need.
I am writing a C memory allocator called tinymem that is intended to be able to defragment the heap, and re-use memory. Check it out:
https://github.com/vitiral/tinymem
Note: this project has been discontinued to work on the rust implementation:
https://github.com/vitiral/defrag-rs
Also, I had not heard of umm_malloc before. Unfortunately, it doesn't seem to be able to deal with fragmentation, but it definitely looks useful. I will have to check it out.

Resources