Fragmentation-resistant Microcontroller Heap Algorithm - c

I am looking to implement a heap allocation algorithm in C for a memory-constrained microcontroller. I have narrowed my search down to 2 options I'm aware of, however I am very open to suggestions, and I am looking for advice or comments from anyone with experience in this.
My Requirements:
-Speed definitely counts, but is a secondary concern.
-Timing determinism is not important - any part of the code requiring deterministic worst-case timing has its own allocation method.
-The MAIN requirement is fragmentation immunity. The device is running a lua script engine, which will require a range of allocation sizes (heavy on the 32 byte blocks). The main requirement is for this device to run for a long time without churning its heap into an unusable state.
Also Note:
-For reference, we are talking about a Cortex-M and PIC32 parts, with memory ranging from 128K and 16MB or memory (with a focus on the lower end).
-I don't want to use the compiler's heap because 1) I want consistent performance across all compilers and 2) their implementations are generally very simple and are the same or worse for fragmentation.
-double indirect options are out because of the huge Lua code base that I don't want to fundamtnetally change and revalidate.
My Favored Approaches Thus Far:
1) Have a binary buddy allocator, and sacrifice memory usage efficiency (rounding up to a power of 2 size).
-this would (as I understand) require a binary tree for each order/bin to store free nodes sorted by memory address for fast buddy-block lookup for rechaining.
2) Have two binary trees for free blocks, one sorted by size and one sorted by memory address. (all binary tree links are stored in the block itself)
-allocation would be best-fit using a lookup on the table by size, and then remove that block from the other tree by address
-deallocation would lookup adjacent blocks by address for rechaining
-Both algorithms would also require storing an allocation size before the start of the allocated block, and have blocks go out as a power of 2 minus 4 (or 8 depending on alignment). (Unless they store a binary tree elsewhere to track allocations sorted by memory address, which I don't consider a good option)
-Both algorithms require height-balanced binary tree code.
-Algorithm 2 does not have the requirement of wasting memory by rounding up to a power of two.
-In either case, I will probably have a fixed bank of 32-byte blocks allocated by nested bit fields to off-load blocks this size or smaller, which would be immune to external fragmentation.
My Questions:
-Is there any reason why approach 1 would be more immune to fragmentation than approach 2?
-Are there any alternatives that I am missing that might fit the requirements?

If block sizes are not rounded up to powers of two or some equivalent(*), certain sequences of allocation and deallocation will generate an essentially-unbounded amount of fragmentation even if the number of non-permanent small objects that exist at any given time is limited. A binary-buddy allocator will, of course, avoid that particular issue. Otherwise, if one is using a limited number of nicely-related object sizes but not using a "binary buddy" system, one may still have to use some judgment in deciding where to allocate new blocks.
Another approach to consider is having different allocation methods for things that are expected to be permanent, temporary, or semi-persistent. Fragmentation often causes the most trouble when temporary and permanent things get interleaved on the heap. Avoiding such interleaving may minimize fragmentation.
Finally, I know you don't really want to use double-indirect pointers, but allowing object relocation can greatly reduce fragmentation-related issues. Many Microsoft-derived microcomputer BASICs used a garbage-collected string heap; Microsoft's garbage collector was really horrible, but its string-heap approach can be used with a good one.

You can pick up a (never used for real) Buddy system allocator at http://www.mcdowella.demon.co.uk/buddy.html, with my blessing for any purpose you like. But I don't think you have a problem that is easily solved just by plugging in a memory allocator. The long-running high integrity systems I am familiar with have predictable resource usage, described in 30+ page documents for each resource (mostly cpu and I/O bus bandwidth - memory is easy because they tend to allocate the same amount at startup every time and then never again).
In your case none of the usual tricks - static allocation, free lists, allocation on the stack, can be shown to work because - at least as described to us - you have a Lua interpreted hovering in the background ready to do who knows what at run time - what if it just gets into a loop allocating memory until it runs out?
Could you separate the memory use into two sections - traditional code allocating almost all of what it needs on startup, and never again, and expendable code (e.g. Lua) allowed to allocate whatever it needs when it needs it, from whatever is left over after static allocation? Could you then trigger a restart or some sort of cleanup of the expendable code if it manages to use all of its area of memory, or fragments it, without bothering the traditional code?

Related

Are stack float array ops faster than heap float ops on modern x86 systems?

On C float (or double) arrays small enough to fit in L1 or L2 cache (about 16k), and whos size I know at compile time, is there generally a speed benefit to define them within the function they are used in, so they are stack variables? If so is it a large difference? I know that in the old days heap variables were much slower than stack ones, but nowadays with the far more complicated structure of cpu addressing and cache, I don't know if this is true.
I need to do repeated runs of floating point math over these arrays in 'chunks', over and over again over the same arrays (about 1000 times), and I wonder if I should define them locally. I imagine keeping them in the closest / fastest locations will allow me to iterate over them repeatedly much faster but I dont understand the implications of caching in this scenario. Perhaps the compiler or cpu is clever enough to realize what I am doing and make these data arrays highly local on the hardware during the inner processing loops without my intervention, and perhaps it does a better job than I can at this.
Maybe I risk running out of stack space if I load large arrays in this way? Or is stack space not hugely limited on modern systems? The array size can be defined at compile time and I only need one array, and one CPU as I need to stick to a single thread for this work.
It is the allocation and deallocation speed that may make difference.
Allocating on the stack is just subtracting the required size from the stack pointer, which is normally done for all local variables once upon function entry anyway, so it is essentially free (unless alloca is used). Same applies to deallocating memory on the stack.
Allocating on the heap requires calling malloc or new which ends up executing an order of magnitude more instructions. Same applies to free and delete.
There should be no difference in the speed of access to the arrays once they were allocated. However, the stack could more likely be in the CPU cache already because previous function calls already used the same stack memory region for local variables.
If your architecture employs Non-uniform memory access (NUMA) there can be difference in access speed to different memory regions when your thread gets re-scheduled to run on a different CPU from the one that originally allocated memory.
For in-depth treatment of the subject have a read of What Every Programmer Should Know About Memory.
The answer is: probably not.
On a modern processor such as the i7 the L1/L2/L3 cache sizes are 64K/1MB/8MB, shared across 4x2 cores. Your numbers are a bit off.
The biggest thing to worry about is parallelism. If you can get all 8 cores running 100% that's a good start.
There is no difference between heap and stack memory, it's just memory. Heap allocation is way slower than stack allocation, but hopefully you don't do much of that.
Cache coherency matters. Cache prefetching matters. The order of accessing things in memory matters. Good reading here: https://software.intel.com/en-us/blogs/2009/08/24/what-you-need-to-know-about-prefetching.
But all this is useless until you can benchmark. You can't improve what you can't measure.
Re comment: there is nothing special about stack memory. The thing that usually does matter is keeping all your data close together. If you access local variables a lot then allocating arrays next to them on the stack might work. If you have several blocks of heap memory then subdividing a single allocation might work better than separate allocations. You'll only know if you read the generated code and benchmark.
They are the same speed on average. Assuming the cache lines that the array occupies have not been touched by other code.
One thing to make sure is that the array memory alignment is at least 32 bit or 64bit aligned (for float and double respectively) so an array element will not cross cache line boundaries. Cache lines are 64 byte on x86.
Another important element is to make sure the compiler is using SSE instructions for scalar floating point operations. This should be the default for modern compilers. The legacy floating point (a.k.a 387 with 80 bit register stack) is much slower and harder to optimize.
If this memory is frequently allocated and released, try to reduce calls to malloc/free by allocating it in a pool, globally or on the stack.

What is the optimal amount to malloc at a given time when the total needed is not known?

I've implemented a multi-level cache simulator that needs to store the values currently in the simulator. With current configurations, the maximum size of all values being stored could reach 2G. Obviously I'm not going to assume this worst case scenario and allocate all of that memory up-front. Instead, I have the program set to allocate memory as needed in chunks. The expense of this allocation is exacerbated by the fact that I'm callocing in order to provide 0 values when no write has occurred previously at the specified location.
My question is, is there a good heuristic for how much memory should be allocated each time more is needed? Currently I'm using an arbitrary value and I considered some solution that would use some ratio of the total system memory (I presume it's possible to dynamically detect this at compile and/or runtime), but even with the latter I'm using an arbitrary ratio with still doesn't sit well with me.
Any insight into best practices for this kind of situation would be appreciated!
A common rule of thumb is to grow geometrically, for example by doubling, on each reallocation.
It's best to understand allocation patterns of your program, if this is a problem you need to optimize for. This comes by understanding the program's implementation, the architecture(s) it runs within, and by observation (e.g. time and memory profiling).
The truth is, you can optimize from many perspectives, but things change over time (inputs change, environments change). In the user-land, your memory usage is already second guessed.
Given your allocation sizes, I assume you are already depending on a system which will default to a backing store as needed. As such, you don't have much control over what is paged or when. Peeking at available physical memory is not worth consideration in this case, and you will have to work hard to do better than the system's existing virtual memory implementation. Several of these systems try to use all available memory (e.g. "Unused RAM is wasted RAM").
Having said that and if those assumptions are correct: It's often better to just reduce your allocation sizes and working sets and do I/O yourself as needed.
Your OSs probably use disk caching as well; reads and writes are probably faster than you suspect for large blocks of memory.
Even deeper: Use virtual memory or memory mapped files for these large data sets. Your kernel will likely handle these cases very well.
Obviously I'm not going to assume this worst case scenario and allocate all of that memory up-front.
Then you will likely be surprised to learn that a 2 GB calloc alone may be better than other alternatives people come up with in some environments because a large calloc could just reserve a domain in virtual memory, loading/initializing pages only when you access them. Depending on your usage, this approach will be much better than some alternatives you may be given.
A good starting point for many problems when understanding a program or input's allocation patterns is to start out conservative, and then make the most beneficial adjustments based on observation. In many cases, you will need little more information than a) accurately determining how much to resize by when resizing is necessary b) reusing allocations where appropriate c) designing your data well for the problem at hand.

dealing with memory fragmentation for a simulation of dynamic memory allocation

I am working on a dynamic memory allocation simulation using a fixed sized array in C and i would like to know the best way to deal with fragmentation. My plan is to split the array into two parts, the left part reserved for small blocks and the right part reserved for big blocks. I would then use the best fit approach to find the smallest/largest memory block available to use. Is there another better approach to avoid fragmentation(where you have a bunch of blocks available throughout the array but a single one does not meet the space needed)?
The best approach depends on the modus operandi of your program (the user of your memory manager). If the usage pattern is to allocate many small fragments and delete them frequently, you don't need to be overly aggressive with defragmentation. In that case rare large block users will pay for the defragmentation operation. Similarly, if large block allocations are frequent, it might make sense to defragment more often. But the best strategy (assuming you still want to roll your own) is to program it in a general, tunable way and then measure performance impact (in fragmentation ops or otherwise) based on real program run.

Best fit vs segregated fit vs buddy system for least fragmentation

I am working on a dynamic memory allocation simulation(malloc() and free()) using a fixed sized array in C and i would like to know which of these will give the least amount of fragmentation(internal and external)? I do not care about speed, being able to reduce fragmentation is the issue i want to solve.
For the buddy system i've read that it has high internal fragmentation because most requested size are not to the power of 2.
For the best fit, the only negative i've heard of is that it is sequential so it takes a long time to search(not a problem in my case). In any case i can use a binary tree from my understanding to search instead. I could also split the fixed sized array into two where the left size is for smaller blocks and the right size is for bigger blocks. I don't know of any other negatives.
For segregated fit it is similar to the buddy system but i'm not sure if it has the same problems for fragmentation since it does not split by a power of 2.
Does anyone have some statistics or know the answer to my question from having used these before?
Controlling fragmentation is use-case dependent. The only scenario where fragmentation will never occur is when your malloc() function returns a fixed size memory chunk whenever you call it. This is more in to memory pooling. Some high-end servers often do this in an attempt to boost performance when they know what they are going to allocate and for how long they will keep that allocation.
The moment you start thinking of reducing fragmentation in a general purpose memory manager, these simple algorithms you mentioned will do no good. There are complex memory management algorithms like jemalloc, tcmalloc and Poul-Henning Kamp malloc that deal with this kind of problem. There are many whitepapers published around this. ACM and IEEE have plethora of literature around this.

Optimization of C program with SLAB-like technologies

I have a programming project with highly intensive use of malloc/free functions.
It has three types of structures with very high dynamics and big numbers. By this way, malloc and free are heavily used, called thousands of times per second. Can replacement of standard memory allocation by user-space version of SLAB solve this problem? Is there any implementation of such algorithms?
P.S.
System is Linux-oriented.
Sizes of structures is less than 100 bytes.
Finally, I'll prefer to use ready implementation because memory management is really hard topic.
If you only have three different then you would greatly gain by using a pool allocator (either custom made or something like boost::pool but for C). Doug Lea's binning based malloc would serve as a very good base for a pool allocator (its used in glibc).
However, you also need to take into account other factors, such as multi-threading and memory reusage (will objects be allocated, freed then realloced or just alloced then freed?). from this angle you can check into tcmalloc (which is designed for extreme allocations, both quantity and memory usage), nedmalloc or hoard. all of these allocators are open source and thus can be easily altered to suite the sizes of the objects you allocate.
Without knowing more it's impossible to give you a good answer, but yes, managing your own memory (often by allocating a large block and then doing your own allocations with in that large block) can avoid the high cost associated with general purpose memory managers. For example, in Windows many small allocations will bring performance to its knees. Existing implementations exist for almost every type of memory manager, but I'm not sure what kind you're asking for exactly...
When programming in Windows I find calling malloc/free is like death for performance -- almost any in-app memory allocation that amortizes memory allocations by batching will save you gobs of processor time when allocating/freeing, so it may not be so important which approach you use, as long as you're not calling the default allocator.
That being said, here's some simplistic multithreading-naive ideas:
This isn't strictly a slab manager, but it seems to achieve a good balance and is commonly used.
I personally find I often end up using a fairly simple-to-implement memory-reusing manager for memory blocks of the same sizes -- it maintains a linked list of unused memory of a fixed size and allocates a new block of memory when it needs to. The trick here is to store the pointers for the linked list in the unused memory blocks -- that way there's a very tiny overhead of four bytes. The entire process is O(1) whenever it's reusing memory. When it has to allocate memory it calls a slab allocator (which itself is trivial.)
For a pure allocate-only slab allocator you just ask the system (nicely) to give you a large chunk of memory and keep track of what space you haven't used yet (just maintain a pointer to the start of the unused area and a pointer to the end). When you don't have enough space to allocate the requested size, allocate a new slab. (For large chunks, just pass through to the system allocator.)
The problem with chaining these approaches? Your application will never free any memory, but performance-critical applications often are either one-shot processing applications or create many objects of the same sizes and then stop using them.
If you're careful, the above approach isn't too hard to make multithread friendly, even with just atomic operations.
I recently implemented my own userspace slab allocator, and it proved to be much more efficient (speedwise and memory-wise) than malloc/free for a large amount of fixed-size allocations. You can find it here.
Allocations and freeing work in O(1) time, and are speeded up because of bitvectors being used to represent empty/full slots. When allocating, the __builtin_ctzll GCC intrinsic is used to locate the first set bit in the bitvector (representing an empty slot), which should translate to a single instruction on modern hardware. When freeing, some clever bitwise arithmetic is performed with the pointer itself, in order to locate the header of the corresponding slab and to mark the corresponding slot as free.

Resources