Resources for memory management in embedded application

Resources for memory management in embedded application - c

How should I manage memory in my mission critical embedded application?
I found some articles with google, but couldn't pinpoint a really useful practical guide.
The DO-178b forbids dynamic memory allocations, but how will you manage the memory then? Preallocate everything in advance and send a pointer to each function that needs allocation? Allocate it on the stack? Use a global static allocator (but then it's very similar to dynamic allocation)?
Answers can be of the form of regular answer, reference to a resource, or reference to good opensource embedded system for example.
clarification: The issue here is not whether or not memory management is availible for the embedded system. But what is a good design for an embedded system, to maximize reliability.
I don't understand why statically preallocating a buffer pool, and dynamically getting and dropping it, is different from dynamically allocating memory.

As someone who has dealt with embedded systems, though not to such rigor so far (I have read DO-178B, though):
If you look at the u-boot bootloader, a lot is done with a globally placed structure. Depending on your exact application, you may be able to get away with a global structure and stack. Of course, there are re-entrancy and related issues there that don't really apply to a bootloader but might for you.
Preallocate, preallocate, preallocate. If you can at design-time bind the size of an array/list structure/etc, declare it as a global (or static global -- look Ma, encapsulation).
The stack is very useful, use it where needed -- but be careful, as it can be easy to keep allocating off of it until you have no stack space left. Some code I once found myself debugging would allocate 1k buffers for string management in multiple functions...occasionally, the usage of the buffers would hit another program's stack space, as the default stack size was 4k.
The buffer pool case may depend on exactly how it's implemented. If you know you need to pass around fixed-size buffers of a size known at compile time, dealing with a buffer pool is likely more easy to demonstrate correctness than a complete dynamic allocator. You just need to verify buffers cannot be lost, and validate your handling won't fail. There seem to be some good tips here: http://www.cotsjournalonline.com/articles/view/101217
Really, though, I think your answers might be found in joining http://www.do178site.com/

I've worked in a DO-178B environment (systems for airplanes). What I have understood, is that the main reason for not allowing dynamic allocation is mainly certification. Certification is done through tests (unitary, coverage, integration, ...). With those tests you have to prove that you the behavior of your program is 100% predictable, nearly to the point that the memory footprint of your process is the same from one execution to the next. As dynamic allocation is done on the heap (and can fail) you can not easily prove that (I imagine it should be possible if you master all the tools from the hardware to any piece of code written, but ...). You have not this problem with static allocation. That also why C++ was not used at this time in such environments. (it was about 15 years ago, that might have changed ...)
Practically, you have to write a lot of struct pools and allocation functions that guarantee that you have something deterministic. You can imagine a lot of solutions. The key is that you have to prove (with TONS of tests) a high level of deterministic behavior. It's easier to prove that your hand crafted developpement work deterministically that to prove that linux + gcc is deterministic in allocating memory.
Just my 2 cents. It was a long time ago, things might have changed, but concerning certification like DO-178B, the point is to prove your app will work the same any time in any context.

Disclaimer: I've not worked specifically with DO-178b, but I have written software for certified systems.
On the certified systems for which I have been a developer, ...
Dynamic memory allocation was
acceptable ONLY during the
initialization phase.
Dynamic memory de-allocation was NEVER acceptable.
This left us with the following options ...
Use statically allocated structures.
Create a pool of structures and then get/release them from/back to the pool.
For flexibility, we could dynamically allocate the size of the pools or number of structures during the initialization phase. However, once past that init phase, we were stuck with what we had.
Our company found that pools of structures and then get/releasing from/back into the pool was most useful. We were able to keep to the model, and keep things deterministic with minimal problems.
Hope that helps.

Real-time, long running, mission critical systems should not dynamically allocate and free memory from heap. If you need and cannot design around it to then write your own allocated and fixed pool management scheme. Yes, allocated fixed ahead of time whenever possible. Anything else is asking for eventual trouble.

Allocating everything from stack is commonly done in embedded systems or elsewhere where the possibility of an allocation failing is unacceptable. I don't know what DO-178b is, but if the problem is that malloc is not available on your platform, you can also implement it yourself (implementing your own heap), but this still may lead to an allocation failing when you run out of space, of course.

There's no way to be 100% sure.
You may look at FreeRTOS' memory allocators examples. Those use static pool, if i'm not mistaken.

You might find this question interesting as well, dynamic allocation is often prohibited in space hardened settings (actually, core memory is still useful there).
Typically, when malloc() is not available, I just use the stack. As Tronic said, the whole reason behind not using malloc() is that it can fail. If you are using a global static pool, it is conceivable that your internal malloc() implementation could be made fail proof.
It really, really, really depends on the task at hand and what the board is going to be exposed to.

Related

memory allocation/deallocation for embedded devices

Currently we use malloc/free Linux commands for memory allocation/de-allocation in our C based embedded application. I heard that this would cause memory fragmentation as the heap size increases/decreases because of memory allocation/de-allocation which would result in performance degradation. Other programming languages with efficient Garbage Collection solves this issue by freeing the memory when not in use.
Are there any alternate approaches which would solve this issue in C based embedded programs ?

You may take a look at a solution called memory pool allocation.
See: Memory pools implementation in C

Yes, there's an easy solution: don't use dynamic memory allocation outside of initialization.
It is common (in my experience) in embedded systems to only allow calls to malloc when a program starts (this is usually done by convention, there's nothing in C to enforce this. Although you can create your own wrapper for malloc to do this). This requires more work to analyze what memory your program could possibly use since you have to allocate it all at once. The benefit you get, however, is a complete understanding of what memory your program uses.
In some cases this is fairly straightforward, in particular if your system has enough memory to allocate everything it could possibly need all at once. In severely memory-limited systems, however, you're left with the managing the memory yourself. I've seen this done by writing "custom allocators" which you allocate and free memory from. I'll provide an example.
Let's say you're implementing some mathematical program that needs lots of big matrices (not terribly huge, but for example 1000x1000 floats). Your system may not have the memory to allocate many of these matrices, but if you can allocate at least one of them, you could create a pool of memory used for matrix objects, and every time you need a matrix you grab memory from that pool, and when you're done with it you return it to the pool. This is easy if you can return them in the same order you got them in, meaning the memory pool works just like a stack. If this isn't the case, perhaps you could just clear the entire pool at the end of each "iteration" (assuming this math system is periodic).
With more detail about what exactly you're trying to implement I could provide more relevant/specific examples.
Edit: See sg7's answer as well: that user provides a link to well-established frameworks which implement what I describe here.

What are the good implementation practices to minimize RAM consumption

I run a C code on an arm based Linux device that has a very small RAM space (16MB). My code is often killed (SIGKILL) by the kernel with 'out of memory' message. I run the program with Valgrind, and it does not look like there is a memory leak. I run the code with gdb as well but could not identify any mistake on the code. I will try to optimize my code going it through some many times.
In general, what would be the good implementation practices on a code to minimize the memory usage?
one might be to use functions as much as possible(?), but I guess gcc already optimizes the code to decrease the source usage.
to avoid dynamic memory allocations
what else?

Be careful about scope of objects. Make sure you are handling the memory deallocation after an object is no longer needed. I'm not sure I understand your use functions as much as possible(?). Functions require overhead, every call causes a little bit of extra memory to be taken up because it has to store a few pointers and a little bit of information about the method on the call stack. So, while that may help keep your source code clean - it won't lower your memory usage (it'll probably increase it). One way to get the best of both worlds in C is to use inline functions - which suggests to the compiler that it should not create an actual function, but rather just insert that block of code wherever it is used. Keep in mind that efficient code usually has a more machine level look to it (meaning repetition, pointers, and often developer-managed array indices) rather than taking advantage of broad purpose, function abundant objects. But, thank goodness for smart compilers so you don't have to know every optimization. However, in a lower level language like c, since it gives you so much ability to manipulate everything, you need to be careful that you don't make costly mistakes.

If you have this kind of problem on Linux you can disable overcommit memory. It will make sure that all the memory allocated has physical memory. The kernel will be less likely to kill your program. Then be sure to test the result of all mallocs because they will fail at some point when you don't have memory anymore. You can find more information here : http://www.etalabs.net/overcommit.html
You can also disable some programs on your embedded system to free memory. May be you don't use cron or don't need six TTY at startup.

C - Design your own free( ) function

Today, I appeared for an interview and the interviewer asked me this,
Tell me the steps how will you design your own free( ) function for
deallocate the allocated memory.
How can it be more efficient than C's default free() function ? What can you conclude ?
I was confused, couldn't think of the way to design.
What do you think guys ?
EDIT : Since we need to know about how malloc() works, can you tell me the steps to write our own malloc() function

That's actually a pretty vague question, and that's probably why you got confused. Does he mean, given an existing malloc implementation, how would you go about trying to develop a more efficient way to free the underlying memory? Or was he expecting you to start discussing different kinds of malloc implementations and their benefits and problems? Did he expect you to know how virtual memory functions on the x86 architecture?
Also, by more efficient, does he mean more space efficient or more time efficient? Does free() have to be deterministic? Does it have to return as much memory to the OS as possible because it's in a low-memory, multi-tasking environment? What's our criteria here?
It's hard to say where to start with a vague question like that, other than to start asking your own questions to get clarification. After all, in order to design your own free function, you first have to know how malloc is implemented. So chances are, the question was really about whether or not you knew anything about how malloc can be implemented.
If you're not familiar with the internals of memory management, the easiest way to get started with understanding how malloc is implemented is to first write your own.
Check out this IBM DeveloperWorks article called "Inside Memory Management" for starters.
But before you can write your own malloc/free, you first need memory to allocate/free. Unfortunately, in a protected mode OS, you can't directly address the memory on the machine. So how do you get it?
You ask the OS for it. With the virtual memory features of the x86, any piece of RAM or swap memory can be mapped to a memory address by the OS. What your program sees as memory could be physically fragmented throughout the entire system, but thanks to the kernel's virtual memory manager, it all looks the same.
The kernel usually provides system calls that allow you to map in additional memory for your process. On older UNIX OS's this was usually brk/sbrk to grow heap memory onto the edge of your process or shrink it off, but a lot of systems also provide mmap/munmap to simply map a large block of heap memory in. It's only once you have access to a large, contiguous looking block of memory that you need malloc/free to manage it.
Once your process has some heap memory available to it, it's all about splitting it into chunks, with each chunk containing its own meta information about its size and position and whether or not it's allocated, and then managing those chunks. A simple list of structs, each containing some fields for meta information and a large array of bytes, could work, in which case malloc has to run through the list until if finds a large enough unallocated chunk (or chunks it can combine), and then map in more memory if it can't find a big enough chunk. Once you find a chunk, you just return a pointer to the data. free() can then use that pointer to reverse back a few bytes to the member fields that exist in the structure, which it can then modify (i.e. marking chunk.allocated = false;). If there's enough unallocated chunks at the end of your list, you can even remove them from the list and unmap or shrink that memory off your process's heap.
That's a real simple method of implementing malloc though. As you can imagine, there's a lot of possible ways of splitting your memory into chunks and then managing those chunks. There's as many ways as there are data structures and algorithms. They're all designed for different purposes too, like limiting fragmentation due to small, allocated chunks mixed with small, unallocated chunks, or ensuring that malloc and free run fast (or sometimes even more slowly, but predictably slowly). There's dlmalloc, ptmalloc, jemalloc, Hoard's malloc, and many more out there, and many of them are quite small and succinct, so don't be afraid to read them. If I remember correctly, "The C Programming Language" by Kernighan and Ritchie even uses a simple malloc implementation as one of their examples.

You can't blindly design free() without knowing how malloc() works under the hood because your implementation of free() would need to know how to manipulate the bookkeeping data and that's impossible without knowing how malloc() is implemented.
So an unswerable question could be how you would design malloc() and free() instead which is not a trivial question but you could answer it partially for example by proposing some very simple implementation of a memory pool that would not be equivalent to malloc() of course but would indicate your presence of knowledge.

One common approach when you only have access to user space (generally known as memory pool) is to get a large chunk of memory from the OS on application start-up. Your malloc needs to check which areas of the right size of that pool are still free (through some data structure) and hand out pointers to that memory. Your free needs to mark the memory as free again in the data structure and possibly needs to check for fragmentation of the pool.
The benefits are that you can do allocation in nearly constant time, the drawback is that your application consumes more memory than actually is needed.

Tell me the steps how will you design your own free( ) function for deallocate the allocated memory.
#include <stdlib.h>
#undef free
#define free(X) my_free(X)
inline void my_free(void *ptr) { }
How can it be more efficient than C's default free() function ?
It is extremely fast, requiring zero machine cycles. It also makes use-after-free bugs go away. It's a very useful free function for use in programs which are instantiated as short-lived batch processes; it can usefully be deployed in some production situations.
What can you conclude ?
I really want this job, but in another company.

Memory usage patterns could be a factor. A default implementation of free can't assume anything about how often you allocate/deallocate and what sizes you allocate when you do.
For example, if you frequently allocate and deallocate objects that are of similar size, you could gain speed, memory efficiency, and reduced fragmentation by using a memory pool.
EDIT: as sharptooth noted, only makes sense to design free and malloc together. So the first thing would be to figure out how malloc is implemented.

malloc and free only have a meaning if your app is to work on top of an OS. If you would like to write your own memory management functions you would have to know how to request the memory from that specific OS or you could reserve the heap memory right away using existing malloc and then use your own functions to distribute/redistribute the allocated memory through out your app

There is an architecture that malloc and free are supposed to adhere to -- essentially a class architecture permitting different strategies to coexist. Then the version of free that is executed corresponds to the version of malloc used.
However, I'm not sure how often this architecture is observed.

The knowledge of working of malloc() is necessary to implement free(). You can find a implementation of malloc() and free() using the sbrk() system call in K&R The C Programming Language Chapter 8, Section 8.7 "Example--A Storage Allocator" pp.185-189.

What are alternatives to malloc() in C?

I am writing C for an MPC 555 board and need to figure out how to allocate dynamic memory without using malloc.

Typically malloc() is implemented on Unix using sbrk() or mmap(). (If you use the latter, you want to use the MAP_ANON flag.)
If you're targetting Windows, VirtualAlloc may help. (More or less functionally equivalent to anonymous mmap().)
Update: Didn't realize you weren't running under a full OS, I somehow got the impression instead that this might be a homework assignment running on top of a Unix system or something...
If you are doing embedded work and you don't have a malloc(), I think you should find some memory range that it's OK for you to write on, and write your own malloc(). Or take someone else's.
Pretty much the standard one that everybody borrows from was written by Doug Lea at SUNY Oswego. For example glibc's malloc is based on this. See: malloc.c, malloc.h.

You might want to check out Ralph Hempel's Embedded Memory Manager.

If your runtime doesn't support malloc, you can find an open source malloc and tweak it to manage a chunk of memory yourself.

malloc() is an abstraction that is use to allow C programs to allocate memory without having to understand details about how memory is actually allocated from the operating system. If you can't use malloc, then you have no choice other than to use whatever facilities for memory allocation that are provided by your operating system.
If you have no operating system, then you must have full control over the layout of memory. At that point for simple systems the easiest solution is to just make everything static and/or global, for more complex systems, you will want to reserve some portion of memory for a heap allocator and then write (or borrow) some code that use that memory to implement malloc.

An answer really depends on why you might need to dynamically allocate memory. What is the system doing that it needs to allocate memory yet cannot use a static buffer? The answer to that question will guide your requirements in managing memory. From there, you can determine which data structure you want to use to manage your memory.
For example, a friend of mine wrote a thing like a video game, which rendered video in scan-lines to the screen. That team determined that memory would be allocated for each scan-line, but there was a specific limit to how many bytes that could be for any given scene. After rendering each scan-line, all the temporary objects allocated during that rendering were freed.
To avoid the possibility of memory leaks and for performance reasons (this was in the 90's and computers were slower then), they took the following approach: They pre-allocated a buffer which was large enough to satisfy all the allocations for a scan-line, according to the scene parameters which determined the maximum size needed. At the beginning of each scan-line, a global pointer was set to the beginning of the scan line. As each object was allocated from this buffer, the global pointer value was returned, and the pointer was advanced to the next machine-word-aligned position following the allocated amount of bytes. (This alignment padding was including in the original calculation of buffer size, and in the 90's was four bytes but should now be 16 bytes on some machinery.) At the end of each scan-line, the global pointer was reset to the beginning of the buffer.
In "debug" builds, there were two scan buffers, which were protected using virtual memory protection during alternating scan lines. This method detects stale pointers being used from one scan-line to the next.
The buffer of scan-line memory may be called a "pool" or "arena" depending on whome you ask. The relevant detail is that this is a very simple data structure which manages memory for a certain task. It is not a general memory manager (or, properly, "free store implementation") such as malloc, which might be what you are asking for.
Your application may require a different data structure to keep track of your free storage. What is your application?

You should explain why you can't use malloc(), as there might be different solutions for different reasons, and there are several reasons why it might be forbidden or unavailable on small/embedded systems:
concern over memory fragmentation. In this case a set of routines that allocate fixed size memory blocks for one or more pools of memory might be the solution.
the runtime doesn't provide a malloc() - I think most modern toolsets for embedded systems do provide some way to link in a malloc() implementation, but maybe you're using one that doesn't for whatever reason. In that case, using Doug Lea's public domain malloc might be a good choice, but it might be too large for your system (I'm not familiar with the MPC 555 off the top of my head). If that's the case, a very simple, custom malloc() facility might be in order. It's not too hard to write, but make sure you unit test the hell out of uit because it's also easy to get details wrong. For example, I have a set of very small routines that use a brain dead memory allocation strategy using blocks on a free list (the allocator can be compile-time configured for first, best or last fit). I give it an array of char at initialization, and subsequent allocation calls will split free blocks as necessary. It's nowhere near as sophisticated as Lea's malloc(), but it's pretty dang small so for simple uses it'll do the trick.
many embedded projects forbid the use of dynamic memory allocation - in this case, you have to live with statically allocated structures

Write your own. Since your allocator will probably be specialized to a few types of objects, I recommend the Quick Fit scheme developed by Bill Wulf and Charles Weinstock. (I have not been able to find a free copy of this paper, but many people have access to the ACM digital library.) The paper is short, easy to read, and well suited to your problem.
If you turn out to need a more general allocator, the best guide I have found on the topic of programming on machines with fixed memory is Donald Knuth's book The Art of Computer Programming, Volume 1. If you want examples, you can find good ones in Don's epic book-length treatment of the source code of TeX, TeX: The Program.
Finally, the undergraduate textbook by Bryant and O'Hallaron is rather expensive, but it goes through the implementation of malloc in excruciating detail.

Write your own. Preallocate a big chunk of static RAM, then write some functions to grab and release chunks of it. That's the spirit of what malloc() does, except that it asks the OS to allocate and deallocate memory pages dynamically.
There are a multitude of ways of keeping track of what is allocated and what is not (bitmaps, used/free linked lists, binary trees, etc.). You should be able to find many references with a few choice Google searches.

malloc() and its related functions are the only game in town. You can, of course, roll your own memory management system in whatever way you choose.

If there are issues allocating dynamic memory from the heap, you can try allocating memory from the stack using alloca(). The usual caveats apply:
The memory is gone when you return.
The amount of memory you can allocate is dependent on the maximum size of your stack.

You might be interested in: liballoc
It's a simple, easy-to-implement malloc/free/calloc/realloc replacement which works.
If you know beforehand or can figure out the available memory regions on your device, you can also use their libbmmm to manage these large memory blocks and provide a backing-store for liballoc. They are BSD licensed and free.

FreeRTOS contains 3 examples implementations of memory allocation (including malloc()) to achieve different optimizations and use cases appropriate for small embedded systems (AVR, ARM, etc). See the FreeRTOS manual for more information.
I don't see a port for the MPC555, but it shouldn't be difficult to adapt the code to your needs.

If the library supplied with your compiler does not provide malloc, then it probably has no concept of a heap.
A heap (at least in an OS-less system) is simply an area of memory reserved for dynamic memory allocation. You can reserve such an area simply by creating a suitably sized statically allocated array and then providing an interface to provide contiguous chunks of this array on demand and to manage chunks in use and returned to the heap.
A somewhat neater method is to have the linker allocate the heap from whatever memory remains after stack and static memory allocation. That way the heap is always automatically as large as it possibly can be, allowing you to use all available memory simply. This will require modification of the application's linker script. Linker scripts are specific to the particular toolchain, and invariable somewhat arcane.
K&R included a simple implementation of malloc for example.

What's a good C memory allocator for embedded systems? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have an single threaded, embedded application that allocates and deallocates lots and lots of small blocks (32-64b). The perfect scenario for a cache based allocator. And although I could TRY to write one it'll likely be a waste of time, and not as well tested and tuned as some solution that's already been on the front lines.
So what would be the best allocator I could use for this scenario?
Note: I'm using a Lua Virtual Machine in the system (which is the culprit of 80+% of the allocations), so I can't trivially refactor my code to use stack allocations to increase allocation performance.

I'm a bit late to the party, but I just want to share very efficient memory allocator for embedded systems I've recently found and tested: https://github.com/dimonomid/umm_malloc
This is a memory management library specifically designed to work with the ARM7, personally I use it on PIC32 device, but it should work on any 16- and 8-bit device (I have plans to test in on 16-bit PIC24, but I haven't tested it yet)
I was seriously beaten by fragmentation with default allocator: my project often allocates blocks of various size, from several bytes to several hundreds of bytes, and sometimes I faced 'out of memory' error. My PIC32 device has total 32K of RAM, and 8192 bytes is used for heap. At the particular moment there is more than 5K of free memory, but default allocator has maximum non-fragmented memory block just of about 700 bytes, because of fragmentation. This is too bad, so I decided to look for more efficient solution.
I already was aware of some allocators, but all of them has some limitations (such as block size should be a power or 2, and starting not from 2 but from, say, 128 bytes), or was just buggy. Every time before, I had to switch back to the default allocator.
But this time, I'm lucky: I've found this one: http://hempeldesigngroup.com/embedded/stories/memorymanager/
When I tried this memory allocator, in exactly the same situation with 5K of free memory, it has more than 3800 bytes block! It was so unbelievable to me (comparing to 700 bytes), and I performed hard test: device worked heavily more than 30 hours. No memory leaks, everything works as it should work.
I also found this allocator in the FreeRTOS repository: http://svnmios.midibox.org/listing.php?repname=svn.mios32&path=%2Ftrunk%2FFreeRTOS%2FSource%2Fportable%2FMemMang%2F&rev=1041&peg=1041# , and this fact is an additional evidence of stability of umm_malloc.
So I completely switched to umm_malloc, and I'm quite happy with it.
I just had to change it a bit: configuration was a bit buggy when macro UMM_TEST_MAIN is not defined, so, I've created the github repository (the link is at the top of this post). Now, user dependent configuration is stored in separate file umm_malloc_cfg.h
I haven't got deeply yet in the algorithms applied in this allocator, but it has very detailed explanation of algorithms, so anyone who is interested can look at the top of the file umm_malloc.c . At least, "binning" approach should give huge benefit in less-fragmentation: http://g.oswego.edu/dl/html/malloc.html
I believe that anyone who needs for efficient memory allocator for microcontrollers, should at least try this one.

In a past project in C I worked on, we went down the road of implementing our own memory management routines for a library ran on a wide range of platforms including embedded systems. The library also allocated and freed a large number of small buffers. It ran relatively well and didn't take a large amount of code to implement. I can give you a bit of background on that implementation in case you want to develop something yourself.
The basic implementation included a set of routines that managed buffers of a set size. The routines were used as wrappers around malloc() and free(). We used these routines to manage allocation of structures that we frequently used and also to manage generic buffers of set sizes. A structure was used to describe each type of buffer being managed. When a buffer of a specific type was allocated, we'd malloc() the memory in blocks (if a list of free buffers was empty). IE, if we were managing 10 byte buffers, we might make a single malloc() that contained space for 100 of these buffers to reduce fragmentation and the number of underlying mallocs needed.
At the front of each buffer would be a pointer that would be used to chain the buffers in a free list. When the 100 buffers were allocated, each buffer would be chained together in the free list. When the buffer was in use, the pointer would be set to null. We also maintained a list of the "blocks" of buffers, so that we could do a simple cleanup by calling free() on each of the actual malloc'd buffers.
For management of dynamic buffer sizes, we also added a size_t variable at the beginning of each buffer telling the size of the buffer. This was then used to identify which buffer block to put the buffer back into when it was freed. We had replacement routines for malloc() and free() that did pointer arithmetic to get the buffer size and then to put the buffer into the free list. We also had a limit on how large of buffers we managed. Buffers larger than this limit were simply malloc'd and passed to the user. For structures that we managed, we created wrapper routines for allocation and freeing of the specific structures.
Eventually we also evolved the system to include garbage collection when requested by the user to clean up unused memory. Since we had control over the whole system, there were various optimizations we were able to make over time to increase performance of the system. As I mentioned, it did work quite well.

I did some research on this very topic recently, as we had an issue with memory fragmentation. In the end we decided to stay with GNU libc's implementation, and add some application-level memory pools where necessary. There were other allocators which had better fragmentation behavior, but we weren't comfortable enough with them replace malloc globally. GNU's has the benefit of a long history behind it.
In your case it seems justified; assuming you can't fix the VM, those tiny allocations are very wasteful. I don't know what your whole environment is, but you might consider wrapping the calls to malloc/realloc/free on just the VM so that you can pass it off to a handler designed for small pools.

Although its been some time since I asked this, my final solution was to use LoKi's SmallObjectAllocator it work great. Got rid off all the OS calls and improved the performance of my Lua engine for embedded devices. Very nice and simple, and just about 5 minutes worth of work!

Since version 5.1, Lua has allowed a custom allocator to be set when creating new states.

I'd just also like to add to this even though it's an old thread. In an embedded application if you can analyze your memory usage for your application and come up with a max number of memory allocation of the varying sizes usually the fastest type of allocator is one using memory pools. In our embedded apps we can determine all allocation sizes that will ever be needed during run time. If you can do this you can completely eliminate heap fragmentation and have very fast allocations. Most these implementations have an overflow pool which will do a regular malloc for the special cases which will hopefully be far and few between if you did your analysis right.

I have used the 'binary buddy' system to good effect under vxworks. Basically, you portion out your heap by cutting blocks in half to get the smallest power of two sized block to hold your request, and when blocks are freed, you can make a pass up the tree to merge blocks back together to mitigate fragmentation. A google search should turn up all the info you need.

I am writing a C memory allocator called tinymem that is intended to be able to defragment the heap, and re-use memory. Check it out:
https://github.com/vitiral/tinymem
Note: this project has been discontinued to work on the rust implementation:
https://github.com/vitiral/defrag-rs
Also, I had not heard of umm_malloc before. Unfortunately, it doesn't seem to be able to deal with fragmentation, but it definitely looks useful. I will have to check it out.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight