I had a bug which I have now fixed but which I need to explain in a report.
I am working on an embedded device running FreeRTOS which does its own heap memory management. FreeRTOS has its own version of malloc(), pvPortMalloc() which I was unaware of and using it fixed the memory issues I was having.
My question relates to the size of the memory overflow that was caused by malloc(), the data was only 8 bytes in size, the size of the overflow however was significant, kilobytes if not larger. My guess is that the first and only use of malloc in this application, set up a second heap in competition with FreeRTOS's heap, of at least several kb is size.
Can anyone confirm this or give a better explanation. Pointers to more info or references greatly appreciated.
It is a common trait of many malloc implementations to request a larger chunk of memory from the system than is needed for a single request. For example, glibc's ptmalloc has this:
#define MINIMUM_MORECORE_SIZE (64 * 1024)
This serves as a lower bound on the amount of memory (in bytes) to request from the OS (via sbrk()) at a single time. So you would expect to see a single tiny allocation result in 64 KB "used."
One reason to do this sort of thing is to reduce system calls; another might be to reduce fragmentation.
Related
I'm creating a list of elements inside a task in the following way:
l = (dllist*)pvPortMalloc(sizeof(dllist));
dllist is 32 byte big.
My embedded system has 60kB SRAM so I expected my 200 element list can be handled easily by the system. I found out that after allocating space for 8 elements the system is crashing on the 9th malloc function call (256byte+).
If possible, where can I change the heap size inside freeRTOS?
Can I somehow request the current status of heap size?
I couldn't find this information in the documentation so I hope somebody can provide some insight in this matter.
Thanks in advance!
(Yes - FreeRTOS pvPortMalloc() returns void*.)
If you have 60K of SRAM, and configTOTAL_HEAP_SIZE is large, then it is unlikely you are going to run out of heap after allocating 256 bytes unless you had hardly any heap remaining before hand. Many FreeRTOS demos will just keep creating objects until all the heap is used, so if your application is based on one of those, then you would be low on heap before your code executed. You may have also done something like use up loads of heap space by creating tasks with huge stacks.
heap_4 and heap_5 will combine adjacent blocks, which will minimise fragmentation as far as practical, but I don't think that will be your problem - especially as you don't mention freeing anything anywhere.
Unless you are using heap_3.c (which just makes the standard C library malloc and free thread safe) you can call xPortGetFreeHeapSize() to see how much free heap you have. You may also have xPortGetMinimumEverFreeHeapSize() available to query how close you have ever come to running out of heap. More information: http://www.freertos.org/a00111.html
You could also define a malloc() failed hook (http://www.freertos.org/a00016.html) to get instant notification of pvPortMalloc() returning NULL.
For the standard allocators you will find a config option in FreeRTOSConfig.h .
However:
It is very well possible you run out of memory already, depending on the allocator used. IIRC there is one that does not free() any blocks (free() is just a dummy). So any block returned will be lost. This is still useful if you only allocate memory e.g. at startup, but then work with what you've got.
Other allocators might just not merge adjacent blocks once returned, increasing fragmentation much faster than a full-grown allocator.
Also, you might loose memory to fragmentation. Depending on your alloc/free pattern, you quickly might end up with a heap looking like swiss cheese: Many holes between allocated blocks. So while there is still enough free memory, no single block is big enough for the size required.
If you only allocate blocks that size there, you might be better of using your own allocator or a pool (blocks of fixed size). Thaqt would be statically allocated (e.g. array) and chained as a linked list during startup. Alloc/free would then just be push/pop on a stack (or put/get on a queue). That would also be very fast and have complexity O(1) (interrupt-safe if properly written).
Note that normal malloc()/free() are not interrupt-safe.
Finally: Do not cast void *. (Well, that's actually what standard malloc() returns and I expect that FreeRTOS-variant does the same).
I have tried in my machine using sbrk(1) and then deliberately write out of bound to test page size, which is 4096 bytes. But when I call malloc(1), I get SEGV after accessing 135152 bytes, which is way more than one page size. I know that malloc is library function and it is implementation dependent, but considering that it calls sbrk eventually, why will it give more than one page size. Can anyone tell me about its internal working?
My operating system is ubuntu 14.04 and my architecture is x86
Update: Now I am wondering if it's because malloc returns the address to a free list block that is large enough to hold my data. But that address may be in the middle of the heap so that I can keep writing until the upper limit of the heap is reached.
Older malloc() implementations of UNIX used sbrk()/brk() system calls. But these days, implementations use mmap() and sbrk(). The malloc() implementation of glibc (that's probably the one you use on your Ubuntu 14.04) uses both sbrk() and mmap() and the choice to use which one to allocate when you request the typically depends on the size of the allocation request, which glibc does dynamically.
For small allocations, glibc uses sbrk() and for larger allocations it uses mmap(). The macro M_MMAP_THRESHOLD is used to decide this. Currently, it's default value is set to 128K. This explains why your code managed to allocate 135152 bytes as it is roughly ~128K. Even though, you requested only 1 byte, your implementation allocates 128K for efficient memory allocation. So segfault didn't occur until you cross this limit.
You can play with M_MAP_THRESHOLD by using mallopt() by changing the default parameters.
M_MMAP_THRESHOLD
For allocations greater than or equal to the limit
specified (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
the free list, the memory-allocation functions employ mmap(2) instead
of increasing the program break using sbrk(2).
Allocating memory using mmap(2) has the significant advantage that
the allocated memory blocks can always be independently released back
to the system. (By contrast, the heap can be trimmed only if memory
is freed at the top end.) On the other hand, there are some
disadvantages to the use of mmap(2): deallocated space is not placed
on the free list for reuse by later allocations; memory may be wasted
because mmap(2) allocations must be page-aligned; and the kernel must
perform the expensive task of zeroing out memory allocated via
mmap(2). Balancing these factors leads to a default setting of
128*1024 for the M_MMAP_THRESHOLD parameter.
The lower limit for this parameter is 0. The upper limit is
DEFAULT_MMAP_THRESHOLD_MAX: 512*1024 on 32-bit systems or
4*1024*1024*sizeof(long) on 64-bit systems.
Note: Nowadays, glibc uses a dynamic mmap threshold by default.
The initial value of the threshold is 128*1024, but when blocks
larger than the current threshold and less than or equal to
DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold is adjusted
upward to the size of the freed block. When dynamic mmap
thresholding is in effect, the threshold for trimming the heap is also
dynamically adjusted to be twice the dynamic mmap threshold. Dynamic
adjustment of the mmap threshold is disabled if any of the
M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX
parameters is set.
For example, if you do:
#include<malloc.h>
mallopt(M_MMAP_THRESHOLD, 0);
before calling malloc(), you'll likely see a different limit. Most of these are implementation details and C standard says it's undefined behaviour to write into memory that your process doesn't own. So do it at your risk -- otherwise, demons may fly out of your nose ;-)
malloc allocates memory in large blocks for performance reasons. Subsequent calls to malloc can give you memory from the large block instead of having to ask the operating system for a lot of small blocks. This cuts down on the number of system calls needed.
From this article:
When a process needs memory, some room is created by moving the upper bound of the heap forward, using the brk() or sbrk() system calls. Because a system call is expensive in terms of CPU usage, a better strategy is to call brk() to grab a large chunk of memory and then split it as needed to get smaller chunks. This is exactly what malloc() does. It aggregates a lot of smaller malloc() requests into fewer large brk() calls. Doing so yields a significant performance improvement.
Note that some modern implementations of malloc use mmap instead of brk/sbrk to allocate memory, but otherwise the above is still true.
I have a question that keeps bothering me for the last week.
In Windows debugger there is the !heap -s command that outputs the virtual memory's heap status and calculates the external fragmentation using the formula :
External fragmentation = 1 - (larget free block / total free size)
Is there a similar method in linux, that outputs those statistics needed to calculate the effect?
Long story now:
I have a C application that keeps allocating and deallocating space of different sizes, using malloc and free, each allocation has different life span.
The platform I am using is Lubuntu, so ptmalloc2 algorithm is the default.
I am aware that those allocations are served in the virtual user space heap(except those that are larger than 128Kb, where the allocator uses mmap), and are mapped to physical pages when actually accessed .
The majority of the allocations is of size < 80 bytes, so they are served from FastBins.
Using Valgrind and Massif I can get the internal fragmentation, since it reports the extra bytes used for each allocation.
However, my main concern is how to figure out the external fragmentation.
I am aware of the /proc/[pid]/smaps heap size and the pmap-d[pid] anon statistics, but I find it difficult to interpret them in terms of external fragmentation.
I am also aware of LD_PRELOAD , and I can dynamically connect the /lib/i386-linux-gnu/libmemusage.so. This library outputs the heap total, peak and the distribution of the requested allocation sizes.
I know that __malloc__hook is deprecated now, and I do not really want to rely on implementation specific statistics like malloc_stats() and mallinfo() . However, If you have any suggestions using those two please let me know.
I can tell that the external fragmentation problem, is when a request can not be satisfied, because there is no contiguous space in the heap, but the requested total size is scattered all around that area.
I still have not figured out, how to get the statistics needed so I can calculate this effect. For example different formulas stating that I have to capture the live_memory or get the total_free_pages, or get the size of the largest_free_block.
How can I have a function to "traverse" through the heap and gather those statistics?
Thank you everyone in advance.
I believe that this will depend on the allocator you are using. That is, you would probably need a different strategy for whichever malloc (et al) and free implementation you are using. If the implementation doesn't offer the information you seek as an extension, you will probably have to read its source-code and type your own logic to examine the state of allocations.
I believe that the mapping of pages to swap-space and physical RAM is at a lower level and so won't particularly help you in your goal. The malloc (et al) and free implementation might or might not care about those lower-level details.
If you are certain that you are using ptmalloc2, are you able to find its source-code?
I have an image compression application that now has two different versions of memory allocation systems. In the original one, malloc is used everywhere, and in the second one, I implemented a simple pool-allocator, that just allocates chunk of memory and returns parts of that memory to myalloc() calls.
We've been noticing a huge memory overhead when malloc is used: At the height of its memory usage, the malloc() code requires about 170 megabytes of memory for a 1920x1080x16bpp image, while the pool allocator allocates just 48 megabytes, of which 47 are used by the program.
In terms of memory allocation patterns, the program allocates a lot of 8byte(most), 32-byte(many) and 1080byte-blocks(some) with the test image. Apart from these, there are no dynamic memory allocations in the code.
The OS of the testing system is Windows 7 (64 Bit).
How did we test memory usage?
With the custom allocator, we could see how much memory is used because all malloc calls are defered to the allocator. With malloc(), in Debug mode we just stepped through the code and watched the memory usage in the task manager. In release mode we did the same, but less fine grained because the compiler optimizes a lot of stuff away so we couldn't step through the code piece by piece (the memory difference between release and debug was about 20MB, which I would attribute to optimization and lack of debug information in release mode).
Could malloc alone be the cause of such a huge overhead? If so, what exactly causes this overhead inside malloc?
On Windows 7 you will always get the low-fragmentation heap allocator, without explicitly calling HeapSetInformation() to ask for it. That allocator sacrifices virtual memory space to reduce fragmentation. Your program is not actually using 170 megabytes, you are just seeing a bunch of free blocks lying around, waiting for an allocation of a similar size.
This algorithm is very easy to beat with a custom allocator that doesn't do anything to reduce fragmentation. Which may well work out for you, albeit that you don't see the side effects of it until you keep the program running longer than a single debug session. You do need to make sure it is stable for days or weeks if that is the expected usage pattern.
Best thing to do is just not fret about it, 170 MB is rather small potatoes. And do keep in mind that this is virtual memory, it doesn't cost anything.
First at all malloc aligns the pointers to 16 byte boundaries. Furthermore they store at least one pointer (or allocated length) in the addresses preceding the returned value. Then they probably add a magic value or release counter to indicate that the linked list is not broken or that the memory block has not been released twice (free ASSERTS for double frees).
#include <stdlib.h>
#include <stdio.h>
int main(int ac, char**av)
{
int *foo = malloc(4);
int *bar = malloc(4);
printf("%d\n", (int)bar - (int)foo);
}
Return: 32
Caution: When you run your program in the Visual Studio or with any debugger attached, by default the malloc behaviour is changed a lot, Low Fragmentation Heap is not used and a memory overhead may be not representative of real usage (see also https://stackoverflow.com/a/3768820/16673). You need to use environment variable _NO_DEBUG_HEAP=1 to avoid being hit by this, or to measure the memory usage when not running under a debugger.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have an single threaded, embedded application that allocates and deallocates lots and lots of small blocks (32-64b). The perfect scenario for a cache based allocator. And although I could TRY to write one it'll likely be a waste of time, and not as well tested and tuned as some solution that's already been on the front lines.
So what would be the best allocator I could use for this scenario?
Note: I'm using a Lua Virtual Machine in the system (which is the culprit of 80+% of the allocations), so I can't trivially refactor my code to use stack allocations to increase allocation performance.
I'm a bit late to the party, but I just want to share very efficient memory allocator for embedded systems I've recently found and tested: https://github.com/dimonomid/umm_malloc
This is a memory management library specifically designed to work with the ARM7, personally I use it on PIC32 device, but it should work on any 16- and 8-bit device (I have plans to test in on 16-bit PIC24, but I haven't tested it yet)
I was seriously beaten by fragmentation with default allocator: my project often allocates blocks of various size, from several bytes to several hundreds of bytes, and sometimes I faced 'out of memory' error. My PIC32 device has total 32K of RAM, and 8192 bytes is used for heap. At the particular moment there is more than 5K of free memory, but default allocator has maximum non-fragmented memory block just of about 700 bytes, because of fragmentation. This is too bad, so I decided to look for more efficient solution.
I already was aware of some allocators, but all of them has some limitations (such as block size should be a power or 2, and starting not from 2 but from, say, 128 bytes), or was just buggy. Every time before, I had to switch back to the default allocator.
But this time, I'm lucky: I've found this one: http://hempeldesigngroup.com/embedded/stories/memorymanager/
When I tried this memory allocator, in exactly the same situation with 5K of free memory, it has more than 3800 bytes block! It was so unbelievable to me (comparing to 700 bytes), and I performed hard test: device worked heavily more than 30 hours. No memory leaks, everything works as it should work.
I also found this allocator in the FreeRTOS repository: http://svnmios.midibox.org/listing.php?repname=svn.mios32&path=%2Ftrunk%2FFreeRTOS%2FSource%2Fportable%2FMemMang%2F&rev=1041&peg=1041# , and this fact is an additional evidence of stability of umm_malloc.
So I completely switched to umm_malloc, and I'm quite happy with it.
I just had to change it a bit: configuration was a bit buggy when macro UMM_TEST_MAIN is not defined, so, I've created the github repository (the link is at the top of this post). Now, user dependent configuration is stored in separate file umm_malloc_cfg.h
I haven't got deeply yet in the algorithms applied in this allocator, but it has very detailed explanation of algorithms, so anyone who is interested can look at the top of the file umm_malloc.c . At least, "binning" approach should give huge benefit in less-fragmentation: http://g.oswego.edu/dl/html/malloc.html
I believe that anyone who needs for efficient memory allocator for microcontrollers, should at least try this one.
In a past project in C I worked on, we went down the road of implementing our own memory management routines for a library ran on a wide range of platforms including embedded systems. The library also allocated and freed a large number of small buffers. It ran relatively well and didn't take a large amount of code to implement. I can give you a bit of background on that implementation in case you want to develop something yourself.
The basic implementation included a set of routines that managed buffers of a set size. The routines were used as wrappers around malloc() and free(). We used these routines to manage allocation of structures that we frequently used and also to manage generic buffers of set sizes. A structure was used to describe each type of buffer being managed. When a buffer of a specific type was allocated, we'd malloc() the memory in blocks (if a list of free buffers was empty). IE, if we were managing 10 byte buffers, we might make a single malloc() that contained space for 100 of these buffers to reduce fragmentation and the number of underlying mallocs needed.
At the front of each buffer would be a pointer that would be used to chain the buffers in a free list. When the 100 buffers were allocated, each buffer would be chained together in the free list. When the buffer was in use, the pointer would be set to null. We also maintained a list of the "blocks" of buffers, so that we could do a simple cleanup by calling free() on each of the actual malloc'd buffers.
For management of dynamic buffer sizes, we also added a size_t variable at the beginning of each buffer telling the size of the buffer. This was then used to identify which buffer block to put the buffer back into when it was freed. We had replacement routines for malloc() and free() that did pointer arithmetic to get the buffer size and then to put the buffer into the free list. We also had a limit on how large of buffers we managed. Buffers larger than this limit were simply malloc'd and passed to the user. For structures that we managed, we created wrapper routines for allocation and freeing of the specific structures.
Eventually we also evolved the system to include garbage collection when requested by the user to clean up unused memory. Since we had control over the whole system, there were various optimizations we were able to make over time to increase performance of the system. As I mentioned, it did work quite well.
I did some research on this very topic recently, as we had an issue with memory fragmentation. In the end we decided to stay with GNU libc's implementation, and add some application-level memory pools where necessary. There were other allocators which had better fragmentation behavior, but we weren't comfortable enough with them replace malloc globally. GNU's has the benefit of a long history behind it.
In your case it seems justified; assuming you can't fix the VM, those tiny allocations are very wasteful. I don't know what your whole environment is, but you might consider wrapping the calls to malloc/realloc/free on just the VM so that you can pass it off to a handler designed for small pools.
Although its been some time since I asked this, my final solution was to use LoKi's SmallObjectAllocator it work great. Got rid off all the OS calls and improved the performance of my Lua engine for embedded devices. Very nice and simple, and just about 5 minutes worth of work!
Since version 5.1, Lua has allowed a custom allocator to be set when creating new states.
I'd just also like to add to this even though it's an old thread. In an embedded application if you can analyze your memory usage for your application and come up with a max number of memory allocation of the varying sizes usually the fastest type of allocator is one using memory pools. In our embedded apps we can determine all allocation sizes that will ever be needed during run time. If you can do this you can completely eliminate heap fragmentation and have very fast allocations. Most these implementations have an overflow pool which will do a regular malloc for the special cases which will hopefully be far and few between if you did your analysis right.
I have used the 'binary buddy' system to good effect under vxworks. Basically, you portion out your heap by cutting blocks in half to get the smallest power of two sized block to hold your request, and when blocks are freed, you can make a pass up the tree to merge blocks back together to mitigate fragmentation. A google search should turn up all the info you need.
I am writing a C memory allocator called tinymem that is intended to be able to defragment the heap, and re-use memory. Check it out:
https://github.com/vitiral/tinymem
Note: this project has been discontinued to work on the rust implementation:
https://github.com/vitiral/defrag-rs
Also, I had not heard of umm_malloc before. Unfortunately, it doesn't seem to be able to deal with fragmentation, but it definitely looks useful. I will have to check it out.