How to get heap memory usage, FreeRTOS [duplicate]

How to get heap memory usage, FreeRTOS [duplicate] - c

I'm creating a list of elements inside a task in the following way:
l = (dllist*)pvPortMalloc(sizeof(dllist));
dllist is 32 byte big.
My embedded system has 60kB SRAM so I expected my 200 element list can be handled easily by the system. I found out that after allocating space for 8 elements the system is crashing on the 9th malloc function call (256byte+).
If possible, where can I change the heap size inside freeRTOS?
Can I somehow request the current status of heap size?
I couldn't find this information in the documentation so I hope somebody can provide some insight in this matter.
Thanks in advance!

(Yes - FreeRTOS pvPortMalloc() returns void*.)
If you have 60K of SRAM, and configTOTAL_HEAP_SIZE is large, then it is unlikely you are going to run out of heap after allocating 256 bytes unless you had hardly any heap remaining before hand. Many FreeRTOS demos will just keep creating objects until all the heap is used, so if your application is based on one of those, then you would be low on heap before your code executed. You may have also done something like use up loads of heap space by creating tasks with huge stacks.
heap_4 and heap_5 will combine adjacent blocks, which will minimise fragmentation as far as practical, but I don't think that will be your problem - especially as you don't mention freeing anything anywhere.
Unless you are using heap_3.c (which just makes the standard C library malloc and free thread safe) you can call xPortGetFreeHeapSize() to see how much free heap you have. You may also have xPortGetMinimumEverFreeHeapSize() available to query how close you have ever come to running out of heap. More information: http://www.freertos.org/a00111.html
You could also define a malloc() failed hook (http://www.freertos.org/a00016.html) to get instant notification of pvPortMalloc() returning NULL.

For the standard allocators you will find a config option in FreeRTOSConfig.h .
However:
It is very well possible you run out of memory already, depending on the allocator used. IIRC there is one that does not free() any blocks (free() is just a dummy). So any block returned will be lost. This is still useful if you only allocate memory e.g. at startup, but then work with what you've got.
Other allocators might just not merge adjacent blocks once returned, increasing fragmentation much faster than a full-grown allocator.
Also, you might loose memory to fragmentation. Depending on your alloc/free pattern, you quickly might end up with a heap looking like swiss cheese: Many holes between allocated blocks. So while there is still enough free memory, no single block is big enough for the size required.
If you only allocate blocks that size there, you might be better of using your own allocator or a pool (blocks of fixed size). Thaqt would be statically allocated (e.g. array) and chained as a linked list during startup. Alloc/free would then just be push/pop on a stack (or put/get on a queue). That would also be very fast and have complexity O(1) (interrupt-safe if properly written).
Note that normal malloc()/free() are not interrupt-safe.
Finally: Do not cast void *. (Well, that's actually what standard malloc() returns and I expect that FreeRTOS-variant does the same).

Related

Reduce malloc calls by slicing one big malloc'd memory

First, here is where I got the idea from:
There was once an app I wrote that used lots of little blobs of
memory, each allocated with malloc(). It worked correctly but was
slow. I replaced the many calls to malloc with just one, and then
sliced up that large block within my app. It was much much faster.
I was profiling my application, and I got a unexpectedly nice performance boost when I reduced the number of malloc calls. I am still allocating the same amount of memory, though.
So, I would like to do what this guy did, but I am unsure what's the best way to do it.
My Idea:
// static global variables
static void * memoryForStruct1 = malloc(sizeof(Struct1) * 10000);
int struct1Index = 0;
...
// somewhere, I need memory, fast:
Struct1* data = memoryForStruct1[struct1Index++];
...
// done with data:
--struct1Index;
Gotchas:
I have to make sure I don't exceed 10000
I have to release the memory in the same order I occupied. (Not a major issue in my case, since I am using recursion, but I would like to avoid it if possible).
Inspired from Mihai Maruseac:
First, I create a linked list of int that basically tells me which memory indexes are free. I then added a property to my struct called int memoryIndex which helps me return the memory occupied in any order. And Luckily, I am sure my memory needs would never exceed 5 MB at any given time, so I can safely allocate that much memory. Solved.

The system call which gives you memory is brk. The usual malloc and calloc, realloc functions simply use the space given by brk. When that space is not enough, another brk is made to create new space. Usually, the space is increased in sizes of a virtual memory page.
Thus, if you really want to have a premade pool of objects, then make sure to allocate memory in multiples of pagesize. For example, you can create one pool of 4KB. 8KB, ... space.
Next idea, look at your objects. Some of them have one size, some have other size. It will be a big pain to handle allocations for all of them from the same pool. Create pools for objects of various sizes (powers of 2 is best) and allocate from them. For example, if you'll have an object of size 34B you'd allocate space for it from the 64B pool.
Lastly, the remaining space can be either left unused or it can be moved down to the other pools. In the above example, you have 30B left. You'd split it in 16B, 8B, 4B and 2B chunks and add each chunk to their respective pool.
Thus, you'd use linked lists to manage the preallocated space. Which means that your application will use more memory than it actually needs but if this really helps you, why not?
Basically, what I've described is a mix between buddy allocator and slab allocator from the Linux kernel.
Edit: After reading your comments, it will be pretty easy to allocate a big area with malloc(BIG_SPACE) and use this as a pool for your memory.

If you can, look at using glib which has memory slicing API that supports this. It's very easy to use, and saves you from having to re-implement it.

making your own malloc function?

I read that some games rewrite their own malloc to be more efficient. I don't understand how this is possible in a virtual memory world. If I recall correctly, malloc actually calls an OS specific function, which maps the virtual address to a real address with the MMU. So then how can someone make their own memory allocator and allocate real memory, without calling the actual runtime's malloc?
Thanks

It's certainly possible to write an allocator more efficient than a general purpose one.
If you know the properties of your allocations, you can blow general purpose allocators out of the water.
Case in point: many years ago, we had to design and code up a communication subsystem (HDLC, X.25 and proprietary layers) for embedded systems. The fact that we knew the maximum allocation would always be less than 128 bytes (or something like that) meant that we didn't have to mess around with variable sized blocks at all. Every allocation was for 128 bytes no matter how much you asked for.
Of course, if you asked for more, it returned NULL.
By using fixed-length blocks, we were able to speed up allocations and de-allocations greatly, using bitmaps and associated structures to hold accounting information rather than relying on slower linked lists. In addition, the need to coalesce freed blocks was not needed.
Granted, this was a special case but you'll find that's so for games as well. In fact, we've even used this in a general purpose system where allocations below a certain threshold got a fixed amount of memory from a self-managed pre-allocated pool done the same way. Any other allocations (larger than the threshold or if the pool was fully allocated) were sent through to the "real" malloc.

Just because malloc() is a standard C function doesn't mean that it's the lowest level access you have to the memory system. In fact, malloc() is probably implemented in terms of lower-level operating system functionality. That means you could call those lower level interfaces too. They might be OS-specific, but they might allow you better performance than you would get from the malloc() interface. If that were the case, you could implement your own memory allocation system any way you want, and maybe be even more efficient about it - optimizing the algorithm for the characteristics of the size and frequency of allocations you're going to make, for example.

In general, malloc will call an OS-specific function to obtain a bunch of memory (at least one VM page), and will then divide that memory up into smaller chunks as needed to return to the caller of malloc.
The malloc library will also have a list (or lists) of free blocks, so it can often meet a request without asking the OS for more memory. Determining how many different block sizes to handle, deciding whether to attempt to combine adjacent free blocks, and so forth, are the choices the malloc library implementor has to make.
It's possible for you to bypass the malloc library and directly invoke the OS-level "give me some memory" function and do your own allocation/freeing within the memory you get from the OS. Such implementations are likely to be OS-specific. Another alternative is to use malloc for initial allocations, but maintain your own cache of freed objects.

One thing you can do is have your allocator allocate a pool of memory, then service requests from than (and allocate a bigger pool if it runs out). I'm not sure if that's what they're doing though.

If I recall correctly, malloc actually
calls an OS specific function
Not quite. Most hardware has a 4KB page size. Operating systems generally don't expose a memory allocation interface offering anything smaller than page-sized (and page-aligned) chunks.
malloc spends most of its time managing the virtual memory space that has already been allocated, and only occasionally requests more memory from the OS (obviously this depends on the size of the items you allocate and how often you free).
There is a common misconception that when you free something it is immediately returned to the operating system. While this sometimes occurs (particularly for larger memory blocks) it is generally the case that freed memory remains allocated to the process and can then be re-used by later mallocs.
So most of the work is in bookkeeping of already-allocated virtual space. Allocation strategies can have many aims, such as fast operation, low memory wastage, good locality, space for dynamic growth (e.g. realloc) and so on.
If you know more about your pattern of memory allocation and release, you can optimise malloc and free for your usage patterns or provide a more extensive interface.
For instance, you may be allocating lots of equal-sized objects, which may change the optimal allocation parameters. Or you may always free large amounts of objects at once, in which case you don't want free to be doing fancy things.
Have a look at memory pools and obstacks.

See How do games like GTA IV not fragment the heap?.

What happens when there is a request for memory block which is not a power of 2?

Suppose we do a malloc request for memory block of size n where 2 ^k !=n for k>0.
Malloc returns us space for that requestted memory block but how is the remainig buffer handled from the page. I read Pages are generally blocks of memory which are powers of two.
Wiki states the following:
Like any method of memory allocation, the heap will become fragmented; that is,
there will be sections of used and unused memory in the allocated
space on the heap. A good allocator will attempt to find an unused area
of already allocated memory to use before resorting to expanding the heap.
So my question is how is this tracked?
EDIT: How is the unused memory tracked when using malloc ?

This really depends on the specific implementation, as Morten Siebuhr pointed out already. In very simple cases, there might be a list of free, fixed-size blocks of memory (possibly all having the same size), so the unused memory is simply wasted. Note that real implementations will never use such simplistic algorithms.
This is an overview over some simple possibilities: http://www.osdcom.info/content/view/31/39/
This Wikipedia entry has several interesting links, including the one above: http://en.wikipedia.org/wiki/Dynamic_memory_allocation#Implementations
As a final remark, googling "malloc implementation" turns up a heap (pun intended) of valuable links.

A standard BSD-style memory allocator basically works like this:
It keeps a linked list of pre-allocated memory blocks for sizes 2^k for k<=12 (for example).
In reality, each list for a given k is composed of memory-blocks from different areas, see below.
A malloc request for n bytes is serviced by calculating n', the closest 2^k >= n, then looking up the first area in the list for k, and then returning the first free block in the free-list for the given area.
When there is no pre-allocated memory block for size 2^k, an area is allocated, an area being some larger piece of continuous memory, say a 4kB piece of memory. This piece of memory is then chopped up into pieces that are 2^k bytes. At the beginning of the continuous memory area there is book-keeping information such as where to find the linked list of free blocks within the area. A bitmap can also be used, but a linked list typically has better cache behavior (you want the next allocated block to return memory that is already in the cache).
The reason for using areas is that free(ptr) can be implemented efficiently. ptr & 0xfffff000 in this example points to the beginning of the area which contains the book-keeping structures and makes it possible to link the memory block back into the area.
The BSD allocator will waste space by always returning a memory block 2^k in size, but it can reuse the memory of the block to keep the free-list, which is a nice property. Also allocation is blazingly fast.
Modifications to the above general idea include:
Using anonymous mmap for large allocations. This shifts the work over to the kernel for handling large mallocs and avoids wasting a lot of memory in these cases.
The GNU version of malloc have special cases for non-power-of-two buckets. There is nothing inherent in the BSD allocator that requires returning 2^k memory blocks, only that there are pre-defined bucket sizes. The GNU allocator has more buckets and thus waste less space.
Sharing memory between threads is a tricky subject. Lock-contention during allocation is an important consideration, so in the GNU allocator for example will eagerly create extra areas for different threads for a given bucket size if it ever encounters lock-contention during allocation.

This varies a lot from implementation to implementation. Some waste the space, some sub-divide pages until they get the requested size (or close to it) &c.
If you are asking out of curiosity, I suggest you read the source code for the implementation in question,
If it's because of performance worries, try to benchmark it and see what happens.

Minimizing the amount of malloc() calls improves performance?

Consider two applications: one (num. 1) that invokes malloc() many times, and the other (num. 2) that invokes malloc() few times.
Both applications allocate the same amount of memory (assume 100MB).
For which application the next malloc() call will be faster, #1 or #2?
In other words: Does malloc() have an index of allocated locations in memory?

You asked 2 questions:
for which application the next malloc() call will be faster, #1 or #2?
In other words: Does malloc() have an index of allocated locations in memory?
You've implied that they are the same question, but they are not. The answer to the latter question is YES.
As for which will be faster, it is impossible to say. It depends on the allocator algorithm, the machine state, the fragmentation in the current process, and so on.
Your idea is sound, though: you should think about how malloc usage will affect performance.
There was once an app I wrote that used lots of little blobs of memory, each allocated with malloc(). It worked correctly but was slow. I replaced the many calls to malloc with just one, and then sliced up that large block within my app. It was much much faster.
I don't recommend this approach; it's just an illustration of the point that malloc usage can materially affect performance.
My advice is to measure it.

Of course this completely depends on the malloc implementation, but in this case, with no calls to free, most malloc implementations will probably give you the same algorithmic speed.
As another answer commented, usually there will be a list of free blocks, but if you have not called free, there will just be one, so it should be O(1) in both cases.
This assumes that the memory allocated for the heap is big enough in both cases. In case #1, you will have allocated more total memory, as each allocation involves memory overhead to store meta-data, as a result you may need to call sbrk(), or equivalent to grow the heap in case #1, which would add an additional overhead.
They will probably be different due to cache and other second order effects, since the memory alignments for the new allocation won't be the same.
If you have been freeing some of the memory blocks, then it is likely that #2 will be faster due to less fragmentation, and so a smaller list of free blocks to search.
If you have freed all the memory blocks, it should end up being exactly the same, since any sane free implementation will have coalesced the blocks back into a single arena of memory.

Malloc has to run through a linked list of free blocks to find one to allocate. This takes time. So, #1 will usually be slower:
The more often you call malloc, the more time it will take - so reducing the number of calls will give you a speed improvement (though whether it is significant will depend on your exact circumstances).
In addition, if you malloc many small blocks, then as you free those blocks, you will fragment the heap much more than if you only allocate and free a few large blocks. So you are likely to end up with many small free blocks on your heap rather than a few big blocks, and therefore your mallocs may have to search further through the free-space lists to find a suitable block to allocate. WHich again will make them slower.

These are of course implementation details, but typically free() will insert the memory into a list of free blocks. malloc() will then look at this list for a free block that is the right size, or larger. Typically, only if this fails does malloc() ask the kernel for more memory.
There are also other considerations, such as when to coalesce multiple adjacent blocks into a single, larger block.
And, another reason that malloc() is expensive: If malloc() is called from multiple threads, there must be some kind of synchronization on these global structures. (i.e. locks.) There exist malloc() implementations with different optimization schemes to make it better for multple threads, but generally, keeping it multi-thread safe adds to the cost, as multiple threads will contend for those locks and block progress on each other.

You can always do a better job using malloc() to allocate a large chunk of memory and sub-dividing it yourself. Malloc() was optimized to work well in the general case and makes no assumptions whether or not you use threads or what the size of the program's allocations might be.
Whether it is a good idea to implement your own sub-allocator is a secondary question. It rarely is, explicit memory management is already hard enough. You rarely need another layer of code that can screw up and crash your program without any good way to debug it. Unless you are writing a debug allocator.

The answer is that it depends, most of the potential slowness rather comes from malloc() and free() in combination and usually #1 and #2 will be of similar speed.
All malloc() implementations do have an indexing mechanism, but the speed of adding a new block to the index is usually not dependant on the number of blocks already in the index.
Most of the slowness of malloc comes from two sources
searching for a suitable free block among the previously freed(blocks)
multi-processor problems with locking
Writing my own almost standards compliant malloc() replacement tool malloc() && free() times from 35% to 3-4%, and it seriously optimised those two factors. It would likely have been a similar speed to use some other high-performance malloc, but having our own was more portable to esoteric devices and of course allows free to be inlined in some places.

You don't define the relative difference between "many" and "few" but I suspect most mallocs would function almost identically in both scenarios. The question implies that each call to malloc has as much overhead as a system call and page table updates. When you do a malloc call, e.g. malloc(14), in a non-brain-dead environment, malloc will actually allocate more memory than you ask for, often a multiple of the system MMU page size. You get your 14 bytes and malloc keeps track of the newly allocated area so that later calls can just return a chunk of the already allocated memory, until more memory needs to be requested from the OS.
In other words, if I call malloc(14) 100 times or malloc(1400) once, the overhead will be about the same. I'll just have to manage the bigger allocated memory chunk myself.

Allocating one block of memory is faster than allocating many blocks. There is the overhead of the system call and also searching for available blocks. In programming reducing the number of operations usually speeds up the execution time.
Memory allocators may have to search to find a block of memory that is the correct size. This adds to the overhead of the execution time.
However, there may be better chances of success when allocating small blocks of memory versus one large block. Is your program allocating one small block and releasing it or does it need to allocate (and preserve) small blocks. When memory becomes fragmented, there are less big chunks available, so the memory allocator may have to coalesce all the blocks to form a block big enough for the allocation.
If your program is allocating and destroying many small blocks of memory you may want to consider allocating a static array and using that for your memory.

Can I write a C application without using the heap?

I'm experiencing what appears to be a stack/heap collision in an embedded environment (see this question for some background).
I'd like to try rewriting the code so that it doesn't allocate memory on the heap.
Can I write an application without using the heap in C? For example, how would I use the stack only if I have a need for dynamic memory allocation?

I did it once in an embedded environment where we were writing "super safe" code for biomedical machines.
Malloc()s were explicitly forbidden, partly for the resources limits and for the unexpected behavior you can get from dynamic memory (look for malloc(), VxWorks/Tornado and fragmentation and you'll have a good example).
Anyway, the solution was to plan in advance the needed resources and statically allocate the "dynamic" ones in a vector contained in a separate module, having some kind of special purpose allocator give and take back pointers. This approach avoided fragmentation issues altogether and helped getting finer grained error info, if a resource was exhausted.
This may sound silly on big iron, but on embedded systems, and particularly on safety critical ones, it's better to have a very good understanding of which -time and space- resources are needed beforehand, if only for the purpose of sizing the hardware.

Funnily enough, I once saw a database application which completly relied on static allocated memory. This application had a strong restriction on field and record lengths. Even the embedded text editor (I still shiver calling it that) was unable to create texts with more than 250 lines of text. That solved some question I had at this time: why are only 40 records allowed per client?
In serious applications you can not calculate in advance the memory requirements of your running system. Therefore it is a good idea to allocate memory dynamically as you need it. Nevertheless it is common case in embedded systems to preallocate memory you really need to prevent unexpected failures due to memory shortage.
You might allocate dynamic memory on the stack using the alloca() library calls. But this memory is tight to the execution context of the application and it is a bad idea to return memory of this type the caller, because it will be overwritten by later subroutine calls.
So I might answer your question with a crisp and clear "it depends"...

You can use alloca() function that allocates memory on the stack - this memory will be freed automatically when you exit the function. alloca() is GNU-specific, you use GCC so it must be available.
See man alloca.
Another option is to use variable-length arrays, but you need to use C99 mode.

It's possible to allocate a large amount of memory from the stack in main() and have your code sub-allocate it later on. It's a silly thing to do since it means your program is taking up memory that it doesn't actually need.
I can think of no reason (save some kind of silly programming challenge or learning exercise) for wanting to avoid the heap. If you've "heard" that heap allocation is slow and stack allocation is fast, it's simply because the heap involves dynamic allocation. If you were to dynamically allocate memory from a reserved block within the stack, it would be just as slow.
Stack allocation is easy and fast because you may only deallocate the "youngest" item on the stack. It works for local variables. It doesn't work for dynamic data structures.
Edit: Having seen the motivation for the question...
Firstly, the heap and the stack have to compete for the same amount of available space. Generally, they grow towards each other. This means that if you move all your heap usage into the stack somehow, then rather than stack colliding with heap, the stack size will just exceed the amount of RAM you have available.
I think you just need to watch your heap and stack usage (you can grab pointers to local variables to get an idea of where the stack is at the moment) and if it's too high, reduce it. If you have lots of small dynamically-allocated objects, remember that each allocation has some memory overhead, so sub-allocating them from a pool can help cut down on memory requirements. If you use recursion anywhere think about replacing it with an array-based solution.

You can't do dynamic memory allocation in C without using heap memory. It would be pretty hard to write a real world application without using Heap. At least, I can't think of a way to do this.
BTW, Why do you want to avoid heap? What's so wrong with it?

1: Yes you can - if you don't need dynamic memory allocation, but it could have a horrible performance, depending on your app. (i.e. not using the heap won't give you better apps)
2: No I don't think you can allocate memory dynamically on the stack, since that part is managed by the compiler.

Yes, it's doable. Shift your dynamic needs out of memory and onto disk (or whatever mass storage you have available) -- and suffer the consequent performance penalty.
E.g., You need to build and reference a binary tree of unknown size. Specify a record layout describing a node of the tree, where pointers to other nodes are actually record numbers in your tree file. Write routines that let you add to the tree by writing an additional record to file, and walk the tree by reading a record, finding its child as another record number, reading that record, etc.
This technique allocates space dynamically, but it's disk space, not RAM space. All the routines involved can be written using statically allocated space -- on the stack.

Embedded applications need to be careful with memory allocations but I don't think using the stack or your own pre-allocated heap is the answer. If possible, allocate all required memory (usually buffers and large data structures) at initialization time from a heap. This requires a different style of program than most of us are used to now but it's the best way to get close to deterministic behavior.
A large heap that is sub-allocated later would still be subject to running out of memory and the only thing to do then is have a watchdog kick in (or similar action). Using the stack sounds appealing but if you're going to allocate large buffers/data structures on the stack you have to be sure that the stack is large enough to handle all possible code paths that your program could execute. This is not easy and in the end is similar to a sub-allocated heap.

My foremost concern is, does abolishing the heap really helps?
Since your wish of not using heap stems from stack/heap collision, assuming the start of stack and start of heap are set properly (e.g. in the same setting, small sample programs have no such collision problem), then the collision means the hardware has not enough memory for your program.
Not using heap, one may indeed save some waste space from heap fragmentation; but if your program does not use the heap for a bunch of irregular large size allocation, the waste there are probably not much. I will see your collision problem more of an out of memory problem, something not fixable by merely avoiding heap.
My advices on tackling this case:
Calculate the total potential memory usage of your program. If it is too close to but not yet exceeding the amount of memory you prepared for the hardware, then you may
Try using less memory (improve the algorithms) or using the memory more efficiently (e.g. smaller and more-regular-sized malloc() to reduce heap fragmentation); or
Simply buy more memory for the hardware
Of course you may try pushing everything into pre-defined static memory space, but it is very probable that it will be stack overwriting into static memory this time. So improve the algorithm to be less memory-consuming first and buy more memory the second.

I'd attack this problem in a different way - if you think the the stack and heap are colliding, then test this by guarding against it.
For example (assuming a *ix system) try mprotect()ing the last stack page (assuming a fixed size stack) so it is not accessible. Or - if your stack grows - then mmap a page in the middle of the stack and heap. If you get a segv on your guard page you know you've run off the end of the stack or heap; and by looking at the address of the seg fault you can see which of the stack & heap collided.

It is often possible to write your embedded application without using dynamic memory allocation. In many embedded applications the use of dynamic allocation is deprecated because of the problems that can arise due to heap fragmentation. Over time it becomes highly likely that there will not be a suitably sized region of free heap space to allow the memory to be allocated and unless there is a scheme in place to handle this error the application will crash. There are various schemes to get around this, one being to always allocate fixed size objects on the heap so that a new allocation will always fit into a freed memory area. Another to detect the allocation failure and to perform a defragmentation process on all of the objects on the heap (left as an exercise for the reader!)
You do not say what processor or toolset you are using but in many the static, heap and stack are allocated to separate defined segments in the linker. If this is the case then it must be that your stack is growing outside the memory space that you have defined for it. The solution that you require is to reduce the heap and/or static variable size (assuming that these two are contiguous) so that there is more available for the stack. It may be possible to reduce the heap unilaterally although this can increase the probability of fragmentation problems. Ensuring that there are no unnecessary static variables will free some space at the cost of possibly increasing the stack usage if the variable is made auto.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight