Use of malloc in Real Time application - c

We had an issue with one of our real time application. The idea was to run one of the threads every 2ms (500 Hz). After the application ran for half or an hour so.. we noticed that the thread is falling behind.
After a few discussions, people complain about the malloc allocations in the real time thread or root cause is malloc allocations.
I am wondering that, is it always a good idea to avoid all dynamic memory allocations in the real time threads?
Internet has very few resource on this ? If you can point to some discussion that would we great too..
Thanks

First step is to profile the code and make sure you understand exactly where the bottleneck is. People are often bad at guessing bottlenecks in code, and you might be surprised with the findings. You can simply instrument several parts of this routine yourself and dump min/avg/max durations in regular intervals. You want to see the worst case (max), and if the average duration increases as the time goes by.
I doubt that malloc will take any significant portion of these 2ms on a reasonable microcontroller capable of running Linux; I'd say it's more likely you would run out of memory due to fragmentation, than having performance issues. If you have any other syscalls in your function, they will easily take an order of magnitude more than malloc.
But if malloc is really the problem, depending on how short-lived your objects are, how much memory you can afford to waste, and how much your requirements are known in advance, there are several approaches you can take:
General purpose allocation (malloc from your standard library, or any third party implementation): best approach if you have "more than enough" RAM, many short-lived objects, and no strict latency requirements
PROS: works for any object size out of the box, familiar interface, memory is shared dynamically, no need to "plan ahead" if memory is not an issue
CONS: slight performance penalty during allocation and/or deallocation, memory fragmentation issues when doing lots of allocations/deallocations of objects of different sizes, whether a run-time allocation will fail is less deterministic and cannot be easily mitigated at runtime
Memory pool: best approach in most cases where memory is limited, low latency is required, and the object needs to live longer than a single block scope
PROS: allocation/deallocation time is guaranteed to be O(1) in any reasonable implementation, does not suffer from fragmentation, easier to plan its size in advance, failure to allocate at run-time is (likely) easier to mitigate
CONS: works for a single specific object size - memory is not shared between other parts of the program, requires a planning for the right size of the pool or risking potential waste of memory
Stack based (automatic-duration) objects: best for smaller, short-lived objects (single block scope)
PROS: allocation and deallocation is done automatically, allows having optimum usage of RAM for the object's lifetime, there are tools which can sometimes do a static analysis of your code and estimate the stack size
CONS: objects limited to a single block scope - cannot share objects between interrupt invocations
Individual statically allocated objects: best approach for long lived objects
PROS: no allocation whatsoever - all needed objects exist throughout the application life-cycle, no problems with allocation/deallocation
CONS: wastes memory if the objects should be short-lived
Even if you decide to go for memory pools all over the program, make sure you add profiling/instrumentation to your code. And then leave it there forever to see how the performance changes over time.

Being a realtime software engineer in the aerospace industry we see this question a lot. Even within our own engineers, we see that software engineers attempt to use non-realtime programming techniques they learned elsewhere or to use open-source code in their programs. Never allocate from the heap during realtime. One of our engineers created a tool that intercepts the malloc and records the overhead. You can see in the numbers that you cannot predict when the allocation attempt will take a long time. Even on very high end computers (72 cores, 256 GB RAM servers) running a realtime hybrid of Linux we record mallocs taking 100's of milliseconds. It is a system call which is cross-ring, so high overhead, and you don't know when you will get hit by garbage collection, or when it decides it must request another large chunk or memory for the task from the OS.

Related

Looking for a custom memory allocator which allocates from within a large pre-allocated block of memory

I have a memory-heavy application which is supposed to run with low latency and with constant speed, but in practice it has poor performance during the first few seconds of startup. This appears to be because the initial memory accesses triggers page faults which have significant performance implications.
I would like to try preallocating a single large block of memory, paging it all in (via mlock() or just by touching each byte), and then using a custom malloc()/free() implementation to ensure that all further allocations are done from within this block.
I am aware of numerous custom memory allocators (TCMalloc, Hoard, jemalloc, etc) but it is not clear to me whether they can be backed by user-provided memory, or whether they always perform their internal allocations from the OS. Does anyone have any insight or recommendations here?
To be clear, I am not looking for a memory pooling system (which would be for reusing small objects). The custom implementation of malloc()/free() should be able to perform any size allocation while limiting fragmentation of its backing store and following other best practices.
Edit based on comments: I do not expect to make the system faster - I just want to move the slow part (allocation, initial page faults) to the start of the process, and then do the real computation work once the system is 'primed'.
Thanks!
A bit late to the party.
dlmalloc is one choice that can be backed by pre-allocated memory. You can find it here. You may just need to add some extra definitions in the beginning to force it to use your pre-allocated memory rather than call the system mmap, you can refer to the nice documentation at the beginning of the file.

Can memset be parallelized on 4 cores?

I am not sure to that. Can I write a large memset (for example 10 MB), on four cores to gain speedup with this?
Is such ram-chip parallelization possible at all, and also how big are time costs of firing other threads - is it more than a millisecond or less?
You are pointing out a right question, at the same time it is difficult to give a simple answer to it. There are several aspects involved.
Overhead of starting new threads (or picking them from some cache);
Contension on the memory bus.
The aspects above differ and have very different cost for different platforms.
Bigger PCs have several memory buses. Smaller ones have only one. On a one memory bus system this does not make any sense. If your system has several memory buses (channels) your array of data may have arbitrary split between memory banks. If it will happen that the whole array sits in the same memory bank, the parralelisation will be useless. Figuring out the layout of your array is an overhead again. In other words before splitting the operation between cores it is necessary to figure out if this is worth doing or not.
Simple answer is that these difficult to predict overheads will most likely will consume the benefit and make the overall result worse.
At the same time for a really huge memory area on some architectures it makes sense.

Does any operating system implement buffering for malloc()?

A lot of c/malloc()'s in a for/while/do can consume a lot of time so I am curious if any operating system buffers memory for fast mallocs.
I have been pondering if I could speed up malloc's by writing a "greedy" wrapper for malloc. E.g. when I ask for 1MB of memory the initial allocator would allocate 10MB and on the 2nd, 3rd, 4th etc... call the malloc function would simply return memory from the chunk first allocated the "normal" way. Of course if there is not enough memory available you would need to allocate a new greedy chunk of memory.
Somehow I think someone must have done this or something similar before. So my question is simply: Is this something that will speed up the memory allocation process significantly. (yes I could have tried it before asking the question but I am just to lazy to write such a thing if there is no need to do it)
All versions of malloc() do buffering of the type you describe to some extent - they will grab a bigger chunk than the current request and use the big chunk to satisfy multiple requests, but only up to some request size. This means that multiple requests for 16 bytes at a time will only require more memory from the o/s once every 50-100 calls, or something along those general lines.
What is less clear is what the boundary size is. It might well be that they allocate some relatively small multiple of 4 KiB at a time. Bigger requests - MiB size requests - will go back to the system for more memory every time that the request cannot be satisfied from what is in the free list. That threshold is typically considerably smaller than 1 MiB, though.
Some versions of malloc() allow you to tune their allocation characteristics, to greater or lesser extents. It has been a fertile area of research - lots of different systems. See Knuth 'The Art of Computer Programming' Vol 1 (Fundamental Algorithms) for one set of discussions.
As I was browsing the Google Chrome code some time ago, I found http://www.canonware.com/jemalloc/.
It is a free, general-purpose and scalable malloc implementation.
Apparrently, it is being used in numerous projects, as it usually outperforms standard malloc's implementations in many real-world scenarios (many small allocations instead of few big ones).
Definitely worth a look!
That technique is called Slab Allocator, and most operating systems support it, but i can't find information that is it available to userspace malloc, just for kernel allocations.
You can find the paper by Jeff Bonwick here, which describes the original technique on Solaris.
Google has a greedy implementation of malloc() that does roughly what you're thinking of. It has some drawbacks, but it's very fast in many use cases.
What you're saying is probably done, I don't really know. However, I don't know that the latency in buffering your malloc() at the system level would decrease latency much. You've still got to take the time to enter into priv. mode for a system call, potentially lock kernel-level structures (which means more system calls and WAITING for the locks), and things of that nature.
If you can write your own memory manager in user-space for your program and only call malloc() when you need more memory for your pool, you may likely see a decrease in latency.

What are out-of-memory handling strategies in C programming?

One strategy that I though of myself is allocating 5 megabytes of memory (or whatever number you feel necessary) at the program startup.
Then when at any point program's malloc() returns NULL, you free the 5 megabytes and call malloc() again, which will succeed and let the program continue running.
What do you think about this strategy?
And what other strategies do you know?
Thanks, Boda Cydo.
Handle malloc failures by exiting gracefully. With modern operating systems, pagefiles, etc you should never pre-emptively brace for memory failure, just exit gracefully. It is unlikely you will ever encounter out of memory errors unless you have an algorithmic problem.
Also, allocating 5MB for no reason at startup is insane.
For the last few years, the (embedded) software I have been working with generally does not permit the use of malloc(). The sole exception to this is that it is permissible during the initialization phase, but once it is decided that no more memory allocations are allowed, all future calls to malloc() fail. As memory may become fragmented due to malloc()/free() it becomes difficult at best in many cases to prove that future calls to malloc() will not fail.
Such a scenario might not apply to your case. However, knowing why malloc() is failing can be useful. The following technique that we use in our code since malloc() is not generally available might (or might not) be applicable to your scenario.
We tend to rely upon memory pools. The memory for each pool is allocated during the transient startup phase. Once we have the pools, we get an entry from the pool when we need it, and release it back to the pool when we are done. Each pool is configurable, and is usually reserved for a particular object type. We can track the usage of each over time. If we run out of pool entries, we can find out why. If we don't, we have the option of making our pool smaller and save some resources.
Hope this helps.
As a method of testing that you handle out of memory situations gracefully, this can be a reasonably useful technique.
Under any other circumstance, it sounds useless at best. You're causing the out of memory situation to happen, then fixing the problem by freeing memory you didn't need to start with.
"try-again-later". Just because you're OOM now, doesn't mean you will be later when the system is less busy.
void *smalloc(size_t size) {
for(int i = 0; i < 100; i++) {
void *p = malloc(size);
if(p)
return p;
sleep(1);
}
return NULL;
}
You should of course think a lot about where you employ such a strategy as it is quite hidious, but it has saved some of our systems in various cases
It actually depends on a policy you'd like to implement, meaning, what is the expected behavior of your program when it's out of memory.
Great solution would be to allocate memory during initialization only and never during runtime. In this case you'll never run out of memory if the program managed to start.
Another could be freeing resources when you hit memory limit. It'd be difficult to implement and test.
Keep in mind that when you are getting NULL from malloc it means both physical and virtual memory have no more free space, meaning your program is swapping all the time, making it slow and the computer unresponsive.
You actually need to make sure (by estimated calculation or by checking the amount of memory in runtime) that the expected amount of free memory the computer has is enough for your program.
Generally the purpose of freeing the memory is so that you have enough to report the error before you terminate the program.
If you are just going to keep running, there is no point in preallocating the emergency reserve.
Most of modern OSes in default configuration allow memory overcommit, so your program wouldn't get NULL from malloc() at all or at least until it somehow (by error, I guess) exhausted all available address space (not memory).
And then it writes some perfectly legal memory location, gets a page fault, there is no memory page in backing store and BANG (SIGBUS) - you dead, and there is no good way out there.
So just forget about it, you can't handle it.
Yeah, this doesn't work in practice. First for a technical reason, a typical low-fragmentation heap implementation doesn't make large free blocks available for small allocations.
But the real problem is that you don't know why you ran out of virtual memory space. And if you don't know why then there's nothing you can do to prevent that extra memory from being consumed very rapidly and still crash your program with OOM. Which is very likely to happen, you've already consumed close to two gigabytes, that extra 5 MB is a drop of water on a hot plate.
Any kind of scheme that switches the app into 'emergency mode' is very impractical. You'll have to abort running code so that you can stop, say, loading an enormous data file. That requires an exception. Now you're back to what you already had before, std::badalloc.
I want to second the sentiment that the 5mb pre-allocation approach is "insane", but for another reason: it's subject to race conditions. If the cause of memory exhaustion is within your program (virtual address space exhausted), another thread could claim the 5mb after you free it but before you get to use it. If the cause of memory exhaustion is lack of physical resources on the machine due to other processes using too much memory, those other processes could claim the 5mb after you free it (if the malloc implementation returns the space to the system).
Some applications, like a music or movie player, would be perfectly justified just exiting/crashing on allocation failures - they're managing little if any modifiable data. On the other hand, I believe any application that is being used to modify potentially-valuable data needs to have a way to (1) ensure that data already on disk is left in a consistent, non-corrupted state, and (2) write out a recovery journal of some sort so that, on subsequent invocations, the user can recover any data lost when the application was forced to close.
As we've seen in the first paragraph, due to race conditions your "malloc 5mb and free it" approach does not work. Ideally, the code to synchronize data and write recovery information would be completely allocation-free; if your program is well-designed, it's probably naturally allocation-free. One possible approach if you know you will need allocations at this stage is to implement your own allocator that works in a small static buffer/pool, and use it during allocation-failure shutdown.

how to fix a memory size for my application in C?

I would like to allocate a fixed memory for my application (developed using C). Say my application should not cross 64MB of memory occupation. And also i should avoid to use more CPU usage. How it is possible?
Regards
Marcel.
Under unix: "ulimit -d 64M"
One fairly low-tech way I could ensure of not crossing a maximum threshold of memory in your application would be to define your own special malloc() function which keeps count of how much memory has been allocated, and returns a NULL pointer if the threshold has been exceeded. This would of course rely on you checking the return value of malloc() every time you call it, which is generally considered good practice anyway because there is no guarantee that malloc() will find a contiguous block of memory of the size that you requested.
This wouldn't be foolproof though, because it probably won't take into account memory padding for word alignment, so you'd probably end up reaching the 64MB memory limit long before your function reports that you have reached it.
Also, assuming you are using Win32, there are probably APIs that you could use to get the current process size and check this within your custom malloc() function. Keep in mind that adding this checking overhead to your code will most likely cause it to use more CPU and run a lot slower than normal, which leads nicely into your next question :)
And also i should avoid to use more
CPU usage.
This is a very general question and there is no easy answer. You could write two different programs which essentially do the same thing, and one could be 100 times more CPU intensive than another one due to the algorithms that have been used. The best technique is to:
Set some performance benchmarks.
Write your program.
Measure to see whether it reaches your benchmarks.
If it doesn't reach your benchmarks, optimise and go to step (3).
You can use profiling programs to help you work out where your algorithms need to be optimised. Rational Quantify is an example of a commercial one, but there are many free profilers out there too.
If you are on POSIX, System V- or BSD-derived system, you can use setrlimit() with resource RLIMIT_DATA - similar to ulimit -d.
Also take a look at RLIMIT_CPU resource - it's probably what you need (similar to ulimit -t)
Check man setrlimit for details.
For CPU, we've had a very low-priority task ( lower than everything else ) that does nothing but count. Then you can see how often that task gets to run, and you know if the rest of your processes are consuming too much CPU. This approach doesn't work if you want to limit your process to 10% while other processes are running, but if you want to ensure that you have 50% CPU free then it works fine.
For memory limitations you are either stuck implementing your own layer on top of malloc, or taking advantage of your OS in some way. On Unix systems ulimit is your friend. On VxWorks I bet you could probably figure out a way to take advantage of the task control block to see how much memory the application is using... if there isn't already a function for that. On Windows you could probably at least set up a monitor to report if your application does go over 64 MB.
The other question is: what do you do in response? Should your application crash if it exceeds 64Mb? Do you want this just as a guide to help you limit yourself? That might make the difference between choosing an "enforcing" approach versus a "monitor and report" approach.
Hmm; good question. I can see how you could do this for memory allocated off the heap, using a custom version of malloc and free, but I don't know about enforcing it on the stack too.
Managing the CPU is harder still...
Interesting.

Resources