Malloc fails to allocate memory on atmega2561 and freeRTOS - c

I am trying to use the malloc() function to create nodes for a Linked List. The function in my case returns NULL on the following dummy segment of code.
I am running FREERTOS on an atmega 2561.
if (!malloc(sizeof(struct Event))
{
//The code gets inside here
} else {
//
}
The struct of the nodes is the following:
struct Event
{
uint8_t shouldCarBrake;
uint16_t tachoPoint;
struct Event *next;
};

If the project is set up to use four out of the five example heap memory management files that come with FreeRTOS then it might be that the heap size supplied by the C libraries is zero. Out of the 5 only heap_3 uses malloc. See http://www.freertos.org/a00111.html for more information.

The standard library malloc() is not normally thread-safe, so should probably not be used in this instance. Some libraries provide stubs for mutex locks for the standard library, but if that is not the case or you choose not implement them, then malloc() itself should be wrapped in a function with mutex locks, or an alternative thread-safe heap management implementation used. FreeRTOS provides both of these options as described here.
The standard malloc() implementation is also non-deterministic - another reason not to use it in real-time critical code; you cannot determine how-long the mutex will be locked for, so not only will the thread allocating be non-deterministic, but so will any threads waiting to allocate.
Even without FreeRTOS's support for various allocators, it is easy to implement a deterministic fixed-block memory allocator in a manner that is easily portable to any RTOS. You simply pre-allocate a pool (or pools) of memory blocks (either statically or from the standard heap before real-time scheduling starts), and stuff an RTOS queue with pointers to the blocks. Allocation is then simply a case taking a pointer from the queue, and deallocation by returning the pointer back to the queue.

Related

Does terminating a program reclaim memory in the same way as free()?

I saw this answer to a stack overflow question that says that freeing memory at the very end of a c program is actually harmful because it moves variables that wouldn't be used again into system memory.
I'm confused why the free() method in C would do anything different than the operating system reclaiming the heap at the end of the program.
Does anyone know if there is a real difference between free() and termination in terms of memory management and if so how the operating system may treat these two differently?
e.g.
would anything different happen between these two short programs?
void main() {
int* mem = malloc(1);
return 0;
}
void main() {
int* mem = malloc(1);
free(mem);
return 0;
}
No, terminating a program, as with exit or abort, does not reclaim memory in the same way as free. Using free causes some activity that ultimately has no effect when the operating system discards the data maintained by malloc and free.
exit has some complications, as it does not immediately terminate the program. For now, let’s just consider the effect of immediately terminating the program and consider the complications later.
In a general-purpose multi-user operating system, when a process is terminated, the operating system releases the memory it was using to make it available for other purposes.1 In large part, this simply means the operating system does some accounting operations.
In contrast, when you call free, software inside the program runs, and it has to look up the size of the memory you are freeing and then insert information about that memory into the pool of memory it is maintaining. There could be thousands or tens of thousands (or more) of such allocations. A program that frees all its data may have to execute many thousands of calls to free. Yet, in the end, when the program exits, all of the changes produced by free will vanish, as the operating system will discard all the data about that pool of memory—all of the data is in memory pages the operating system does not preserve.
So, in this regard, the answer you link to is correct, calling free is a waste. And, as it points out, the necessity of going through all the data structures in the program to fetch the pointers in them so the memory they point to can be freed causes all those data structures to be read into memory if they had been swapped out to disk. For large programs, it can take a considerable amount of time and other resources.
On the other hand, it is not clear it is easy to avoid many calls to free. This is because releasing memory is not the only thing a terminating program has to clean up. A program may want to write final data to files or send final messages to network connections. Furthermore, a program may not have established all of this context directly. Most large programs rely on layers of software, and each software package may have set up its own context, and often no way is provided to tell other software “I want to exit now. Finish the valuable context, but skip all the freeing of memory.” So all the desired clean-up tasks may be interwined with the free-memory tasks, and there may be no good way to untangle them.
Software should generally be written so that nothing terrible happens if a program is suddenly aborted (since this can happen from a loss of power, not just deliberate user action). But even though a program might be able to tolerate an abort, there can still be value in a graceful exit.
Getting back to exit, calling the C exit routine does not exit the program immediately. Exit handlers (registered with atexit) are called, stream buffers are flushed, and streams are closed. Any software libraries you called may have set up their own exit handlers so that they can finish up when the program is exiting. So, if you want to be sure libraries you have used in your program are not calling free when you end the program, you have to call abort, not exit. But it is generally preferred to end a program gracefully, not by aborting. Calling abort will not call exit handlers, flush streams, close streams, or perform other wind-down code that exit does—data can be lost when a program calls abort.
Footnote
1 Releasing memory does not mean it is immediately available for other purposes. The specific result of this depends on each page of memory. For example:
If the memory is shared with other processes, it is still needed for them, so releasing it from use by this process only decrements the number of processes using the memory. It is not immediately available for any other use.
If the memory is not in use by any other processes but contains data mapped from a file on disk, the operating system might mark it as available when needed but leave it alone for the moment. This is because you might run the same program again, and it would be nice if the data were still in memory, so why not just leave it in place just in case? The data might even be used by a different program that uses the same file. (For example, many programs might use the same shared library.)
If the memory is not in use by any other processes and was just used by the program as a work area, not mapped from a file, then system may mark it as immediately available and not containing anything useful.
would anything different happen between these two short programs?
The simple answer is: it makes no difference, the memory is released to the system in both cases. Calling free() is not strictly necessary and does incur an infinitesimal overhead but may prove useful when trying to track memory leaks in more complex programs.
Does terminating a program reclaim memory in the same way as free?
Not exactly:
Terminating a program releases the memory used by the program, be it for the program code, data, stack or heap. It also releases some other resources such as file handles, device handles, network sockets... All this is done efficiently, no matter how many blocks of memory have been allocated with malloc().
Conversely, free() makes the block of memory available for further use by the program for later calls to malloc() or realloc(). Depending on its size and the implementation of the heap, this freed block may or may not be returned to the OS for use by other programs. Also worth noting it the fragmentation problem, where small blocks of freed memory may not be usable for a larger allocation because they are surrounded by allocated blocks. The C heap does not perform packing or de-fragmentation, it merely coalesces adjacent free blocks. Freeing all allocated blocks before leaving the program may be useful for debugging purposes, but may be complicated and time consuming, while not necessary for the memory to be reused by the system after the program terminates.
free() is a user level memory management function and depends on malloc implementation you are currently using. The user-level allocator might maintain a linked-list of memory chunk and malloc/free will take the chunk of appropropriate size/put it back.
exit() Destroys an address space and all regions.
This is related to malloced heap as well as some other regions and in-kernel data structures used for managing address space of the process:
Each address space consists of a number of page-aligned regions
of memory that are in use. They never overlap and represent a set
of addresses which contain pages that are related to each other in
terms of protection and purpose. These regions are represented by
a struct vm_area_struct and are roughly analogous to the
vm_map_entry struct in BSD. For clarity, a region may represent the
process heap for use with malloc(), a memory mapped file such as
a shared library or a block of anonymous memory allocated with
mmap(). The pages for this region may still have to be allocated, be
active and resident or have been paged out
Reference: https://www.kernel.org/doc/gorman/html/understand/understand007.html
The reason well-designed programs free memory at exit is to check for memory leaks. If your application-level memory allocation does not go to zero after your last deallocation, you know that you have a memory memory that is not being managed properly and probably have a memory leak in your code.
would anything different happen between these two short programs?
YES
I'm confused why the free() method in C would do anything different than the operating system reclaiming the heap at the end of the program.
The operating system allocates memory in pages. Heap managers (such as malloc/free implementations) allocate pages from the operating system and subdivide the pages into smaller allocations. Calls to free() normally return memory to the heap. They do not return the pages to the operating system.

How does malloc work in a multithreaded environment?

Does the typical malloc (for x86-64 platform and Linux OS) naively lock a mutex at the beginning and release it when done, or does it lock a mutex in a more clever way at a finer level, so that lock contention is reduced for concurrent calls? If it indeed does it the second way, how does it do it?
glibc 2.15 operates multiple allocation arenas. Each arena has its own lock. When a thread needs to allocate memory, malloc() picks an arena, locks it, and allocates memory from it.
The mechanism for choosing an arena is somewhat elaborate and is aimed at reducing lock contention:
/* arena_get() acquires an arena and locks the corresponding mutex.
First, try the one last locked successfully by this thread. (This
is the common case and handled with a macro for speed.) Then, loop
once over the circularly linked list of arenas. If no arena is
readily available, create a new one. In this latter case, `size'
is just a hint as to how much memory will be required immediately
in the new arena. */
With this in mind, malloc() basically looks like this (edited for brevity):
mstate ar_ptr;
void *victim;
arena_lookup(ar_ptr);
arena_lock(ar_ptr, bytes);
if(!ar_ptr)
return 0;
victim = _int_malloc(ar_ptr, bytes);
if(!victim) {
/* Maybe the failure is due to running out of mmapped areas. */
if(ar_ptr != &main_arena) {
(void)mutex_unlock(&ar_ptr->mutex);
ar_ptr = &main_arena;
(void)mutex_lock(&ar_ptr->mutex);
victim = _int_malloc(ar_ptr, bytes);
(void)mutex_unlock(&ar_ptr->mutex);
} else {
/* ... or sbrk() has failed and there is still a chance to mmap() */
ar_ptr = arena_get2(ar_ptr->next ? ar_ptr : 0, bytes);
(void)mutex_unlock(&main_arena.mutex);
if(ar_ptr) {
victim = _int_malloc(ar_ptr, bytes);
(void)mutex_unlock(&ar_ptr->mutex);
}
}
} else
(void)mutex_unlock(&ar_ptr->mutex);
return victim;
This allocator is called ptmalloc. It is based on earlier work by Doug Lea, and is maintained by Wolfram Gloger.
Doug Lea's malloc used coarse locking (or no locking, depending on the configuration settings), where every call to malloc/realloc/free is protected by a global mutex. This is safe but can be inefficient in highly multithreaded environments.
ptmalloc3, which is the default malloc implementation in the GNU C library (libc) used on most Linux systems these days, has a more fine-grained strategy, as described in aix's answer, which allows multiple threads to concurrently allocate memory safely.
nedmalloc is another independent implementation which claims even better multithreaded performance than ptmalloc3 and various other allocators. I don't know how it works, and there doesn't seem to be any obvious documentation, so you'll have to check the source code to see how it works.
In addition to ptmalloc mentioned by #NPE, there is also tcmalloc which is offered by Google and, under certain scenarios, it has been tested to run slightly faster than ptmalloc (e.g.: when doing malloc() for 10^6 times and frees them).
It uses global heap and per-thread heaps so that the global heap can be used by every thread. There is only mutex on the global heap and not the per-thread heap. For each individual thread, they could pull storage from the global heap and free that storage so this storage goes to the per-thread heap.
And each per-thread heap is owned by individual threads. Unless things are imbalanced, then threads would move the storage back to the global heap.
here for code implementation: https://github.com/google/tcmalloc

Lockless circular queue with pthreads. Anything to watch out for?

I'd like to implement a lockless single-producer, single-consumer circular queue between two pthreads; in C, on ARM Linux.
The queue will hold bytes, the producer will memcpy() things in, and the consumer will write() them out to file.
Is it naive to think I can store head and tail offsets in ints and everything will just work?
I am wondering about such things as compiler optimisations meaning my head/tail writes sit in registers and are not visible to the other thread, or needing a memory barrier somewhere.
The memory consistency model of pthreads doesn't offer you any assistance for building lockless algorithms. You're on your own - you will have to use whatever atomic instructions and memory barriers are provided and required by your architecture. You'll also have to consult your compiler documentation to determine how to request a compiler barrier.
You are almost certainly better off using a normal queue implementation protected by a mutex and condition variable - if the queue simply stores pointers to the buffers that are being written out to file (rather than the data itself), then lock contention shouldn't be a problem, as the lock will only have to be held while a pointer is added or removed from the queue.

Any single-consumer single-producer lock free queue implementation in C?

I'm writing a program with a consumer thread and a producer thread, now it seems queue synchronization is a big overhead in the program, and I looked for some lock free queue implementations, but only found Lamport's version and an improved version on PPoPP '08:
enqueue_nonblock(data) {
if (NULL != buffer[head]) {
return EWOULDBLOCK;
}
buffer[head] = data;
head = NEXT(head);
return 0;
}
dequeue_nonblock(data) {
data = buffer[tail];
if (NULL == data) {
return EWOULDBLOCK;
}
buffer[tail] = NULL;
tail = NEXT(tail);
return 0;
}
Both versions require a pre-allocated array for the data, my question is that is there any single-consumer single-producer lock-free queue implementation which uses malloc() to allocate space dynamically?
And another related question is, how can I measure exact overhead in queue synchronization? Such as how much time it takes of pthread_mutex_lock(), etc.
If you are worried about performance, adding malloc() to the mix won't help things. And if you are not worried about performance, why not simply control access to the queue via a mutex. Have you actually measured the performance of such an implementation? It sounds to me as though you are going down the familar route of premature optimisation.
The algorithm you show manages to work because although the two threads share the resource (i.e., the queue), they share it in a very particular way. Because only one thread ever alters the head-index of the queue (the producer), and only one thread every alters the tail-index (consumer, of course), you can't get an inconsistent state of the shared object. It's also important that the producer put the actual data in before updating the head index, and that the consumer reads the data it wants before updating the tail index.
It works as well as it does b/c the array is quite static; both threads can count on the storage for the elements being there. You probably can't replace the array entirely, but what you can do is change what the array is used for.
I.e., instead of keeping the data in the array, use it to keep pointers to the data. Then you can malloc() and free() the data items, while passing references (pointers) to them between your threads via the array.
Also, posix does support reading a nanosecond clock, although the actual precision is system dependent. You can read this high resolution clock before and after and just subtract.
Yes.
There exist a number of lock-free multiple-reader multiple-writer queues.
I have implemented one, by Michael and Scott, from their 1996 paper.
I will (after some more testing) be releasing a small library of lock-free data structures (in C) which will include this queue.
You should look at FastFlow library
I recall seeing one that looked interesting a few years ago, though I can't seem to find it now. :( The lock-free implementation that was proposed did require use of a CAS primitive, though even the locking implementation (if you didn't want to use the CAS primitive) had pretty good perf characteristics--- the locks only prevented multiple readers or multiple producers from hitting the queue at the same time, the producer still never raced with the consumer.
I do remember that the fundamental concept behind the queue was to create a linked list that always had one extra "empty" node in it. This extra node meant that the head and the tail pointers of the list would only ever refer to the same data when the list was empty. I wish I could find the paper, I'm not doing the algorithm justice with my explanation...
AH-ha!
I've found someone who transcribed the algorithm without the remainder of the article. This could be a useful starting point.
I've worked with a fairly simple queue implementation the meets most of your criteria. It used a static maximum size pool of bytes, and then we implemented messages within that. There was a head pointer that one process would move, and and a tail pointer that the other process would move.
Locks were still required, but we used Peterson's 2-Processor Algorithm, which is pretty lightweight since it doesn't involve system calls. The lock is only required for very small, well-bounded area: a few CPU cycles at most, so you never block for long.
I think the allocator can be a performance problem. You can try to use a custom multithreaded memory allocator, that use a linked-list for maintaing freed blocks. If your blocks are not (nearly) the same size, you can implement a "Buddy system memory allocator", witch is very fast. You have to synchronise your queue (ring buffer) with a mutex.
To avoid too much synchronisation, you can try write/read multiple values to/from the queue at each access.
If you still want to use, lock-free algorithms, then you must use pre-allocated data or use a lock-free allocator.
There is a paper about a lock-free allocator "Scalable Lock-Free Dynamic Memory Allocation", and an implementation Streamflow
Before starting with Lock-free stuff, look at:Circular lock-free buffer
Adding malloc would kill any performance gain you may make and a lock based structure would be just as effective. This is so because malloc requires some sort of CAS lock over the heap and hence some forms of malloc have their own lock so you may be locking in the Memory Manager.
To use malloc you would need to pre allocate all the nodes and manage them with another queue...
Note you can make some form of expandable array which would need to lock if it was expanded.
Also while interlocked are lock free on the CPU they do placea memory lock and block memory for the duration of the instruction and often stall the pipeline.
This implementation uses C++'s new and delete which can trivially be ported to the C standard library using malloc and free:
http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448?pgno=2

Do threads have a distinct heap?

As far as I know each thread gets a distinct stack when the thread is created by the operating system. I wonder if each thread has a heap distinct to itself also?
No. All threads share a common heap.
Each thread has a private stack, which it can quickly add and remove items from. This makes stack based memory fast, but if you use too much stack memory, as occurs in infinite recursion, you will get a stack overflow.
Since all threads share the same heap, access to the allocator/deallocator must be synchronized. There are various methods and libraries for avoiding allocator contention.
Some languages allow you to create private pools of memory, or individual heaps, which you can assign to a single thread.
By default, C has only a single heap.
That said, some allocators that are thread aware will partition the heap so that each thread has it's own area to allocate from. The idea is that this should make the heap scale better.
One example of such a heap is Hoard.
Depends on the OS. The standard c runtime on windows and unices uses a shared heap across threads. This means locking every malloc/free.
On Symbian, for example, each thread comes with its own heap, although threads can share pointers to data allocated in any heap. Symbian's design is better in my opinion since it not only eliminates the need for locking during alloc/free, but also encourages clean specification of data ownership among threads. Also in that case when a thread dies, it takes all the objects it allocated along with it - i.e. it cannot leak objects that it has allocated, which is an important property to have in mobile devices with constrained memory.
Erlang also follows a similar design where a "process" acts as a unit of garbage collection. All data is communicated between processes by copying, except for binary blobs which are reference counted (I think).
Each thread has its own stack and call stack.
Each thread shares the same heap.
It depends on what exactly you mean when saying "heap".
All threads share the address space, so heap-allocated objects are accessible from all threads. Technically, stacks are shared as well in this sense, i.e. nothing prevents you from accessing other thread's stack (though it would almost never make any sense to do so).
On the other hand, there are heap structures used to allocate memory. That is where all the bookkeeping for heap memory allocation is done. These structures are sophisticatedly organized to minimize contention between the threads - so some threads might share a heap structure (an arena), and some might use distinct arenas.
See the following thread for an excellent explanation of the details: How does malloc work in a multithreaded environment?
Typically, threads share the heap and other resources, however there are thread-like constructions that don't. Among these thread-like constructions are Erlang's lightweight processes, and UNIX's full-on processes (created with a call to fork()). You might also be working on multi-machine concurrency, in which case your inter-thread communication options are considerably more limited.
Generally speaking, all threads use the same address space and therefore usually have just one heap.
However, it can be a bit more complicated. You might be looking for Thread Local Storage (TLS), but it stores single values only.
Windows-Specific:
TLS-space can be allocated using TlsAlloc and freed using TlsFree (Overview here). Again, it's not a heap, just DWORDs.
Strangely, Windows support multiple Heaps per process. One can store the Heap's handle in TLS. Then you would have something like a "Thread-Local Heap". However, just the handle is not known to the other threads, they still can access its memory using pointers as it's still the same address space.
EDIT: Some memory allocators (specifically jemalloc on FreeBSD) use TLS to assign "arenas" to threads. This is done to optimize allocation for multiple cores by reducing synchronization overhead.
On FreeRTOS Operating system, tasks(threads) share the same heap but each one of them has its own stack. This comes in very handy when dealing with low power low RAM architectures,because the same pool of memory can be accessed/shared by several threads, but this comes with a small catch , the developer needs to keep in mind that a mechanism for synchronizing malloc and free is needed, that is why it is necessary to use some type of process synchronization/lock when allocating or freeing memory on the heap, for example a semaphore or a mutex.

Resources