Memory consumption of a concurrent data structure in C - c

I would like to know how much memory a given data structure is consuming. So suppose I have a concurrent linked list. I would like to know how big the list is. I have a few options: malloc_hooks, which I do not think is thread-safe, and getrusage's ru_maxrss, but I don't really know what that gives me (how much memory the whole process consumed during its execution?). I would like to know if someone has actually measured memory consumption in this way. Is there a tool to do this? How does massif fare?

To get an idea of how many bytes it actually costs to malloc some structure, like a linked list node, make an isolated test case(non-concurrent!) which allocates thousands of them, and look at the delta values in the program's memory usage. There are various ways to do that. If your library has a mallinfo structure, like the GNU C Library found on GNU/Linux systems, you can look at the statistics before and after. Another way is to trace the program's system calls to watch its pattern of allocating from the OS. If, say, we allocate 10,000,000 list nodes, and the program performs a sbrk() call about 39,000 times, increasing the size of the process by 8192 bytes in each call, then that means that a list node takes up 32 bytes, overhead and all.
Keeping in mind that allocating thousands of objects of the same size in a single thread does not realistically represent the actual memory usage in a realistic program, which includes fragmentation.
If you want to allocate small structures and come close to not wasting a byte (or not causing any waste that you don't know about and control), and to control fragmentation, then allocate large arrays of the objects from malloc (or your system allocator of choice) and break them up yourself. There is still unknown overhead in the malloc but it is divided over a large number of objects, making it negligible.
Or, generally, write your own allocator whose behavior and overheads you understand in detail, and which itself takes big chunks from the system.

Conceptually speaking you need to know the number of items you are working with. Then you need to know the size of each different data type used in your data structure. You also will have to take into account the size of pointers or anything that is somewhat using some sort of memory.
Then you can come up with a formula that looks like the following:
Consumption= N *(sizeof(data types) ).
So in other words you want to make sure you add any data type together (the data type's size) and multiply it by the number of items.

Related

For what do we need storage allocation?

for what do we actually require manual storage allocation?
The only possible tasks which I could think about would be for bigger binary data which does not fit to a 32bit integer.
Is this correct?
What are other use cases?
In general, you need to do manual storage allocation every time that the size of your data is not known at compile time. Almost all situations fall into two categories:
Your program must read data from files/networks/user inputs etc., and the exact amount of that data is not known at compile time, or
Your program must produce and store some output, and the exact amount of that output is not known to you at the time when you write your program.
Many very common data structures assume an ability to allocate memory of arbitrary size, when the precise size is determined at runtime. Doing so lets the data structure "grow" and "shrink" dynamically, as the storage demands of your program change with the time and the amounts of data that it processes.
tons of examples. Allocating memory to fill a struct, for example, a linked list struct.

Reduce malloc calls by slicing one big malloc'd memory

First, here is where I got the idea from:
There was once an app I wrote that used lots of little blobs of
memory, each allocated with malloc(). It worked correctly but was
slow. I replaced the many calls to malloc with just one, and then
sliced up that large block within my app. It was much much faster.
I was profiling my application, and I got a unexpectedly nice performance boost when I reduced the number of malloc calls. I am still allocating the same amount of memory, though.
So, I would like to do what this guy did, but I am unsure what's the best way to do it.
My Idea:
// static global variables
static void * memoryForStruct1 = malloc(sizeof(Struct1) * 10000);
int struct1Index = 0;
...
// somewhere, I need memory, fast:
Struct1* data = memoryForStruct1[struct1Index++];
...
// done with data:
--struct1Index;
Gotchas:
I have to make sure I don't exceed 10000
I have to release the memory in the same order I occupied. (Not a major issue in my case, since I am using recursion, but I would like to avoid it if possible).
Inspired from Mihai Maruseac:
First, I create a linked list of int that basically tells me which memory indexes are free. I then added a property to my struct called int memoryIndex which helps me return the memory occupied in any order. And Luckily, I am sure my memory needs would never exceed 5 MB at any given time, so I can safely allocate that much memory. Solved.
The system call which gives you memory is brk. The usual malloc and calloc, realloc functions simply use the space given by brk. When that space is not enough, another brk is made to create new space. Usually, the space is increased in sizes of a virtual memory page.
Thus, if you really want to have a premade pool of objects, then make sure to allocate memory in multiples of pagesize. For example, you can create one pool of 4KB. 8KB, ... space.
Next idea, look at your objects. Some of them have one size, some have other size. It will be a big pain to handle allocations for all of them from the same pool. Create pools for objects of various sizes (powers of 2 is best) and allocate from them. For example, if you'll have an object of size 34B you'd allocate space for it from the 64B pool.
Lastly, the remaining space can be either left unused or it can be moved down to the other pools. In the above example, you have 30B left. You'd split it in 16B, 8B, 4B and 2B chunks and add each chunk to their respective pool.
Thus, you'd use linked lists to manage the preallocated space. Which means that your application will use more memory than it actually needs but if this really helps you, why not?
Basically, what I've described is a mix between buddy allocator and slab allocator from the Linux kernel.
Edit: After reading your comments, it will be pretty easy to allocate a big area with malloc(BIG_SPACE) and use this as a pool for your memory.
If you can, look at using glib which has memory slicing API that supports this. It's very easy to use, and saves you from having to re-implement it.

Fragmentation-resistant Microcontroller Heap Algorithm

I am looking to implement a heap allocation algorithm in C for a memory-constrained microcontroller. I have narrowed my search down to 2 options I'm aware of, however I am very open to suggestions, and I am looking for advice or comments from anyone with experience in this.
My Requirements:
-Speed definitely counts, but is a secondary concern.
-Timing determinism is not important - any part of the code requiring deterministic worst-case timing has its own allocation method.
-The MAIN requirement is fragmentation immunity. The device is running a lua script engine, which will require a range of allocation sizes (heavy on the 32 byte blocks). The main requirement is for this device to run for a long time without churning its heap into an unusable state.
Also Note:
-For reference, we are talking about a Cortex-M and PIC32 parts, with memory ranging from 128K and 16MB or memory (with a focus on the lower end).
-I don't want to use the compiler's heap because 1) I want consistent performance across all compilers and 2) their implementations are generally very simple and are the same or worse for fragmentation.
-double indirect options are out because of the huge Lua code base that I don't want to fundamtnetally change and revalidate.
My Favored Approaches Thus Far:
1) Have a binary buddy allocator, and sacrifice memory usage efficiency (rounding up to a power of 2 size).
-this would (as I understand) require a binary tree for each order/bin to store free nodes sorted by memory address for fast buddy-block lookup for rechaining.
2) Have two binary trees for free blocks, one sorted by size and one sorted by memory address. (all binary tree links are stored in the block itself)
-allocation would be best-fit using a lookup on the table by size, and then remove that block from the other tree by address
-deallocation would lookup adjacent blocks by address for rechaining
-Both algorithms would also require storing an allocation size before the start of the allocated block, and have blocks go out as a power of 2 minus 4 (or 8 depending on alignment). (Unless they store a binary tree elsewhere to track allocations sorted by memory address, which I don't consider a good option)
-Both algorithms require height-balanced binary tree code.
-Algorithm 2 does not have the requirement of wasting memory by rounding up to a power of two.
-In either case, I will probably have a fixed bank of 32-byte blocks allocated by nested bit fields to off-load blocks this size or smaller, which would be immune to external fragmentation.
My Questions:
-Is there any reason why approach 1 would be more immune to fragmentation than approach 2?
-Are there any alternatives that I am missing that might fit the requirements?
If block sizes are not rounded up to powers of two or some equivalent(*), certain sequences of allocation and deallocation will generate an essentially-unbounded amount of fragmentation even if the number of non-permanent small objects that exist at any given time is limited. A binary-buddy allocator will, of course, avoid that particular issue. Otherwise, if one is using a limited number of nicely-related object sizes but not using a "binary buddy" system, one may still have to use some judgment in deciding where to allocate new blocks.
Another approach to consider is having different allocation methods for things that are expected to be permanent, temporary, or semi-persistent. Fragmentation often causes the most trouble when temporary and permanent things get interleaved on the heap. Avoiding such interleaving may minimize fragmentation.
Finally, I know you don't really want to use double-indirect pointers, but allowing object relocation can greatly reduce fragmentation-related issues. Many Microsoft-derived microcomputer BASICs used a garbage-collected string heap; Microsoft's garbage collector was really horrible, but its string-heap approach can be used with a good one.
You can pick up a (never used for real) Buddy system allocator at http://www.mcdowella.demon.co.uk/buddy.html, with my blessing for any purpose you like. But I don't think you have a problem that is easily solved just by plugging in a memory allocator. The long-running high integrity systems I am familiar with have predictable resource usage, described in 30+ page documents for each resource (mostly cpu and I/O bus bandwidth - memory is easy because they tend to allocate the same amount at startup every time and then never again).
In your case none of the usual tricks - static allocation, free lists, allocation on the stack, can be shown to work because - at least as described to us - you have a Lua interpreted hovering in the background ready to do who knows what at run time - what if it just gets into a loop allocating memory until it runs out?
Could you separate the memory use into two sections - traditional code allocating almost all of what it needs on startup, and never again, and expendable code (e.g. Lua) allowed to allocate whatever it needs when it needs it, from whatever is left over after static allocation? Could you then trigger a restart or some sort of cleanup of the expendable code if it manages to use all of its area of memory, or fragments it, without bothering the traditional code?

Optimization of C program with SLAB-like technologies

I have a programming project with highly intensive use of malloc/free functions.
It has three types of structures with very high dynamics and big numbers. By this way, malloc and free are heavily used, called thousands of times per second. Can replacement of standard memory allocation by user-space version of SLAB solve this problem? Is there any implementation of such algorithms?
P.S.
System is Linux-oriented.
Sizes of structures is less than 100 bytes.
Finally, I'll prefer to use ready implementation because memory management is really hard topic.
If you only have three different then you would greatly gain by using a pool allocator (either custom made or something like boost::pool but for C). Doug Lea's binning based malloc would serve as a very good base for a pool allocator (its used in glibc).
However, you also need to take into account other factors, such as multi-threading and memory reusage (will objects be allocated, freed then realloced or just alloced then freed?). from this angle you can check into tcmalloc (which is designed for extreme allocations, both quantity and memory usage), nedmalloc or hoard. all of these allocators are open source and thus can be easily altered to suite the sizes of the objects you allocate.
Without knowing more it's impossible to give you a good answer, but yes, managing your own memory (often by allocating a large block and then doing your own allocations with in that large block) can avoid the high cost associated with general purpose memory managers. For example, in Windows many small allocations will bring performance to its knees. Existing implementations exist for almost every type of memory manager, but I'm not sure what kind you're asking for exactly...
When programming in Windows I find calling malloc/free is like death for performance -- almost any in-app memory allocation that amortizes memory allocations by batching will save you gobs of processor time when allocating/freeing, so it may not be so important which approach you use, as long as you're not calling the default allocator.
That being said, here's some simplistic multithreading-naive ideas:
This isn't strictly a slab manager, but it seems to achieve a good balance and is commonly used.
I personally find I often end up using a fairly simple-to-implement memory-reusing manager for memory blocks of the same sizes -- it maintains a linked list of unused memory of a fixed size and allocates a new block of memory when it needs to. The trick here is to store the pointers for the linked list in the unused memory blocks -- that way there's a very tiny overhead of four bytes. The entire process is O(1) whenever it's reusing memory. When it has to allocate memory it calls a slab allocator (which itself is trivial.)
For a pure allocate-only slab allocator you just ask the system (nicely) to give you a large chunk of memory and keep track of what space you haven't used yet (just maintain a pointer to the start of the unused area and a pointer to the end). When you don't have enough space to allocate the requested size, allocate a new slab. (For large chunks, just pass through to the system allocator.)
The problem with chaining these approaches? Your application will never free any memory, but performance-critical applications often are either one-shot processing applications or create many objects of the same sizes and then stop using them.
If you're careful, the above approach isn't too hard to make multithread friendly, even with just atomic operations.
I recently implemented my own userspace slab allocator, and it proved to be much more efficient (speedwise and memory-wise) than malloc/free for a large amount of fixed-size allocations. You can find it here.
Allocations and freeing work in O(1) time, and are speeded up because of bitvectors being used to represent empty/full slots. When allocating, the __builtin_ctzll GCC intrinsic is used to locate the first set bit in the bitvector (representing an empty slot), which should translate to a single instruction on modern hardware. When freeing, some clever bitwise arithmetic is performed with the pointer itself, in order to locate the header of the corresponding slab and to mark the corresponding slot as free.

Combining two buffers into one

I need to have two buffers (A and B) and when either of the buffers is full it needs to write its contents to the "merged" buffer - C. Using memcopy seems to be too slow for this operation as noted below in my question. Any insight?'
I haven't tried but I've been told that memcopy will not work. This is an embedded system. 2 buffers. Both of different sizes and when they are full dumb to a common 'C' buffer which is a bigger size than the other two.. Not sure why I got down rated..
Edit: Buffer A and B will be written to prior to C being completely empty.
The memcopy is taking too long and the common buffer 'C' is getting over run.
memcpy is pretty much the fastest way to copy memory. It's frequently a compiler intrinsic and is highly optimized. If it's too slow you're probably going to have to find another way to speed your program up.
I'd expect that copying memory faster is not the lowest hanging fruit in a program.
Some other opportunities could be to copy less memory or copy less often. See if you can profile your program to analyze it's performance and find where the biggest opportunities are.
Edit: With your edit it sounds like the problem is that there's not enough time for you to deal with some data all at once between the time you notice that it needs to be handled and the time that more data comes in. A solution in this case could be, as one of the commenters noted, to have additional buffers that you can flip between. So you may then have time to handle the data in one while another is filled up.
The only way you can merge two buffers without memcpy is by linking them, like a linked list of buffer fragments (or an array of fragments).
Consider that a buffer may not always have to be contiguous. I've done a lot of work with 600dpi images, which means very large buffers. If you can break them up into a sequence of smaller fragments, that helps reducing fragmentation as well as unnecessary copying due to buffer growth.
In some cases buffers must be contiguous, if your API / microcontroller mandates it. For example, Windows bitmap functions require continuity. You could try to use the C realloc function, but it might internally work like the combination of malloc+memcpy+free. Either way, as others have said earlier, memcpy is supposed to be the fastest possible way of copying contiguous buffers.
If the buffer must be contiguous, you could reserve a large address space and commit it on demand. The implementation depends on the platform. For example, on Win32 the VirtualAlloc function can do that. This gives you a very large contiguous buffer, of which only a portion is allocated (committed). Later you can commit further pages as the buffer needs to grow. This trick requires the concept of virtual memory, which may not be available on a microcontroller.

Resources