Malloc vs custom allocator: Malloc has a lot of overhead. Why?

Malloc vs custom allocator: Malloc has a lot of overhead. Why? - c

I have an image compression application that now has two different versions of memory allocation systems. In the original one, malloc is used everywhere, and in the second one, I implemented a simple pool-allocator, that just allocates chunk of memory and returns parts of that memory to myalloc() calls.
We've been noticing a huge memory overhead when malloc is used: At the height of its memory usage, the malloc() code requires about 170 megabytes of memory for a 1920x1080x16bpp image, while the pool allocator allocates just 48 megabytes, of which 47 are used by the program.
In terms of memory allocation patterns, the program allocates a lot of 8byte(most), 32-byte(many) and 1080byte-blocks(some) with the test image. Apart from these, there are no dynamic memory allocations in the code.
The OS of the testing system is Windows 7 (64 Bit).
How did we test memory usage?
With the custom allocator, we could see how much memory is used because all malloc calls are defered to the allocator. With malloc(), in Debug mode we just stepped through the code and watched the memory usage in the task manager. In release mode we did the same, but less fine grained because the compiler optimizes a lot of stuff away so we couldn't step through the code piece by piece (the memory difference between release and debug was about 20MB, which I would attribute to optimization and lack of debug information in release mode).
Could malloc alone be the cause of such a huge overhead? If so, what exactly causes this overhead inside malloc?

On Windows 7 you will always get the low-fragmentation heap allocator, without explicitly calling HeapSetInformation() to ask for it. That allocator sacrifices virtual memory space to reduce fragmentation. Your program is not actually using 170 megabytes, you are just seeing a bunch of free blocks lying around, waiting for an allocation of a similar size.
This algorithm is very easy to beat with a custom allocator that doesn't do anything to reduce fragmentation. Which may well work out for you, albeit that you don't see the side effects of it until you keep the program running longer than a single debug session. You do need to make sure it is stable for days or weeks if that is the expected usage pattern.
Best thing to do is just not fret about it, 170 MB is rather small potatoes. And do keep in mind that this is virtual memory, it doesn't cost anything.

First at all malloc aligns the pointers to 16 byte boundaries. Furthermore they store at least one pointer (or allocated length) in the addresses preceding the returned value. Then they probably add a magic value or release counter to indicate that the linked list is not broken or that the memory block has not been released twice (free ASSERTS for double frees).
#include <stdlib.h>
#include <stdio.h>
int main(int ac, char**av)
{
int *foo = malloc(4);
int *bar = malloc(4);
printf("%d\n", (int)bar - (int)foo);
}
Return: 32

Caution: When you run your program in the Visual Studio or with any debugger attached, by default the malloc behaviour is changed a lot, Low Fragmentation Heap is not used and a memory overhead may be not representative of real usage (see also https://stackoverflow.com/a/3768820/16673). You need to use environment variable _NO_DEBUG_HEAP=1 to avoid being hit by this, or to measure the memory usage when not running under a debugger.

Related

Is there a performance cost to using large mmap calls that go beyond expected memory usage?

Edit: On systems that use on-demand paging
For initializing data structures that are both persistent for the duration of the program and require a dynamic amount of memory is there any reason not to mmap an upper bound from the start?
An example is an array that will persistent for the entire program's life but whose final size is unknown. The approach I am most familiar with is something along the lines of:
type * array = malloc(size);
and when the array has reached capacity doubling it with:
array = realloc(array, 2 * size);
size *= 2;
I understand this is probably the best way to do this if the array might freed mid execution so that its VM can be reused, but if it is persistent is there any reason not to just initialize the array as follows:
array = mmap(0,
huge_size,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE,
-1, 0)
so that the elements never needs to be copied.
Edit: Specifically for an OS that uses on-demand paging.

Don't try to be smarter than the standard library, unless you 100% know what you are doing.
malloc() already does this for you. If you request a large amount of memory, malloc() will mmap() you a dedicated memory area. If what you are concerned about is the performance hit coming from doing size *= 2; realloc(old, size), then just malloc(huge_size) at the beginning, and then keep track of the actual used size in your program. There really is no point in doing an mmap() unless you explicitly need it for some specific reason: it isn't faster nor better in any particular way, and if malloc() thinks it's needed, it will do it for you.

It's fine to allocate upper bounds as long as:
You're building a 64bit program: 32bit ones have restricted virtual space, even on 64bit CPUs
Your upper bounds don't approach 2^47, as a mathematically derived one might
You're fine with crashing as your out-of-memory failure mode
You'll only run on systems where overcommit is enabled
As a side note, an end user application doing this may want to borrow a page from GHC's book and allocate 1TB up front even if 10GB would do. This unrealistically large amount will ensure that users don't confuse virtual memory usage with physical memory usage.

If you know for a fact that wasting a chunk of memory (most likely an entire page which is likely 4096 bytes) will not cause your program or the other programs running on your system to run out of memory, AND you know for a fact that your program will only ever be compiled and run on UNIX machines, then this approach is not incorrect, but it is not good programming practice for the following reasons:
The <stdlib.h> file you #include to use malloc() and free() in your C programs is specified by the C standard, but it is specifically implemented for your architecture by the writers of the operating system. This means that your specific system was kept in-mind when these functions were written, so finding a sneaky way to improve efficiency for memory allocation is unlikely unless you know the inner workings of memory management in your OS better than those who wrote it.
Furthermore, the <sys/mman.h> file you include to mmap() stuff is not part of the C standard, and will only compile on UNIX machines, which reduces the portability of your code.
There's also a really good chance (assuming a UNIX environment) that malloc() and realloc() already use mmap() behind-the-scenes to allocate memory for your process anyway, so it's almost certainly better to just use them. (read that as "realloc doesn't necessarily actively allocate more space for me, because there's a good chance there's already a chunk of memory that my process has control of that can satisfy my new memory request without calling mmap() again")
Hope that helps!

C allocation and memory overhead

Apologies if this is a stupid question, but it's been kinda bothering me for a long time.
I'd like to know some details on how the memory manager knows what memory is in use.
Imagine a one-chip microcomputer with 1024B of RAM - not much to spare.
Now you allocate 100 ints - each int is 4 bytes, each pointer 4 bytes too (yea, 8bit one-chip will most likely have smaller pointers, but w/e).
So you've just used 800B of ram for 100 ints? But it's worse - the allocation system must somehow take note on where the memory is malloc'd and where it's free - 200 extra bytes or something? Or some bit marks?
If this is true, why is C favoured over assembler so often?
Is this really how it works? So super inefficient?
(Or am I having a totally incorrect idea about it?)

It may surprise younger developers to learn that greying old ones like myself used to write in C on systems with 1 or 2k of RAM.
In systems this size, dynamic memory allocation would have been a luxury that we could not afford. It's not just the pointer overhead of managing the free store, but also the effects of free store fragmentation making memory allocation inefficient and very probably leading to a fatal out-of-memory condition (virtual memory was not an option).
So we used to use static memory allocation (i.e. global variables), kept a very tight control on the depth of function all nesting, and an even tighter control over nested interrupt handling.
When writing on these systems, I didn't even link the standard library. I wrote my own C startup routines and provided custom minimal I/O routines.
One program I wrote in a 2k ram system used the lower part of RAM as the data area and the upper part as the stack. In the final cut, I proved that the maximal use of stack reached so far down in memory that it was 1 byte away from the last variable in the data area.
Ah, the good old days...
EDIT:
To answer your question specifically, the original K&R free store manager used to add a header block to the beginning of each block of memory allocated via malloc.
The header block looked something like this:
union header {
struct {
union header *ptr;
unsigned size;
} s;
};
Where ptr is the address of the next header block and size is the size of the memory allocated (in blocks). The malloc function would actually return the address computed by &header + sizeof(header). The free function would subtract the size of the header from the pointer you provided in order to re-link the block back into the free list.

There are several approaches how you can do that:
as you write, malloc() one memory block for every int you have. Completely inefficient, thus I strike it out.
malloc() an array of 100 ints. That needs in total 100*sizeof(int) + 1* sizeof(int*) + whatever malloc() internally needs. Much better.
statically allocate the array. Here you just need 100*sizeof(int).
allocate the array on the stack. That needs the same, but only for the current function call.
Which of these you need depends on how long you need the memory and other criteria.
If you have that few RAM, it might even be questionable if it is useful to use malloc() at all. It can be an option if several code blocks need a lot of RAM, but not at the same time.
On how the memory addresses are tracked, that as well depends:
for malloc(), you have to put the pointer in a place where you don't lose it.
for an array on the stack, it is expressed relatively to the current frame pointer. The code sets up the frame and thus knows about the offset, so it is normally not needed to store it anywhere.
for an array in the data segment, the compiler and linker know about the address and statically put the address where it is needed.

If this is true, why is C favoured over assembler so often?
You're simplifying the problem too much. C or assembler - doesn't matter, you still need to manage the memory chunks. The main issue is fragmentation, not the actual management overhead. In a system like the one you described, you would probably just allocate the memory and never ever release it, thus no need to check what's free - whatever is below the watermark is free, and that's it.
Is this really how it works? So super inefficient?
There are many algorithms around this problem, but if you're simplifying - yes, it basically is. In reality its a much more complicated problem - and there are much more complicated systems revolving around servicing memory, dealing with fragmentation, garbage collection (on a OS level), etc etc.

Unit testing involving C free()

I'm working in Unix/Linux with C. I have a basic understanding of how memory allocation works, enough to know that if I malloc() then free(), I'm not likely going to actually free an entire page; thus if I use getrusage() before and after a free() I'm not likely going to see any difference.
I'd like to write a unit test for a function which destroys a data structure to see that the memory regions involved has actually been freed. I'm open to an OS-dependent solution, in which case my primary platform is
Linux beast 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
with OS X and FreeBSD as secondaries. I'm also open to a drop in replacement malloc() if there's a solution that makes checking free() relatively easy.
To be clear, I'm testing a routine that is to delete a large data structure, and I want to make sure that all the allocated regions are actually freed, in essence a unit test that the particular unit does not have a basic memory leak. I'm going to assume that free() does its job, I'm just making sure that my code actually calls free on all allocated regions it's responsible for.
In this particular case it's a tree structure and for each piece of data in the tree the structure is responsible for calling the routine that deletes the data stored in the tree as well, which might be some other arbitrary thing...

I hate to do this again, but after a night of sleep I found an explicit answer, it comes from SVID (does anyone even remember System V?), but is incorporated in glibc on Linux, likely through it's use in dlmalloc; hence it can be used on other system by using dlmalloc() as a drop in replacement malloc.
use the routine
struct mallinfo mallinfo(void);
and struct mallinfo is
struct mallinfo
{
int arena; /* non-mmapped space allocated from system */
int ordblks; /* number of free chunks */
int smblks; /* number of fastbin blocks */
int hblks; /* number of mmapped regions */
int hblkhd; /* space in mmapped regions */
int usmblks; /* maximum total allocated space */
int fsmblks; /* space available in freed fastbin blocks */
int uordblks; /* total allocated space */
int fordblks; /* total free space */
int keepcost; /* top-most, releasable (via malloc_trim) space */
};
In particular arena and uordblks will give you the number of bytes allocated by malloc() independent of the size of the pages requested from the OS using sbrk() or mmap(), which are given by arena and hblkhd.

My best suggestion is to use valgrind.
They make wrappers for malloc and free and shows you memory leaks better than you will do with a homegrown system. And they catch all kinds of other errors not easily spotted and tested with unittests like uninitialized variables.
So always run your tests with valgrind and make your success criteria that all tests pass and that valgrind shows no errors and no un-allocated memory.
PS: It might be a bit of a pain getting started if you already have a bit of a codebase, but if you take the time to fix all the valgrind errors both your tests and your system will be much more reliable!

One trick you can use for all kinds of memory leak debug is to simply increase the scale until the problem stands out more. Do one malloc/free cycle, then capture the results of getrusage(), then do 1000 more malloc/free cycles and ensure your process didn't actually allocate more memory. By leaking 1000 of them you increase the scale of the problem to the point where it stands out.
If you want something more deterministic (but slightly more intrusive) you can override malloc() and free() to track allocations and observe the tracking data from the test harness. You can find several examples of debug malloc libraries to show you how it's done.

The easiest way is to make wrappers for malloc and free. If you use the real malloc and free then the following could happen:
you allocate memory
you do something with that memory
you free the memory
someone else in the OS, even something else in your own code, needs memory and the OS allocates that memory elsewhere, since it's now free
you test to see if the memory is free, and it's not, but you think it should be, so you get a false test failure.

How to find how much memory is actually used up by a malloc call?

If I call:
char *myChar = (char *)malloc(sizeof(char));
I am likely to be using more than 1 byte of memory, because malloc is likely to be using some memory on its own to keep track of free blocks in the heap, and it may effectively cost me some memory by always aligning allocations along certain boundaries.
My question is: Is there a way to find out how much memory is really used up by a particular malloc call, including the effective cost of alignment, and the overhead used by malloc/free?
Just to be clear, I am not asking to find out how much memory a pointer points to after a call to malloc. Rather, I am debugging a program that uses a great deal of memory, and I want to be aware of which parts of the code are allocating how much memory. I'd like to be able to have internal memory accounting that very closely matches the numbers reported by top. Ideally, I'd like to be able to do this programmatically on a per-malloc-call basis, as opposed to getting a summary at a checkpoint.

There isn't a portable solution to this, however there may be operating-system specific solutions for the environments you're interested in.
For example, with glibc on Linux, you can use the mallinfo() function from <malloc.h> which returns a struct mallinfo. The uordblks and hblkhd members of this structure contains the dynamically allocated address space used by the program including book-keeping overhead - if you take the difference of this before and after each malloc() call, you will know the amount of space used by that call. (The overhead is not necessarily constant for every call to malloc()).
Using your example:
char *myChar;
size_t s = sizeof(char);
struct mallinfo before, after;
int mused;
before = mallinfo();
myChar = malloc(s);
after = mallinfo();
mused = (after.uordblks - before.uordblks) + (after.hblkhd - before.hblkhd);
printf("Requested size %zu, used space %d, overhead %zu\n", s, mused, mused - s);
Really though, the overhead is likely to be pretty minor unless you are making a very very high number of very small allocations, which is a bad idea anyway.

It really depends on the implementation. You should really use some memory debugger. On Linux Valgrind's Massif tool can be useful. There are memory debugging libraries like dmalloc, ...
That said, typical overhead:
1 int for storing size + flags of this block.
possibly 1 int for storing size of previous/next block, to assist in coallescing blocks.
2 pointers, but these may only be used in free()'d blocks, being reused for application storage in allocated blocks.
Alignment to an approppriate type, e.g: double.
-1 int (yes, that's a minus) of the next/previous chunk's field containing our size if we are an allocated block, since we cannot be coallesced until we're freed.
So, a minimum size can be 16 to 24 bytes. and minimum overhead can be 4 bytes.
But you could also satisfy every allocation via mapping memory pages (typically 4Kb), which would mean overhead for smaller allocations would be huge. I think OpenBSD does this.

There is nothing defined in the C library to query the total amount of physical memory used by a malloc() call. The amount of memory allocated is controlled by whatever memory manager is hooked up behind the scenes that malloc() calls into. That memory manager can allocate as much extra memory as it deemes necessary for its internal tracking purposes, on top of whatever extra memory the OS itself requires. When you call free(), it accesses the memory manager, which knows how to access that extra memory so it all gets released properly, but there is no way for you to know how much memory that involves. If you need that much fine detail, then you need to write your own memory manager.

If you do use valgrind/Massif, there's an option to show either the malloc value or the top value, which differ a LOT in my experience. Here's an excerpt from the Valgrind manual http://valgrind.org/docs/manual/ms-manual.html :
...However, if you wish to measure all the memory used by your program,
you can use the --pages-as-heap=yes. When this option is enabled,
Massif's normal heap block profiling is replaced by lower-level page
profiling. Every page allocated via mmap and similar system calls is
treated as a distinct block. This means that code, data and BSS
segments are all measured, as they are just memory pages. Even the
stack is measured...

Why should I use malloc() when "char bigchar[ 1u << 31 - 1 ];" works just fine?

What's the advantage of using malloc (besides the NULL return on failure) over static arrays? The following program will eat up all my ram and start filling swap only if the loops are uncommented. It does not crash.
...
#include <stdio.h>
unsigned int bigint[ 1u << 29 - 1 ];
unsigned char bigchar[ 1u << 31 - 1 ];
int main (int argc, char **argv) {
int i;
/* for (i = 0; i < 1u << 29 - 1; i++) bigint[i] = i; */
/* for (i = 0; i < 1u << 31 - 1; i++) bigchar[i] = i & 0xFF; */
getchar();
return 0;
}
...
After some trial and error I found the above is the largest static array allowed on my 32-bit Intel machine with GCC 4.3. Is this a standard limit, a compiler limit, or a machine limit? Apparently I can have as many of of them as I want. It will segfault, but only if I ask for (and try to use) more than malloc would give me anyway.
Is there a way to determine if a static array was actually allocated and safe to use?
EDIT: I'm interested in why malloc is used to manage the heap instead of letting the virtual memory system handle it. Apparently I can size an array to many times the size I think I'll need and the virtual memory system will only keep in ram what is needed. If I never write to e.g. the end (or beginning) of these huge arrays then the program doesn't use the physical memory. Furthermore, if I can write to every location then what does malloc do besides increment a pointer in the heap or search around previous allocations in the same process?
Editor's note: 1 << 31 causes undefined behaviour if int is 32-bit, so I have modified the question to read 1u. The intent of the question is to ask about allocating large static buffers.

Well, for two reasons really:
Because of portability, since some systems won't do the virtual memory management for you.
You'll inevitably need to divide this array into smaller chunks for it to be useful, then to keep track of all the chunks, then eventually as you start "freeing" some of the chunks of the array you no longer require you'll hit the problem of memory fragmentation.
All in all you'll end up implementing a lot of memory management functionality (actually pretty much reimplementing the malloc) without the benefit of portability.
Hence the reasons:
Code portability via memory management encapsulation and standardisation.
Personal productivity enhancement by the way of code re-use.

Please see:
malloc() and the C/C++ heap
Should a list of objects be stored on the heap or stack?
C++ Which is faster: Stack allocation or Heap allocation
Proper stack and heap usage in C++?
About C/C++ stack allocation
Stack,Static and Heap in C++
Of Memory Management, Heap Corruption, and C++
new on stack instead of heap (like alloca vs malloc)

with malloc you can grow and shrink your array: it becomes dynamic, so you can allocate exactly for what you need.

This is called custom memory management, I guess.
You can do that, but you'll have to manage that chunk of memory yourself.
You'd end up writing your own malloc() woring over this chunk.

Regarding:
After some trial and error I found the
above is the largest static array
allowed on my 32-bit Intel machine
with GCC 4.3. Is this a standard
limit, a compiler limit, or a machine
limit?
One upper bound will depend on how the 4GB (32-bit) virtual address space is partitioned between user-space and kernel-space. For Linux, I believe the most common partitioning scheme has a 3 GB range of addresses for user-space and a 1 GB range of addresses for kernel-space. The partitioning is configurable at kernel build-time, 2GB/2GB and 1GB/3GB splits are also in use. When the executable is loaded, virtual address space must be allocated for every object regardless of whether real memory is allocated to back it up.

You may be able to allocate that gigantic array in one context, but not others. For example, if your array is a member of a struct and you wish to pass the struct around. Some environments have a 32K limit on struct size.
As previously mentioned, you can also resize your memory to use exactly what you need. It's important in performance-critical contexts to not be paging out to virtual memory if it can be avoided.

There is no way to free stack allocation other than going out of scope. So when you actually use global allocation and VM has to alloc you real hard memory, it is allocated and will stay there until your program runs out. This means that any process will only grow in it's virtual memory use (functions have local stack allocations and those will be "freed").
You cannot "keep" the stack memory once it goes out of scope of function, it is always freed. So you must know how much memory you will use at compile time.
Which then boils down to how many int foo[1<<29]'s you can have. Since first one takes up whole memory (on 32bit) and will be (lets lie: 0x000000) the second will resolve to 0xffffffff or thereaobout. Then the third one would resolve to what? Something that 32bit pointers cannot express. (remember that stack reservations are resolved partially at compiletime, partially runtime, via offsets, how far the stack offset is pushed when you alloc this or that variable).
So the answer is pretty much that once you have int foo [1<<29] you cant have any reasonable depth of functions with other local stack variables anymore.

You really should avoid doing this unless you know what you're doing. Try to only request as much memory as you need. Even if it's not being used or getting in the way of other programs it can mess up the process its self. There are two reasons for this. First, on certain systems, particularly 32bit ones it can cause address space to be exhausted prematurely in rare circumstances. Additionally many kernels have some kind of per process limit on reserved/virtual/not in use memory. If your program asks for memory at points in run time the kernel can kill the process if it asks for memory to be reserved that exceeds this limit. I've seen programs that have either crashed or exited due to a failed malloc because they are reserving GBs of memory while only using a few MB.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight