System malloc vs DLMalloc on large malloc

System malloc vs DLMalloc on large malloc - c

I haven't coded in a while, so excuse me upfront. I have this odd problem. I am trying to malloc 8GB in one go and I plan to manage that heap with TLSF later on. That is, I want to avoid mallocing throughout my application at all, just get one big glob at the beginning and freeing it in the end. Here is the peculiarity though. I was always using dlmalloc until now in my programs. Linking it in and everything went well. However, now when I try to malloc 8GB at once and link in dlmalloc to use it I get segmentation fault 11 on OSX when I run it, without dlmalloc everything goes well. Doesn't matter if I use either gcc or clang. System doesn't have 8GB of RAM though, but it has 4GB. Interestingly enough same thing happens on Windows machine which has 32GB of RAM and Ubuntu one that has 16GB of RAM. With system malloc it all works, allocation goes through and simple iteration through allocated memory works as expected on all three systems. But, when I link in dlmalloc it fails. Tried it both with malloc and dlmalloc function calls.
Allocation itself is nothing extraordinary, plain c99.
[...]
size_t bytes = 1024LL*1024LL*1024LL*8LL;
unsigned long *m = (unsigned long*)malloc(bytes);
[...]
I'm confused by several things here. How come system malloc gives me 8GB malloc even without system having 4GB or RAM, are those virtual pages? Why dlmalloc doesn't do the same? I am aware there might not be a continuos block of 8GB of RAM to allocate, but why segmentation fault then, why not a null ptr?
Is there a viable robust (hopefully platform neutral) solution to get that amount of RAM in one go from malloc even if I'm not sure system will have that much RAM?
edit: program is 64-bit as are OS' which I'm running on.
edit2:
So I played with it some more. Turns out if I break down allocation into 1GB chunks, that is 8 separate mallocs, then it works with dlmalloc. So it seems to be an issue with contiguous range allocation where dlmalloc probably tries to allocate only if there is a contiguous block. This makes my question then even harder to formulate. Is there a somewhat sure way to get that size of a memory chunk with or without dlmalloc across platforms, and not have it fail if there is no physical memory left (can be in swap, as long as it doesn't fail). Also would it be possible in a cross platform manner to tell if malloc is in ram or swap.

I will give you just a bit of perspective, if not an outright answer. When I see you attempting to allocate 8GB of contiguous RAM, I cringe. Yes, with 64-bit computing and all, that is probably "legal", but on a normal machine, you are probably going to run into a lot of edge cases, 32-bit legacy code choking on a 64-bit size, and just plain usability issues getting a chunk of memory big enough to make this work. If you want to try this sort of thing, perhaps attempt to malloc the single chunk, then if that fails, use smaller chunks. This somewhat defeats the purpose of a 1 chunk system though. Perhaps there is some sort of "page size" in the OS that you could link your malloc size to - in order to help performance and just plain ability to get memory in the amount you wish.
On game consoles, this approach to memory management is somewhat common - allocate 1 buffer from the OS at bootup as big as possible, then place your own memory manager on there to avoid OS overhead and possible inferior allocation code. It also allows one to better control memory fragmentation on such systems where virtual memory doesn't exist. But on these systems, you also know up front exactly how much RAM you have.
Is there a way to see if memory is physical or virtual in a platform independent way? I don't think so, but perhaps someone else can give a good answer to that and I'll edit this part away.
So not a 100% answer, but some random thoughts to help out and my internally wondering what you are doing that wants 8GB of RAM in one chunk when it sounds like multiple chunks will work fine. :)

Related

Is there a performance cost to using large mmap calls that go beyond expected memory usage?

Edit: On systems that use on-demand paging
For initializing data structures that are both persistent for the duration of the program and require a dynamic amount of memory is there any reason not to mmap an upper bound from the start?
An example is an array that will persistent for the entire program's life but whose final size is unknown. The approach I am most familiar with is something along the lines of:
type * array = malloc(size);
and when the array has reached capacity doubling it with:
array = realloc(array, 2 * size);
size *= 2;
I understand this is probably the best way to do this if the array might freed mid execution so that its VM can be reused, but if it is persistent is there any reason not to just initialize the array as follows:
array = mmap(0,
huge_size,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE,
-1, 0)
so that the elements never needs to be copied.
Edit: Specifically for an OS that uses on-demand paging.

Don't try to be smarter than the standard library, unless you 100% know what you are doing.
malloc() already does this for you. If you request a large amount of memory, malloc() will mmap() you a dedicated memory area. If what you are concerned about is the performance hit coming from doing size *= 2; realloc(old, size), then just malloc(huge_size) at the beginning, and then keep track of the actual used size in your program. There really is no point in doing an mmap() unless you explicitly need it for some specific reason: it isn't faster nor better in any particular way, and if malloc() thinks it's needed, it will do it for you.

It's fine to allocate upper bounds as long as:
You're building a 64bit program: 32bit ones have restricted virtual space, even on 64bit CPUs
Your upper bounds don't approach 2^47, as a mathematically derived one might
You're fine with crashing as your out-of-memory failure mode
You'll only run on systems where overcommit is enabled
As a side note, an end user application doing this may want to borrow a page from GHC's book and allocate 1TB up front even if 10GB would do. This unrealistically large amount will ensure that users don't confuse virtual memory usage with physical memory usage.

If you know for a fact that wasting a chunk of memory (most likely an entire page which is likely 4096 bytes) will not cause your program or the other programs running on your system to run out of memory, AND you know for a fact that your program will only ever be compiled and run on UNIX machines, then this approach is not incorrect, but it is not good programming practice for the following reasons:
The <stdlib.h> file you #include to use malloc() and free() in your C programs is specified by the C standard, but it is specifically implemented for your architecture by the writers of the operating system. This means that your specific system was kept in-mind when these functions were written, so finding a sneaky way to improve efficiency for memory allocation is unlikely unless you know the inner workings of memory management in your OS better than those who wrote it.
Furthermore, the <sys/mman.h> file you include to mmap() stuff is not part of the C standard, and will only compile on UNIX machines, which reduces the portability of your code.
There's also a really good chance (assuming a UNIX environment) that malloc() and realloc() already use mmap() behind-the-scenes to allocate memory for your process anyway, so it's almost certainly better to just use them. (read that as "realloc doesn't necessarily actively allocate more space for me, because there's a good chance there's already a chunk of memory that my process has control of that can satisfy my new memory request without calling mmap() again")
Hope that helps!

Better Memory (Heap) management on Solaris 10

I have c code with embedded SQL for Oracle through Pro*C.
Whenever I do an insert or update (below given an update example),
update TBL1 set COL1 = :v, . . . where rowid = :v
To manage bulk insertions and updates, I have allocated several memory chunks to insert as bulk and commit once. There are other memory allocations too going on as and when necessary. How do I better manage the memory (heap) for dynamic memory allocations? One option is to have the heap size configurable during the GNU linking time. I'm using g++ version 2.95, I know it's quite an old version, but have to use this for legacy. Since the executable (runs on solaris 10), obce built, could run on several production environments with varied resources, one-size-fit-all for heap size allocation may not be appropriate. As an alternative, need some mechanism where heaps may elastically grow as and when needed. Unlike Linux, Solaris, I think, does not have the concept of over-allocated memory. So, memory allocations could fail with ENOMEM if there is no more space left. What could be better strategy to know that we could be crossing the danger level and now we should either deallocate chunks that we are storing in case these are done using or transfer memory chunks to oracle DB in case these are still pending to be loaded and finally deallocate. Any strategy that you could suggest?

C is not java where the heap size is fixed at startup.
The heap and the stack of a C compiled application both share the same virtual memory space and adjust dynamically.
The size of this space depends on whether you are compiling a 32 bit or a 64 bit binary, and also whether your kernel is a 32 bit or a 64 bit one (on SPARC hardware, it's always 64 bit).
If you have not enough RAM and want Solaris to accept large memory reservations anyway, a similar way Linux over commits memory, you can just add enough swap for the reservation to be backed by actual storage.
If for some reason, you are unhappy with the Solaris libc memory allocator, you can evaluate the bundled alternative ones like libumem, mtmalloc or the third party hoard. See http://www.oracle.com/technetwork/articles/servers-storage-dev/mem-alloc-1557798.html for details.

One solution would be to employ soft limits in your code for various purposes, f.e. that only 100 transactions at a time are handled and other transactions have to wait until the previous ones are deallocated. This guarantees predictable behavior, as no code part can use up more memory than allowed.
The question is:
Do you really run out of memory in your application or do you fragment your memory and fail to get a sufficient large contiguous block of memory? The strategies to handle each case are different.

Pb in using nedflush for memory leaks

I try to use the debug functionalities of nedmalloc to find potential memory leaks in my code. So I activate the flags ENABLE_LOGGING and NEDMALLOC_TESTLOGENTRY.
in my program, I only use the system memory pool. At the very end of my program, I call the function neddestroysyspool in order to flush all memory events.
First of all, I don't manage to activate the stack trace functionality. When I change this depth, the program crashes after a few allocations. In order to compile Under VS2010, I had to define DeinitSym myself with a call to CloseHandle; I hope I'm doing right ... but it does not work properly. So I don't use it.
So I just parse the file nedmalloc.csv: I sort it thanks to addresses, sum allocated sizes and substract freed ones wrt the address. For an unknown reason, for several big chunks (size>400kb), the size given at allocation is right but the size given at free is different, above the allocated size. For example, I allocated a block of 840352 bytes, but when freed, the recorded size was 851932 bytes. Is is normal?
Does anyone has some answer(s) or hint(s) for this problem?

Firstly, I really wouldn't use nedmalloc's logging facilities for memory leak detection. valgrind is a vastly superior way of finding resource leaks. I even have a hack there in nedmalloc to have it use the system allocator instead of dlmalloc precisely in order to avail of valgrind.
Secondly, yes the stack backtracing code can be a little brittle. If I remember rightly I did something naughty to get the performance considerably higher at the expense of it not quite working well. I'll be blunt in saying almost no one but me ever used that code path, so I never had much reason to debug it properly. It worked well enough for my purposes.
Thirdly for larger blocks, dlmalloc will round up allocations including their bookkeeping to a 64Kb multiple, so it is expected that a 12 chunk allocation would be turned into a 13 chunk allocation.
Lastly, I as the author of nedmalloc would recommend you not use nedmalloc on Windows 7 or better, or any Linux with a 3.x kernel, or any Mac OS X produced in the past three years. Why? The system allocator is probably good enough and as I personally haven't needed nedmalloc for some years, I'll freely admit the code is bit rotting.
Hope that helps.
Niall

Microcontroller memory allocation

I've been thinking for day about the following question:
In a common pc when you allocate some memory, you ask for it to the OS that keeps track of which memory segments are occupied and which ones are not, and don't let you mess around with other programs memory etc.
But what about a microcontroller, I mean a microcontroller doesn't have an operating system running so when you ask for a bunch of memory what is going on? you cannot simply acess the memory chip and acess a random place cause it may be occupied... who keeps track of which parts of memory are already occupied, and gives you a free place to store something?
EDIT:
I've programmed microcontrollers in C... and I was thinking that the answer could be "language independent". But let me be more clear: supose i have this program running on a microcontroller:
int i=0;
int d=3;
what makes sure that my i and d variables are not stored at the same place in memory?

I think the comments have already covered this...
To ask for memory means you have some operating system managing memory that you are mallocing from (using a loose sense of the term operating system). First you shouldnt be mallocing memory in a microcontroller as a general rule (I may get flamed for that statement). Can be done in cases but you are in control of your memory, you own the system with your application, asking for memory means asking yourself for it.
Unless you have reasons why you cannot statically allocate your structures or arrays or use a union if there are mutually exclusive code paths that might both want much or all of the spare memory, you can try to allocate dynamically and free but it is a harder system engineering problem to solve.
There is a difference between runtime allocation of memory and compile time. your example has nothing to do with the rest of the question
int i=0;
int d=3;
the compiler at compile time allocates two locations in .data one for each of those items. the linker and/or script manages where .data lives and what its limitations are on size, if .data is bigger than what is available you should get a linker warning, if not then you need to fix your linker commands or script to match your system.
runtime allocation is managed at runtime and where and how it manages the memory is determined by that library, even if you have plenty of memory a bad or improperly written library could overlap .text, .data, .bss and/or the stack and cause a lot of problems.
excessive use of the stack is also a pretty serious system engineering problem which coming from non-embedded systems is these days often overlooked because there is so much memory. It is a very real problem when dealing with embedded code on a microcontroller. You need to know your worst case stack usage, leave room for at least that much memory if you are going to have a heap to dynamically allocate, or even if you statically allocate.

Malloc vs custom allocator: Malloc has a lot of overhead. Why?

I have an image compression application that now has two different versions of memory allocation systems. In the original one, malloc is used everywhere, and in the second one, I implemented a simple pool-allocator, that just allocates chunk of memory and returns parts of that memory to myalloc() calls.
We've been noticing a huge memory overhead when malloc is used: At the height of its memory usage, the malloc() code requires about 170 megabytes of memory for a 1920x1080x16bpp image, while the pool allocator allocates just 48 megabytes, of which 47 are used by the program.
In terms of memory allocation patterns, the program allocates a lot of 8byte(most), 32-byte(many) and 1080byte-blocks(some) with the test image. Apart from these, there are no dynamic memory allocations in the code.
The OS of the testing system is Windows 7 (64 Bit).
How did we test memory usage?
With the custom allocator, we could see how much memory is used because all malloc calls are defered to the allocator. With malloc(), in Debug mode we just stepped through the code and watched the memory usage in the task manager. In release mode we did the same, but less fine grained because the compiler optimizes a lot of stuff away so we couldn't step through the code piece by piece (the memory difference between release and debug was about 20MB, which I would attribute to optimization and lack of debug information in release mode).
Could malloc alone be the cause of such a huge overhead? If so, what exactly causes this overhead inside malloc?

On Windows 7 you will always get the low-fragmentation heap allocator, without explicitly calling HeapSetInformation() to ask for it. That allocator sacrifices virtual memory space to reduce fragmentation. Your program is not actually using 170 megabytes, you are just seeing a bunch of free blocks lying around, waiting for an allocation of a similar size.
This algorithm is very easy to beat with a custom allocator that doesn't do anything to reduce fragmentation. Which may well work out for you, albeit that you don't see the side effects of it until you keep the program running longer than a single debug session. You do need to make sure it is stable for days or weeks if that is the expected usage pattern.
Best thing to do is just not fret about it, 170 MB is rather small potatoes. And do keep in mind that this is virtual memory, it doesn't cost anything.

First at all malloc aligns the pointers to 16 byte boundaries. Furthermore they store at least one pointer (or allocated length) in the addresses preceding the returned value. Then they probably add a magic value or release counter to indicate that the linked list is not broken or that the memory block has not been released twice (free ASSERTS for double frees).
#include <stdlib.h>
#include <stdio.h>
int main(int ac, char**av)
{
int *foo = malloc(4);
int *bar = malloc(4);
printf("%d\n", (int)bar - (int)foo);
}
Return: 32

Caution: When you run your program in the Visual Studio or with any debugger attached, by default the malloc behaviour is changed a lot, Low Fragmentation Heap is not used and a memory overhead may be not representative of real usage (see also https://stackoverflow.com/a/3768820/16673). You need to use environment variable _NO_DEBUG_HEAP=1 to avoid being hit by this, or to measure the memory usage when not running under a debugger.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight