How does windows 10 avoid memory fragmentation? - mmu

I tried a test program that just allocated 3 bytes to a large array of pointers and the virtual addresses returned from malloc were only 0x20 apart (32 bytes). Now I'm familiar with most algorithms for avoiding fragmentation, which usually involve playing around with the MMU page table entries. So, what I expected was a page size of 4K or something like that. I'm more familiar with linux or embedded RTOS solutions to these problems, but when it comes to windows, I'm unsure. Does anyone out there know what Windows does that allows it to allocate memory with this kind of tight resolution (32 bytes) instead of the more usual 4K page sizes? Just curious as to what windows is doing there. Is it that they have a special block of memory for small allocations and then do some kind of garbage collection later? Interested in any feedback. Thank you in advance.

Related

First use of malloc sets up the heap?

I had a bug which I have now fixed but which I need to explain in a report.
I am working on an embedded device running FreeRTOS which does its own heap memory management. FreeRTOS has its own version of malloc(), pvPortMalloc() which I was unaware of and using it fixed the memory issues I was having.
My question relates to the size of the memory overflow that was caused by malloc(), the data was only 8 bytes in size, the size of the overflow however was significant, kilobytes if not larger. My guess is that the first and only use of malloc in this application, set up a second heap in competition with FreeRTOS's heap, of at least several kb is size.
Can anyone confirm this or give a better explanation. Pointers to more info or references greatly appreciated.
It is a common trait of many malloc implementations to request a larger chunk of memory from the system than is needed for a single request. For example, glibc's ptmalloc has this:
#define MINIMUM_MORECORE_SIZE (64 * 1024)
This serves as a lower bound on the amount of memory (in bytes) to request from the OS (via sbrk()) at a single time. So you would expect to see a single tiny allocation result in 64 KB "used."
One reason to do this sort of thing is to reduce system calls; another might be to reduce fragmentation.

C: Store a large "virtual" array

I am translating a 32 bit CPU emulator from python to C.
32 bit address space ==> 4GB of memory, but that is more memory than a lot of machines can handle. For this reason in the Python emulator, I used a dict, because it gave access to the entire address space, but only a small subset would be used at once.
In C, I would like to preserve access to the whole address space (since a C-based emulator would be able to read or write to the whole address space in a matter of seconds) but keep the memory manageable (so no 4gb array), and maintain high performance(the main reason for rewriting the emulator in c).
One solution I have thought of is creating a paging system, so only a small amount of the array is stored in memory and the rest on disk. How could I implement this (I am new to C), and are there any better solutions?
Consider looking into mmap and memory-mapped storage.

System malloc vs DLMalloc on large malloc

I haven't coded in a while, so excuse me upfront. I have this odd problem. I am trying to malloc 8GB in one go and I plan to manage that heap with TLSF later on. That is, I want to avoid mallocing throughout my application at all, just get one big glob at the beginning and freeing it in the end. Here is the peculiarity though. I was always using dlmalloc until now in my programs. Linking it in and everything went well. However, now when I try to malloc 8GB at once and link in dlmalloc to use it I get segmentation fault 11 on OSX when I run it, without dlmalloc everything goes well. Doesn't matter if I use either gcc or clang. System doesn't have 8GB of RAM though, but it has 4GB. Interestingly enough same thing happens on Windows machine which has 32GB of RAM and Ubuntu one that has 16GB of RAM. With system malloc it all works, allocation goes through and simple iteration through allocated memory works as expected on all three systems. But, when I link in dlmalloc it fails. Tried it both with malloc and dlmalloc function calls.
Allocation itself is nothing extraordinary, plain c99.
[...]
size_t bytes = 1024LL*1024LL*1024LL*8LL;
unsigned long *m = (unsigned long*)malloc(bytes);
[...]
I'm confused by several things here. How come system malloc gives me 8GB malloc even without system having 4GB or RAM, are those virtual pages? Why dlmalloc doesn't do the same? I am aware there might not be a continuos block of 8GB of RAM to allocate, but why segmentation fault then, why not a null ptr?
Is there a viable robust (hopefully platform neutral) solution to get that amount of RAM in one go from malloc even if I'm not sure system will have that much RAM?
edit: program is 64-bit as are OS' which I'm running on.
edit2:
So I played with it some more. Turns out if I break down allocation into 1GB chunks, that is 8 separate mallocs, then it works with dlmalloc. So it seems to be an issue with contiguous range allocation where dlmalloc probably tries to allocate only if there is a contiguous block. This makes my question then even harder to formulate. Is there a somewhat sure way to get that size of a memory chunk with or without dlmalloc across platforms, and not have it fail if there is no physical memory left (can be in swap, as long as it doesn't fail). Also would it be possible in a cross platform manner to tell if malloc is in ram or swap.
I will give you just a bit of perspective, if not an outright answer. When I see you attempting to allocate 8GB of contiguous RAM, I cringe. Yes, with 64-bit computing and all, that is probably "legal", but on a normal machine, you are probably going to run into a lot of edge cases, 32-bit legacy code choking on a 64-bit size, and just plain usability issues getting a chunk of memory big enough to make this work. If you want to try this sort of thing, perhaps attempt to malloc the single chunk, then if that fails, use smaller chunks. This somewhat defeats the purpose of a 1 chunk system though. Perhaps there is some sort of "page size" in the OS that you could link your malloc size to - in order to help performance and just plain ability to get memory in the amount you wish.
On game consoles, this approach to memory management is somewhat common - allocate 1 buffer from the OS at bootup as big as possible, then place your own memory manager on there to avoid OS overhead and possible inferior allocation code. It also allows one to better control memory fragmentation on such systems where virtual memory doesn't exist. But on these systems, you also know up front exactly how much RAM you have.
Is there a way to see if memory is physical or virtual in a platform independent way? I don't think so, but perhaps someone else can give a good answer to that and I'll edit this part away.
So not a 100% answer, but some random thoughts to help out and my internally wondering what you are doing that wants 8GB of RAM in one chunk when it sounds like multiple chunks will work fine. :)

Determining size of bit vectors for memory management given hard limit on memory

After searching around a bit and consulting the Dinosaur Book, I've come to SO seeking wisdom. Note that this is somewhat homework-related, but actually isn't a homework problem. Also, this is using the C programming language.
I'm working with a kernel that currently allocates memory in 4K chunks. In an attempt to cut down on wasted memory, I've rewritten what I think malloc to be like, where it'll grab a 4K page and then give out memory from that, as needed. That part is currently working fine. I plan to have a linked list of pages of memory. Memory is handled as a char*, so my struct has a char* in it. It also currently has some ints describing it, as well as a pointer to the next node.
The question is this: I plan to use a bit vector to keep track of free and used memory. I want to figure out how many integers (4 bytes, 32 bits) I need to keep track of all the 1 byte blocks in the page of memory. So 1 bit in the bit vector will correspond to 1 byte in the page. The catch is that I must fit this all within the 4K I've been allocated, so I need to figure out the number of integers necessary to satisfy the 1-bit-per-byte constraint and fit in the 4K.
Or rather, I need to maximize the "actual" memory I'll have, while minimizing the number of integers required to map one bit per byte, while both parts ("actual" memory and bit vector) are in the same page.
Due to information about the page, and the pointers, I won't actually have 4K to play with, but something closer to 4062 bytes.
I believe this to be a linear programming problem, but the formulations of it that I've tried haven't worked out.
You want to use a bitmap to keep track of allocated bytes in a 4k page? And are wondering how to figure out how big the bitmap should be (in bytes)? The answer is 456 (after rounding), found by solving this equation:
X + X/8 = 4096
which reduces to:
9X = 32768
But ... the whole approach of keeping a bitmap within each allocated page sounds very wrong to me. What if you want to allocate a 12k buffer?

What's a good C memory allocator for embedded systems? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have an single threaded, embedded application that allocates and deallocates lots and lots of small blocks (32-64b). The perfect scenario for a cache based allocator. And although I could TRY to write one it'll likely be a waste of time, and not as well tested and tuned as some solution that's already been on the front lines.
So what would be the best allocator I could use for this scenario?
Note: I'm using a Lua Virtual Machine in the system (which is the culprit of 80+% of the allocations), so I can't trivially refactor my code to use stack allocations to increase allocation performance.
I'm a bit late to the party, but I just want to share very efficient memory allocator for embedded systems I've recently found and tested: https://github.com/dimonomid/umm_malloc
This is a memory management library specifically designed to work with the ARM7, personally I use it on PIC32 device, but it should work on any 16- and 8-bit device (I have plans to test in on 16-bit PIC24, but I haven't tested it yet)
I was seriously beaten by fragmentation with default allocator: my project often allocates blocks of various size, from several bytes to several hundreds of bytes, and sometimes I faced 'out of memory' error. My PIC32 device has total 32K of RAM, and 8192 bytes is used for heap. At the particular moment there is more than 5K of free memory, but default allocator has maximum non-fragmented memory block just of about 700 bytes, because of fragmentation. This is too bad, so I decided to look for more efficient solution.
I already was aware of some allocators, but all of them has some limitations (such as block size should be a power or 2, and starting not from 2 but from, say, 128 bytes), or was just buggy. Every time before, I had to switch back to the default allocator.
But this time, I'm lucky: I've found this one: http://hempeldesigngroup.com/embedded/stories/memorymanager/
When I tried this memory allocator, in exactly the same situation with 5K of free memory, it has more than 3800 bytes block! It was so unbelievable to me (comparing to 700 bytes), and I performed hard test: device worked heavily more than 30 hours. No memory leaks, everything works as it should work.
I also found this allocator in the FreeRTOS repository: http://svnmios.midibox.org/listing.php?repname=svn.mios32&path=%2Ftrunk%2FFreeRTOS%2FSource%2Fportable%2FMemMang%2F&rev=1041&peg=1041# , and this fact is an additional evidence of stability of umm_malloc.
So I completely switched to umm_malloc, and I'm quite happy with it.
I just had to change it a bit: configuration was a bit buggy when macro UMM_TEST_MAIN is not defined, so, I've created the github repository (the link is at the top of this post). Now, user dependent configuration is stored in separate file umm_malloc_cfg.h
I haven't got deeply yet in the algorithms applied in this allocator, but it has very detailed explanation of algorithms, so anyone who is interested can look at the top of the file umm_malloc.c . At least, "binning" approach should give huge benefit in less-fragmentation: http://g.oswego.edu/dl/html/malloc.html
I believe that anyone who needs for efficient memory allocator for microcontrollers, should at least try this one.
In a past project in C I worked on, we went down the road of implementing our own memory management routines for a library ran on a wide range of platforms including embedded systems. The library also allocated and freed a large number of small buffers. It ran relatively well and didn't take a large amount of code to implement. I can give you a bit of background on that implementation in case you want to develop something yourself.
The basic implementation included a set of routines that managed buffers of a set size. The routines were used as wrappers around malloc() and free(). We used these routines to manage allocation of structures that we frequently used and also to manage generic buffers of set sizes. A structure was used to describe each type of buffer being managed. When a buffer of a specific type was allocated, we'd malloc() the memory in blocks (if a list of free buffers was empty). IE, if we were managing 10 byte buffers, we might make a single malloc() that contained space for 100 of these buffers to reduce fragmentation and the number of underlying mallocs needed.
At the front of each buffer would be a pointer that would be used to chain the buffers in a free list. When the 100 buffers were allocated, each buffer would be chained together in the free list. When the buffer was in use, the pointer would be set to null. We also maintained a list of the "blocks" of buffers, so that we could do a simple cleanup by calling free() on each of the actual malloc'd buffers.
For management of dynamic buffer sizes, we also added a size_t variable at the beginning of each buffer telling the size of the buffer. This was then used to identify which buffer block to put the buffer back into when it was freed. We had replacement routines for malloc() and free() that did pointer arithmetic to get the buffer size and then to put the buffer into the free list. We also had a limit on how large of buffers we managed. Buffers larger than this limit were simply malloc'd and passed to the user. For structures that we managed, we created wrapper routines for allocation and freeing of the specific structures.
Eventually we also evolved the system to include garbage collection when requested by the user to clean up unused memory. Since we had control over the whole system, there were various optimizations we were able to make over time to increase performance of the system. As I mentioned, it did work quite well.
I did some research on this very topic recently, as we had an issue with memory fragmentation. In the end we decided to stay with GNU libc's implementation, and add some application-level memory pools where necessary. There were other allocators which had better fragmentation behavior, but we weren't comfortable enough with them replace malloc globally. GNU's has the benefit of a long history behind it.
In your case it seems justified; assuming you can't fix the VM, those tiny allocations are very wasteful. I don't know what your whole environment is, but you might consider wrapping the calls to malloc/realloc/free on just the VM so that you can pass it off to a handler designed for small pools.
Although its been some time since I asked this, my final solution was to use LoKi's SmallObjectAllocator it work great. Got rid off all the OS calls and improved the performance of my Lua engine for embedded devices. Very nice and simple, and just about 5 minutes worth of work!
Since version 5.1, Lua has allowed a custom allocator to be set when creating new states.
I'd just also like to add to this even though it's an old thread. In an embedded application if you can analyze your memory usage for your application and come up with a max number of memory allocation of the varying sizes usually the fastest type of allocator is one using memory pools. In our embedded apps we can determine all allocation sizes that will ever be needed during run time. If you can do this you can completely eliminate heap fragmentation and have very fast allocations. Most these implementations have an overflow pool which will do a regular malloc for the special cases which will hopefully be far and few between if you did your analysis right.
I have used the 'binary buddy' system to good effect under vxworks. Basically, you portion out your heap by cutting blocks in half to get the smallest power of two sized block to hold your request, and when blocks are freed, you can make a pass up the tree to merge blocks back together to mitigate fragmentation. A google search should turn up all the info you need.
I am writing a C memory allocator called tinymem that is intended to be able to defragment the heap, and re-use memory. Check it out:
https://github.com/vitiral/tinymem
Note: this project has been discontinued to work on the rust implementation:
https://github.com/vitiral/defrag-rs
Also, I had not heard of umm_malloc before. Unfortunately, it doesn't seem to be able to deal with fragmentation, but it definitely looks useful. I will have to check it out.

Resources