How to handle memory management via mmap properly? - c

I'm trying to write my own malloc and free implementation for the sake of learning, with just mmap and munmap (since brk and sbrk are obsoletes). I've read a fair amount of documentation on the subject, but every example I see either use sbrk or doesn't explain very well how to handle large zones of mapped memory.
The idea of what I'm trying to write is this: I first map a big zone (i.e. 512 pages); this zone will contains all allocations between 1 and 992 bytes, in 16 bytes increments. I'll do the same later with a 4096 pages zone for bigger allocations (or mmap directly if the requested size is bigger than a page). So I need a way to store informations about every chunk that I allocate or free.
My question is, how do I handle these informations properly ?
My problematics are: If I create a linked list, how do I allocate more space for each node ? Or do I need to copy it to the mapped zone ? If so, how can I juggle between data space and reserved space ? Or is it better to use a static sized array ? Problem with this is that my zone's size depends on the page size.

There are several possible implementations for a mmap-based malloc:
Sequential (first-fit, best-fit).
Idea: Use a linked list with the last chunk sized to the remaining size of your page.
struct chunk
{
size_t size;
struct chunk *next;
int is_free;
}
To allocate
Iterate your list for a suitable free chunk (optimizable)
If nothing's found, resize the last chunk to the required size and create a free chunk to the remaining size.
If you reach the end of the page, (the size is too small, and next is NULL), simply mmap a new page (optimisable: generate a custom page if the size is abnormal ...)
To free, even simpler: simply set is_free to 1. optionally, you can check if the next chunk is also free and merge both in a bigger free chunk (watch out for page borders).
Pros: Easy to implement, trivial to understand, simple to tweak.
Cons: not very efficient (iterate your whole list to find a block?), need lots of optimisation, hectic memory organization
Binary buddies (I love binary arithmetics and recursion)
Idea: Use powers-of-2 as size units:
struct chunk
{
size_t size;
int is_free;
}
the structure here does not need a next as you'll see.
The principle is the following:
You have a 4096-bytes page. that is (-16 for metadata) 4080 usable bytes
To allocate a small block, simply split up the page in two 2048-bytes chunks, and split again the first half in 1028-bytes chunks... until you get a suitable usable space (minimum at 32-bytes (16 usable)).
Every block, if it isn't a full page, has a buddy.
You end up with a tree-like structure like this:
to access your buddy, use a binary XOR between your pointer and your block size.
Implementation:
Allocating a block of size Size
Get the required Block_size = 2^k > size + sizeof(chunk)
find the smallest free space in the tree that fits block_size
If it can get smaller, Split it, recursively.
Freeing a block
Setting is_free to 1
checking if your buddy is free (XOR size, don't forget to verify he's the same size as you)
if he is, double his size. Recurse.
Pros: Extremely fast and memory-efficient, clean.
Cons: Complicated, a few tricky cases (page borders and buddy sizes)
Need to keep a list of your pages
Buckets (I have a lot of time to lose)
This is the only method of the three I have not attempted to implement myself, so I can only speak of the Theory:
struct bucket
{
size_t buck_num; //number of data segment
size_t buck_size; //size of a data segment
void *page;
void *freeinfo;
}
You have from the start a few small pages, each split in blocks of constant size (one 8-bytes page, one 16-bytes, one 32-bytes and so on)
The "freedom information" of those data buckets are stored in bitsets (structures representing a large set of ints) either at the start of each page, or in a separate memory zone.
for example, for a 512-bytes bucket in a 4096 bytes pages, the bitset representing it would be a 8-bit bitset,
supposing *freeinfo = 01001000, this would mean the second and fifth buckets are free.
Pros: By far the fastest and cleanest over the long run,
Most efficient on many small allocations
Cons: Very cumbersome to implement, quite heavy for a small program, need for a separate memory space for bitsets.
There are probably other algorithms and implementations but those three are the most used, So I hope you can get a lead on what you want to do from this.

Related

Memory manager implementation in C for memory blocks of 1 byte to 8k bytes - interview question

You need to implement a memory manager in C with the following 3 functions:
void init() - initialize the memory manager.
void* get(int numOfBytes) - return a memory block (on the heap) of size "numOfBytes". The value of "numOfBytes" can be in the range [1,8k].
void free(void* ptr) - free the memory block pointed by "ptr".
Few rules:
You can call for malloc function only in the "init()" function.
The methods "get" and "free" should be as efficient as possible, but the method "init" doesn't have to be as long as you don't waste too much memory or something like that.
You can assume your memory manager will not need to allocate more than some fixed size number of bytes in total, say no more than 1GB at total.
My attempt:
I thought of just implementing fixed size memory pool where each block is 8k bytes, like in here. This will give us O(1) run time for methods "get" and "free" which is great, but the problem is that we are wasting a lot of memoy like that if the user only calls "get" for small number of bytes (say, 1 byte each time).
But if I try to implement it with variable block sizes - I'll need to handle fragmentation which will make the run time worse.
So do you have a better idea?
I'd avoid a fixed size block.
A common strategy is to form pools at power-of-2: 16,32,...1G, with everything initially in the largest pool.
Each block allocated is the user size n + overhead (est. 4-8 bytes) "ceiling" up to a power-of-2.
If a pool lacks an available block, cut a larger one in half.
As similar allocation sizes tend to occur in groups, this avoids excess size waste.
On de-allocation (and collapsing for reuse) only requires free'ing of a pair'd block to re-form the larger block (which may in turn re-join another block) and reduce fragmentation.
Note: All *alloc() return a pointer OK to align max_align_t, thus that is the lower bound expected likewise for get() - (maybe size 4?). As part of an interview, mentioning alignment and portability concerns is good.
There are various improves like well accommodating power-of-2 size blocks, yet for an interview question, only need to touch on such improvement ideas.
free() is a standard lib. function - best to not redefine - use a different name.

use malloc and realloc to use C struct but take twice the memory space as I calculated

I use C struct to build a prefix tree, i.e. Trie. I want to use it to store a lot of number lists to get the next possible number in a given path. (For example, I have [10, 15, 17] and [10, 15, 18], then the next possible number of [10,15] is 17,18.) The problem I met is about the memory use, my each struct Trie node takes only 12 bytes(I used sizeof() to check it), and have 0.83 billion nodes in total, which should take 0.83 billion * 12 bytes = 10G memory use, but actually I used 20G memory, and I want to reduce the memory use into 10G.
I use an unsigned short to store the data for each node, unsignedn short n_child to store how many children does this node have, a pointer to his children list beginning location, and realloc a 12 bytes bigger memory space for a new node.
#pragma pack(2)
typedef struct trie_node{
unsigned short data;
unsigned short n_child;
struct trie_node * children;
} Trie;
When I have to add a new child, I use:
this_node->n_child = this_node->n_child + 1;
this_node->children = (Trie *)realloc(this_node->children, this_node->n_child * sizeof(Trie));
I want to know why the memory use is bigger than calculated and cound I reduce the memory use.
The problem here is that you are allocating very small chunks of data (the size of struct trie_node is very small, and it's about the size of the data needed by the malloc() library to manage the different allocations you do. And you call realloc() each time you add a single element. Figure this scenario, you have a chunk with, let's say 10 nodes and you add one, reallocing from 10 to 11, but as you have allocated plenty of different arrays, you don't have space to fit in the hole you have your last array, so the memory manager has to make a free hole with space for 10 nodes and allocate another with space or 11 (elsewhere).
If you have the luck of having another chunk with 9 nodes and growing to 10, then you can reuse the last hole.... but that's not normally the case, that hole is reused (partially) before there's another need to grow from 9 to 10. and this provokes that there's much fragmentation in the dynamic memory area.
The fragmentation, jointly with the fact that you use a very small node structure, is generating this overhead of 100% of the used memory.
You can alleviate this in several ways:
Don't realloc() by one. Just do it doubling, for example. This has two advantages: first, the chunks sizes are only powers of two, so the fragmentation level is far less, because the probability of having a valid power of two is far easier if you only have powers of two. Another good value is to grow by adding the last two used sizes (as in a fibonacci series).
Add a pointer to the sibling and organise your tree as linked lists... This way you will allocate all chunks the same size (memory managers are best when the sizes are the same) But be careful, you grow your structure by one pointer. Parto of what you get from one side, goes out in the other.
If you know in advance the average of the number of children a node will have, preallocating that average number will be interesting as it will make the chunks to be near this fixed size, so it will manage better.
Finally, there's one thing you cannot avoid, and is to have some overhead, due to the metadata the memory manager needs to allocate, apart from your data, to manage properly the heap. But the bigger are your allocations, the lower losses on this data, as the memory managers normally need some amount of data per allocation, and it doesn't depend on the allocation size.

Cannot get memory allocated from `flex_array_alloc` when requesting a relatively big size in linux kernel

I'm doing some linux kernel development.
And I'm going to allocate some memory space with something like:
ptr = flex_array_alloc(size=136B, num=1<<16, GFP_KERNEL)
And ptr turns out to be NULL every time I try.
What's more, when I change the size to 20B or num to 256,there's nothing wrong and the memory can be obtained.
So I want to know if there are some limitations for requesting memory in linux kernel modules. And how to debug it or to allocate a big memory space.
Thanks.
And kzalloc has a similar behavior in my environment. That is, requesting a 136B * (1<<16) space failed, while 20B or 1<<8 succeed.
There are two limits to the size of an array allocated with flex_array_allocate. First, the object size itself must not exceed a single page, as indicated in https://www.kernel.org/doc/Documentation/flexible-arrays.txt:
The down sides are that the arrays cannot be indexed directly, individual object size cannot exceed the system page size, and putting data into a flexible array requires a copy operation.
Second, there is a maximum number of elements in the array.
Both limitations are the result of the implementation technique:
…the need for memory from vmalloc() can be eliminated by piecing together an array from smaller parts…
A flexible array holds an arbitrary (within limits) number of fixed-sized objects, accessed via an integer index.… Only single-page allocations are made…
The array is "pieced" together by using an array of pointers to individual parts, where each part is one system page. Since this array is also allocated, and only single-page allocations are made (as noted above), the maximum number of parts is slightly less than the number of pointers which can fit in a page (slightly less because there is also some bookkeeping data.) In effect, this limits the total size of a flexible array to about 2MB on systems with 8-byte pointers and 4kb pages. (The precise limitation will vary depending on the amount of wasted space in a page if the object size is not a power of two.)

One large malloc versus multiple smaller reallocs

Sorry if this has been asked before, I haven't been able to find just what I am looking for.
I am reading fields from a list and writing them to a block of memory. I could
Walk the whole list, find the total needed size, do one malloc and THEN walk the list again and copy each field;
Walk the whole list and realloc the block of memory as I write the values;
Right now the first seems the most efficient to me (smallest number of calls). What are the pros and cons of either approach ?
Thank you for your time.
The first approach is almost always better. A realloc() typically works by copying the entire contents of a memory block into the freshly allocated, larger block. So n reallocs can mean n copies, each one larger than the last. (If you are adding m bytes to your allocation each time, then the first realloc has to copy m bytes, the next one 2m, the next 3m, ...).
The pedantic answer would be that the internal performance implications of realloc() are implementation specific, not explicitly defined by the standard, in some implementation it could work by magic fairies that move the bytes around instantly, etc etc etc - but in any realistic implementation, realloc() means a copy.
You're probably better off allocating a decent amount of space initially, based on what you think is the most likely maximum.
Then, if you find you need more space, don't just allocate enough for the extra, allocate a big chunk extra.
This will minimise the number of re-allocations while still only processing the list once.
By way of example, initially allocate 100K. If you then find you need more, re-allocate to 200K, even if you only need 101K.
Don't reinvent the wheel and use CCAN's darray which implements an approach similar to what paxdiablo described. See darray on GitHub

Efficient heap-manager for heavy churn, tiny allocs?

I'm looking for ideas for a heap-manager to handle a very specific situation: Lots and lots of very small allocations, ranging from 12 to 64 bytes each. Anything bigger, I will pass on to the regular heap-manager, so only tiny blocks need be catered for. Only 4-byte alignment is needed.
My main concerns are
Overhead. The regular libc heap will typically round up an allocation to a multiple of 16 bytes, then add another 16 byte header - this means over 50% overhead on a 20-byte allocation, which sucks.
Performance
One helpful aspect is that Lua (which is the user of this heap) will tell you the size of the block it's freeing when it calls free() - this may enable certain optimisations.
I'll post my current approach, which works ok, but I'd like to improve on it if at all possible. Any ideas?
It is possible to build a heap manager that is very efficient for objects that are all the same size. You could create one of these heaps for each size of object that you need, or if you don't mind using a bit of space, create one for 16 byte objects, one for 32, and one for 64. The maximum overhead would be 31 bytes for a 33 byte allocation (which would go on the 64 blocksize heap).
To expand on what Greg Hewgill says, one way to do an ultra-efficient fixed-size heap is:
Split a big buffer into nodes. Node size must be at least sizeof(void*).
String them together into a singly-linked list (the "free list"), using the first sizeof(void*) bytes of each free node as a link pointer. Allocated nodes will not need a link pointer, so per-node overhead is 0.
Allocate by removing the head of the list and returning it (2 loads, 1 store).
Free by inserting at the head of the list (1 load, 2 stores).
Obviously step 3 also has to check if the list's empty, and if so do a bunch of work getting a new big buffer (or fail).
Even more efficient, as Greg D and hazzen say, is to allocate by incrementing or decrementing a pointer (1 load, 1 store), and not offer a way to free a single node at all.
Edit: In both cases, free can deal with the complication "anything bigger I pass on the regular heap-manager" by the helpful fact that you get the size back in the call to free. Otherwise you'd be looking at either a flag (overhead probably 4 bytes per node) or else a lookup in some kind of record of the buffer(s) you've used.
The answer may depend on the lifetime patterns for these objects. If the objects are all instantiated as you proceed, and then all removed in one fell swoop, it may make sense to create a very simple heap manager that allocates memory by simply incrementing a pointer. Then, when you're done, blow away the entire heap.
Raymond Chen made an interesting post that may help to inspire you. :)
I like onebyones answer.
You might also consider The buddy system for your sets of fixed size heaps.
If a bunch of memory is allocated, used, and freed before moving on to the next round of allocation, I'd suggest using the simplest allocator possible:
typedef struct _allocator {
void* buffer;
int start;
int max;
} allocator;
void init_allocator(size_t size, allocator* alloc) {
alloc->buffer = malloc(size);
alloc->start = 0;
alloc->max = size;
}
void* allocator_malloc(allocator* alloc, size_t amount) {
if (alloc->max - alloc->start < 0) return NULL;
void* mem = alloc->buffer + alloc->start;
alloc->start += bytes;
return mem;
}
void allocator_free(allocator* alloc) {
alloc->start = 0;
}
I use a mostly O(1) Small Block Memory Manager (SBMM). Basically it works this way:
1) It allocates larger SuperBlocks from the OS and tracks the Start+End Addresses as a range. The size of the SuperBlock is adjustable but 1MB makes a pretty good size.
2) The SuperBlocks are broken into Blocks (also adjustable in size... 4K-64K is good depending on your app). Each of these Blocks handles allocations of a specific size and stores all the items in the Block as a singly linked list. When you allocate a SuperBlock, you make a linked list of Free Blocks.
3) Allocating an Item means A) Checking to see if there is a Block with Free Items handling that size - and if not, allocating a new Block from the SuperBlocks. B) Removing the Item from the Block's Free List.
4) Freeing an Item by address means A) Finding SuperBlock containing address(*) B) Finding Block in SuperBlock (substract SuperBlock start address and divide by Block size) C) Pushing Item back on the Block's Free Item list.
As I stated, this SBMM is very fast as it runs with O(1) performance(*). In the version I have implemented, I use an AtomicSList (similar to SLIST in Windows) so that it is not only O(1) performance but also ThreadSafe and LockFree in the implementation. You could actually implement the algorithm using Win32 SLIST if you wanted to.
Interestingly, the algorithm for allocating Blocks from the SuperBlocks or Items from the Blocks result in nearly identicaly code (they're both O(1) allocations off a Free List).
(*) The SuperBlocks are arranged in a rangemap with O(1) average performance (but a potential O(Lg N) for worstcase where N is the number of SuperBlocks). The width of the rangemap depends on knowing roughly how much memory you're going to need in order to get the O(1) performance. If you overshoot, you'll waste a bit of memory but still get O(1) performance. If you undershoot, you'll approach O(Lg N) performance but the N is for the SuperBlock count -- not the item count. Since the SuperBlock count is very low compared to the Item count (by about 20 binary orders of magnitude in my code), it is not as critical as the rest of the allocator.

Resources