I'm looking for ideas for a heap-manager to handle a very specific situation: Lots and lots of very small allocations, ranging from 12 to 64 bytes each. Anything bigger, I will pass on to the regular heap-manager, so only tiny blocks need be catered for. Only 4-byte alignment is needed.
My main concerns are
Overhead. The regular libc heap will typically round up an allocation to a multiple of 16 bytes, then add another 16 byte header - this means over 50% overhead on a 20-byte allocation, which sucks.
Performance
One helpful aspect is that Lua (which is the user of this heap) will tell you the size of the block it's freeing when it calls free() - this may enable certain optimisations.
I'll post my current approach, which works ok, but I'd like to improve on it if at all possible. Any ideas?
It is possible to build a heap manager that is very efficient for objects that are all the same size. You could create one of these heaps for each size of object that you need, or if you don't mind using a bit of space, create one for 16 byte objects, one for 32, and one for 64. The maximum overhead would be 31 bytes for a 33 byte allocation (which would go on the 64 blocksize heap).
To expand on what Greg Hewgill says, one way to do an ultra-efficient fixed-size heap is:
Split a big buffer into nodes. Node size must be at least sizeof(void*).
String them together into a singly-linked list (the "free list"), using the first sizeof(void*) bytes of each free node as a link pointer. Allocated nodes will not need a link pointer, so per-node overhead is 0.
Allocate by removing the head of the list and returning it (2 loads, 1 store).
Free by inserting at the head of the list (1 load, 2 stores).
Obviously step 3 also has to check if the list's empty, and if so do a bunch of work getting a new big buffer (or fail).
Even more efficient, as Greg D and hazzen say, is to allocate by incrementing or decrementing a pointer (1 load, 1 store), and not offer a way to free a single node at all.
Edit: In both cases, free can deal with the complication "anything bigger I pass on the regular heap-manager" by the helpful fact that you get the size back in the call to free. Otherwise you'd be looking at either a flag (overhead probably 4 bytes per node) or else a lookup in some kind of record of the buffer(s) you've used.
The answer may depend on the lifetime patterns for these objects. If the objects are all instantiated as you proceed, and then all removed in one fell swoop, it may make sense to create a very simple heap manager that allocates memory by simply incrementing a pointer. Then, when you're done, blow away the entire heap.
Raymond Chen made an interesting post that may help to inspire you. :)
I like onebyones answer.
You might also consider The buddy system for your sets of fixed size heaps.
If a bunch of memory is allocated, used, and freed before moving on to the next round of allocation, I'd suggest using the simplest allocator possible:
typedef struct _allocator {
void* buffer;
int start;
int max;
} allocator;
void init_allocator(size_t size, allocator* alloc) {
alloc->buffer = malloc(size);
alloc->start = 0;
alloc->max = size;
}
void* allocator_malloc(allocator* alloc, size_t amount) {
if (alloc->max - alloc->start < 0) return NULL;
void* mem = alloc->buffer + alloc->start;
alloc->start += bytes;
return mem;
}
void allocator_free(allocator* alloc) {
alloc->start = 0;
}
I use a mostly O(1) Small Block Memory Manager (SBMM). Basically it works this way:
1) It allocates larger SuperBlocks from the OS and tracks the Start+End Addresses as a range. The size of the SuperBlock is adjustable but 1MB makes a pretty good size.
2) The SuperBlocks are broken into Blocks (also adjustable in size... 4K-64K is good depending on your app). Each of these Blocks handles allocations of a specific size and stores all the items in the Block as a singly linked list. When you allocate a SuperBlock, you make a linked list of Free Blocks.
3) Allocating an Item means A) Checking to see if there is a Block with Free Items handling that size - and if not, allocating a new Block from the SuperBlocks. B) Removing the Item from the Block's Free List.
4) Freeing an Item by address means A) Finding SuperBlock containing address(*) B) Finding Block in SuperBlock (substract SuperBlock start address and divide by Block size) C) Pushing Item back on the Block's Free Item list.
As I stated, this SBMM is very fast as it runs with O(1) performance(*). In the version I have implemented, I use an AtomicSList (similar to SLIST in Windows) so that it is not only O(1) performance but also ThreadSafe and LockFree in the implementation. You could actually implement the algorithm using Win32 SLIST if you wanted to.
Interestingly, the algorithm for allocating Blocks from the SuperBlocks or Items from the Blocks result in nearly identicaly code (they're both O(1) allocations off a Free List).
(*) The SuperBlocks are arranged in a rangemap with O(1) average performance (but a potential O(Lg N) for worstcase where N is the number of SuperBlocks). The width of the rangemap depends on knowing roughly how much memory you're going to need in order to get the O(1) performance. If you overshoot, you'll waste a bit of memory but still get O(1) performance. If you undershoot, you'll approach O(Lg N) performance but the N is for the SuperBlock count -- not the item count. Since the SuperBlock count is very low compared to the Item count (by about 20 binary orders of magnitude in my code), it is not as critical as the rest of the allocator.
Related
You need to implement a memory manager in C with the following 3 functions:
void init() - initialize the memory manager.
void* get(int numOfBytes) - return a memory block (on the heap) of size "numOfBytes". The value of "numOfBytes" can be in the range [1,8k].
void free(void* ptr) - free the memory block pointed by "ptr".
Few rules:
You can call for malloc function only in the "init()" function.
The methods "get" and "free" should be as efficient as possible, but the method "init" doesn't have to be as long as you don't waste too much memory or something like that.
You can assume your memory manager will not need to allocate more than some fixed size number of bytes in total, say no more than 1GB at total.
My attempt:
I thought of just implementing fixed size memory pool where each block is 8k bytes, like in here. This will give us O(1) run time for methods "get" and "free" which is great, but the problem is that we are wasting a lot of memoy like that if the user only calls "get" for small number of bytes (say, 1 byte each time).
But if I try to implement it with variable block sizes - I'll need to handle fragmentation which will make the run time worse.
So do you have a better idea?
I'd avoid a fixed size block.
A common strategy is to form pools at power-of-2: 16,32,...1G, with everything initially in the largest pool.
Each block allocated is the user size n + overhead (est. 4-8 bytes) "ceiling" up to a power-of-2.
If a pool lacks an available block, cut a larger one in half.
As similar allocation sizes tend to occur in groups, this avoids excess size waste.
On de-allocation (and collapsing for reuse) only requires free'ing of a pair'd block to re-form the larger block (which may in turn re-join another block) and reduce fragmentation.
Note: All *alloc() return a pointer OK to align max_align_t, thus that is the lower bound expected likewise for get() - (maybe size 4?). As part of an interview, mentioning alignment and portability concerns is good.
There are various improves like well accommodating power-of-2 size blocks, yet for an interview question, only need to touch on such improvement ideas.
free() is a standard lib. function - best to not redefine - use a different name.
I use C struct to build a prefix tree, i.e. Trie. I want to use it to store a lot of number lists to get the next possible number in a given path. (For example, I have [10, 15, 17] and [10, 15, 18], then the next possible number of [10,15] is 17,18.) The problem I met is about the memory use, my each struct Trie node takes only 12 bytes(I used sizeof() to check it), and have 0.83 billion nodes in total, which should take 0.83 billion * 12 bytes = 10G memory use, but actually I used 20G memory, and I want to reduce the memory use into 10G.
I use an unsigned short to store the data for each node, unsignedn short n_child to store how many children does this node have, a pointer to his children list beginning location, and realloc a 12 bytes bigger memory space for a new node.
#pragma pack(2)
typedef struct trie_node{
unsigned short data;
unsigned short n_child;
struct trie_node * children;
} Trie;
When I have to add a new child, I use:
this_node->n_child = this_node->n_child + 1;
this_node->children = (Trie *)realloc(this_node->children, this_node->n_child * sizeof(Trie));
I want to know why the memory use is bigger than calculated and cound I reduce the memory use.
The problem here is that you are allocating very small chunks of data (the size of struct trie_node is very small, and it's about the size of the data needed by the malloc() library to manage the different allocations you do. And you call realloc() each time you add a single element. Figure this scenario, you have a chunk with, let's say 10 nodes and you add one, reallocing from 10 to 11, but as you have allocated plenty of different arrays, you don't have space to fit in the hole you have your last array, so the memory manager has to make a free hole with space for 10 nodes and allocate another with space or 11 (elsewhere).
If you have the luck of having another chunk with 9 nodes and growing to 10, then you can reuse the last hole.... but that's not normally the case, that hole is reused (partially) before there's another need to grow from 9 to 10. and this provokes that there's much fragmentation in the dynamic memory area.
The fragmentation, jointly with the fact that you use a very small node structure, is generating this overhead of 100% of the used memory.
You can alleviate this in several ways:
Don't realloc() by one. Just do it doubling, for example. This has two advantages: first, the chunks sizes are only powers of two, so the fragmentation level is far less, because the probability of having a valid power of two is far easier if you only have powers of two. Another good value is to grow by adding the last two used sizes (as in a fibonacci series).
Add a pointer to the sibling and organise your tree as linked lists... This way you will allocate all chunks the same size (memory managers are best when the sizes are the same) But be careful, you grow your structure by one pointer. Parto of what you get from one side, goes out in the other.
If you know in advance the average of the number of children a node will have, preallocating that average number will be interesting as it will make the chunks to be near this fixed size, so it will manage better.
Finally, there's one thing you cannot avoid, and is to have some overhead, due to the metadata the memory manager needs to allocate, apart from your data, to manage properly the heap. But the bigger are your allocations, the lower losses on this data, as the memory managers normally need some amount of data per allocation, and it doesn't depend on the allocation size.
I'm trying to write my own malloc and free implementation for the sake of learning, with just mmap and munmap (since brk and sbrk are obsoletes). I've read a fair amount of documentation on the subject, but every example I see either use sbrk or doesn't explain very well how to handle large zones of mapped memory.
The idea of what I'm trying to write is this: I first map a big zone (i.e. 512 pages); this zone will contains all allocations between 1 and 992 bytes, in 16 bytes increments. I'll do the same later with a 4096 pages zone for bigger allocations (or mmap directly if the requested size is bigger than a page). So I need a way to store informations about every chunk that I allocate or free.
My question is, how do I handle these informations properly ?
My problematics are: If I create a linked list, how do I allocate more space for each node ? Or do I need to copy it to the mapped zone ? If so, how can I juggle between data space and reserved space ? Or is it better to use a static sized array ? Problem with this is that my zone's size depends on the page size.
There are several possible implementations for a mmap-based malloc:
Sequential (first-fit, best-fit).
Idea: Use a linked list with the last chunk sized to the remaining size of your page.
struct chunk
{
size_t size;
struct chunk *next;
int is_free;
}
To allocate
Iterate your list for a suitable free chunk (optimizable)
If nothing's found, resize the last chunk to the required size and create a free chunk to the remaining size.
If you reach the end of the page, (the size is too small, and next is NULL), simply mmap a new page (optimisable: generate a custom page if the size is abnormal ...)
To free, even simpler: simply set is_free to 1. optionally, you can check if the next chunk is also free and merge both in a bigger free chunk (watch out for page borders).
Pros: Easy to implement, trivial to understand, simple to tweak.
Cons: not very efficient (iterate your whole list to find a block?), need lots of optimisation, hectic memory organization
Binary buddies (I love binary arithmetics and recursion)
Idea: Use powers-of-2 as size units:
struct chunk
{
size_t size;
int is_free;
}
the structure here does not need a next as you'll see.
The principle is the following:
You have a 4096-bytes page. that is (-16 for metadata) 4080 usable bytes
To allocate a small block, simply split up the page in two 2048-bytes chunks, and split again the first half in 1028-bytes chunks... until you get a suitable usable space (minimum at 32-bytes (16 usable)).
Every block, if it isn't a full page, has a buddy.
You end up with a tree-like structure like this:
to access your buddy, use a binary XOR between your pointer and your block size.
Implementation:
Allocating a block of size Size
Get the required Block_size = 2^k > size + sizeof(chunk)
find the smallest free space in the tree that fits block_size
If it can get smaller, Split it, recursively.
Freeing a block
Setting is_free to 1
checking if your buddy is free (XOR size, don't forget to verify he's the same size as you)
if he is, double his size. Recurse.
Pros: Extremely fast and memory-efficient, clean.
Cons: Complicated, a few tricky cases (page borders and buddy sizes)
Need to keep a list of your pages
Buckets (I have a lot of time to lose)
This is the only method of the three I have not attempted to implement myself, so I can only speak of the Theory:
struct bucket
{
size_t buck_num; //number of data segment
size_t buck_size; //size of a data segment
void *page;
void *freeinfo;
}
You have from the start a few small pages, each split in blocks of constant size (one 8-bytes page, one 16-bytes, one 32-bytes and so on)
The "freedom information" of those data buckets are stored in bitsets (structures representing a large set of ints) either at the start of each page, or in a separate memory zone.
for example, for a 512-bytes bucket in a 4096 bytes pages, the bitset representing it would be a 8-bit bitset,
supposing *freeinfo = 01001000, this would mean the second and fifth buckets are free.
Pros: By far the fastest and cleanest over the long run,
Most efficient on many small allocations
Cons: Very cumbersome to implement, quite heavy for a small program, need for a separate memory space for bitsets.
There are probably other algorithms and implementations but those three are the most used, So I hope you can get a lead on what you want to do from this.
Sorry if this has been asked before, I haven't been able to find just what I am looking for.
I am reading fields from a list and writing them to a block of memory. I could
Walk the whole list, find the total needed size, do one malloc and THEN walk the list again and copy each field;
Walk the whole list and realloc the block of memory as I write the values;
Right now the first seems the most efficient to me (smallest number of calls). What are the pros and cons of either approach ?
Thank you for your time.
The first approach is almost always better. A realloc() typically works by copying the entire contents of a memory block into the freshly allocated, larger block. So n reallocs can mean n copies, each one larger than the last. (If you are adding m bytes to your allocation each time, then the first realloc has to copy m bytes, the next one 2m, the next 3m, ...).
The pedantic answer would be that the internal performance implications of realloc() are implementation specific, not explicitly defined by the standard, in some implementation it could work by magic fairies that move the bytes around instantly, etc etc etc - but in any realistic implementation, realloc() means a copy.
You're probably better off allocating a decent amount of space initially, based on what you think is the most likely maximum.
Then, if you find you need more space, don't just allocate enough for the extra, allocate a big chunk extra.
This will minimise the number of re-allocations while still only processing the list once.
By way of example, initially allocate 100K. If you then find you need more, re-allocate to 200K, even if you only need 101K.
Don't reinvent the wheel and use CCAN's darray which implements an approach similar to what paxdiablo described. See darray on GitHub
So the title is somewhat misleading... I'll keep this simple: I'm comparing these two data structures:
An array, whereby it starts at size 1, and for each subsequent addition, there is a realloc() call to expand the memory, and then append the new (malloced) element to the n-1 position.
A linked list, whereby I keep track of the head, tail, and size. And addition involves mallocing for a new element and updating the tail pointer and size.
Don't worry about any of the other details of these data structures. This is the only functionality I'm concerned with for this testing.
In theory, the LL should be performing better. However, they're near identical in time tests involving 10, 100, 1000... up to 5,000,000 elements.
My gut feeling is that the heap is large. I think the data segment defaults to 10 MB on Redhat? I could be wrong. Anyway, realloc() is first checking to see if space is available at the end of the already-allocated contiguous memory location (0-[n-1]). If the n-th position is available, there is not a relocation of the elements. Instead, realloc() just reserves the old space + the immediately following space. I'm having a hard time finding evidence of this, and I'm having a harder time proving that this array should, in practice, perform worse than the LL.
Here is some further analysis, after reading posts below:
[Update #1]
I've modified the code to have a separate list that mallocs memory every 50th iteration for both the LL and the Array. For 1 million additions to the array, there are almost consistently 18 moves. There's no concept of moving for the LL. I've done a time comparison, they're still nearly identical. Here's some output for 10 million additions:
(Array)
time ./a.out a 10,000,000
real 0m31.266s
user 0m4.482s
sys 0m1.493s
(LL)
time ./a.out l 10,000,000
real 0m31.057s
user 0m4.696s
sys 0m1.297s
I would expect the times to be drastically different with 18 moves. The array addition is requiring 1 more assignment and 1 more comparison to get and check the return value of realloc to ensure a move occurred.
[Update #2]
I ran an ltrace on the testing that I posted above, and I think this is an interesting result... It looks like realloc (or some memory manager) is preemptively moving the array to larger contiguous locations based on the current size.
For 500 iterations, a memory move was triggered on iterations:
1, 2, 4, 7, 11, 18, 28, 43, 66, 101, 154, 235, 358
Which is pretty close to a summation sequence. I find this to be pretty interesting - thought I'd post it.
You're right, realloc will just increase the size of the allocated block unless it is prevented from doing so. In a real world scenario you will most likely have other objects allocated on the heap in between subsequent additions to the list? In that case realloc will have to allocate a completely new chunk of memory and copy the elements already in the list.
Try allocating another object on the heap using malloc for every ten insertions or so, and see if they still perform the same.
So you're testing how quickly you can expand an array verses a linked list?
In both cases you're calling a memory allocation function. Generally memory allocation functions grab a chunk of memory (perhaps a page) from the operating system, then divide that up into smaller pieces as required by your application.
The other assumption is that, from time to time, realloc() will spit the dummy and allocate a large chunk of memory elsewhere because it could not get contiguous chunks within the currently allocated page. If you're not making any other calls to memory allocation functions in between your list expand then this won't happen. And perhaps your operating system's use of virtual memory means that your program heap is expanding contiguously regardless of where the physical pages are coming from. In which case the performance will be identical to a bunch of malloc() calls.
Expect performance to change where you mix up malloc() and realloc() calls.
Assuming your linked list is a pointer to the first element, if you want to add an element to the end, you must first walk the list. This is an O(n) operation.
Assuming realloc has to move the array to a new location, it must traverse the array to copy it. This is an O(n) operation.
In terms of complexity, both operations are equal. However, as others have pointed out, realloc may be avoiding relocating the array, in which case adding the element to the array is O(1). Others have also pointed out that the vast majority of your program's time is probably spent in malloc/realloc, which both implementations call once per addition.
Finally, another reason the array is probably faster is cache coherency and the generally high performance of linear copies. Jumping around to erratic addresses with significant gaps between them (both the larger elements and the malloc bookkeeping) is not usually as fast as doing a bulk copy of the same volume of data.
The performance of an array-based solution expanded with realloc() will depend on your strategy for creating more space.
If you increase the amount of space by adding a fixed amount of storage on each re-allocation, you'll end up with an expansion that, on average, depends on the number of elements you have stored in the array. This is on the assumption that realloc will need to (occasionally) allocate space elsewhere and copy the contents, rather than just expanding the existing allocation.
If you increase the amount of space by adding a proportion of your current number of elements (doubling is pretty standard), you'll end up with an expansion that, on average, takes constant time.
Will the compiler output be much different in these two cases?
This is not a real life situation. Presumably, in real life, you are interested in looking at or even removing items from your data structures as well as adding them.
If you allow removal, but only from the head, the linked list becomes better than the array because removing an item is trivial and, if instead of freeing the removed item, you put it on a free list to be recycled, you can eliminate a lot of the mallocs needed when you add items to the list.
On the other had, if you need random access to the structure, clearly an array beats the linked list.
(Updated.)
As others have noted, if there are no other allocations in between reallocs, then no copying is needed. Also as others have noted, the risk of memory copying lessens (but also its impact of course) for very small blocks, smaller than a page.
Also, if all you ever do in your test is to allocate new memory space, I am not very surprised you see little difference, since the syscalls to allocate memory are probably taking most of the time.
Instead, choose your data structures depending on how you want to actually use them. A framebuffer is for instance probably best represented by a contiguous array.
A linked list is probably better if you have to reorganise or sort data within the structure quickly.
Then these operations will be more or less efficient depending on what you want to do.
(Thanks for the comments below, I was initially confused myself about how these things work.)
What's the basis of your theory that the linked list should perform better for insertions at the end? I would not expect it to, for exactly the reason you stated. realloc will only copy when it has to to maintain contiguity; in other cases it may have to combine free chunks and/or increase the chunk size.
However, every linked list node requires fresh allocation and (assuming double linked list) two writes. If you want evidence of how realloc works, you can just compare the pointer before and after realloc. You should find that it usually doesn't change.
I suspect that since you're calling realloc for every element (obviously not wise in production), the realloc/malloc call itself is the biggest bottleneck for both tests, even though realloc often doesn't provide a new pointer.
Also, you're confusing the heap and data segment. The heap is where malloced memory lives. The data segment is for global and static variables.