(Dynamic Memory Allocation) why maximize peak utilization?

(Dynamic Memory Allocation) why maximize peak utilization? - c

I was reading a textbook which says that:
I'm totally lost, let's says:
n = 10 and p (required payload) = 800 bytes,
Does it mean that on the n = 9 which is 9th allocate request, P9 needs to be 792 bytes( suppose a single minimum allocate is 8 bytes)? Is my understanding correct?

As far as I understand the text, that allocator objective is to maximize Pi (sum of allocated memory at instant i). The peak utilization up to k is the ratio of the max of what could be allocated divided by the heap size at k.
Since there is a number of alloc and free at i, if the allocator is too basic and does not handle well the requests, it might be unable to answer another allocation request (for instance due to heap fragmentation, see exemple below).
A smart allocator might allow a maximal payload, at the expense of a slower response.
On the opposite you might have a fast allocator that might not be able to maximize the aggregate payload Pk after a number of requests.
To give a (simple) exemple, having that chain of requests
R1: alloc(1000)
R2: alloc(2000)
R3: alloc(1500)
R4: free(R1)
R5: free(R2)
R6: alloc(3000) => use space from R1+R2?
At R6 a basic allocator might not be able to understand it could reuse the space freed from R1 and R2, giving a low peak ratio, and the heap size is unnecessary higher than it should.
A smarter one might do, but at the likely expense of more CPU /resources.

The assumption behind this is that memory is a precious commodity, and we should try to avoid wasting it. If peak utilization is low, it means there's lots of memory reserved for the heap that isn't being used.
Peak utilization doesn't refer to any specific allocation. Pi is the total payloads from i successive requests. So it doesn't make sense to refer to a single p = 800 bytes in this analysis. maxi≤kPi is the peak of a graph of Pi from request 0 to k. Each allocation adds to the heap utilization, although frees will reduce it.
You need enough heap memory to satisfy the maximum amount of memory you'll ever allocate at simultaneously, but if you have much more then the rest is being wasted. Maximizing peak utilization means to try to find this sweet spot.

Related

Memory manager implementation in C for memory blocks of 1 byte to 8k bytes - interview question

You need to implement a memory manager in C with the following 3 functions:
void init() - initialize the memory manager.
void* get(int numOfBytes) - return a memory block (on the heap) of size "numOfBytes". The value of "numOfBytes" can be in the range [1,8k].
void free(void* ptr) - free the memory block pointed by "ptr".
Few rules:
You can call for malloc function only in the "init()" function.
The methods "get" and "free" should be as efficient as possible, but the method "init" doesn't have to be as long as you don't waste too much memory or something like that.
You can assume your memory manager will not need to allocate more than some fixed size number of bytes in total, say no more than 1GB at total.
My attempt:
I thought of just implementing fixed size memory pool where each block is 8k bytes, like in here. This will give us O(1) run time for methods "get" and "free" which is great, but the problem is that we are wasting a lot of memoy like that if the user only calls "get" for small number of bytes (say, 1 byte each time).
But if I try to implement it with variable block sizes - I'll need to handle fragmentation which will make the run time worse.
So do you have a better idea?

I'd avoid a fixed size block.
A common strategy is to form pools at power-of-2: 16,32,...1G, with everything initially in the largest pool.
Each block allocated is the user size n + overhead (est. 4-8 bytes) "ceiling" up to a power-of-2.
If a pool lacks an available block, cut a larger one in half.
As similar allocation sizes tend to occur in groups, this avoids excess size waste.
On de-allocation (and collapsing for reuse) only requires free'ing of a pair'd block to re-form the larger block (which may in turn re-join another block) and reduce fragmentation.
Note: All *alloc() return a pointer OK to align max_align_t, thus that is the lower bound expected likewise for get() - (maybe size 4?). As part of an interview, mentioning alignment and portability concerns is good.
There are various improves like well accommodating power-of-2 size blocks, yet for an interview question, only need to touch on such improvement ideas.
free() is a standard lib. function - best to not redefine - use a different name.

How much time to initialize an array to 0?

I am confused whether
int arr[n]={0} takes constant time i.e. O(1), or O(n)?

You should expect O(N) time, but there are caveats:
It is O(1) if memory occupied by array is smaller than the word size. (it may be O(1) all the way to the cache line size on modern CPUs)
It is O(N) if array fits within a single tier in the memory.
It is complicated if array pushes through the tiers: There are multiple tiers on all modern computers (registers, L0 cache, L1 cache, L3 cache?, NUMA on multi-CPU machines, Virtual memory(mapping to swap), ...). If array can't fit in one - there will be a severe performance penalty.
CPU cache architecture impacts the time needed to zero out memory quite severely. In practice calling it O(N) is somewhat misleading given that going from 100 to 101 may increase time 10x if it falls on a cache boundary (line or whole). It may be even more dramatic if swapping is involved. Beware of the tiered memory model...

Generally initialization to zero of non-static storage is linear in the size of the storage. If you are reusing the array, it will need to be zero'ed each time. There are computer architectures that try to make this free via maintaining bit masks on pages or cache lines and returning zeroes at some point in the cache fill or load unit machinery. These are somewhat rare, but the effect can often be replicated in software if performance is paramount.
Arguably zeroed static storage is free but in practice the pages to back it up will be faulted in and there will be some cost to zero them on most architectures.
(One can end up in situations where the cost of the faults to provide zero filled pages is quite noticeable. E.g. repeated malloc/free of blocks bigger than some threshold can result in the address space backing the allocation being returned to the OS at each deallocation. The OS then has to zero it for security reasons even though malloc isn't guaranteed to return zero filled storage. Worse case the program then writes zeroes into the same block after it is returned from malloc so it ends up being twice the cost.)
For cases where large arrays of mostly zeroes will be accessed in a sparse fashion, the zero fill on demand behavior mentioned above can reduce the cost to linear in the number of pages actually used, not the total size. Arranging to do this usually requires using mmap or similar directly, not just allocating an array in C and zero initializing it.

Memory allocator for small chunks (Typ <= 16 bytes, Rare >= 64 Bytes, Max = 192) with static heap

This allocator will be used inside an embedded system with static memory (ie, no system heap available, so the 'heap' will simply be 'char heap[4096]')
There seems to be lots of "small memory allocators" around, but I'm looking for one that handles REALLY small allocations. I'm talking typical sizes of 16 bytes with small CPU use and smaller memory use.
Considering the typical allocation sizes are <= 16 bytes, Rare allocations being <= 64 bytes, and the "one in a million" allocations being upto 192 bytes, I would thinking of simply chopping those 4096 bytes into 255 pages of 16 bytes each and having a bitmap and "next free chunk" pointer. So rather than searching, if the memory is available, the appropriate chunks are marked and the function returns the pointer. Only once the end is reached would it go searching for an appropriate slot of required size. Due to the nature of the system, earlier blocks 'should' have been released by the time the 'Next free chunk' arrives at the end of the 'heap'.
So,
Does anyone know something like this already exists?
If not, can anyone poke holes in my theory?
Or, can they suggest something better?
C only, no C++. (Entire application must fit into <= 64KB, and there's about 40K of graphics so far...)

OP: can anyone poke holes in my theory?
In reading the first half, I thought out a solution using a bit array to record usage and came up with effectively the same thing you outline in the 2nd half.
So here is the hole: avoid hard coding a 16-bite block. Allow your bit map to work with, say 20 or 24 byte blocks at the beginning of you development. During this time, you may want to put tag information and sentinels on the edges of the block. Thus you can more readily track down double free(), usage outside allocation, etc. Of course, the price is a smaller effective pool.
After your debug stage, go with your 16-byte solution with confidence.
Be sure to keep track of 0 <= total allocation <= (2048 - overhead) and allow a check of it versus your bitmap.
For debug, consider filling a freed block with "0xDEAD", etc. to help force inadvertent free usage errors.

What are the differences between (and reasons to choose) tcmalloc/jemalloc and memory pools?

tcmalloc/jemalloc are improved memory allocators, and memory pool is also introduced for better memory allocation. So what are the differences between them and how to choose them in my application?

It depends upon requirement of your program. If your program has more dynamic memory allocations, then you
need to choose a memory allocator, from available allocators, which would generate most optimal performance
out of your program.
For good memory management you need to meet the following requirements at minimum:
Check if your system has enough memory to process data.
Are you albe to allocate from the available memory ?
Returning the used memory / deallocated memory to the pool (program or operating system)
The ability of a good memory manager can be tested on basis of (at the bare minimum) its efficiency in retriving / allocating and
returning / dellaocating memory. (There are many more conditions like cache locality, managing overhead, VM environments, small or large
environments, threaded environment etc..)
With respect to tcmalloc and jemalloc there are many people who have done comparisions. With reference to one of the
comparisions:
http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/
tcmalloc scores over all other in terms of CPU cycles per allocation if the number of threads are less.
jemalloc is very close to tcmalloc but better than ptmalloc (std glibc implementation).
In terms of memory overhead jemalloc is the best, seconded by ptmalloc, followed by tcmalloc.
Overall it can be said that jemalloc scores over others. You can also read more about jemalloc here:
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919
I have just quoted from tests done and published by other people and have not tested it myself. I hope
this could be a good starting point for you and use it to test and select the most optimal for
your application.

Summary from this doc
Tcmalloc
tcmalloc is a memory management library open sourced by Google as an alternative to glibc malloc. It has been used in well-known software such as chrome and safari. According to the official test report, ptmalloc takes about 300 nanoseconds to execute malloc and free on a 2.8GHz P4 machine (for small objects). The TCMalloc version takes about 50 nanoseconds for the same operation.
Small object allocation
tcmalloc allocates a thread-local ThreadCache for each thread. Small memory is allocated from ThreadCache. In addition, there is a central heap (CentralCache). When ThreadCache is not enough, it will get space from CentralCache and put it in ThreadCache.
Small objects (<=32K) are allocated from ThreadCache, and large objects are allocated from CentralCache. The space allocated by large objects is aligned with 4k pages, and multiple pages can also be cut into multiple small objects and divided into ThreadCache
CentralCache allocation management
Large objects (>32K) are first aligned with 4k and then allocated from CentralCache.
When there is no free space in the page linked list of the best fit, the page space is always larger. If all 256 linked lists are traversed, the allocation is still not successful. Use sbrk, mmap,/dev/mem to allocate from the system.
The contiguous pages managed by tcmalloc PageHeap are called span. If span is not allocated, span is a linked list element in PageHeap.
Recycle
When an object is free, the page number is calculated according to the address alignment, and then the corresponding span is found through the central array.
If it is a small object, span will tell us its size class, and then insert the object into the ThreadCache of the current thread. If ThreadCache exceeds a budget value (default 2MB) at this time, the garbage collection mechanism will be used to move unused objects from ThreadCache to CentralCache's central free lists.
If it is a large object, span will tell us the page number range where the object is locked. Assuming that this range is [p,q], first search for the span where pages p-1 and q+1 are located. If these adjacent spans are also free, merge them into the span where [p,q] is located, and then recycle this span To PageHeap.
The central free lists of CentralCache are similar to the FreeList of ThreadCache, but it adds a first-level structure.
Jemalloc
jemalloc was launched by facebook, and it was first implemented by freebsd's libc malloc. At present, it is widely used in various components of firefox and facebook server.
memory management
Similar to tcmalloc, each thread also uses thread-local cache without lock when it is less than 32KB.
Jemalloc uses the following size-class classifications on 64bits systems: Small: [8], [16, 32, 48, …, 128], [192, 256, 320, …, 512], [768, 1024, 1280, …, 3840]
Large: [4 KiB, 8 KiB, 12 KiB, …, 4072 KiB]
Huge: [4 MiB, 8 MiB, 12 MiB, …]
Small/large objects need constant time to find metadata, and huge objects are searched in logarithmic time through the global red-black tree.
The virtual memory is logically divided into chunks (the default is 4MB, 1024 4k pages), and the application thread allocates arenas at the first malloc through the round-robin algorithm. Each arena is independent of each other and maintains its own chunks. Chunk cuts pages into small/large objects. The memory of free() is always returned to the arena to which it belongs, regardless of which thread calls free().
Compare
The biggest advantage of jemalloc is its powerful multi-core/multi-thread allocation capability. The more cores the CPU has, the more program threads, and the faster jemalloc allocates
When allocating a lot of small memory, the space for recording meta data of jemalloc will be slightly more than tcmalloc.
When allocating large memory allocations, there will also be less memory fragmentation than tcmalloc.
Jemalloc classifies memory allocation granularity more finely, it leads to less lock contention than ptmalloc.

Is it better to allocate memory in the power of two?

When we use malloc() to allocate memory, should we give the size which is in power of two? Or we just give the exact size that we need?
Like
//char *ptr= malloc( 200 );
char *ptr= malloc( 256 );//instead of 200 we use 256
If it is better to give size which is in the power of two, what is the reason for that? Why is it better?
Thanks
Edit
The reason of my confusion is following quote from Joel's blog Back to Basics
Smart programmers minimize the
potential distruption of malloc by
always allocating blocks of memory
that are powers of 2 in size. You
know, 4 bytes, 8 bytes, 16 bytes,
18446744073709551616 bytes, etc. For
reasons that should be intuitive to
anyone who plays with Lego, this
minimizes the amount of weird
fragmentation that goes on in the free
chain. Although it may seem like this
wastes space, it is also easy to see
how it never wastes more than 50% of
the space. So your program uses no
more than twice as much memory as it
needs to, which is not that big a
deal.
Sorry, I should have posted the above quote earlier. My apologies!
Most replies, so far, say that allocating memory in the power of two is a bad idea, then in which scenario its better to follow Joel's point about malloc()? Why did he say that? Is the above quoted suggestion obsolete now?
Kindly explain it.
Thanks

Just give the exact size you need. The only reason that a power-of-two size might be "better" is to allow quicker allocation and/or to avoid memory fragmentation.
However, any non-trivial malloc implementation that concerns itself with being efficient will internally round allocations up in this way if and when it is appropriate to do so. You don't need to concern yourself with "helping" malloc; malloc can do just fine on its own.
Edit:
In response to your quote of the Joel on Software article, Joel's point in that section (which is hard to correctly discern without the context that follows the paragraph that you quoted) is that if you are expecting to frequently re-allocate a buffer, it's better to do so multiplicatively, rather than additively. This is, in fact, exactly what the std::string and std::vector classes in C++ (among others) do.
The reason that this is an improvement is not because you are helping out malloc by providing convenient numbers, but because memory allocation is an expensive operation, and you are trying to minimize the number of times you do it. Joel is presenting a concrete example of the idea of a time-space tradeoff. He's arguing that, in many cases where the amount of memory needed changes dynamically, it's better to waste some space (by allocating up to twice as much as you need at each expansion) in order to save the time that would be required to repeatedly tack on exactly n bytes of memory, every time you need n more bytes.
The multiplier doesn't have to be two: you could allocate up to three times as much space as you need and end up with allocations in powers of three, or allocate up to fifty-seven times as much space as you need and end up with allocations in powers of fifty-seven. The more over-allocation you do, the less frequently you will need to re-allocate, but the more memory you will waste. Allocating in powers of two, which uses at most twice as much memory as needed, just happens to be a good starting-point tradeoff until and unless you have a better idea of exactly what your needs are.
He does mention in passing that this helps reduce "fragmentation in the free chain", but the reason for that is more because of the number and uniformity of allocations being done, rather than their exact size. For one thing, the more times you allocate and deallocate memory, the more likely you are to fragment the heap, no matter in what size you're allocating. Secondly, if you have multiple buffers that you are dynamically resizing using the same multiplicative resizing algorithm, then it's likely that if one resizes from 32 to 64, and another resizes from 16 to 32, then the second's reallocation can fit right where the first one used to be. This wouldn't be the case if one resized from 25 to 60 and and the other from 16 to 26.
And again, none of what he's talking about applies if you're going to be doing the allocation step only once.

Just to play devil's advocate, here's how Qt does it:
Let's assume that we append 15000
characters to the QString string. Then
the following 18 reallocations (out of
a possible 15000) occur when QString
runs out of space: 4, 8, 12, 16, 20,
52, 116, 244, 500, 1012, 2036, 4084,
6132, 8180, 10228, 12276, 14324,
16372. At the end, the QString has 16372 Unicode characters allocated,
15000 of which are occupied.
The values above may seem a bit
strange, but here are the guiding
principles:
QString allocates 4 characters at a
time until it reaches size 20. From 20
to 4084, it advances by doubling the
size each time. More precisely, it
advances to the next power of two,
minus 12. (Some memory allocators
perform worst when requested exact
powers of two, because they use a few
bytes per block for book-keeping.)
From 4084 on, it advances by blocks of
2048 characters (4096 bytes). This
makes sense because modern operating
systems don't copy the entire data
when reallocating a buffer; the
physical memory pages are simply
reordered, and only the data on the
first and last pages actually needs to
be copied.
I like the way they anticipate operating system features in code that is meant to perform well from smartphones to server farms. Given that they're smarter people than me, I'd assume that said feature is available in all modern OSes.

It might have been true once, but it's certainly not better.
Just allocate the memory you need, when you need it and free it up as soon as you've finished.
There are far too many programs that are profligate with resources - don't make yours one of them.

It's somewhat irrelevant.
Malloc actually allocates slightly more memory than you request, because it has it's own headers to deal with. Therefore the optimal storage is probably something like 4k-12 bytes... but that varies depending on the implementation.
In any case, there is no reason for you to round up to more storage than you need as an optimization technique.

You may want to allocate memory in terms of the processor's word size; not any old power of 2 will do.
If the processor has a 32-bit word (4 bytes), then allocate in units of 4 bytes. Allocating in terms of 2 bytes may not be helpful since the processor prefers data to start on a 4 byte boundary.
On the other hand, this may be a micro-optimization. Most memory allocation libraries are set up to return memory that is aligned at the correct position and will leave the least amount of fragmentation. If you allocate 15 bytes, the library may pad out and allocate 16 bytes. Some memory allocators have different pools based on the allocation size.
In summary, allocate the amount of memory that you need. Let the allocation library / manager handle the actual amount for you. Put more energy into correctness and robustness than worry about these trivial issues.

When I'm allocating a buffer that may need to keep growing to accommodate as-yet-unknown-size data, I start with a power of 2 minus 1, and every time it runs out of space, I realloc with twice the previous size plus 1. This makes it so I never have to worry about integer overflows; the size can only overflow when the previous size was SIZE_MAX, at which point the allocation would already have failed, and 2*SIZE_MAX+1 == SIZE_MAX anyway.
In contrast, if I just used a power of 2 and doubled it each time, I might successfully get a 2^31 byte buffer and then reallocate to a 0 byte buffer next time I doubled the size.
As some people have commented about power-of-2-minus-12 being good for certain malloc implementations, one could equally start with a power of 2 minus 12, then double it and add 12 at each step...
On the other hand if you're just allocating small buffers that won't need to grow, request exactly the size you need. Don't try to second-guess what's good for malloc.

This is totally dependent on the given libc implementation of malloc(3). It's up to that implementation to reserve heap chunks in whatever order it sees fit.
To answer the question - no, it's not "better" (here by "better" you mean ...?). If the size you ask for is too small, malloc(3) will reserve bigger chunk internally, so just stick with your exact size.

With today's amount of memory and its speed I don't think it's relevant anymore.
Furthermore, if you're gonna allocate memory frequently you better consider custom memory pooling / pre-allocation.

There is always testing...
You can try a "sample" program that allocates memory in a loop. This way you can see if your compiler magically allocates memory in powers of 2.
With that information, you can try to allocate the same amount of total memory using the 2 strategies: random sized blocks and power of 2 sized blocks.
I would only expect differences, if any, for large amounts of memory though.

If you're allocating some sort of expandable buffer where you need to pick some number for initial allocations, then yes, powers of 2 are good numbers to choose. If you need to allocate memory for struct foo, then just malloc(sizeof(struct foo)). The recommendation for power-of-2 allocations stems from the inefficiency of internal fragmentation, but modern malloc implementations intended for multiprocessor systems are starting to use CPU-local pools for allocations small enough for this to matter, which prevents the lock contention that used to result when multiple threads would attempt to malloc at the same time, and spend more time blocked due to fragmentation.
By allocating only what you need, you ensure that data structures are packed more densely in memory, which improves cache hit rate, which has a much bigger impact on performance than internal fragmentation. There exist scenarios with very old malloc implementations and very high-end multiprocessor systems where explicitly padding allocations can provide a speedup, but your resources in that case would be better spent getting a better malloc implementation up and running on that system. Pre-padding also makes your code less portable, and prevents the user or the system selecting the malloc behavior at run-time, either programmatically or with environment variables.
Premature optimization is the root of all evil.

You should use realloc() instead of malloc() when reallocating.
http://www.cplusplus.com/reference/clibrary/cstdlib/realloc/
Always use a power of two? It depends on what your program is doing. If you need to reprocess your whole data structure when it grows to a power of two, yeah it makes sense. Otherwise, just allocate what you need and don't hog memory.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight