Assume I have used ptr = malloc(old_size); to allocate a memory block with old_size bytes. Only the first header_size bytes is meaningful. I'm going to increase the size to new_size.
new_size is greater than old_size and old_size is greater than header_size.
before:
/- - - - - - - old_size - - - - - - - \
+===============+---------------------+
\-header_size-/
after:
/- - - - - - - - - - - - - - - new_size - - - - - - - - - - - - - - - - - - -\
+===============+------------------------------------------------------------+
\- header_size-/
I don't care what is stored after ptr + header_size because I'll read some data to there.
method 1: go straight to new_size
ptr = realloc(ptr, new_size);
method 2: shrink to header_size and grow to new_size
ptr = realloc(ptr, header_size);
ptr = realloc(ptr, new_size);
method 3: allocate a new memory block and copy the first header_size bytes
void *newptr = malloc(new_size);
memcpy(newptr, ptr, header_size);
free(ptr);
ptr = newptr;
Which is faster?
Neither malloc (for the whole block) nor realloc (for the space beyond the size of the old block when increasing the size) guarantee what the memory you receive will contain so, if you want those excess bytes set to zero (for example), you'll have to do it yourself with something like:
// ptr contains current block.
void *saveptr = ptr;
ptr = realloc (ptr, new_size);
if (ptr == NULL) {
// do something intelligent like recover saveptr and exit.
}
memset (ptr + header_size, 0, new_size - header_size);
However, since you've stated that you don't care about the content beyond the header, the fastest is almost certainly a single realloc since that's likely to be optimised under the covers.
Calling it twice for contraction and expansion, or calling malloc-new/memcpy/free-old is very unlikely to be as efficient though, as with all optimisations, you should measure, don't guess!
Keep in mind that realloc doesn't necessarily have to copy your memory at all. If the expansion can be done in place, then an intelligent heap manager will just increase the size of the block without copying anything, such as:
+-----------+ ^ +-----------+ <- At same address,
| Old block | | Need | New block | no copying
| | | this | | involved.
+-----------+ | much | |
| Free | | now. | |
| | v +-----------+
| | | Free |
| | | |
+-----------+ +-----------+
It almost certainly depends on the values of old_size, new_size and header_size, and also it depends on the implementation. You'd have to pick some values and measure.
1) is probably best in the case where header_size == old_size-1 && old_size == new_size-1, since it gives you the best chance of the single realloc being basically a no-op. (2) should be only very slightly slower in that case (2 almost-no-ops being marginally slower than 1).
3) is probably best in the case where header_size == 1 && old_size == 1024*1024 && new_size == 2048*1024, because the realloc would have to move the allocation, but you avoid copying 1MB of data you don't care about. (2) should be only very slightly slower in that case.
2) is probably best when header_size is much smaller than old_size, and new_size is in a range where it's reasonably likely that the realloc will relocate, but also reasonably likely that it won't. Then you can't predict which of (1) and (3) it is that will be very slightly faster than (2).
In analyzing (2), I have assumed that realloc downwards is approximately free and returns the same pointer. This is not guaranteed. I can think of two things that can mess you up:
realloc downwards copies to a new allocation
realloc downwards splits the buffer to create a new chunk of free memory, but then when you realloc back up again the allocator doesn't merge that new free chunk straight back onto your buffer again in order to return without copying.
Either of those could make (2) significantly more expensive than (1). So it's an implementation detail whether or not (2) is a good way of hedging your bets between the advantages of (1) (sometimes avoids copying anything) and the advantages of (3) (sometimes avoids copying too much).
Btw, this kind of idle speculation about performance is more effective in order to tentatively explain your observations, than it is to tentatively predict what observations we would make in the unlikely event that we actually cared enough about performance to test it.
Furthermore, I suspect that for large allocations, the implementation might be able to do even a relocating realloc without copying anything, by re-mapping the memory to a new address. In which case they would all be fast. I haven't looked into whether implementations actually do that, though.
That probably depends on what the sizes are and if copying is needed.
Method 1 will copy everything contained in the old block - but if you don't do that too often, you won't notice.
Method 2 will only copy what you need to keep, as you discard everything else beforehand.
Method 3 will copy unconditionally, while the others only copy if the memory block cannot be resized where it is.
Personally, I would prefer method 2 if you do this quite often, or method 1 if you do it more seldom. Respectively, I would profile which of these will be faster.
Related
I'm thinking about re-implementing the malloc(3) function (as well as free(3), realloc(3) and calloc(3)) using mmap(2) and munmap(2) syscalls (as sbrk(2) is now deprecated on some operating systems).
My strategy to allocate memory blocks on the page returned by mmap(2) would be to store metadata right before the block of data. Thus the metadata could consist of 3 attributes :
is_free : a char (1 byte) to tell if the block is considered free or not;
size : an size_t (4 bytes) with the size of the block in term of bytes;
next : a pointer (1 byte) to the next block's metadata (or to the next page first block if there's no more space after the block).
But as I can't use malloc to allocate a struct for them, I would simply consider putting 6 bytes of metadata in front of the block each time I create one :
+---------+---------+--------+------------------+---------+---------+--------+---------+---
| is_free | size | next | Block1 | is_free | size | next | Block2 | ...
+---------+---------+--------+------------------+---------+---------+--------+---------+---
| 1 byte | 4 bytes | 1 byte | n bytes | 1 byte | 4 bytes | 1 byte | m bytes | ...
+---------+---------+--------+------------------+---------+---------+--------+---------+---
The question is :
How can I be sure the user/process using my malloc won't be able to read/write the metadata of the blocks with such architecture ?
Eg: With the previous schema, I return the Block1's first byte to the user/process. If he/it does *(Block1 + n) = Some1ByteData he/it can alter the metadata of the next block which will cause issues with my program if I try to allocate a new block later on.
On the mmap(2) man page I read that I could give protection flags for the pages, but if I use them, then the user/process using my malloc won't be able to use the block I give. How is it achieve in the real malloc ?
PS: For the moment I don't consider thread-safe implementation nor looking for top-tier performances. I just want something strong and functionnal.
Thanks.
I'm re-coding the malloc function using brk, sbrk & getpagesize()
I must follow two rules:
1)
I must align my memory on a power of 2
It means: If the call to malloc is : malloc(9); i must return them a block of 16 byte. ( the nearest power of 2);
2)
I must align the break (program end data segment) on a multiple of 2 pages.
I'm thinking about the rules, i'm wondering if i'm true;
Rule 1)
I just need to make the return of my malloc (so the adress returned by malloc in hexa) a multiple of 2 ?
And for the Rule 2)
the break is the last adress in the heap if i'm not wrong,
do i need to set my break like this (the break - the heap start) % (2 * getpagesize())== 0?
or just the break % (2 * getpagesize() == 0?
Thanks
1) I must align my memory on a power of 2
…
Rule 1) I just need to make the return of my malloc (so the adress returned by malloc in hexa) a multiple of 2 ?
For an address to be aligned on power of 2 that is 2p, the address must be a multiple of 2p.
2) I must align the break (program end data segment) on a multiple of 2 pages.
…
the break is the last adress in the heap if i'm not wrong, do i need to set my break like this (the break - the heap start) % (2 * getpagesize())== 0? or just the break % (2 * getpagesize() == 0?
The phrase “set my break” is unclear. You need to use sbrk(0) to get the current value of the break and calculate how much you need to add to it to make it a multiple of twice the page size. That tells you where you need to start a block of memory that is aligned to a multiple of twice the page size. Then you need additional memory to contain whatever amount of data you want to put there (the amount being allocated).
I'm writing a program what reads data from stream (pipe or socket in my example) and put that data into array. The problem is what I can't know how much data I need to read from my stream and what's why I don't know how much memory I need to allocate for my array. If I know what, there is no need in this question. The only thing I know is what than some value (-1 for example) appears in stream what means end of stream. So the function what reads data from stream could look like this:
int next_value() {
return (rand() % 100) - 1;
}
Code what work with this data looks like this:
int main()
{
int len = 0;
int *arr = NULL;
int val, res = 0;
srand(time(NULL));
while ((val = next_value()) != -1) {
if ((res = set_value_in_array(val, &arr, &len))) {
perror("set_value_in_array");
exit(EXIT_FAILURE);
}
}
// uncomment next line if set_value_in_array_v2 or set_value_in_array_v3
//realloc(arr, len * sizeof(*arr));
free(arr);
return 0;
}
I have three strategies of putting data into array with memory allocation routine for that array.
The easiest is to allocate (reallocate) memory for each new value what appears from next_value() like this:
// allocate new element in array for each call
int set_value_in_array_v1(int val, int **arr, int *len) {
int *tmp;
tmp = realloc(*arr, ((*len) + 1) * sizeof(**arr));
if (tmp) {
*arr = tmp;
} else {
return -1;
}
*((*arr) + (*len)) = val;
(*len)++;
return 0;
}
Easy, but I think that it's not ideal. I don't know how many values will be read from stream. The number of values could be in range from 0 to infinity. Another strategy is to allocate memory for more than one element. This will decrease number of calls to memory management unit. The code can look like this:
// allocate ELEMS_PER_ALLOC every time allocation needed
int set_value_in_array_v2(int val, int **arr, int *len) {
#define ELEMS_PER_ALLOC 4 // how many elements allocate on next allocation
int *tmp;
if ((*len) % ELEMS_PER_ALLOC == 0) {
tmp = realloc(*arr, ((*len) + ELEMS_PER_ALLOC) * sizeof(**arr));
if (tmp) {
*arr = tmp;
} else {
return -1;
}
}
*((*arr) + (*len)) = val;
(*len)++;
return 0;
}
Much more better, but is it the best solution? What if I will allocate memory in geometric progression like this:
// allocate *len * FRAC_FOR_ALLOC each time allocation needed
int set_value_in_array_v3(int val, int **arr, int *len) {
#define FRAC_FOR_ALLOC 3 // how many times increase number of allocated memory on next allocation
static int allocated = 0; // i know this is bad to use static but it's for experiments only
int *tmp;
if (allocated == (*len)) {
if (allocated == 0) {
allocated = 1;
}
allocated *= FRAC_FOR_ALLOC;
tmp = realloc(*arr, allocated * sizeof(**arr));
if (tmp) {
*arr = tmp;
} else {
return -1;
}
}
*((*arr) + (*len)) = val;
(*len)++;
return 0;
}
The same way is used in .NET Framework List<T> data structure. This way has one big problem: it will allocate a lot of memory after 100 of elements and situations when there is no way to increase current chunk of memory will be more likely to appear.
In the other hand, set_value_in_array_v2 will call memory manager very often what is also not a good idea if there are many data in stream.
So my question is what is the best strategy of memory allocation in situations similar to mine? I can't find any answers for my question in Internet. Every link just show me the best practices for memory management API usage.
Thanks in advance.
The number of reallocations if you are reallocating every time a new element is added is n. There is no worst case scenario for memory usage.
The number of reallocations if you are reallocating memory in multiples of 4 is nearly n/4. In the worst case scenario , you'll be wasting a constant 3 units of memory.
The number of reallocations required if you are reallocating the memory by a factor of k each time you run out of space is log n where the base of the logarithm is k. In the worst case scenario, you will have (1 - 1/k)*100% of memory being wasted. For k = 2, you'll have 50% of the allocated memory being unused. On average, you will have (1 - 1/k)*0.5*100% of your memory unused.
While reallocating memory using a geometric sequence, you will be guaranteed logarithmic time complexity. However, large factors of k will also put a limit on the maximum amount of memory you can allocate.
Suppose you could allocate just 1GB of memory for your requirement and you already storing 216MB. If you use a k factor of 20, your next reallocation would fail because you would demand more than 1GB of memory.
The larger your base is, the smaller the time complexity will be but it also increases the amount of memory going unused in the worst (and average) case and also caps the maximum memory to something lesser than what you could have actually used (this of course varies from situation to situation; if you had 1296MB of allocatable memory and your base was 6, the cap on the array size would be 1296MB as 1296 is a power of 6 assuming that you started off with memory which is a power of 6).
What you need depends on your situation. In most cases, you would have a rough estimate of your memory requirements. You can do the first optimization by setting up the initial memory to your estimate. You can keep doubling the memory thereon every time you run out of memory. After your stream is closed, you can reallocate the memory to match the exact size of your data (if you really really need to free the unused memory).
This question was part of my bachelor's thesis, unfortunately it is in german.
I compared 3 allocation methods: fixed increase (your case 2), fixed factor (your case 3), and a dynamic factor.
The analysis in the other answers are quite good, but I want to add an important finding of my practical tests: The fixed step increase can use the most memory in runtime! (and is some orders of magnitude slower...)
Why? Suppose you have allocated space for 10 items. Then when adding the 11th item, the space should grow by 10. Now it might not be possible to simply increase the space adjacent to the first 10 items (because it is used otherwise). So fresh space for 20 items is allocated, the origninal 10 are copied, and the original space is freed. You have now allocated 30 items, when you can actually only use 20. This gets worse with every allocation.
My dynamic factor approach intendet to grow fast, as long as the steps are not too big, and later use smaller factors, so that the risk of getting out of memory is minimized. It is some kind of inverted sigmoid function.
The thesis can be found here: XML Toolbox for Matlab. Relevant chapters are 3.2 (implementation) and 5.3.2 (practical tests)
When making automatically expanding arrays (like C++'s std::vector) in C, it is often common (or at least common advice) to double the size of the array each time it is filled to limit the amount of calls to realloc in order to avoid copying the entire array as much as possible.
Eg. we start by allocating room for 8 elements, 8 elements are inserted, we then allocate room for 16 elements, 8 more elements are inserted, we allocate for 32.., etc.
But realloc does not have to actually copy the data if it can expand the existing memory allocation. For example, the following code only does 1 copy (the initial NULL allocation, so it is not really a copy) on my system, even though it calls realloc 10000 times:
#include <stdlib.h>
#include <stdio.h>
int main()
{
int i;
int copies = 0;
void *data = NULL;
void *ndata;
for (i = 0; i < 10000; i++)
{
ndata = realloc(data, i * sizeof(int));
if (data != ndata)
copies++;
data = ndata;
}
printf("%d\n", copies);
}
I realize that this example is very clinical - a real world application would probably have more memory fragmentation and would do more copies, but even if I make a bunch of random allocations before the realloc loop, it only does marginally worse with 2-4 copies instead.
So, is the "doubling method" really necessary? Would it not be better to just call realloc each time a element is added to the dynamic array?
You have to step back from your code for a minute and thing abstractly. What is the cost of growing a dynamic container? Programmers and researchers don't think in terms of "this took 2ms", but rather in terms of asymptotic complexity: What is the cost of growing by one element given that I already have n elements; how does this change as n increases?
If you only ever grew by a constant (or bounded) amount, then you would periodically have to move all the data, and so the cost of growing would depend on, and grow with, the size of the container. By contrast, when you grow the container geometrically, i.e. multiply its size by a fixed factor, every time it is full, then the expected cost of inserting is actually independent of the number of elements, i.e. constant.
It is of course not always constant, but it's amortized constant, meaning that if you keep inserting elements, then the average cost per element is constant. Every now and then you have to grow and move, but those events get rarer and rarer as you insert more and more elements.
I once asked whether it makes sense for C++ allocators to be able to grow, in the way that realloc does. The answers I got indicated that the non-moving growing behaviour of realloc is actually a bit of a red herring when you think asymptotically. Eventually you won't be able to grow anymore, and you'll have to move, and so for the sake of studying the asymptotic cost, it's actually irrelevant whether realloc can sometimes be a no-op or not. (Moreover, non-moving growth seems to upset moder, arena-based allocators, which expect all their allocations to be of a similar size.)
Compared to almost every other type of operation, malloc, calloc, and especially realloc are very memory expensive. I've personally benchmarked 10,000,000 reallocs, and it takes a HUGE amount of time to do that.
Even though I had other operations going on at the same time (in both benchmark tests), I found that I could literally cut HOURS off of the run time by using max_size *= 2 instead of max_size += 1.
Q: 'doubling the capacity of a dynamic array necessary"
A: No. One could grow only to the extent needed. But then you may truly copy data many times. It is a classic trade off between memory and processor time. A good growth algorithm takes into account what is known about the program's data needs and also not to over-think those needs. An exponential growth of 2x is a happy compromise.
But now to your claim "following code only does 1 copy".
The amount of copying with advanced memory allocators may not be what OP thinks. Getting the same address does not mean that the underlying memory mapping did not perform significant work. All sorts of activity go on under-the-hood.
For memory allocations that grow & shrink a lot over the life of the code, I like grow and shrink thresholds geometrically placed apart from each other.
const size_t Grow[] = {1, 4, 16, 64, 256, 1024, 4096, ... };
const size_t Shrink[] = {0, 2, 8, 32, 128, 512, 2048, ... };
By using the grow thresholds while getting larger and shrink one while contracting, one avoid thrashing near a boundary. Sometimes a factor of 1.5 is used instead.
I am working on a Windows C project which is string-intensive: I need to convert a marked up string from one form to another. The basic flow is something like:
DWORD convert(char *point, DWORD extent)
{
char *point_end = point + extent;
char *result = memory_alloc(1);
char *p_result = result;
while (point < point_end)
{
switch (*point)
{
case FOO:
result_extent = p_result - result;
result = memory_realloc(12);
result += result_extent;
*p_result++ = '\n';
*p_result++ = '\t';
memcpy(result, point, 10);
point += 10;
result += 10;
break;
case BAR:
result_extent = p_result - result;
result = memory_realloc(1);
result += result_extent;
*result++ = *point++;
break;
default:
point++;
break;
}
}
// assume point is big enough to take anything I would copy to it
memcpy(point, result, result_extent);
return result_extent;
}
memory_alloc() and memory_realloc() are fake functions to highlight the purpose of my question. I do not know beforehand how big the result 'string' will be (technically, it's not a C-style/null-terminate string I'm working with, just a pointer to a memory address and a length/extent), so I'll need to dynamically size the result string (it might be bigger than the input, or smaller).
In my initial pass, I used malloc() to create room for the first byte/bytes and then subsequently realloc() whenever I needed to append another byte/handful of bytes...it works, but it feels like this approach will needlessly hammer away at the OS and likely result in shifting bytes around in memory over and over.
So I made a second pass, which determines how long the result_string will be after an individual unit of the transformation (illustrated above with the FOO and BAR cases) and picks a 'preferred allocation size', e.g. 256 bytes. For example, if result_extent is 250 bytes and I'm in the FOO case, I know I need to grow the memory 12 bytes (newline, tab and 10 bytes from the input string) -- rather than reallocating 260 bytes of memory, I'd reach for 512 bytes, hedging my bet that I'm likely going to continue to add more data (and thus I can save myself a few calls into realloc).
On to my question: is this latter thinking sound or is it premature optimization that the compiler/OS is probably already taking care of for me? Other than not wasting memory space, is there an advantage to reallocating memory by a couple bytes, as needed?
I have some rough ideas of what I might expect during a single conversion instance, e.g. a worse case scenario might be a 2MB input string with a couple hundred bytes of markup that will result in 50-100 bytes of data to be added to the result string, per markup instance (so, say 200 reallocs stretching the string by 50-100 bytes with another 100 reallocations caused by simply copying data from the input string into the result string, aside from the markup).
Any thoughts on the subject would be appreciated. thanks
As you might know, realloc can move your data at each call. This results in an additional copy. In cases like this, I think it is much better to allocate a large buffer that will most probably be sufficient for the operation (an upper bound). In the end, you can allocate the exact amount for the result and do a final copy/free. This is better and is not premature optimization at all. IMO using realloc might be considered premature optimization in this case.