Are multiple realloc more expensive than a huge malloc? - c

I am using a dynamic array to represent a min-heap. There is a loop that removes minimum, and add random elements to the min-heap until some condition occur. Although I don't know how the length of the heap will change during run-time (there is a lot of randomness), I know the upper bound, which is 10 million. I have two options:
1) Declare a small array using malloc, then call realloc when there number of elements in the heap exceeds the length.
2) Declare a 10 million entry array using malloc. This avoids ever calling realloc.
Question
Is option 2 more efficient than option 1?
I tested this with my code and there seems to be significant (20%) run-time reduction from using 2. This is estimated because of the randomness in the code. Is there any drawback to declaring a large 10-50 million entry array with malloc up front?

If you can spare the memory to make the large up-front allocation, and it gives a worthwhile performance increase, then by all means do it.
If you stick with realloc, then you might find that doubling the size every time instead of increasing by a fixed amount can give a good trade-off between performance and efficient memory usage.

It's not said that when you use realloc, the memory will be expanded from the same place.It may also happen that the memory will be displaced in another area.
So using realloc may cause to copy the previous chuck of memory that you had.
Also consider that a system call may take some overhead, so you'd better call malloc once.

The drawback is that if you are not using all that space you are taking up a large chunk of memory which might be needed. If you know exactly how many bytes you need it is going to be more efficient to allocate at once, due to system call overhead, then to allocate it piece by piece. Usually you might have an upper bound but not know the exact number. Taking the time to malloc up the space to handle the upper bound might take 1 second. If however, this particular case only has half of the upper bound it might take .75 seconds allocating piece by piece. So it depends on how close to the upper bound you think you are going to get.

Related

optimal way of using malloc and realloc for dynamic storing

I'm trying to figure out what is the optimal way of using malloc and realloc for recieving unknown amount of characters from the user ,storing them, and printing them only by the end.
I've figured that calling realloc too many times wont be so smart.
so instead, I allocate a set amount of space each time,lets say
sizeof char*100
and by the end of file,i use realloc to fit the size of the whole thing precisely.
what do you think?is this a good way to go about?
would you go in a different path?
Please note,I have no intention of using linked lists,getchar(),putchar().
using malloc and realloc only is a must.
If you realloc to fit the exact amount of data needed, then you are optimizing for memory consumption. This will likely give slower code because 1) you get extra realloc calls and 2) you might not allocate amounts that fit well with CPU alignment and data cache. Possibly this also causes heap segmentation issues because of the repeated reallocs, in which case it could actually waste memory.
It's hard to answer what's "best" generically, but the below method is fairly common, as it is a good compromise between reducing execution speed for realloc calls and lowering memory use:
You allocate a segment, then keep track of how much of this segment that is user data. It is a good idea to allocate size_t mempool_size = n * _Alignof(int); bytes and it is probably also wise to use a n which is divisible by 8.
Each time you run out of free memory in this segment, you realloc to mempool_size*2 bytes. That way you keep doubling the available memory each time.
I've figured that calling realloc too many times wont be so smart.
How have you figured it out? Because the only way to really know is to measure the performance.
Your strategy may need to differ based on how you are reading the data from the user. If you are using getchar() you probably don't want to use realloc() to increase the buffer size by one char each time you read a character. However, a good realloc() will be much less inefficient than you think even in these circumstances. The minimum block size that glibc will actually give you in response to a malloc() is, I think, 16 bytes. So going from 0 to 16 characters and reallocing each time doesn't involve any copying. Similarly for larger reallocations, a new block might not need to be allocated, it may be possible to make the existing block bigger. Don't forget that even at its slowest, realloc() will be faster than a person can type.
Most people don't go for that strategy. What can by typed can be piped so the argument that people don't type very fast doesn't necessarily work. Normally, you introduce the concept of capacity. You allocate a buffer with a certain capacity and when it gets full, you increase its capacity (with realloc()) by adding a new chunk of a certain size. The initial size and the reallocation size can be tuned in various ways. If you are reading user input, you might go for small values e.g. 256 bytes, if you are reading files off disk or across the network, you might go for larger values e.g. 4Kb or bigger.
The increment size doesn't even need to be constant, you could choose to double the size for each needed reallocation. This is the strategy used by some programming libraries. For example the Java implementation of a hash table uses it I believe and so possibly does the Cocoa implementation of an array.
It's impossible to know beforehand what the best strategy in any particular situation is. I would pick something that feels right and then, if the application has performance issues, I would do testing to tune it. Your code doesn't have to be the fastest possible, but only fast enough.
However one thing I absolutely would not do is overlay a home rolled memory algorithm over the top of the built in allocator. If you find yourself maintaining a list of blocks you are not using instead of freeing them, you are doing it wrong. This is what got OpenSSL into trouble.

Using realloc() for exact amount vs malloc() for too much

I have a bunch of files that I'm going to be processing in batches of about 1000. Some calculations are done on the files and approximately 75% of them will need to have data stored in a struct array.
I have no way of knowing how many files will need to be stored in the array until the calculations are done at runtime.
Right now I'm counting the number of files processed that need a struct array space, and using malloc(). Then for the next batch of files I use realloc(), and so on until all files are done. This way I allocate the exact amount of memory I need.
I'm able to count the total number of files in advance. Would I be better off using one big malloc() right at the start, even though it's only going to be 75% filled?
I would try it both ways, and see if there is a noticeable difference in performance.
If there isn't I would stick with using realloc() as then you wont allocate any excess memory that you don't need. You can also do something like the vector class in C++ which increases the memory logarithmically.
Libraries can implement different strategies for growth to balance between memory usage and reallocations, but in any case, reallocations should only happen at logarithmically growing intervals of size so that the insertion of individual elements at the end of the vector can be provided with amortized constant time complexity
Keep in mind that you can also allocate memory in advance, and free some of it when you are done. It really depends on what your constraints are.

One large malloc versus multiple smaller reallocs

Sorry if this has been asked before, I haven't been able to find just what I am looking for.
I am reading fields from a list and writing them to a block of memory. I could
Walk the whole list, find the total needed size, do one malloc and THEN walk the list again and copy each field;
Walk the whole list and realloc the block of memory as I write the values;
Right now the first seems the most efficient to me (smallest number of calls). What are the pros and cons of either approach ?
Thank you for your time.
The first approach is almost always better. A realloc() typically works by copying the entire contents of a memory block into the freshly allocated, larger block. So n reallocs can mean n copies, each one larger than the last. (If you are adding m bytes to your allocation each time, then the first realloc has to copy m bytes, the next one 2m, the next 3m, ...).
The pedantic answer would be that the internal performance implications of realloc() are implementation specific, not explicitly defined by the standard, in some implementation it could work by magic fairies that move the bytes around instantly, etc etc etc - but in any realistic implementation, realloc() means a copy.
You're probably better off allocating a decent amount of space initially, based on what you think is the most likely maximum.
Then, if you find you need more space, don't just allocate enough for the extra, allocate a big chunk extra.
This will minimise the number of re-allocations while still only processing the list once.
By way of example, initially allocate 100K. If you then find you need more, re-allocate to 200K, even if you only need 101K.
Don't reinvent the wheel and use CCAN's darray which implements an approach similar to what paxdiablo described. See darray on GitHub

Array Performance very similar to LinkedList - What gives?

So the title is somewhat misleading... I'll keep this simple: I'm comparing these two data structures:
An array, whereby it starts at size 1, and for each subsequent addition, there is a realloc() call to expand the memory, and then append the new (malloced) element to the n-1 position.
A linked list, whereby I keep track of the head, tail, and size. And addition involves mallocing for a new element and updating the tail pointer and size.
Don't worry about any of the other details of these data structures. This is the only functionality I'm concerned with for this testing.
In theory, the LL should be performing better. However, they're near identical in time tests involving 10, 100, 1000... up to 5,000,000 elements.
My gut feeling is that the heap is large. I think the data segment defaults to 10 MB on Redhat? I could be wrong. Anyway, realloc() is first checking to see if space is available at the end of the already-allocated contiguous memory location (0-[n-1]). If the n-th position is available, there is not a relocation of the elements. Instead, realloc() just reserves the old space + the immediately following space. I'm having a hard time finding evidence of this, and I'm having a harder time proving that this array should, in practice, perform worse than the LL.
Here is some further analysis, after reading posts below:
[Update #1]
I've modified the code to have a separate list that mallocs memory every 50th iteration for both the LL and the Array. For 1 million additions to the array, there are almost consistently 18 moves. There's no concept of moving for the LL. I've done a time comparison, they're still nearly identical. Here's some output for 10 million additions:
(Array)
time ./a.out a 10,000,000
real 0m31.266s
user 0m4.482s
sys 0m1.493s
(LL)
time ./a.out l 10,000,000
real 0m31.057s
user 0m4.696s
sys 0m1.297s
I would expect the times to be drastically different with 18 moves. The array addition is requiring 1 more assignment and 1 more comparison to get and check the return value of realloc to ensure a move occurred.
[Update #2]
I ran an ltrace on the testing that I posted above, and I think this is an interesting result... It looks like realloc (or some memory manager) is preemptively moving the array to larger contiguous locations based on the current size.
For 500 iterations, a memory move was triggered on iterations:
1, 2, 4, 7, 11, 18, 28, 43, 66, 101, 154, 235, 358
Which is pretty close to a summation sequence. I find this to be pretty interesting - thought I'd post it.
You're right, realloc will just increase the size of the allocated block unless it is prevented from doing so. In a real world scenario you will most likely have other objects allocated on the heap in between subsequent additions to the list? In that case realloc will have to allocate a completely new chunk of memory and copy the elements already in the list.
Try allocating another object on the heap using malloc for every ten insertions or so, and see if they still perform the same.
So you're testing how quickly you can expand an array verses a linked list?
In both cases you're calling a memory allocation function. Generally memory allocation functions grab a chunk of memory (perhaps a page) from the operating system, then divide that up into smaller pieces as required by your application.
The other assumption is that, from time to time, realloc() will spit the dummy and allocate a large chunk of memory elsewhere because it could not get contiguous chunks within the currently allocated page. If you're not making any other calls to memory allocation functions in between your list expand then this won't happen. And perhaps your operating system's use of virtual memory means that your program heap is expanding contiguously regardless of where the physical pages are coming from. In which case the performance will be identical to a bunch of malloc() calls.
Expect performance to change where you mix up malloc() and realloc() calls.
Assuming your linked list is a pointer to the first element, if you want to add an element to the end, you must first walk the list. This is an O(n) operation.
Assuming realloc has to move the array to a new location, it must traverse the array to copy it. This is an O(n) operation.
In terms of complexity, both operations are equal. However, as others have pointed out, realloc may be avoiding relocating the array, in which case adding the element to the array is O(1). Others have also pointed out that the vast majority of your program's time is probably spent in malloc/realloc, which both implementations call once per addition.
Finally, another reason the array is probably faster is cache coherency and the generally high performance of linear copies. Jumping around to erratic addresses with significant gaps between them (both the larger elements and the malloc bookkeeping) is not usually as fast as doing a bulk copy of the same volume of data.
The performance of an array-based solution expanded with realloc() will depend on your strategy for creating more space.
If you increase the amount of space by adding a fixed amount of storage on each re-allocation, you'll end up with an expansion that, on average, depends on the number of elements you have stored in the array. This is on the assumption that realloc will need to (occasionally) allocate space elsewhere and copy the contents, rather than just expanding the existing allocation.
If you increase the amount of space by adding a proportion of your current number of elements (doubling is pretty standard), you'll end up with an expansion that, on average, takes constant time.
Will the compiler output be much different in these two cases?
This is not a real life situation. Presumably, in real life, you are interested in looking at or even removing items from your data structures as well as adding them.
If you allow removal, but only from the head, the linked list becomes better than the array because removing an item is trivial and, if instead of freeing the removed item, you put it on a free list to be recycled, you can eliminate a lot of the mallocs needed when you add items to the list.
On the other had, if you need random access to the structure, clearly an array beats the linked list.
(Updated.)
As others have noted, if there are no other allocations in between reallocs, then no copying is needed. Also as others have noted, the risk of memory copying lessens (but also its impact of course) for very small blocks, smaller than a page.
Also, if all you ever do in your test is to allocate new memory space, I am not very surprised you see little difference, since the syscalls to allocate memory are probably taking most of the time.
Instead, choose your data structures depending on how you want to actually use them. A framebuffer is for instance probably best represented by a contiguous array.
A linked list is probably better if you have to reorganise or sort data within the structure quickly.
Then these operations will be more or less efficient depending on what you want to do.
(Thanks for the comments below, I was initially confused myself about how these things work.)
What's the basis of your theory that the linked list should perform better for insertions at the end? I would not expect it to, for exactly the reason you stated. realloc will only copy when it has to to maintain contiguity; in other cases it may have to combine free chunks and/or increase the chunk size.
However, every linked list node requires fresh allocation and (assuming double linked list) two writes. If you want evidence of how realloc works, you can just compare the pointer before and after realloc. You should find that it usually doesn't change.
I suspect that since you're calling realloc for every element (obviously not wise in production), the realloc/malloc call itself is the biggest bottleneck for both tests, even though realloc often doesn't provide a new pointer.
Also, you're confusing the heap and data segment. The heap is where malloced memory lives. The data segment is for global and static variables.

Is it better to allocate memory in the power of two?

When we use malloc() to allocate memory, should we give the size which is in power of two? Or we just give the exact size that we need?
Like
//char *ptr= malloc( 200 );
char *ptr= malloc( 256 );//instead of 200 we use 256
If it is better to give size which is in the power of two, what is the reason for that? Why is it better?
Thanks
Edit
The reason of my confusion is following quote from Joel's blog Back to Basics
Smart programmers minimize the
potential distruption of malloc by
always allocating blocks of memory
that are powers of 2 in size. You
know, 4 bytes, 8 bytes, 16 bytes,
18446744073709551616 bytes, etc. For
reasons that should be intuitive to
anyone who plays with Lego, this
minimizes the amount of weird
fragmentation that goes on in the free
chain. Although it may seem like this
wastes space, it is also easy to see
how it never wastes more than 50% of
the space. So your program uses no
more than twice as much memory as it
needs to, which is not that big a
deal.
Sorry, I should have posted the above quote earlier. My apologies!
Most replies, so far, say that allocating memory in the power of two is a bad idea, then in which scenario its better to follow Joel's point about malloc()? Why did he say that? Is the above quoted suggestion obsolete now?
Kindly explain it.
Thanks
Just give the exact size you need. The only reason that a power-of-two size might be "better" is to allow quicker allocation and/or to avoid memory fragmentation.
However, any non-trivial malloc implementation that concerns itself with being efficient will internally round allocations up in this way if and when it is appropriate to do so. You don't need to concern yourself with "helping" malloc; malloc can do just fine on its own.
Edit:
In response to your quote of the Joel on Software article, Joel's point in that section (which is hard to correctly discern without the context that follows the paragraph that you quoted) is that if you are expecting to frequently re-allocate a buffer, it's better to do so multiplicatively, rather than additively. This is, in fact, exactly what the std::string and std::vector classes in C++ (among others) do.
The reason that this is an improvement is not because you are helping out malloc by providing convenient numbers, but because memory allocation is an expensive operation, and you are trying to minimize the number of times you do it. Joel is presenting a concrete example of the idea of a time-space tradeoff. He's arguing that, in many cases where the amount of memory needed changes dynamically, it's better to waste some space (by allocating up to twice as much as you need at each expansion) in order to save the time that would be required to repeatedly tack on exactly n bytes of memory, every time you need n more bytes.
The multiplier doesn't have to be two: you could allocate up to three times as much space as you need and end up with allocations in powers of three, or allocate up to fifty-seven times as much space as you need and end up with allocations in powers of fifty-seven. The more over-allocation you do, the less frequently you will need to re-allocate, but the more memory you will waste. Allocating in powers of two, which uses at most twice as much memory as needed, just happens to be a good starting-point tradeoff until and unless you have a better idea of exactly what your needs are.
He does mention in passing that this helps reduce "fragmentation in the free chain", but the reason for that is more because of the number and uniformity of allocations being done, rather than their exact size. For one thing, the more times you allocate and deallocate memory, the more likely you are to fragment the heap, no matter in what size you're allocating. Secondly, if you have multiple buffers that you are dynamically resizing using the same multiplicative resizing algorithm, then it's likely that if one resizes from 32 to 64, and another resizes from 16 to 32, then the second's reallocation can fit right where the first one used to be. This wouldn't be the case if one resized from 25 to 60 and and the other from 16 to 26.
And again, none of what he's talking about applies if you're going to be doing the allocation step only once.
Just to play devil's advocate, here's how Qt does it:
Let's assume that we append 15000
characters to the QString string. Then
the following 18 reallocations (out of
a possible 15000) occur when QString
runs out of space: 4, 8, 12, 16, 20,
52, 116, 244, 500, 1012, 2036, 4084,
6132, 8180, 10228, 12276, 14324,
16372. At the end, the QString has 16372 Unicode characters allocated,
15000 of which are occupied.
The values above may seem a bit
strange, but here are the guiding
principles:
QString allocates 4 characters at a
time until it reaches size 20. From 20
to 4084, it advances by doubling the
size each time. More precisely, it
advances to the next power of two,
minus 12. (Some memory allocators
perform worst when requested exact
powers of two, because they use a few
bytes per block for book-keeping.)
From 4084 on, it advances by blocks of
2048 characters (4096 bytes). This
makes sense because modern operating
systems don't copy the entire data
when reallocating a buffer; the
physical memory pages are simply
reordered, and only the data on the
first and last pages actually needs to
be copied.
I like the way they anticipate operating system features in code that is meant to perform well from smartphones to server farms. Given that they're smarter people than me, I'd assume that said feature is available in all modern OSes.
It might have been true once, but it's certainly not better.
Just allocate the memory you need, when you need it and free it up as soon as you've finished.
There are far too many programs that are profligate with resources - don't make yours one of them.
It's somewhat irrelevant.
Malloc actually allocates slightly more memory than you request, because it has it's own headers to deal with. Therefore the optimal storage is probably something like 4k-12 bytes... but that varies depending on the implementation.
In any case, there is no reason for you to round up to more storage than you need as an optimization technique.
You may want to allocate memory in terms of the processor's word size; not any old power of 2 will do.
If the processor has a 32-bit word (4 bytes), then allocate in units of 4 bytes. Allocating in terms of 2 bytes may not be helpful since the processor prefers data to start on a 4 byte boundary.
On the other hand, this may be a micro-optimization. Most memory allocation libraries are set up to return memory that is aligned at the correct position and will leave the least amount of fragmentation. If you allocate 15 bytes, the library may pad out and allocate 16 bytes. Some memory allocators have different pools based on the allocation size.
In summary, allocate the amount of memory that you need. Let the allocation library / manager handle the actual amount for you. Put more energy into correctness and robustness than worry about these trivial issues.
When I'm allocating a buffer that may need to keep growing to accommodate as-yet-unknown-size data, I start with a power of 2 minus 1, and every time it runs out of space, I realloc with twice the previous size plus 1. This makes it so I never have to worry about integer overflows; the size can only overflow when the previous size was SIZE_MAX, at which point the allocation would already have failed, and 2*SIZE_MAX+1 == SIZE_MAX anyway.
In contrast, if I just used a power of 2 and doubled it each time, I might successfully get a 2^31 byte buffer and then reallocate to a 0 byte buffer next time I doubled the size.
As some people have commented about power-of-2-minus-12 being good for certain malloc implementations, one could equally start with a power of 2 minus 12, then double it and add 12 at each step...
On the other hand if you're just allocating small buffers that won't need to grow, request exactly the size you need. Don't try to second-guess what's good for malloc.
This is totally dependent on the given libc implementation of malloc(3). It's up to that implementation to reserve heap chunks in whatever order it sees fit.
To answer the question - no, it's not "better" (here by "better" you mean ...?). If the size you ask for is too small, malloc(3) will reserve bigger chunk internally, so just stick with your exact size.
With today's amount of memory and its speed I don't think it's relevant anymore.
Furthermore, if you're gonna allocate memory frequently you better consider custom memory pooling / pre-allocation.
There is always testing...
You can try a "sample" program that allocates memory in a loop. This way you can see if your compiler magically allocates memory in powers of 2.
With that information, you can try to allocate the same amount of total memory using the 2 strategies: random sized blocks and power of 2 sized blocks.
I would only expect differences, if any, for large amounts of memory though.
If you're allocating some sort of expandable buffer where you need to pick some number for initial allocations, then yes, powers of 2 are good numbers to choose. If you need to allocate memory for struct foo, then just malloc(sizeof(struct foo)). The recommendation for power-of-2 allocations stems from the inefficiency of internal fragmentation, but modern malloc implementations intended for multiprocessor systems are starting to use CPU-local pools for allocations small enough for this to matter, which prevents the lock contention that used to result when multiple threads would attempt to malloc at the same time, and spend more time blocked due to fragmentation.
By allocating only what you need, you ensure that data structures are packed more densely in memory, which improves cache hit rate, which has a much bigger impact on performance than internal fragmentation. There exist scenarios with very old malloc implementations and very high-end multiprocessor systems where explicitly padding allocations can provide a speedup, but your resources in that case would be better spent getting a better malloc implementation up and running on that system. Pre-padding also makes your code less portable, and prevents the user or the system selecting the malloc behavior at run-time, either programmatically or with environment variables.
Premature optimization is the root of all evil.
You should use realloc() instead of malloc() when reallocating.
http://www.cplusplus.com/reference/clibrary/cstdlib/realloc/
Always use a power of two? It depends on what your program is doing. If you need to reprocess your whole data structure when it grows to a power of two, yeah it makes sense. Otherwise, just allocate what you need and don't hog memory.

Resources