Related
Nowadays compilers optimize crazy things. Especially gcc and clang sometime do really crazy transformations.
So, I'm wondering why the following piece of code is not optimized:
#include <stdio.h>
#include <string.h>
int main() {
printf("%d\n", 0);
}
I would expect a heavy-optimizing compiler to generate code equivalent to this:
#include <stdio.h>
#include <string.h>
int main() {
printf("0\n");
}
But gcc and clang don't apply the format at compile-time (see https://godbolt.org/z/Taa44c1n7). Is there a reason why such an optimization is not applied?
I see regularly that some format-arguments are known at compile-time, so I guess it could make sense (especially for floating-point values, because there the formatting is probably relatively expensive).
Suppose that at various spots within a function one has the following four lines:
printf("%d\n", 100);
printf("%d\n", 120);
printf("%d\n", 157);
printf("%d\n", 192);
when compiling for a typical ARM microcontroller, each of those printf call would take eight bytes, and all four function calls would share four bytes that are used to hold a copy of the address of the string, and four bytes for the string itself. The total for code and data would thus be 40 bytes.
If one were to replace that with:
printf("100\n");
printf("120\n");
printf("157\n");
printf("192\n");
then each printf call would only need six bytes of code, but would need to have its own separate five-byte string in memory, along with four bytes to hold the address. The total code plus data requirement would thus increase from 40 bytes to 60. Far from being an optimization, thus would actually increase code space requirement by 50%.
Even if there were only one printf call, the savings would be minuscule. Eight bytes for the original version of the call plus eight bytes of data overhead would be 16 bytes. The second version of the code would require eight bytes of code plus seven bytes of data, for a total of 17 bytes. An optimization which would save a little storage in best case, and cost a lot in scenarios that are hardly implausible isn't really an optimization.
Why are compile-time known format-strings not optimized?
Some potential optimizations are easy for compilers/compiler developers to support and have large impact on the quality/performance of the code the compiler generates; and some potential optimizations are extremely difficult for compilers/compiler developers to support and have a small impact on the quality/performance of the code the compiler generates. Obviously compiler developers are going to spend most of their time on the "easier with higher impact" optimizations, and some of the "harder with less impact" optimizations are going to be postponed (and possibly never implemented).
Something like optimizing printf("%d\n", 0); into a puts() (or better, an fputs()) looks like it'd be relatively easy to implement (but would have a very small performance impact, partly because it'd be rare for that to occur in source code anyway).
The real problem is that it's a "slippery slope".
If compiler developers do the work needed for the compiler to optimize printf("%d\n", 0);, then what about optimizing printf("%d\n", x); too? If you optimize those then why not also optimize printf("%04d\n", 0); and printf("%04d\n", x);, and floating point, and more complex format strings? Surely printf("Hello %04d\n Foo is %0.6f!\n", x, y); could be broken down into a series of smaller functions (puts(), atoi(), ..)?
This "slippery slope" means that the complexity increases rapidly (as the compiler supports more permutations of format strings) while the performance impact barely improves at all.
If compiler developers are going to spend most of their time on the "easier with higher impact" optimizations, they're not going to be enthusiastic about clawing further up that particular slippery slope.
How could the compiler know what kind of printf() function you will link?
There is a standard meaning of this function, but nobody forces you to use the standard library.
So it's beyond the compiler's competence to do in advance what the function is supposed to do.
My guess is that how printf() does the conversion from the internal C format to the user representation is a matter of operating system, with its locale conventions (often/always configurable?) so it would be wrong to assume that to print an "integer 0" one should use the ascii char 48 (extreme example, I know). Particularly, and here I may be wrong, a printf("\n") outputs a newline on unix, and a CR+LF on other systems. I am not sure about where this transformation is done and, to be sincere, I think this is a glitch: I've seen, more than once, programs which outputted the wrong line termination, and I think this is because the C compiler should not code a line termination character into the program. This is the same as assuming that a particular number has to be written in a particular way.
Little addition: below this question there's a comment saying that it is possible to change at runtime the behavior of format specifiers. If this is true, then we have the final reply to the question... can someone confirm?
I have to do one class´s exercise where I have to do some subroutines and then check the cache misses, this is the atachment:
I have to create 2 subroutines where s1 is the capacity of my L1 cache 32kb and b1 is the line witdh 64 bytes.
subroutine A: increment all bytes of a memory buffer containing 2*s1 bytes
in the order of increasing memory addresses;
subroutine B: increment each b1-th byte of a memory buffer containing 2*s1
bytes in the order of increasing memory addresses;
For subroutine A I think I just have to do:
char buffer[2*s1];
printf...
buffer++;
printf...
both printf will show:
buffer[0]= 0x7fff36769fe0 buffer[1]= 0x7fff36769fe1
buffer[0]= 0x7fff36769fe1 buffer[1]= 0x7fff36769fe2
All bytes would be increassed,so, I think that it is correct,and for subroutine B, I have no idea...So, I would like to have some help for subroutine B.
It would be nice if someone can help me.
Thank you!
These exercises only makes sense as a lesson in optimisation. Unfortunately, they're not quite hitting the figurative head of the nail.
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
That's a famous quote, along-side its context, from perhaps the most reputable teacher we can agree upon in this field: Donald Knuth. Following on from this exercise, your professor will probably introduce you to profilers by walking you through profiling your unoptimised code.
What can we hope to improve if we apply manual optimisations as though the compiler performs none, and only later enable the compilers automatic optimisations? Perhaps the compiler might automatically perform the same optimisations (or better optimisations) we manually perform, but in a miniscule fraction of the time.
There's a marketing/planning strategy to consider behind those figures in that quote. In the commercial world or even as a hobbyist, you will have spent the 97% of your time up front developing and testing a program that solves the actual problem. Notice the emphasis on actual problem. You won't be writing and testing programs for the purpose of observing cache behaviours. You'll be writing and testing programs to solve real life problems.
Once you've written and tested the entire program thoroughly you'll know whether or not it feels fast enough (e.g. you'll probably be told by your clients/employers to optimise it). If it feels fast enough, you won't optimise it, even if there are cache misses everywhere; you won't even know there are cache misses. If it feels too slow, however, you'll want use a profiler on your fully optimised code to determine where the most significant bottleneck is, and spend roughly 3% of your time on that bottleneck...
It is possible to program optimistically, but that's not an exercise in observation like these exercises; it's an exercise in planning and factoring. One guideline is to avoid repeating yourself, which is what I've done with the code below. By determining the common behaviours between the two exercises (e.g. both loop from low indexes to high indexes, incrementing each element by 1) I've reduced two loops to one, and avoided repeating myself at the same time. Factor your programs to avoid as much repetition as possible and your code will be smaller, resulting in less testing, maintenance and of course less repeated keystrokes. Most of the optimisation is an optimisation of our time, not of the computers time! It just so happens that we are also optimising the computer time by making our code smaller.
void increment_every(char *buffer, size_t multiple, size_t maximum) {
while (maximum > multiple) {
buffer[0]++;
buffer += multiple;
maximum -= multiple;
}
}
You should be able to solve both of these problems in terms of this function. For example:
#define S1 32768 /* 32KB */
#define B1 64 /* 64 B */
int main(void) {
char buffer[2 * S1] = { 0 };
increment_every(buffer, 1, sizeof buffer); // subroutine A
if (B1 < S1) { // subroutine B
increment_every(buffer + B1, B1, sizeof buffer - B1);
}
}
This is assuming S1 and B1 are fixed values that don't come from a file or an interactive device. Such an assumption causes the compiler to be quite aggressive with optimisation. In fact, this program will probably be optimised to int main(void) { } because there's no observable behaviour; there's no input from or output to any files or interactive devices anywhere in this program! Hopefully, now, you'll see a few true lessons behind your next lesson:
Write code optimally rather than write optimal code. e.g. Avoid repeating yourself.
Profile your code only when it feels too slow. e.g. Optimise only after you've written and tested the entire program.
From a memory access standpoint... is it worth attempting an optimization like this?
int boolean_value = 0;
//magical code happens and boolean_value could be 0 or 1
if(boolean_value)
{
//do something
}
Instead of
unsigned char boolean_value = 0;
//magical code happens and boolean_value could be 0 or 1
if(boolean_value)
{
//do something
}
The unsigned char of course takes up only 1 byte as apposed to the integers 4 (assuming 32 bit platform here), but my understanding is that it would be faster for a processor to read the integer value from memory.
It may or may not be faster, and the speed depends on so many things that a generic answer is impossible. For example: hardware architecture, compiler, compiler options, amount of data (does it fit into L1 cache?), other things competing for the CPU, etc.
The correct answer, therefore, is: try both ways and measure for your particular case.
If measurement does not indicate that one method is significantly faster than the other, opt for the one that is clearer.
From a memory access standpoint... is
it worth attempting an optimization
like this?
Probably not. In almost all modern processors, memory will get fetched based on the word size of the processor. In your case, even to get one byte of memory out, your processor probably fetches the entire 32-bit word or more based on the caching of that processor. Your architecture may vary, so you will want to understand how your CPU works to gauge.
But as others have said, it doesn't hurt to try it and measure it.
This is almost never a good idea. Many systems can only read word-sized chunks from memory at once, so reading a byte then masking or shifting will actually take more code space and just as much (data) memory. If you're using an obscure tiny system, measure, but in general this will actually slow down and bloat your code.
Asking how much memory unsigned char takes versus int is only meaningful when it's in an array (or possibly a structure, if you're careful to order the elements to take care of alignment). As a lone variable, it's very unlikely that you save any memory at all, and the compiler is likely to generate larger code to truncate the upper bits of registers.
As a general policy, never use smaller-than-int types except in arrays unless you have a really good reason other than trying to save space.
Follow the standard rules of optimization. First, don't optimize. Then test if your code needs it at some point. Then optimize that point. This link provides an excellent intro to the topic of optimization.
http://www.catb.org/~esr/writings/taoup/html/optimizationchapter.html
I'm trying to write a crc32 implementation on linux that's as fast as possible, as an exercise in learning to optimise C. I've tried my best, but I haven't been able to find many good resources online. I'm not even sure if my buffer size is sensible; it was chosen by repeated experimentation.
#include <stdio.h>
#define BUFFSIZE 1048567
const unsigned long int lookupbase = 0xEDB88320;
unsigned long int crctable[256] = {
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
/* LONG LIST OF PRECALCULTED VALUES */
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D};
int main(int argc, char *argv[]){
register unsigned long int x;
int i;
register unsigned char *c, *endbuff;
unsigned char buff[BUFFSIZE];
register FILE *thisfile=NULL;
for (i = 1; i < argc; i++){
thisfile = fopen(argv[i], "r");
if (thisfile == NULL) {
printf("Unable to open ");
} else {
x = 0xFFFFFFFF;
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
while (c != endbuff){
while (c != endbuff){
x=(x>>8) ^ crctable[(x&0xFF)^*c];
c++;
}
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
}
fclose(thisfile);
x = x ^ 0xFFFFFFFF;
printf("%0.8X ", x);
}
printf("%s\n", argv[i]);
}
return 0;
}
Thanks in advance for any suggestions or resources I can read through.
On Linux? Forget about the register keyword, that's just a suggestion to the compiler and, from my experience with gcc, it's a waste of space. gcc is more than capable of figuring that out for itself.
I would just make sure you're compiling with the insane optimisation level, -O3, and check that. I've seen gcc produce code at that level which took me hours to understand, so sneaky that it was.
And, on the buffer size, make it as large as you possibly can. Even with buffering, the cost of calling fread is still a cost, so the less you do it, the better. You would see a huge improvement if you increased the buffer size from 1K to 1M, not so much if you up it from 1M to 2M, but even a small amount of increased performance is an increase. And, 2M isn't the upper bound of what you can use, I'd set it to one or more gigabytes if possible.
You may then want to put it at file level (rather than inside main). At some point, the stack won't be able to hold it.
As with most optimisations, you can usually trade space for time. Keep in mind that, for small files (less than 1M), you won't see any improvement since there is still only one read no matter how big you make the buffer. You may even find a slight slowdown if the loading of the process has to take more time to set up memory.
But, since this would only be for small files (where the performance isn't a problem anyway), it shouldn't really matter. Large files, where the performance is an issue, should hopefully find an improvement.
And I know I don't need to tell you this (since you indicate you are doing it), but I will mention it anyway for those who don't know: Measure, don't guess! The ground is littered with the corpses of those who optimised with guesswork :-)
You've asked for three values to be stored in registers, but standard x86 only has four general purpose registers: that's an awful lot of burden to place on the last remaining register, which is one reason why I expect register really only prevents you from ever using &foo to find the address of the variable. I don't think any modern compiler even uses it as a hint, these days. Feel free to remove all three uses and re-time your application.
Since you're reading in huge chunks of the file yourself, you might as well use open(2) and read(2) directly, and remove all the standard IO handling behind the scenes. Another common approach is to open(2) and mmap(2) the file into memory: let the OS page it in as pages are required. This may allow future pages to be optimistically read from disk while you're doing your computation: this is a common access pattern, and one the OS designers have attempted to optimize. (The simple mechanism of mapping the entire file at once does put an upper limit on the size of the files you can handle, probably about 2.5 gigabytes on 32-bit platforms and absolutely huge on 64-bit platforms. Mapping the file in chunks will allow you to handle arbitrary sized files even on 32-bit platforms, but at the cost of loops like you've got now for reading, but for mapping.)
As David Gelhar points out, you're using an odd-length buffer -- this might complicate the code path of reading the file into memory. If you want to stick with reading from files into buffers, I suggest using a multiple of 8192 (two pages of memory), as it won't have special cases until the last loop.
If you're really into eeking out of the last bit of speed and don't mind drastically increasing the size of your pre-computation table, you can look at the file in 16-bit chunks, rather than just 8-bit chunks. Frequently, accessing memory along 16-bit alignment is faster than along 8-bit alignment, and you'd cut the number of iterations through your loop in half, which usually gives a huge speed boost. The downside, of course, is increased memory pressure (65k entries, each of 8 bytes, rather than just 256 entries each of 4 bytes), and the much larger table is much less likely to fit entirely in the CPU cache.
And the last optimization idea that crosses my mind is to fork(2) into 2, 3, or 4 processes (or use threading), each of which can compute the crc32 of a portion of the file, and then combine the end results after all processes have completed. crc32 may not be computationally intensive enough to actually benefit from trying to use multiple cores out of SMP or multicore computers, and figuring out how to combine partial computations of crc32 may not be feasible -- I haven't looked into it myself :) -- but it might repay the effort, and learning how to write multi-process or multi-threaded software is well worth the effort regardless.
You are not going to be able to speed up the actual arithmetic of the CRC calculation, so the areas you can look at are the overhead of (a) reading the file, and (b) looping.
You're using a pretty large buffer size, which is good (but why is it an odd number?). Using a read(2) system call (assuming you're on a unix-like system) instead of the fread(3) standard library function may save you one copy operation (copying the data from fread's internal buffer into your bufffer).
For the loop overhead, look into loop unrolling.
Your code also has some redundancies that you might want to eliminate.
sizeof (unsigned char) is 1 (by definition in C); no need to explicitly compute it
c = &(buff[0]) is exactly equivalent to c = buff
Neither of these changes will improve the performance of the code (assuming a decent compiler), but they will make it clearer and more in accordance with usual C style.
C++ has std::vector and Java has ArrayList, and many other languages have their own form of dynamically allocated array. When a dynamic array runs out of space, it gets reallocated into a larger area and the old values are copied into the new array. A question central to the performance of such an array is how fast the array grows in size. If you always only grow large enough to fit the current push, you'll end up reallocating every time. So it makes sense to double the array size, or multiply it by say 1.5x.
Is there an ideal growth factor? 2x? 1.5x? By ideal I mean mathematically justified, best balancing performance and wasted memory. I realize that theoretically, given that your application could have any potential distribution of pushes that this is somewhat application dependent. But I'm curious to know if there's a value that's "usually" best, or is considered best within some rigorous constraint.
I've heard there's a paper on this somewhere, but I've been unable to find it.
I remember reading many years ago why 1.5 is preferred over two, at least as applied to C++ (this probably doesn't apply to managed languages, where the runtime system can relocate objects at will).
The reasoning is this:
Say you start with a 16-byte allocation.
When you need more, you allocate 32 bytes, then free up 16 bytes. This leaves a 16-byte hole in memory.
When you need more, you allocate 64 bytes, freeing up the 32 bytes. This leaves a 48-byte hole (if the 16 and 32 were adjacent).
When you need more, you allocate 128 bytes, freeing up the 64 bytes. This leaves a 112-byte hole (assuming all previous allocations are adjacent).
And so and and so forth.
The idea is that, with a 2x expansion, there is no point in time that the resulting hole is ever going to be large enough to reuse for the next allocation. Using a 1.5x allocation, we have this instead:
Start with 16 bytes.
When you need more, allocate 24 bytes, then free up the 16, leaving a 16-byte hole.
When you need more, allocate 36 bytes, then free up the 24, leaving a 40-byte hole.
When you need more, allocate 54 bytes, then free up the 36, leaving a 76-byte hole.
When you need more, allocate 81 bytes, then free up the 54, leaving a 130-byte hole.
When you need more, use 122 bytes (rounding up) from the 130-byte hole.
In the limit as n → ∞, it would be the golden ratio: ϕ = 1.618...
For finite n, you want something close, like 1.5.
The reason is that you want to be able to reuse older memory blocks, to take advantage of caching and avoid constantly making the OS give you more memory pages. The equation you'd solve to ensure that a subsequent allocation can re-use all prior blocks reduces to xn − 1 − 1 = xn + 1 − xn, whose solution approaches x = ϕ for large n. In practice n is finite and you'll want to be able to reusing the last few blocks every few allocations, and so 1.5 is great for ensuring that.
(See the link for a more detailed explanation.)
It will entirely depend on the use case. Do you care more about the time wasted copying data around (and reallocating arrays) or the extra memory? How long is the array going to last? If it's not going to be around for long, using a bigger buffer may well be a good idea - the penalty is short-lived. If it's going to hang around (e.g. in Java, going into older and older generations) that's obviously more of a penalty.
There's no such thing as an "ideal growth factor." It's not just theoretically application dependent, it's definitely application dependent.
2 is a pretty common growth factor - I'm pretty sure that's what ArrayList and List<T> in .NET uses. ArrayList<T> in Java uses 1.5.
EDIT: As Erich points out, Dictionary<,> in .NET uses "double the size then increase to the next prime number" so that hash values can be distributed reasonably between buckets. (I'm sure I've recently seen documentation suggesting that primes aren't actually that great for distributing hash buckets, but that's an argument for another answer.)
One approach when answering questions like this is to just "cheat" and look at what popular libraries do, under the assumption that a widely used library is, at the very least, not doing something horrible.
So just checking very quickly, Ruby (1.9.1-p129) appears to use 1.5x when appending to an array, and Python (2.6.2) uses 1.125x plus a constant (in Objects/listobject.c):
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}
newsize above is the number of elements in the array. Note well that newsize is added to new_allocated, so the expression with the bitshifts and ternary operator is really just calculating the over-allocation.
Let's say you grow the array size by x. So assume you start with size T. The next time you grow the array its size will be T*x. Then it will be T*x^2 and so on.
If your goal is to be able to reuse the memory that has been created before, then you want to make sure the new memory you allocate is less than the sum of previous memory you deallocated. Therefore, we have this inequality:
T*x^n <= T + T*x + T*x^2 + ... + T*x^(n-2)
We can remove T from both sides. So we get this:
x^n <= 1 + x + x^2 + ... + x^(n-2)
Informally, what we say is that at nth allocation, we want our all previously deallocated memory to be greater than or equal to the memory need at the nth allocation so that we can reuse the previously deallocated memory.
For instance, if we want to be able to do this at the 3rd step (i.e., n=3), then we have
x^3 <= 1 + x
This equation is true for all x such that 0 < x <= 1.3 (roughly)
See what x we get for different n's below:
n maximum-x (roughly)
3 1.3
4 1.4
5 1.53
6 1.57
7 1.59
22 1.61
Note that the growing factor has to be less than 2 since x^n > x^(n-2) + ... + x^2 + x + 1 for all x>=2.
Another two cents
Most computers have virtual memory! In the physical memory you can have random pages everywhere which are displayed as a single contiguous space in your program's virtual memory. The resolving of the indirection is done by the hardware. Virtual memory exhaustion was a problem on 32 bit systems, but it is really not a problem anymore. So filling the hole is not a concern anymore (except special environments). Since Windows 7 even Microsoft supports 64 bit without extra effort. # 2011
O(1) is reached with any r > 1 factor. Same mathematical proof works not only for 2 as parameter.
r = 1.5 can be calculated with old*3/2 so there is no need for floating point operations. (I say /2 because compilers will replace it with bit shifting in the generated assembly code if they see fit.)
MSVC went for r = 1.5, so there is at least one major compiler that does not use 2 as ratio.
As mentioned by someone 2 feels better than 8. And also 2 feels better than 1.1.
My feeling is that 1.5 is a good default. Other than that it depends on the specific case.
The top-voted and the accepted answer are both good, but neither answer the part of the question asking for a "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory". (The second-top-voted answer does try to answer this part of the question, but its reasoning is confused.)
The question perfectly identifies the 2 considerations that have to be balanced, performance and wasted memory. If you choose a growth rate too low, performance suffers because you'll run out of extra space too quickly and have to reallocate too frequently. If you choose a growth rate too high, like 2x, you'll waste memory because you'll never be able to reuse old memory blocks.
In particular, if you do the math1 you'll find that the upper limit on the growth rate is the golden ratio ϕ = 1.618… . Growth rate larger than ϕ (like 2x) mean that you'll never be able to reuse old memory blocks. Growth rates only slightly less than ϕ mean you won't be able to reuse old memory blocks until after many many reallocations, during which time you'll be wasting memory. So you want to be as far below ϕ as you can get without sacrificing too much performance.
Therefore I'd suggest these candidates for "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory":
≈1.466x (the solution to x4=1+x+x2) allows memory reuse after just 3 reallocations, one sooner than 1.5x allows, while reallocating only slightly more frequently
≈1.534x (the solution to x5=1+x+x2+x3) allows memory reuse after 4 reallocations, same as 1.5x, while reallocating slightly less frequently for improved performance
≈1.570x (the solution to x6=1+x+x2+x3+x4) only allows memory reuse after 5 reallocations, but will reallocate even less infrequently for even further improved performance (barely)
Clearly there's some diminishing returns there, so I think the global optimum is probably among those. Also, note that 1.5x is a great approximation to whatever the global optimum actually is, and has the advantage being extremely simple.
1 Credits to #user541686 for this excellent source.
It really depends. Some people analyze common usage cases to find the optimal number.
I've seen 1.5x 2.0x phi x, and power of 2 used before.
If you have a distribution over array lengths, and you have a utility function that says how much you like wasting space vs. wasting time, then you can definitely choose an optimal resizing (and initial sizing) strategy.
The reason the simple constant multiple is used, is obviously so that each append has amortized constant time. But that doesn't mean you can't use a different (larger) ratio for small sizes.
In Scala, you can override loadFactor for the standard library hash tables with a function that looks at the current size. Oddly, the resizable arrays just double, which is what most people do in practice.
I don't know of any doubling (or 1.5*ing) arrays that actually catch out of memory errors and grow less in that case. It seems that if you had a huge single array, you'd want to do that.
I'd further add that if you're keeping the resizable arrays around long enough, and you favor space over time, it might make sense to dramatically overallocate (for most cases) initially and then reallocate to exactly the right size when you're done.
I recently was fascinated by the experimental data I've got on the wasted memory aspect of things. The chart below is showing the "overhead factor" calculated as the amount of overhead space divided by the useful space, the x-axis shows a growth factor. I'm yet to find a good explanation/model of what it reveals.
Simulation snippet: https://gist.github.com/gubenkoved/7cd3f0cb36da56c219ff049e4518a4bd.
Neither shape nor the absolute values that simulation reveals are something I've expected.
Higher-resolution chart showing dependency on the max useful data size is here: https://i.stack.imgur.com/Ld2yJ.png.
UPDATE. After pondering this more, I've finally come up with the correct model to explain the simulation data, and hopefully, it matches experimental data nicely. The formula is quite easy to infer simply by looking at the size of the array that we would need to have for a given amount of elements we need to contain.
Referenced earlier GitHub gist was updated to include calculations using scipy.integrate for numerical integration that allows creating the plot below which verifies the experimental data pretty nicely.
UPDATE 2. One should however keep in mind that what we model/emulate there mostly has to do with the Virtual Memory, meaning the over-allocation overheads can be left entirely on the Virtual Memory territory as physical memory footprint is only incurred when we first access a page of Virtual Memory, so it's possible to malloc a big chunk of memory, but until we first access the pages all we do is reserving virtual address space. I've updated the GitHub gist with CPP program that has a very basic dynamic array implementation that allows changing the growth factor and the Python snippet that runs it multiple times to gather the "real" data. Please see the final graph below.
The conclusion there could be that for x64 environments where virtual address space is not a limiting factor there could be really little to no difference in terms of the Physical Memory footprint between different growth factors. Additionally, as far as Virtual Memory is concerned the model above seems to make pretty good predictions!
Simulation snippet was built with g++.exe simulator.cpp -o simulator.exe on Windows 10 (build 19043), g++ version is below.
g++.exe (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
PS. Note that the end result is implementation-specific. Depending on implementation details dynamic array might or might not access the memory outside the "useful" boundaries. Some implementations would use memset to zero-initialize POD elements for whole capacity -- this will cause virtual memory page translated into physical. However, std::vector implementation on a referenced above compiler does not seem to do that and so behaves as per mock dynamic array in the snippet -- meaning overhead is incurred on the Virtual Memory side, and negligible on the Physical Memory.
I agree with Jon Skeet, even my theorycrafter friend insists that this can be proven to be O(1) when setting the factor to 2x.
The ratio between cpu time and memory is different on each machine, and so the factor will vary just as much. If you have a machine with gigabytes of ram, and a slow CPU, copying the elements to a new array is a lot more expensive than on a fast machine, which might in turn have less memory. It's a question that can be answered in theory, for a uniform computer, which in real scenarios doesnt help you at all.
I know it is an old question, but there are several things that everyone seems to be missing.
First, this is multiplication by 2: size << 1. This is multiplication by anything between 1 and 2: int(float(size) * x), where x is the number, the * is floating point math, and the processor has to run additional instructions for casting between float and int. In other words, at the machine level, doubling takes a single, very fast instruction to find the new size. Multiplying by something between 1 and 2 requires at least one instruction to cast size to a float, one instruction to multiply (which is float multiplication, so it probably takes at least twice as many cycles, if not 4 or even 8 times as many), and one instruction to cast back to int, and that assumes that your platform can perform float math on the general purpose registers, instead of requiring the use of special registers. In short, you should expect the math for each allocation to take at least 10 times as long as a simple left shift. If you are copying a lot of data during the reallocation though, this might not make much of a difference.
Second, and probably the big kicker: Everyone seems to assume that the memory that is being freed is both contiguous with itself, as well as contiguous with the newly allocated memory. Unless you are pre-allocating all of the memory yourself and then using it as a pool, this is almost certainly not the case. The OS might occasionally end up doing this, but most of the time, there is going to be enough free space fragmentation that any half decent memory management system will be able to find a small hole where your memory will just fit. Once you get to really bit chunks, you are more likely to end up with contiguous pieces, but by then, your allocations are big enough that you are not doing them frequently enough for it to matter anymore. In short, it is fun to imagine that using some ideal number will allow the most efficient use of free memory space, but in reality, it is not going to happen unless your program is running on bare metal (as in, there is no OS underneath it making all of the decisions).
My answer to the question? Nope, there is no ideal number. It is so application specific that no one really even tries. If your goal is ideal memory usage, you are pretty much out of luck. For performance, less frequent allocations are better, but if we went just with that, we could multiply by 4 or even 8! Of course, when Firefox jumps from using 1GB to 8GB in one shot, people are going to complain, so that does not even make sense. Here are some rules of thumb I would go by though:
If you cannot optimize memory usage, at least don't waste processor cycles. Multiplying by 2 is at least an order of magnitude faster than doing floating point math. It might not make a huge difference, but it will make some difference at least (especially early on, during the more frequent and smaller allocations).
Don't overthink it. If you just spent 4 hours trying to figure out how to do something that has already been done, you just wasted your time. Totally honestly, if there was a better option than *2, it would have been done in the C++ vector class (and many other places) decades ago.
Lastly, if you really want to optimize, don't sweat the small stuff. Now days, no one cares about 4KB of memory being wasted, unless they are working on embedded systems. When you get to 1GB of objects that are between 1MB and 10MB each, doubling is probably way too much (I mean, that is between 100 and 1,000 objects). If you can estimate expected expansion rate, you can level it out to a linear growth rate at a certain point. If you expect around 10 objects per minute, then growing at 5 to 10 object sizes per step (once every 30 seconds to a minute) is probably fine.
What it all comes down to is, don't over think it, optimize what you can, and customize to your application (and platform) if you must.