Why is that data structures usually have a size of 2^n? - c

Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.

There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?

Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.

Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.

Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.

I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.

Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.

I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.

It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.

It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.

In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.

Related

Fastest use of a dataset of just over 64 bytes?

Structure: I have 8 64-bit integers (512 bits = 64 bytes, the assumed cache line width) that I would like to compare to another, single 64-bit integer, in turn, without cache misses. The data set is, unfortunately, absolutely inflexible -- it's already as small as possible.
Access pattern: Each uint64_t is in fact an array of 4x4x4 bits, each bit representing the presence or absence of a voxel. This means sometimes I will be using half of one chunk and half of another, or even corners of 8 different 64-bit chunks.... I guess what this means is there is a high likelihood of a lack of alignment.
How can I do this as fast as possible i.e. without thrashing the cache?
P.S. The idea is that this code will ultimately run on a fairly wide range of architectures of at least a 64B cache line width, so I'd prefer this were absolutely as fast as possible. This also means I can't rely on MOVNTDQA, which anyway may incur a performance hit of it's own inspite of loading the 9th element directly to the CPU.
P.P.S. My knowledge of this area is fairly limited so please take it easy on me. But please spare me the premature optimisation comments; be sure that this is the 3% of this application that really counts.
I wouldn't worry about it. If your dataset is really only 9 integers, most of it will likely be stored in registers anyway. Also, there isn't really any way to optimize cache usage without specifying an architecture, since cache structure is architecture dependent. If you can list several target architectures you may be able to find some commonalities that you can optimize toward, but without knowing those architectures, I don't think there's much we can do for you.
Lastly, this seems like a good example of optimizing too early. I would suggest you take the following steps:
Decide what your maximum acceptable run time is
Finish your program in C
Compile for all of your target architectures
For those platforms that don't meet your speed spec, hand-optimize the intermediate assembly files and recompile until you meet your spec.
Are you sure you get cache-misses?
Even if the comparing value is not in an register, i think your first uint64 array should be on one cache stage (or what ever it is called) and your other data in another.
Your cache surely has some n-way associativity, that prevents your data row from being removed from the cache just by accessing your compare value.
Do not lose your time on Micro Optimizations. Improve your algorithms and data structures.

C fastest way to compare two bitmaps

There are two arrays of bitmaps in the form of char arrays with millions of records. What could be fastest way to compare them using C.
I can imagine to use bitwise operator xor 1 byte at a time in a for loop.
Important point about bitmaps:
1% to 10% of times algorithm is run, bitmaps can differ. Most of the time they will be same. When hey can differ, they can as much as 100%. There is high probability of change of bits in continuous streak.
Both bitmaps are of same length.
Aim:
Check do they differ and if yes then where.
Be correct every time (probability of detecting error if there is one should be 1).
This answer assumes you mean 'bitmap' as a sequence of 0/1 values rather than 'bitmap image format'
If you simply have two bitmaps of the same length and wish to compare them quickly, memcmp() will be effective as someone suggested in the comments. You could if you want try using SSE type optimizations, but these are not as easy as memcmp(). memcmp() is assuming you simply want to know 'they are different' and nothing more.
If you want to know how many bits they are different by, e.g. 615 bits differ, then again you have little option except to XOR every byte and count the number of differences. As others have noted, you probably want to do this more at 32/64 or even 256 bits at a time, depending on your platform. However, if the arrays are millions of bytes long, then the biggest delay (with current CPUs) will be the time to transfer main memory to the CPU, and it wont matter terribly what the CPU does (lots of caveats here)
If you question is more asking about comparing A to B, but really you are doing this lots of times, such as A to B and C,D,E etc, then you can do a couple of things
A. Store a checksum of each array and first compare the checksums, if these are the same then there is a high chance the arrays are the same. Obviously there is a risk here that checksums can be equal but the data can differ, so make sure that a false result in this case will not have dramatic side effects. And, if you cannot withstand false results, do not use this technique.
B. if the arrays have structure, such as they are image data, then leverage specific tools for this, how is beyond this answer to explain.
C. If the image data can be compressed effectively, then compress each array and compare using the compressed form. If you use ZIP type of compression you cannot tell directly from zip how many bits differ, but other techniques such as RLE can be effective to quickly count bit differences (but are a lot of work to build and get correct and fast)
D. If the risk with (a) is acceptable, then you can checksum each chunk of say 262144 bits, and only count differences where checksums differ. This heavily reduces main memory access and will go lots faster.
All of the options A..D are about reducing main memory access as this is the nub of any performance gain (for problem as stated)

Is it possible to create a float array of 10^13 elements in C?

I am writing a program in C to solve an optimisation problem, for which I need to create an array of type float with an order of 1013 elements. Is it practically possible to do so on a machine with 20GB memory.
A float in C occupies 4 bytes (assuming IEEE floating point arithmetic, which is pretty close to universal nowadays). That means 1013 elements are naïvely going to require 4×1013 bytes of space. That's quite a bit (40 TB, a.k.a. quite a lot of disk for a desktop system, and rather more than most people can afford when it comes to RAM) so you need to find another approach.
Is the data sparse (i.e., mostly zeroes)? If it is, you can try using a hash table or tree to store only the values which are anything else; if your data is sufficiently sparse, that'll let you fit everything in. Also be aware that processing 1013 elements will take a very long time. Even if you could process a billion items a second (very fast, even now) it would still take 104 seconds (several hours) and I'd be willing to bet that in any non-trivial situation you'll not be able to get anything near that speed. Can you find some way to make not just the data storage sparse but also the processing, so that you can leave that massive bulk of zeroes alone?
Of course, if the data is non-sparse then you're doomed. In that case, you might need to find a smaller, more tractable problem instead.
I suppose if you had a 64 bit machine with a lot of swap space, you could just declare an array of size 10^13 and it may work.
But for a data set of this size it becomes important to consider carefully the nature of the problem. Do you really need random access read and write operations for all 10^13 elements? Is the array at all sparse? Could you express this as a map/reduce problem? If so, sequential access to 10^13 elements is much more practical than random access.

Maximum values for array sizes in C

Just a quick question: What are people's practices when you have to define the (arbitrary) maximum that some array can take in C. So, some people just choose a round number hoping it will be big enough, others the prime number closer to the round number (!), etc., other some more esoteric number, like the prime number closer to... and so on.
I'm wondering, then, what are some best practices for deciding such values?
Thanks.
There is no general rule. Powers of twos work for buffers, I use 1024 quite often for string buffers in C but any other number would work. Prime numbers are useful for hash tables where simple modulo-hashing works well with prime-number sizes. Of course you define the size as a symbolic constant so that you can change it later.
If I can't pin down a reasonable maximum I tend to use malloc and realloc to grow the array as needed. Using a fixed size array when you can't gurantee that it is large enough for the intended purpose is hazardous.
Best practice is to avoid arbitrary limits whenever possible.
It's not always possible, so second-best practice is to take an educated estimate of the largest thing that the array is ever likely to need to hold, and then round up by a healthy margin, at least 25%. I tend to prefer powers of ten when I do this, because it makes it obvious on inspection that the number is an arbitrary limit. (Powers of two also often signify that, but only if the reader recognizes the number as a power of two, and most readers-of-code don't have that table memorized much past 216. If there's a good reason to use a power of two and it needs to be bigger than that, write it in hex. End of digression.) Always document the reasoning behind your estimate of the largest thing the array needs to hold, even if it's as simple as "anyone with a single source file bigger than 2GB needs to rethink their coding style" (actual example)
Don't use a prime number unless you specifically need the properties of a prime number (e.g. as Juho mentions, for hash tables -- but you only need that there if your hash function isn't very good -- but often it is, unfortunately.) When you do, document that you are intentionally using prime numbers and why, because most people do not recognize prime numbers on sight or know why they might be necessary in a particular situation.
If I need to do this I usually go with either a power of two, or for larger data sets, the number of pages required to hold the data. Most of the time though I prefer to allocate a chunk of memory on the heap and then realloc if the buffer size is insufficient later.
I only define a maximum when I have a strong reason for a particular number to be the maximum. Otherwise, I size it dynamically, perhaps with a sanity-check maximum (e.g. a person's name should not be several megabytes long).
Round numbers (powers of 2) are used because they are often easy for things like malloc to use (many implementations keep up with memory in blocks of various power of two sizes), easier for linkers to use (in the case of static or global arrays), and also because you can use bitwise operations to test for limits of them, which are often faster than < and >.
Prime numbers are used because using prime number sized hash tables is supposed to avoid collision.
Many people likely use both prime number and power of two sizes for things in cases where they don't actually provide any benefit, though.
It really isn't possible to predict at the outset what the maximum size could be.
For example, I coded a small cmdline interpreter, where each line of output produced was stored in a char array of size 200. Sufficient for all possible outputs, don't you think?
That was until I issued the env command which had a line with ~ 400 characters(!).
LS_COLORS='no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;
05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;
32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;
31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;
35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:';
Moral of the story: Try to use dynamic allocation as far as possible.

What is the ideal growth rate for a dynamically allocated array?

C++ has std::vector and Java has ArrayList, and many other languages have their own form of dynamically allocated array. When a dynamic array runs out of space, it gets reallocated into a larger area and the old values are copied into the new array. A question central to the performance of such an array is how fast the array grows in size. If you always only grow large enough to fit the current push, you'll end up reallocating every time. So it makes sense to double the array size, or multiply it by say 1.5x.
Is there an ideal growth factor? 2x? 1.5x? By ideal I mean mathematically justified, best balancing performance and wasted memory. I realize that theoretically, given that your application could have any potential distribution of pushes that this is somewhat application dependent. But I'm curious to know if there's a value that's "usually" best, or is considered best within some rigorous constraint.
I've heard there's a paper on this somewhere, but I've been unable to find it.
I remember reading many years ago why 1.5 is preferred over two, at least as applied to C++ (this probably doesn't apply to managed languages, where the runtime system can relocate objects at will).
The reasoning is this:
Say you start with a 16-byte allocation.
When you need more, you allocate 32 bytes, then free up 16 bytes. This leaves a 16-byte hole in memory.
When you need more, you allocate 64 bytes, freeing up the 32 bytes. This leaves a 48-byte hole (if the 16 and 32 were adjacent).
When you need more, you allocate 128 bytes, freeing up the 64 bytes. This leaves a 112-byte hole (assuming all previous allocations are adjacent).
And so and and so forth.
The idea is that, with a 2x expansion, there is no point in time that the resulting hole is ever going to be large enough to reuse for the next allocation. Using a 1.5x allocation, we have this instead:
Start with 16 bytes.
When you need more, allocate 24 bytes, then free up the 16, leaving a 16-byte hole.
When you need more, allocate 36 bytes, then free up the 24, leaving a 40-byte hole.
When you need more, allocate 54 bytes, then free up the 36, leaving a 76-byte hole.
When you need more, allocate 81 bytes, then free up the 54, leaving a 130-byte hole.
When you need more, use 122 bytes (rounding up) from the 130-byte hole.
In the limit as n → ∞, it would be the golden ratio: ϕ = 1.618...
For finite n, you want something close, like 1.5.
The reason is that you want to be able to reuse older memory blocks, to take advantage of caching and avoid constantly making the OS give you more memory pages. The equation you'd solve to ensure that a subsequent allocation can re-use all prior blocks reduces to xn − 1 − 1 = xn + 1 − xn, whose solution approaches x = ϕ for large n. In practice n is finite and you'll want to be able to reusing the last few blocks every few allocations, and so 1.5 is great for ensuring that.
(See the link for a more detailed explanation.)
It will entirely depend on the use case. Do you care more about the time wasted copying data around (and reallocating arrays) or the extra memory? How long is the array going to last? If it's not going to be around for long, using a bigger buffer may well be a good idea - the penalty is short-lived. If it's going to hang around (e.g. in Java, going into older and older generations) that's obviously more of a penalty.
There's no such thing as an "ideal growth factor." It's not just theoretically application dependent, it's definitely application dependent.
2 is a pretty common growth factor - I'm pretty sure that's what ArrayList and List<T> in .NET uses. ArrayList<T> in Java uses 1.5.
EDIT: As Erich points out, Dictionary<,> in .NET uses "double the size then increase to the next prime number" so that hash values can be distributed reasonably between buckets. (I'm sure I've recently seen documentation suggesting that primes aren't actually that great for distributing hash buckets, but that's an argument for another answer.)
One approach when answering questions like this is to just "cheat" and look at what popular libraries do, under the assumption that a widely used library is, at the very least, not doing something horrible.
So just checking very quickly, Ruby (1.9.1-p129) appears to use 1.5x when appending to an array, and Python (2.6.2) uses 1.125x plus a constant (in Objects/listobject.c):
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}
newsize above is the number of elements in the array. Note well that newsize is added to new_allocated, so the expression with the bitshifts and ternary operator is really just calculating the over-allocation.
Let's say you grow the array size by x. So assume you start with size T. The next time you grow the array its size will be T*x. Then it will be T*x^2 and so on.
If your goal is to be able to reuse the memory that has been created before, then you want to make sure the new memory you allocate is less than the sum of previous memory you deallocated. Therefore, we have this inequality:
T*x^n <= T + T*x + T*x^2 + ... + T*x^(n-2)
We can remove T from both sides. So we get this:
x^n <= 1 + x + x^2 + ... + x^(n-2)
Informally, what we say is that at nth allocation, we want our all previously deallocated memory to be greater than or equal to the memory need at the nth allocation so that we can reuse the previously deallocated memory.
For instance, if we want to be able to do this at the 3rd step (i.e., n=3), then we have
x^3 <= 1 + x
This equation is true for all x such that 0 < x <= 1.3 (roughly)
See what x we get for different n's below:
n maximum-x (roughly)
3 1.3
4 1.4
5 1.53
6 1.57
7 1.59
22 1.61
Note that the growing factor has to be less than 2 since x^n > x^(n-2) + ... + x^2 + x + 1 for all x>=2.
Another two cents
Most computers have virtual memory! In the physical memory you can have random pages everywhere which are displayed as a single contiguous space in your program's virtual memory. The resolving of the indirection is done by the hardware. Virtual memory exhaustion was a problem on 32 bit systems, but it is really not a problem anymore. So filling the hole is not a concern anymore (except special environments). Since Windows 7 even Microsoft supports 64 bit without extra effort. # 2011
O(1) is reached with any r > 1 factor. Same mathematical proof works not only for 2 as parameter.
r = 1.5 can be calculated with old*3/2 so there is no need for floating point operations. (I say /2 because compilers will replace it with bit shifting in the generated assembly code if they see fit.)
MSVC went for r = 1.5, so there is at least one major compiler that does not use 2 as ratio.
As mentioned by someone 2 feels better than 8. And also 2 feels better than 1.1.
My feeling is that 1.5 is a good default. Other than that it depends on the specific case.
The top-voted and the accepted answer are both good, but neither answer the part of the question asking for a "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory". (The second-top-voted answer does try to answer this part of the question, but its reasoning is confused.)
The question perfectly identifies the 2 considerations that have to be balanced, performance and wasted memory. If you choose a growth rate too low, performance suffers because you'll run out of extra space too quickly and have to reallocate too frequently. If you choose a growth rate too high, like 2x, you'll waste memory because you'll never be able to reuse old memory blocks.
In particular, if you do the math1 you'll find that the upper limit on the growth rate is the golden ratio ϕ = 1.618… . Growth rate larger than ϕ (like 2x) mean that you'll never be able to reuse old memory blocks. Growth rates only slightly less than ϕ mean you won't be able to reuse old memory blocks until after many many reallocations, during which time you'll be wasting memory. So you want to be as far below ϕ as you can get without sacrificing too much performance.
Therefore I'd suggest these candidates for "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory":
≈1.466x (the solution to x4=1+x+x2) allows memory reuse after just 3 reallocations, one sooner than 1.5x allows, while reallocating only slightly more frequently
≈1.534x (the solution to x5=1+x+x2+x3) allows memory reuse after 4 reallocations, same as 1.5x, while reallocating slightly less frequently for improved performance
≈1.570x (the solution to x6=1+x+x2+x3+x4) only allows memory reuse after 5 reallocations, but will reallocate even less infrequently for even further improved performance (barely)
Clearly there's some diminishing returns there, so I think the global optimum is probably among those. Also, note that 1.5x is a great approximation to whatever the global optimum actually is, and has the advantage being extremely simple.
1 Credits to #user541686 for this excellent source.
It really depends. Some people analyze common usage cases to find the optimal number.
I've seen 1.5x 2.0x phi x, and power of 2 used before.
If you have a distribution over array lengths, and you have a utility function that says how much you like wasting space vs. wasting time, then you can definitely choose an optimal resizing (and initial sizing) strategy.
The reason the simple constant multiple is used, is obviously so that each append has amortized constant time. But that doesn't mean you can't use a different (larger) ratio for small sizes.
In Scala, you can override loadFactor for the standard library hash tables with a function that looks at the current size. Oddly, the resizable arrays just double, which is what most people do in practice.
I don't know of any doubling (or 1.5*ing) arrays that actually catch out of memory errors and grow less in that case. It seems that if you had a huge single array, you'd want to do that.
I'd further add that if you're keeping the resizable arrays around long enough, and you favor space over time, it might make sense to dramatically overallocate (for most cases) initially and then reallocate to exactly the right size when you're done.
I recently was fascinated by the experimental data I've got on the wasted memory aspect of things. The chart below is showing the "overhead factor" calculated as the amount of overhead space divided by the useful space, the x-axis shows a growth factor. I'm yet to find a good explanation/model of what it reveals.
Simulation snippet: https://gist.github.com/gubenkoved/7cd3f0cb36da56c219ff049e4518a4bd.
Neither shape nor the absolute values that simulation reveals are something I've expected.
Higher-resolution chart showing dependency on the max useful data size is here: https://i.stack.imgur.com/Ld2yJ.png.
UPDATE. After pondering this more, I've finally come up with the correct model to explain the simulation data, and hopefully, it matches experimental data nicely. The formula is quite easy to infer simply by looking at the size of the array that we would need to have for a given amount of elements we need to contain.
Referenced earlier GitHub gist was updated to include calculations using scipy.integrate for numerical integration that allows creating the plot below which verifies the experimental data pretty nicely.
UPDATE 2. One should however keep in mind that what we model/emulate there mostly has to do with the Virtual Memory, meaning the over-allocation overheads can be left entirely on the Virtual Memory territory as physical memory footprint is only incurred when we first access a page of Virtual Memory, so it's possible to malloc a big chunk of memory, but until we first access the pages all we do is reserving virtual address space. I've updated the GitHub gist with CPP program that has a very basic dynamic array implementation that allows changing the growth factor and the Python snippet that runs it multiple times to gather the "real" data. Please see the final graph below.
The conclusion there could be that for x64 environments where virtual address space is not a limiting factor there could be really little to no difference in terms of the Physical Memory footprint between different growth factors. Additionally, as far as Virtual Memory is concerned the model above seems to make pretty good predictions!
Simulation snippet was built with g++.exe simulator.cpp -o simulator.exe on Windows 10 (build 19043), g++ version is below.
g++.exe (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
PS. Note that the end result is implementation-specific. Depending on implementation details dynamic array might or might not access the memory outside the "useful" boundaries. Some implementations would use memset to zero-initialize POD elements for whole capacity -- this will cause virtual memory page translated into physical. However, std::vector implementation on a referenced above compiler does not seem to do that and so behaves as per mock dynamic array in the snippet -- meaning overhead is incurred on the Virtual Memory side, and negligible on the Physical Memory.
I agree with Jon Skeet, even my theorycrafter friend insists that this can be proven to be O(1) when setting the factor to 2x.
The ratio between cpu time and memory is different on each machine, and so the factor will vary just as much. If you have a machine with gigabytes of ram, and a slow CPU, copying the elements to a new array is a lot more expensive than on a fast machine, which might in turn have less memory. It's a question that can be answered in theory, for a uniform computer, which in real scenarios doesnt help you at all.
I know it is an old question, but there are several things that everyone seems to be missing.
First, this is multiplication by 2: size << 1. This is multiplication by anything between 1 and 2: int(float(size) * x), where x is the number, the * is floating point math, and the processor has to run additional instructions for casting between float and int. In other words, at the machine level, doubling takes a single, very fast instruction to find the new size. Multiplying by something between 1 and 2 requires at least one instruction to cast size to a float, one instruction to multiply (which is float multiplication, so it probably takes at least twice as many cycles, if not 4 or even 8 times as many), and one instruction to cast back to int, and that assumes that your platform can perform float math on the general purpose registers, instead of requiring the use of special registers. In short, you should expect the math for each allocation to take at least 10 times as long as a simple left shift. If you are copying a lot of data during the reallocation though, this might not make much of a difference.
Second, and probably the big kicker: Everyone seems to assume that the memory that is being freed is both contiguous with itself, as well as contiguous with the newly allocated memory. Unless you are pre-allocating all of the memory yourself and then using it as a pool, this is almost certainly not the case. The OS might occasionally end up doing this, but most of the time, there is going to be enough free space fragmentation that any half decent memory management system will be able to find a small hole where your memory will just fit. Once you get to really bit chunks, you are more likely to end up with contiguous pieces, but by then, your allocations are big enough that you are not doing them frequently enough for it to matter anymore. In short, it is fun to imagine that using some ideal number will allow the most efficient use of free memory space, but in reality, it is not going to happen unless your program is running on bare metal (as in, there is no OS underneath it making all of the decisions).
My answer to the question? Nope, there is no ideal number. It is so application specific that no one really even tries. If your goal is ideal memory usage, you are pretty much out of luck. For performance, less frequent allocations are better, but if we went just with that, we could multiply by 4 or even 8! Of course, when Firefox jumps from using 1GB to 8GB in one shot, people are going to complain, so that does not even make sense. Here are some rules of thumb I would go by though:
If you cannot optimize memory usage, at least don't waste processor cycles. Multiplying by 2 is at least an order of magnitude faster than doing floating point math. It might not make a huge difference, but it will make some difference at least (especially early on, during the more frequent and smaller allocations).
Don't overthink it. If you just spent 4 hours trying to figure out how to do something that has already been done, you just wasted your time. Totally honestly, if there was a better option than *2, it would have been done in the C++ vector class (and many other places) decades ago.
Lastly, if you really want to optimize, don't sweat the small stuff. Now days, no one cares about 4KB of memory being wasted, unless they are working on embedded systems. When you get to 1GB of objects that are between 1MB and 10MB each, doubling is probably way too much (I mean, that is between 100 and 1,000 objects). If you can estimate expected expansion rate, you can level it out to a linear growth rate at a certain point. If you expect around 10 objects per minute, then growing at 5 to 10 object sizes per step (once every 30 seconds to a minute) is probably fine.
What it all comes down to is, don't over think it, optimize what you can, and customize to your application (and platform) if you must.

Resources