Is there a optimal batch size for arc4random_buf? - c

I need billions of random bytes from arc4random_buf, and my strategy is to request X random bytes at a time, and repeat this many times.
My question is how large should X be. Since the nbytes argument to arc4random_buf can be arbitrarily large, I suppose there must be some kind of internal loop that generates some entropy each time its body is executed. Say, if X is a multiple of the number of random bytes generated each iteration, the performance can be improved because I’m not wasting any entropy.
I’m on macOS, which is unfortunately closed-source, so I cannot simply read the source code. Is there any portable way to determine the optimal X?

Doing some benchmarks on typical target systems is probably the best way to figure this out, but looking at a couple of implementations, it seems unlikely that the buffer size will make much difference to the cost of arc4random_buffer.
The original implementation implements arc4random_buffer as a simple loop around a function which generates one byte. As long as the buffer is big enough to avoid excessive call overhead, it should make little difference.
The FreeBSD library implementation appears to attempt to optimise by periodically computing about 1K of random bytes. Then arc4random_buffer uses memcpy to copy the bytes from the internal buffer to the user buffer.
For the FreeBSD implementation, the optimal buffer size would be the amount of data available in the internal buffer, because that minimizes the number of calls to memcpy. However, there's no way to know how much that is, and it will not be the same on every call because of the rekeying algorithm.
My guess is that you will find very little difference between buffer sizes greater than, say, 16K, and probably even less. For the FreeBSD implementation, it will be very slightly more efficient if your buffer size is a multiple of 8.
Addendum: All the implementations I know of have a global rekey threshold, so you cannot influence the cost of rekeying by changing the buffer size in arc4random_buffer. The library simply rekeys every X bytes generated.

Related

Are there limit to received packet count / byte in linux?

I want to write a program that sends network's current received packet cnt/byte.
I get these data from /proc/net/dev.
But I can't decide what type to store these data.
I just have feeling that using unsigned long long int is wasteful.
Are there limit for received packet cnt/byte like RLIMIT_*?
uint64_t, uint_fast64_t, or unsigned long long, are the correct types to use here. The first two are available from <stdint.h> or <inttypes.h>, and are what I'd recommend. unsigned long long is perfectly acceptable too. [*]
You are suffering from a misguided instinct towards premature optimization.
Even if you had a thousand of these counters – and you usually do not –, they would take a paltry amount of RAM, some 8192 bytes. This is a tiny fraction of the RAM use of a typical userspace process, because even the standard C library (especially functions like printf(); and anything that does file I/O using <stdio.h>) uses a couple of orders of magnitude more.
So, when you worry about how much memory you're "wasting" by using an unsigned integer type that might be larger than strictly necessary for most cases, you're probably wasting an order of magnitude more by not choosing a better approach or a better algorithm in the first place.
It is make-work worry. There are bigger things you are not thinking about all yet (because you lack the experience or knowledge or both) that affect the results you might be thinking of –– efficiency, memory footprint, run time to complete the task at hand –– often an order of magnitude more than those small details. You need to learn to think of the big picture, instead: Is this needed? Is this useful, or is there a better way to look at this?
[*] You can verify this by looking at how the data is generated, by the net/core/net-procfs.c:dev_seq_printf_stats(), as well as look at the data structure, include/uapi/linux/if_link.h:struct rtnl_link_stats64.
The __u64 type is how the Linux kernel calls the type, and %llu is how the Linux kernel seq_printf() implementation formats 64-bit unsigned integers.)
To decide what type to use depends on pragmatic conditions like:
How long do you want to record the number of packets received
What is the worst-case average data rate over the above time
How many times will you be storing (or transferring over a network) this number
Once you've figured this out, calculate the rough maximum value of this number. Then, depending on how many times you want to store(/transfer) this number you can determine the storage(/transfer) volume per possible type.
Finally select the type that has a very broad value margin and doesn't take up too much space.
I wouldn't expect long long to become wasteful quickly. However, when you're counting packets, a 4-byte integer seems to be more than sufficient, unless you're in an extreme environment with crazy data volumes.

How to choose the best buffer size when you need read large data

Let's assume a scenario where I have a lot of log files for a given system, let's imagine that it's petabytes of data. This is my scenario.
Used Technology
For my purpose, I'm going to choose the C/C++ to do this.
My Problem
I have the need to read these files, which are on disk, and do some processing later, whether sending them to a topic on some pub/sub system or simply displaying these logs on screen.
Questions
What is the best buffer size for me to have the best performance in reading this data and which saves hardware resources such as disk and RAM memory?
I just don't know if I should choose 64 Kilobytes, 128 Kilobytes, 5 Megabytes, 10 Megabytes, how do I calculate this?
And if this calculation depends on how much available resource I have, then how to calculate from these resources?
The optimal buffer size depends on many factors, most notably the hardware. You can find out which size is optimal by picking one size, measuring how long the operation takes then picking another size, measuring, comparing. Repeat until you find optimal size.
Caveats:
You need to measure with the hardware matching the target system to have meaningful measurements.
You also need to measure with inputs comparable to the target task. You may reduce the size of input by using subset of real data to make measuring faster, but at some size it may affect the quality of measurement.
It's possible to encounter a local maxima buffer size that is faster than either slightly larger or smaller buffer, but not as fast as some other buffer size that is more larger or smaller. General global optimisation techniques may be used to avoid getting stuck in the search for the optimal value, such as simulated annealing.
Although benchmarking is a simple concept, it's actually quite difficult to do correctly. It's possible and likely that your measurements are biased by incidental factors that may cause differences in performance of the target system. Environment randomisation may help reduce this.
Typical sizes that may be a good starting point to measure are the size of the caches on the system:
Cache line size
L1 cache size
L2 cache size
L3 cache size
Memory page size
SSD DRAM cache size
I saw this answer regarding the same question in C#, basically buffer size doesn't really matter performance-wise (as long as it's a reasonable value). Then regarding the RAM and disk usage you will have the same quantity of data to read/write, whatever your buffer size might be. Again, as long as you stay between reasonable values you shouldn't have a problem.
Actually you don't have to load all your data into memory for doing anything. You just have to read those which are concerned.
I have the need to read these files, which are on disk, and do some processing later
Just load them later and pass to subsystem at instant. If you want to display these then, Simply Read, Process and Display.
What is the best buffer size for me to have the best performance in reading this data and which saves hardware resources such as disk and RAM memory?
Why do you want to save Disk resource, Isn't where your files are? You have to load data from here to RAM in small Quantities like a particular log file then do whatever you want and finally Flush it all. Repeat.
I just don't know if I should choose 64 Kilobytes, 128 Kilobytes, 5 Megabytes, 10 Megabytes, how do I calculate this?
Again load files one by one not there data in specific amounts.
And if this calculation depends on how much available resource I have, then how to calculate from these resources?
No calculation Needed. Just smartly handle RAM resources by focusing on one or may be two file at a time. Don't care about Disk resources.

zlib and buffer sizes

I am currently trying to use zlib for compression in one of my projects. I had a look at the basic zlib tutorial and I am confused by the following statements:
CHUNK is simply the buffer size for feeding data to and pulling data
from the zlib routines. Larger buffer sizes would be more efficient,
especially for inflate(). If the memory is available, buffers sizes on
the order of 128K or 256K bytes should be used.
#define CHUNK 16384
In my case I will always have a small buffer already available at the output end (around 80 bytes) and will continually feed very small data (a few bytes) from the input side through zlib. This means I will not hav a larger buffer on either side, but I am planning on using much smaller ones.
However I am not sure how to interpret the "larger buffer sizes would be more efficient". Is this referring to efficiency of the encoding or time/space efficiency?
One Idea I have to remedy this situation would be to add some more layers of buffering had have accumulate from the input and flush to the output repeatedly. However this would mean I would have to accumulate data and add some more levels of copying to my data, which would also hurt performance.
Now if efficiency is just referring to time/space efficiency, I could just measure the impact of both methods and decide on one to use. However if the actually encoding could be impacted by the smaller buffer size, this might be really hard to detect.
Does anyone have an experience on using zlib with very small buffers?
It means time efficiency. If you give inflate large input and output buffers, it will use faster inflation code internally. It will work just fine with buffers as small as you like (even size 1), but it will be slower.
It is probably worthwhile for you to accumulate input and feed it to inflate in larger chunks. You would also need to provide larger output buffers.

Buffer growth strategy

I have a generic growing buffer indended to accumulate "random" string pieces and then fetch the result. Code to handle that buffer is written in plain C.
Pseudocode API:
void write(buffer_t * buf, const unsigned char * bytes, size_t len);/* appends */
const unsigned char * buffer(buffer_t * buf);/* returns accumulated data */
I'm thinking about the growth strategy I should pick for that buffer.
I do not know if my users would prefer memory or speed — or what would be the nature of user's data.
I've seen two strategies in the wild: grow buffer by fixed size increments (that is what I've currently implemented) or grow data exponentially. (There is also a strategy to allocate the exact amount of memory needed — but this is not that interesting in my case.)
Perhaps I should let user to pick the strategy... But that would make code a bit more complex...
Once upon a time, Herb Sutter wrote (referencing Andrew Koenig) that the best strategy is, probably, exponential growth with factor 1.5 (search for "Growth Strategy"). Is this still the best choice?
Any advice? What does your experience say?
Unless you have a good reason to do otherwise, exponential growth is probably the best choice. Using 1.5 for the exponent isn't really magical, and in fact that's not what Andrew Koenig originally said. What he originally said was that the growth factor should be less than (1+sqrt(5))/2 (~1.6).
Pete Becker says when he was at Dinkumware P.J. Plauger, owner of Dinkumware, says they did some testing and found that 1.5 worked well. When you allocate a block of memory, the allocator will usually allocate a block that's at least slightly larger than you requested to give it room for a little book-keeping information. My guess (though unconfirmed by any testing) is that reducing the factor a little lets the real block size still fit within the limit.
References:
I believe Andrew originally published this in a magazine (the Journal of Object Oriented Programming, IIRC) which hasn't been published in years now, so getting a re-print would probably be quite difficult.
Andrew Koenig's Usenet post, and P.J. Plauger's Usenet post.
The exponential growth strategy is used throughout STL and it seems to work fine. I'd say stick with that at least until you find a definite case where it won't work.
I usually use a combination of addition of a small fixed amount and multiplication by 1.5 because it is efficent to implement and leads to reasonable step widths which are bigger at first and more memory sensible when the buffer grows. As fixed offset I usually use the initial size of the buffer and start with rather small initial sizes:
new_size = old_size + ( old_size >> 1 ) + initial_size;
As initial_size I use 4 for collection types, 8, 12 or 16 for string types and 128 to 4096 for in-/output buffers depending on the context.
Here is a little chart that shows that this grows much faster (yellow+red) in the early steps compared to multiplying by 1.5 only (red).
So, if you started with 100 you would need for example 6 increases to accommodate 3000 elements while multiplying with 1.5 alone would need 9.
At larger sizes the influence of the addition becomes negligible, which makes both approaches scale equally well by a factor of 1.5 then. These are the effective growth factors if you use the initial size as fixed amount for the addition:
2.5
1.9
1.7
1.62
1.57
1.54
1.53
1.52
1.51
1.5
...
The key point is that the exponential growth strategy lets you avoid expensive copies of the buffer content when you hit the current size for the cost of some wasted memory. The article you link has the numbers for the trade-of.
The answer, as always is, it "depends".
The idea behind exponential growth - ie allocating a new buffer that is x times the current size is that as you require more buffer, you'll need more buffer ansd the chances are that you'll be needing much more buffer than a small fixed increment provides.
So, if you have a 8-byte buffer, and need more allocating an extra 8 bytes is ok, then allocating an additional 16 bytes is probably a good idea - someone with a 16-byte buffer is not likely to require a extra 1 byte. And if they do, all that's happening is you're wasting a little memory.
I thought the best growth factor was 2 - ie double your buffer, but if Koenig/Sutter say 1.5 is optimal, then I'm agreeing with them. You may want to tweak your growth rate after getting some usage statistics though.
So exponential growth is a good trade-off between performance and keeping memory usage low.
Double the size until a threshold (~100MB?) and then lower the exponential growth toward 1.5,..,
1.3
Another option would be to make the default buffer size configurable at runtime.
The point of using exponential growth (whether the factor be 1.5 or 2) is to avoid copies. Each time you realloc the array, you can trigger an implicit copy of the item, which, of course, gets more expensive the larger it gets. By using an exponential growth, you get an amortized constant number of recopies -- i.e. you rarely end up copying.
As long as you're running on a desktop computer of some kind, you can expect an essentially unlimited amount of memory, so time is probably the right side of that tradeoff. For hard real-time systems, you would probably want to find a way to avoid the copies altogether -- a linked list comes to mind.
There's no way anyone can give good advice without knowing something about the allocations, runtime environment, execution characteristics, etc., etc.
Code which works is way more important than highly optimized code... which is under development. Choose some algorithm—any workable algorithm—and try it! If it proves suboptimal, then change the strategy. Placing this in the control of the library user often does them no favors. But if you already have some option scheme in place, then adding it could be useful, unless you hit on a good algorithm (and n^1.5 is a pretty good one).
Also, the use of a function named write in C (not C++) conflicts with <io.h> and <stdio.h>. It's fine if nothing uses them, but it would also be hard to add them later. Best to use a more descriptive name.
As a wild idea, for this specific case, you could change the API to require the caller to allocate the memory for each chunk, and then remembering the chunks instead of copying the data.
Then, when it's time to actually produce the result, you know exactly how much memory is going to be needed and can allocate exactly that.
This has the benefit that the caller will need to allocate memory for the chunks anyway, and so you might as well make use of that. This also avoids copying data more than once.
It has the disadvantage that the caller will have to dynamically allocate each chunk. To get around that, you could allocate memory for each chunk, and remember those, rather than keeping one large buffer, which gets resized when it gets full. This way, you'll copy data twice (once into the chunk you allocate, another time into the resulting string), but no more. If you have to resize several times, you may end up with more than two copies.
Further, really large areas of free memory may be difficult for the memory allocator to find. Allocating smaller chunks may well be easier. There might not be space for a one-gigabyte chunk of memory, but there might be space for a thousand megabyte chunks.

What is the ideal growth rate for a dynamically allocated array?

C++ has std::vector and Java has ArrayList, and many other languages have their own form of dynamically allocated array. When a dynamic array runs out of space, it gets reallocated into a larger area and the old values are copied into the new array. A question central to the performance of such an array is how fast the array grows in size. If you always only grow large enough to fit the current push, you'll end up reallocating every time. So it makes sense to double the array size, or multiply it by say 1.5x.
Is there an ideal growth factor? 2x? 1.5x? By ideal I mean mathematically justified, best balancing performance and wasted memory. I realize that theoretically, given that your application could have any potential distribution of pushes that this is somewhat application dependent. But I'm curious to know if there's a value that's "usually" best, or is considered best within some rigorous constraint.
I've heard there's a paper on this somewhere, but I've been unable to find it.
I remember reading many years ago why 1.5 is preferred over two, at least as applied to C++ (this probably doesn't apply to managed languages, where the runtime system can relocate objects at will).
The reasoning is this:
Say you start with a 16-byte allocation.
When you need more, you allocate 32 bytes, then free up 16 bytes. This leaves a 16-byte hole in memory.
When you need more, you allocate 64 bytes, freeing up the 32 bytes. This leaves a 48-byte hole (if the 16 and 32 were adjacent).
When you need more, you allocate 128 bytes, freeing up the 64 bytes. This leaves a 112-byte hole (assuming all previous allocations are adjacent).
And so and and so forth.
The idea is that, with a 2x expansion, there is no point in time that the resulting hole is ever going to be large enough to reuse for the next allocation. Using a 1.5x allocation, we have this instead:
Start with 16 bytes.
When you need more, allocate 24 bytes, then free up the 16, leaving a 16-byte hole.
When you need more, allocate 36 bytes, then free up the 24, leaving a 40-byte hole.
When you need more, allocate 54 bytes, then free up the 36, leaving a 76-byte hole.
When you need more, allocate 81 bytes, then free up the 54, leaving a 130-byte hole.
When you need more, use 122 bytes (rounding up) from the 130-byte hole.
In the limit as n → ∞, it would be the golden ratio: ϕ = 1.618...
For finite n, you want something close, like 1.5.
The reason is that you want to be able to reuse older memory blocks, to take advantage of caching and avoid constantly making the OS give you more memory pages. The equation you'd solve to ensure that a subsequent allocation can re-use all prior blocks reduces to xn − 1 − 1 = xn + 1 − xn, whose solution approaches x = ϕ for large n. In practice n is finite and you'll want to be able to reusing the last few blocks every few allocations, and so 1.5 is great for ensuring that.
(See the link for a more detailed explanation.)
It will entirely depend on the use case. Do you care more about the time wasted copying data around (and reallocating arrays) or the extra memory? How long is the array going to last? If it's not going to be around for long, using a bigger buffer may well be a good idea - the penalty is short-lived. If it's going to hang around (e.g. in Java, going into older and older generations) that's obviously more of a penalty.
There's no such thing as an "ideal growth factor." It's not just theoretically application dependent, it's definitely application dependent.
2 is a pretty common growth factor - I'm pretty sure that's what ArrayList and List<T> in .NET uses. ArrayList<T> in Java uses 1.5.
EDIT: As Erich points out, Dictionary<,> in .NET uses "double the size then increase to the next prime number" so that hash values can be distributed reasonably between buckets. (I'm sure I've recently seen documentation suggesting that primes aren't actually that great for distributing hash buckets, but that's an argument for another answer.)
One approach when answering questions like this is to just "cheat" and look at what popular libraries do, under the assumption that a widely used library is, at the very least, not doing something horrible.
So just checking very quickly, Ruby (1.9.1-p129) appears to use 1.5x when appending to an array, and Python (2.6.2) uses 1.125x plus a constant (in Objects/listobject.c):
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}
newsize above is the number of elements in the array. Note well that newsize is added to new_allocated, so the expression with the bitshifts and ternary operator is really just calculating the over-allocation.
Let's say you grow the array size by x. So assume you start with size T. The next time you grow the array its size will be T*x. Then it will be T*x^2 and so on.
If your goal is to be able to reuse the memory that has been created before, then you want to make sure the new memory you allocate is less than the sum of previous memory you deallocated. Therefore, we have this inequality:
T*x^n <= T + T*x + T*x^2 + ... + T*x^(n-2)
We can remove T from both sides. So we get this:
x^n <= 1 + x + x^2 + ... + x^(n-2)
Informally, what we say is that at nth allocation, we want our all previously deallocated memory to be greater than or equal to the memory need at the nth allocation so that we can reuse the previously deallocated memory.
For instance, if we want to be able to do this at the 3rd step (i.e., n=3), then we have
x^3 <= 1 + x
This equation is true for all x such that 0 < x <= 1.3 (roughly)
See what x we get for different n's below:
n maximum-x (roughly)
3 1.3
4 1.4
5 1.53
6 1.57
7 1.59
22 1.61
Note that the growing factor has to be less than 2 since x^n > x^(n-2) + ... + x^2 + x + 1 for all x>=2.
Another two cents
Most computers have virtual memory! In the physical memory you can have random pages everywhere which are displayed as a single contiguous space in your program's virtual memory. The resolving of the indirection is done by the hardware. Virtual memory exhaustion was a problem on 32 bit systems, but it is really not a problem anymore. So filling the hole is not a concern anymore (except special environments). Since Windows 7 even Microsoft supports 64 bit without extra effort. # 2011
O(1) is reached with any r > 1 factor. Same mathematical proof works not only for 2 as parameter.
r = 1.5 can be calculated with old*3/2 so there is no need for floating point operations. (I say /2 because compilers will replace it with bit shifting in the generated assembly code if they see fit.)
MSVC went for r = 1.5, so there is at least one major compiler that does not use 2 as ratio.
As mentioned by someone 2 feels better than 8. And also 2 feels better than 1.1.
My feeling is that 1.5 is a good default. Other than that it depends on the specific case.
The top-voted and the accepted answer are both good, but neither answer the part of the question asking for a "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory". (The second-top-voted answer does try to answer this part of the question, but its reasoning is confused.)
The question perfectly identifies the 2 considerations that have to be balanced, performance and wasted memory. If you choose a growth rate too low, performance suffers because you'll run out of extra space too quickly and have to reallocate too frequently. If you choose a growth rate too high, like 2x, you'll waste memory because you'll never be able to reuse old memory blocks.
In particular, if you do the math1 you'll find that the upper limit on the growth rate is the golden ratio ϕ = 1.618… . Growth rate larger than ϕ (like 2x) mean that you'll never be able to reuse old memory blocks. Growth rates only slightly less than ϕ mean you won't be able to reuse old memory blocks until after many many reallocations, during which time you'll be wasting memory. So you want to be as far below ϕ as you can get without sacrificing too much performance.
Therefore I'd suggest these candidates for "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory":
≈1.466x (the solution to x4=1+x+x2) allows memory reuse after just 3 reallocations, one sooner than 1.5x allows, while reallocating only slightly more frequently
≈1.534x (the solution to x5=1+x+x2+x3) allows memory reuse after 4 reallocations, same as 1.5x, while reallocating slightly less frequently for improved performance
≈1.570x (the solution to x6=1+x+x2+x3+x4) only allows memory reuse after 5 reallocations, but will reallocate even less infrequently for even further improved performance (barely)
Clearly there's some diminishing returns there, so I think the global optimum is probably among those. Also, note that 1.5x is a great approximation to whatever the global optimum actually is, and has the advantage being extremely simple.
1 Credits to #user541686 for this excellent source.
It really depends. Some people analyze common usage cases to find the optimal number.
I've seen 1.5x 2.0x phi x, and power of 2 used before.
If you have a distribution over array lengths, and you have a utility function that says how much you like wasting space vs. wasting time, then you can definitely choose an optimal resizing (and initial sizing) strategy.
The reason the simple constant multiple is used, is obviously so that each append has amortized constant time. But that doesn't mean you can't use a different (larger) ratio for small sizes.
In Scala, you can override loadFactor for the standard library hash tables with a function that looks at the current size. Oddly, the resizable arrays just double, which is what most people do in practice.
I don't know of any doubling (or 1.5*ing) arrays that actually catch out of memory errors and grow less in that case. It seems that if you had a huge single array, you'd want to do that.
I'd further add that if you're keeping the resizable arrays around long enough, and you favor space over time, it might make sense to dramatically overallocate (for most cases) initially and then reallocate to exactly the right size when you're done.
I recently was fascinated by the experimental data I've got on the wasted memory aspect of things. The chart below is showing the "overhead factor" calculated as the amount of overhead space divided by the useful space, the x-axis shows a growth factor. I'm yet to find a good explanation/model of what it reveals.
Simulation snippet: https://gist.github.com/gubenkoved/7cd3f0cb36da56c219ff049e4518a4bd.
Neither shape nor the absolute values that simulation reveals are something I've expected.
Higher-resolution chart showing dependency on the max useful data size is here: https://i.stack.imgur.com/Ld2yJ.png.
UPDATE. After pondering this more, I've finally come up with the correct model to explain the simulation data, and hopefully, it matches experimental data nicely. The formula is quite easy to infer simply by looking at the size of the array that we would need to have for a given amount of elements we need to contain.
Referenced earlier GitHub gist was updated to include calculations using scipy.integrate for numerical integration that allows creating the plot below which verifies the experimental data pretty nicely.
UPDATE 2. One should however keep in mind that what we model/emulate there mostly has to do with the Virtual Memory, meaning the over-allocation overheads can be left entirely on the Virtual Memory territory as physical memory footprint is only incurred when we first access a page of Virtual Memory, so it's possible to malloc a big chunk of memory, but until we first access the pages all we do is reserving virtual address space. I've updated the GitHub gist with CPP program that has a very basic dynamic array implementation that allows changing the growth factor and the Python snippet that runs it multiple times to gather the "real" data. Please see the final graph below.
The conclusion there could be that for x64 environments where virtual address space is not a limiting factor there could be really little to no difference in terms of the Physical Memory footprint between different growth factors. Additionally, as far as Virtual Memory is concerned the model above seems to make pretty good predictions!
Simulation snippet was built with g++.exe simulator.cpp -o simulator.exe on Windows 10 (build 19043), g++ version is below.
g++.exe (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
PS. Note that the end result is implementation-specific. Depending on implementation details dynamic array might or might not access the memory outside the "useful" boundaries. Some implementations would use memset to zero-initialize POD elements for whole capacity -- this will cause virtual memory page translated into physical. However, std::vector implementation on a referenced above compiler does not seem to do that and so behaves as per mock dynamic array in the snippet -- meaning overhead is incurred on the Virtual Memory side, and negligible on the Physical Memory.
I agree with Jon Skeet, even my theorycrafter friend insists that this can be proven to be O(1) when setting the factor to 2x.
The ratio between cpu time and memory is different on each machine, and so the factor will vary just as much. If you have a machine with gigabytes of ram, and a slow CPU, copying the elements to a new array is a lot more expensive than on a fast machine, which might in turn have less memory. It's a question that can be answered in theory, for a uniform computer, which in real scenarios doesnt help you at all.
I know it is an old question, but there are several things that everyone seems to be missing.
First, this is multiplication by 2: size << 1. This is multiplication by anything between 1 and 2: int(float(size) * x), where x is the number, the * is floating point math, and the processor has to run additional instructions for casting between float and int. In other words, at the machine level, doubling takes a single, very fast instruction to find the new size. Multiplying by something between 1 and 2 requires at least one instruction to cast size to a float, one instruction to multiply (which is float multiplication, so it probably takes at least twice as many cycles, if not 4 or even 8 times as many), and one instruction to cast back to int, and that assumes that your platform can perform float math on the general purpose registers, instead of requiring the use of special registers. In short, you should expect the math for each allocation to take at least 10 times as long as a simple left shift. If you are copying a lot of data during the reallocation though, this might not make much of a difference.
Second, and probably the big kicker: Everyone seems to assume that the memory that is being freed is both contiguous with itself, as well as contiguous with the newly allocated memory. Unless you are pre-allocating all of the memory yourself and then using it as a pool, this is almost certainly not the case. The OS might occasionally end up doing this, but most of the time, there is going to be enough free space fragmentation that any half decent memory management system will be able to find a small hole where your memory will just fit. Once you get to really bit chunks, you are more likely to end up with contiguous pieces, but by then, your allocations are big enough that you are not doing them frequently enough for it to matter anymore. In short, it is fun to imagine that using some ideal number will allow the most efficient use of free memory space, but in reality, it is not going to happen unless your program is running on bare metal (as in, there is no OS underneath it making all of the decisions).
My answer to the question? Nope, there is no ideal number. It is so application specific that no one really even tries. If your goal is ideal memory usage, you are pretty much out of luck. For performance, less frequent allocations are better, but if we went just with that, we could multiply by 4 or even 8! Of course, when Firefox jumps from using 1GB to 8GB in one shot, people are going to complain, so that does not even make sense. Here are some rules of thumb I would go by though:
If you cannot optimize memory usage, at least don't waste processor cycles. Multiplying by 2 is at least an order of magnitude faster than doing floating point math. It might not make a huge difference, but it will make some difference at least (especially early on, during the more frequent and smaller allocations).
Don't overthink it. If you just spent 4 hours trying to figure out how to do something that has already been done, you just wasted your time. Totally honestly, if there was a better option than *2, it would have been done in the C++ vector class (and many other places) decades ago.
Lastly, if you really want to optimize, don't sweat the small stuff. Now days, no one cares about 4KB of memory being wasted, unless they are working on embedded systems. When you get to 1GB of objects that are between 1MB and 10MB each, doubling is probably way too much (I mean, that is between 100 and 1,000 objects). If you can estimate expected expansion rate, you can level it out to a linear growth rate at a certain point. If you expect around 10 objects per minute, then growing at 5 to 10 object sizes per step (once every 30 seconds to a minute) is probably fine.
What it all comes down to is, don't over think it, optimize what you can, and customize to your application (and platform) if you must.

Resources