What is the ideal growth rate for a dynamically allocated array? - arrays

C++ has std::vector and Java has ArrayList, and many other languages have their own form of dynamically allocated array. When a dynamic array runs out of space, it gets reallocated into a larger area and the old values are copied into the new array. A question central to the performance of such an array is how fast the array grows in size. If you always only grow large enough to fit the current push, you'll end up reallocating every time. So it makes sense to double the array size, or multiply it by say 1.5x.
Is there an ideal growth factor? 2x? 1.5x? By ideal I mean mathematically justified, best balancing performance and wasted memory. I realize that theoretically, given that your application could have any potential distribution of pushes that this is somewhat application dependent. But I'm curious to know if there's a value that's "usually" best, or is considered best within some rigorous constraint.
I've heard there's a paper on this somewhere, but I've been unable to find it.

I remember reading many years ago why 1.5 is preferred over two, at least as applied to C++ (this probably doesn't apply to managed languages, where the runtime system can relocate objects at will).
The reasoning is this:
Say you start with a 16-byte allocation.
When you need more, you allocate 32 bytes, then free up 16 bytes. This leaves a 16-byte hole in memory.
When you need more, you allocate 64 bytes, freeing up the 32 bytes. This leaves a 48-byte hole (if the 16 and 32 were adjacent).
When you need more, you allocate 128 bytes, freeing up the 64 bytes. This leaves a 112-byte hole (assuming all previous allocations are adjacent).
And so and and so forth.
The idea is that, with a 2x expansion, there is no point in time that the resulting hole is ever going to be large enough to reuse for the next allocation. Using a 1.5x allocation, we have this instead:
Start with 16 bytes.
When you need more, allocate 24 bytes, then free up the 16, leaving a 16-byte hole.
When you need more, allocate 36 bytes, then free up the 24, leaving a 40-byte hole.
When you need more, allocate 54 bytes, then free up the 36, leaving a 76-byte hole.
When you need more, allocate 81 bytes, then free up the 54, leaving a 130-byte hole.
When you need more, use 122 bytes (rounding up) from the 130-byte hole.

In the limit as n → ∞, it would be the golden ratio: ϕ = 1.618...
For finite n, you want something close, like 1.5.
The reason is that you want to be able to reuse older memory blocks, to take advantage of caching and avoid constantly making the OS give you more memory pages. The equation you'd solve to ensure that a subsequent allocation can re-use all prior blocks reduces to xn − 1 − 1 = xn + 1 − xn, whose solution approaches x = ϕ for large n. In practice n is finite and you'll want to be able to reusing the last few blocks every few allocations, and so 1.5 is great for ensuring that.
(See the link for a more detailed explanation.)

It will entirely depend on the use case. Do you care more about the time wasted copying data around (and reallocating arrays) or the extra memory? How long is the array going to last? If it's not going to be around for long, using a bigger buffer may well be a good idea - the penalty is short-lived. If it's going to hang around (e.g. in Java, going into older and older generations) that's obviously more of a penalty.
There's no such thing as an "ideal growth factor." It's not just theoretically application dependent, it's definitely application dependent.
2 is a pretty common growth factor - I'm pretty sure that's what ArrayList and List<T> in .NET uses. ArrayList<T> in Java uses 1.5.
EDIT: As Erich points out, Dictionary<,> in .NET uses "double the size then increase to the next prime number" so that hash values can be distributed reasonably between buckets. (I'm sure I've recently seen documentation suggesting that primes aren't actually that great for distributing hash buckets, but that's an argument for another answer.)

One approach when answering questions like this is to just "cheat" and look at what popular libraries do, under the assumption that a widely used library is, at the very least, not doing something horrible.
So just checking very quickly, Ruby (1.9.1-p129) appears to use 1.5x when appending to an array, and Python (2.6.2) uses 1.125x plus a constant (in Objects/listobject.c):
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}
newsize above is the number of elements in the array. Note well that newsize is added to new_allocated, so the expression with the bitshifts and ternary operator is really just calculating the over-allocation.

Let's say you grow the array size by x. So assume you start with size T. The next time you grow the array its size will be T*x. Then it will be T*x^2 and so on.
If your goal is to be able to reuse the memory that has been created before, then you want to make sure the new memory you allocate is less than the sum of previous memory you deallocated. Therefore, we have this inequality:
T*x^n <= T + T*x + T*x^2 + ... + T*x^(n-2)
We can remove T from both sides. So we get this:
x^n <= 1 + x + x^2 + ... + x^(n-2)
Informally, what we say is that at nth allocation, we want our all previously deallocated memory to be greater than or equal to the memory need at the nth allocation so that we can reuse the previously deallocated memory.
For instance, if we want to be able to do this at the 3rd step (i.e., n=3), then we have
x^3 <= 1 + x
This equation is true for all x such that 0 < x <= 1.3 (roughly)
See what x we get for different n's below:
n maximum-x (roughly)
3 1.3
4 1.4
5 1.53
6 1.57
7 1.59
22 1.61
Note that the growing factor has to be less than 2 since x^n > x^(n-2) + ... + x^2 + x + 1 for all x>=2.

Another two cents
Most computers have virtual memory! In the physical memory you can have random pages everywhere which are displayed as a single contiguous space in your program's virtual memory. The resolving of the indirection is done by the hardware. Virtual memory exhaustion was a problem on 32 bit systems, but it is really not a problem anymore. So filling the hole is not a concern anymore (except special environments). Since Windows 7 even Microsoft supports 64 bit without extra effort. # 2011
O(1) is reached with any r > 1 factor. Same mathematical proof works not only for 2 as parameter.
r = 1.5 can be calculated with old*3/2 so there is no need for floating point operations. (I say /2 because compilers will replace it with bit shifting in the generated assembly code if they see fit.)
MSVC went for r = 1.5, so there is at least one major compiler that does not use 2 as ratio.
As mentioned by someone 2 feels better than 8. And also 2 feels better than 1.1.
My feeling is that 1.5 is a good default. Other than that it depends on the specific case.

The top-voted and the accepted answer are both good, but neither answer the part of the question asking for a "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory". (The second-top-voted answer does try to answer this part of the question, but its reasoning is confused.)
The question perfectly identifies the 2 considerations that have to be balanced, performance and wasted memory. If you choose a growth rate too low, performance suffers because you'll run out of extra space too quickly and have to reallocate too frequently. If you choose a growth rate too high, like 2x, you'll waste memory because you'll never be able to reuse old memory blocks.
In particular, if you do the math1 you'll find that the upper limit on the growth rate is the golden ratio ϕ = 1.618… . Growth rate larger than ϕ (like 2x) mean that you'll never be able to reuse old memory blocks. Growth rates only slightly less than ϕ mean you won't be able to reuse old memory blocks until after many many reallocations, during which time you'll be wasting memory. So you want to be as far below ϕ as you can get without sacrificing too much performance.
Therefore I'd suggest these candidates for "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory":
≈1.466x (the solution to x4=1+x+x2) allows memory reuse after just 3 reallocations, one sooner than 1.5x allows, while reallocating only slightly more frequently
≈1.534x (the solution to x5=1+x+x2+x3) allows memory reuse after 4 reallocations, same as 1.5x, while reallocating slightly less frequently for improved performance
≈1.570x (the solution to x6=1+x+x2+x3+x4) only allows memory reuse after 5 reallocations, but will reallocate even less infrequently for even further improved performance (barely)
Clearly there's some diminishing returns there, so I think the global optimum is probably among those. Also, note that 1.5x is a great approximation to whatever the global optimum actually is, and has the advantage being extremely simple.
1 Credits to #user541686 for this excellent source.

It really depends. Some people analyze common usage cases to find the optimal number.
I've seen 1.5x 2.0x phi x, and power of 2 used before.

If you have a distribution over array lengths, and you have a utility function that says how much you like wasting space vs. wasting time, then you can definitely choose an optimal resizing (and initial sizing) strategy.
The reason the simple constant multiple is used, is obviously so that each append has amortized constant time. But that doesn't mean you can't use a different (larger) ratio for small sizes.
In Scala, you can override loadFactor for the standard library hash tables with a function that looks at the current size. Oddly, the resizable arrays just double, which is what most people do in practice.
I don't know of any doubling (or 1.5*ing) arrays that actually catch out of memory errors and grow less in that case. It seems that if you had a huge single array, you'd want to do that.
I'd further add that if you're keeping the resizable arrays around long enough, and you favor space over time, it might make sense to dramatically overallocate (for most cases) initially and then reallocate to exactly the right size when you're done.

I recently was fascinated by the experimental data I've got on the wasted memory aspect of things. The chart below is showing the "overhead factor" calculated as the amount of overhead space divided by the useful space, the x-axis shows a growth factor. I'm yet to find a good explanation/model of what it reveals.
Simulation snippet: https://gist.github.com/gubenkoved/7cd3f0cb36da56c219ff049e4518a4bd.
Neither shape nor the absolute values that simulation reveals are something I've expected.
Higher-resolution chart showing dependency on the max useful data size is here: https://i.stack.imgur.com/Ld2yJ.png.
UPDATE. After pondering this more, I've finally come up with the correct model to explain the simulation data, and hopefully, it matches experimental data nicely. The formula is quite easy to infer simply by looking at the size of the array that we would need to have for a given amount of elements we need to contain.
Referenced earlier GitHub gist was updated to include calculations using scipy.integrate for numerical integration that allows creating the plot below which verifies the experimental data pretty nicely.
UPDATE 2. One should however keep in mind that what we model/emulate there mostly has to do with the Virtual Memory, meaning the over-allocation overheads can be left entirely on the Virtual Memory territory as physical memory footprint is only incurred when we first access a page of Virtual Memory, so it's possible to malloc a big chunk of memory, but until we first access the pages all we do is reserving virtual address space. I've updated the GitHub gist with CPP program that has a very basic dynamic array implementation that allows changing the growth factor and the Python snippet that runs it multiple times to gather the "real" data. Please see the final graph below.
The conclusion there could be that for x64 environments where virtual address space is not a limiting factor there could be really little to no difference in terms of the Physical Memory footprint between different growth factors. Additionally, as far as Virtual Memory is concerned the model above seems to make pretty good predictions!
Simulation snippet was built with g++.exe simulator.cpp -o simulator.exe on Windows 10 (build 19043), g++ version is below.
g++.exe (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
PS. Note that the end result is implementation-specific. Depending on implementation details dynamic array might or might not access the memory outside the "useful" boundaries. Some implementations would use memset to zero-initialize POD elements for whole capacity -- this will cause virtual memory page translated into physical. However, std::vector implementation on a referenced above compiler does not seem to do that and so behaves as per mock dynamic array in the snippet -- meaning overhead is incurred on the Virtual Memory side, and negligible on the Physical Memory.

I agree with Jon Skeet, even my theorycrafter friend insists that this can be proven to be O(1) when setting the factor to 2x.
The ratio between cpu time and memory is different on each machine, and so the factor will vary just as much. If you have a machine with gigabytes of ram, and a slow CPU, copying the elements to a new array is a lot more expensive than on a fast machine, which might in turn have less memory. It's a question that can be answered in theory, for a uniform computer, which in real scenarios doesnt help you at all.

I know it is an old question, but there are several things that everyone seems to be missing.
First, this is multiplication by 2: size << 1. This is multiplication by anything between 1 and 2: int(float(size) * x), where x is the number, the * is floating point math, and the processor has to run additional instructions for casting between float and int. In other words, at the machine level, doubling takes a single, very fast instruction to find the new size. Multiplying by something between 1 and 2 requires at least one instruction to cast size to a float, one instruction to multiply (which is float multiplication, so it probably takes at least twice as many cycles, if not 4 or even 8 times as many), and one instruction to cast back to int, and that assumes that your platform can perform float math on the general purpose registers, instead of requiring the use of special registers. In short, you should expect the math for each allocation to take at least 10 times as long as a simple left shift. If you are copying a lot of data during the reallocation though, this might not make much of a difference.
Second, and probably the big kicker: Everyone seems to assume that the memory that is being freed is both contiguous with itself, as well as contiguous with the newly allocated memory. Unless you are pre-allocating all of the memory yourself and then using it as a pool, this is almost certainly not the case. The OS might occasionally end up doing this, but most of the time, there is going to be enough free space fragmentation that any half decent memory management system will be able to find a small hole where your memory will just fit. Once you get to really bit chunks, you are more likely to end up with contiguous pieces, but by then, your allocations are big enough that you are not doing them frequently enough for it to matter anymore. In short, it is fun to imagine that using some ideal number will allow the most efficient use of free memory space, but in reality, it is not going to happen unless your program is running on bare metal (as in, there is no OS underneath it making all of the decisions).
My answer to the question? Nope, there is no ideal number. It is so application specific that no one really even tries. If your goal is ideal memory usage, you are pretty much out of luck. For performance, less frequent allocations are better, but if we went just with that, we could multiply by 4 or even 8! Of course, when Firefox jumps from using 1GB to 8GB in one shot, people are going to complain, so that does not even make sense. Here are some rules of thumb I would go by though:
If you cannot optimize memory usage, at least don't waste processor cycles. Multiplying by 2 is at least an order of magnitude faster than doing floating point math. It might not make a huge difference, but it will make some difference at least (especially early on, during the more frequent and smaller allocations).
Don't overthink it. If you just spent 4 hours trying to figure out how to do something that has already been done, you just wasted your time. Totally honestly, if there was a better option than *2, it would have been done in the C++ vector class (and many other places) decades ago.
Lastly, if you really want to optimize, don't sweat the small stuff. Now days, no one cares about 4KB of memory being wasted, unless they are working on embedded systems. When you get to 1GB of objects that are between 1MB and 10MB each, doubling is probably way too much (I mean, that is between 100 and 1,000 objects). If you can estimate expected expansion rate, you can level it out to a linear growth rate at a certain point. If you expect around 10 objects per minute, then growing at 5 to 10 object sizes per step (once every 30 seconds to a minute) is probably fine.
What it all comes down to is, don't over think it, optimize what you can, and customize to your application (and platform) if you must.

Related

Is array access always constant-time / O(1)?

From Richard Bird, Pearls of Functional Algorithm Design (2010), page 6:
For a pure functional programmer, an update operation takes logarithmic time in the size of the array. To be fair, procedural programmers also appreciate that constant-time indexing and updating
are only possible when the arrays are small.
Under what conditions do arrays have non-constant-time access? Is this related to CPU cache?
Most modern machine architectures try to offer small unit time access to memory.
They fail. Instead we see layers of caches with differing speeds.
The problem is simple: the speed of light. If you need an enormous memory [array] (in the extreme, imagine a memory the size of the Andromeda galaxy), it will take enormous space, and light cannot cross enormous space in short periods of time. Information cannot travel faster than the speed of light. You are screwed by physics from the start.
So the best you can do is build part of memory "nearby" where light takes only fractions of nanosecond to traverse (thus registers and L1 cache), and part of the memory far away (the disk drive). Other practical complications ensue, such as capacitance (think inertia) to slow down access to things further away.
Now, if you are willing to take the access time of your farthest memory element as "unit" time, yes, access to everything takes the same amount of time, e.g., O(1). In practical computing, we treat RAM memory this way most of the time, and we leave out other, slower devices to avoid screwing up our simple model.
Then you discover people that aren't satisfied with that, and voila, you have people optimizing for cache line access. So it may be O(1) in theory, and acts like O(1) for small arrays (that fit in the first level of cache), but it often is not in practice.
An extreme practical case is an array that doesn't fit in main memory; now an array access may cause paging from the disk.
Sometimes we don't care even in that case. Google is essentially a giant cache. We tend to think of Google searches as O(1).
Can oranges be red?
Yes, they can be red due to a number of reasons -
You color them red.
You grow a genetically modified variety.
You grow them on Mars, the red planet, where every thing is supposed to look red.
The (theoretical) list of some practical (given todays technology) and impractical (fiction / or future reality) goes on...
Point is, that I think the question you are asking, is really about two orthogonal concepts. Namely -
Big O Notation - "In mathematics, big O notation describes the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions."
vs
Practicalities (hardware and software) a good software engineer should be aware of, while architecting / designing their app and writing code.
In other words, while the concept of Big O Notation can be called academic, but it is most appropriate way of classifying algorithms complexity (Time / Space).. and that's where it ends. There is no need to muddy the waters with orthogonal concerns.
To be clear, I am not saying that one should not be aware of the under the hood implementation details and workings of things, which affect the performance of the software you write.. but there is no point of mixing the two together. For example, does it make sense to say -
Arrays do not have constant time access (with indexes) because -
Large arrays do not fit in CPU cache, and hence incur high cost of cache misses.
On a system under memory pressure, the array, big or small, has been swapped out from Physical Memory to Hard Disk, and not only is impacted by a cache miss, but also a hard page fault.
On a system under extreme CPU load, the thread which read the supposed array can be pre-empted, and may not get a chance to execute for several seconds.
On a hypothetical OS, which backs its memory not just with disk, but with additional memory on another computer on the other corner of the world, will make array access un-imaginably slow.
Like my apple and orange example, as you read through my increasingly absurd examples, hope the point I am trying to make is clear.
Conclusion - Any day, I'd answer the question "Do Arrays have constant time O(1) access (with indexes)", as yes.. without any doubt or ifs and buts, they do.
EDIT:
Put it another way - If O(1) is not the answer.. then neither is O(log n), or O(n log n), or O(n^2) or O(n ^ 3)..... and certainly not 42.
He is talking about Computation Models, and in particular the word-based RAM machine
A RAM machine is a formalization of something of very similar to an actual computer: we model the computer memory as a big array of memory words of w bits each, and we can read/write any words in O(1) time
But we have yet something important to define: how large should a word be?
We need a word size w ≥ Ω(log n) to be able at least to address the n parts of our input.
For this reason, word-based RAMs normally assume a word length of O(log n)
But having the word length of your machine depends on the size of the input appears strange and unrealistic
What if we keep the word length fixed? Then even following a pointer needs Ω(log n) time just to read the entire pointer
We need Ω(log n) words to store pointer and Ω(log n) time to access input element
if a language supports sparse arrays, access to the array would have to go through a directory, and a tree-structured directory would have non-linear access time. Or did you mean realistic conditions? ;-)

How can we allocate memory of order 10^15 in C

I need to allocate memory of order of 10^15 to store integers which can be of long long type.
If i use an array and declare something like
long long a[1000000000000000];
that's never going to work. So how can i allocate such a huge amount of memory.
Really large arrays generally aren't a job for memory, more one for disk. 1015 array elements at 64 bits apiece is (I think) 8 petabytes. You can pick up 8G memory slices for about $15 at the moment so, even if your machine could handle that much memory or address space, you'd be outlaying about $15 million dollars.
In addition, with upcoming DDR4 being clocked up to about 4GT/s (giga-transfers), even if each transfer was a 64-bit value, it would still take about one million seconds just to initialise that array to zero. Do you really want to be waiting around for eleven and a half days before your code even starts doing anything useful?
And, even if you go the disk route, that's quite a bit. At (roughly) $50 per TB, you're still looking at $400,000 and you'll possibly have to provide your own software for managing those 8,000 disks somehow. And I'm not even going to contemplate figuring out how long it would take to initialise the array on disk.
You may want to think about rephrasing your question to indicate the actual problem rather than what you currently have, a proposed solution. It may be that you don't need that much storage at all.
For example, if you're talking about an array where many of the values are left at zero, a sparse array is one way to go.
You can't. You don't have all this memory, and you'll don't have it for a while. Simple.
EDIT: If you really want to work with data that does not fit into your RAM, you can use some library that work with mass storage data, like stxxl, but it will work a lot slower, and you have always disk size limits.
MPI is what you need, that's actually a small size for parallel computing problems the blue gene Q monster at Lawerence Livermore National Labs holds around 1.5 PB of ram. you need to use block decomposition to divide up your problem and viola!
the basic approach is dividing up the array into equal blocks or chunks among many processors
You need to uppgrade to a 64-bit system. Then get 64-bit-capable compiler then put a l at the end of 100000000000000000.
Have you heard of sparse matrix implementation? In one of the sparse matrices, you just use very little part of the matrix despite of the matrix being huge.
Here are some libraries for you.
Here is a basic info about sparse-matrices You dont actually use all of it. Just the needed few points.

Why double stack capacity instead of just increasing it by fixed amount?

I'm using using an array implementation of a stack, if the stack is full instead of throwing error I am doubling the array size, copying over the elements, changing stack reference and adding the new element to the stack. (I'm following a book to teach my self this stuff).
What I don't fully understand is why should I double it, why not increase it by a fixed amount, why not just increase it by 3 times.
I assume it has something to do with the time complexity or something?
A explanation would be greatly appreciated!
Doubling has just become the standard for generic implementations of things like array lists ("dynamically" sized arrays that really just do what you're doing in the background) and really most dynamically sized data types that are backed by arrays. If you knew your scenario and had the time and willpower to write a custom stack/array list implementation you could certainly write a more optimal solution.
If you knew in your software that items would be added incredibly infrequently after the initial array was built, you could initialise it with a specific size then only increase it by the size of what was being added to preserve memory.
On the other hand if you knew the list would be expanded very frequently, you might chose to increase the list size by 3 times or more when it runs out of space.
For a generic implementation that's part of a common library, your implementation specifics and requirements aren't known so doubling is just a happy medium.
In theory, you indeed arrive at different time complexities. If you increase by a constant size, you divide the number of re-allocations (and thus O(n) copies) by a constant, but you still get O(n) time complexity for appending. If you double them, you get a better time complexity for appending (armortized O(1) IIRC), and as you at most consume twice as much memory as needed, you still got the same space complexity.
In practice, it's less severe, but nevertheless viable. Copies are expensive, while a bit of memory usually doesn't hurt. It's a tradeoff, but you'd have to be quite low on memory to choose another strategy. Often, you don't know beforehand (or can't let the stack know due to API limits) how much space you'll actually need. For instance, if you build a 1024 element stack starting with one element, you get down to (I may be off by one) 10 re-allocations, from 1024/K -- assuming K=3, that would be roughly 34 times as many re-allocations, only to save a bit of memory.
The same holds for any other factor. 2 is nice because you never end up with non-integer sizes and it's still quite small, limiting the wasted space to 50%. Specific use cases may be better-served by other factors, but usually the ROI is too small to justify re-implementing and optimizing what's already available in some library.
The problem with a fixed amount is choosing that fixed amount - if you (say) choose 100 items as your fixed amount, that makes sense if your stack is currently ~100 items in size. However, if your stack is already 10,000 items in size, it's likely to grow to 11,000 items. You don't want to do 10 reallocations / moves to grow the size of your stack by 10%.
As for 2x versus 3x, that's pretty arbitrary - nothing wrong with choosing 3x; which is "better" will depend on your exact use case and how you define "better".
Scaling by 2x is easy, and will ensure that on average items get copied no more than twice [an expansion will copy half the items for the first time, a quarter for the second, an eighth for the third, etc.] If things instead grew by a fixed amount, then when e.g. the twentieth expansion was performed, half the items will be copied for the tenth time.
Growing by a factor of more than 2x will increase the average "permanent" slack space; growing by a smaller factor will increase the amount of storage that is allocated and abandoned. Depending upon the relative perceived "costs" of permanent and abandoned allocations, the optimal growth factor may be larger or smaller, but growth factors which are anywhere close to optimum will generally not perform too much worse than would optimum growth factors. Regardless of what the optimum growth factor would be, a growth factor of 2x will be close enough to yield decent performance.

Buffer growth strategy

I have a generic growing buffer indended to accumulate "random" string pieces and then fetch the result. Code to handle that buffer is written in plain C.
Pseudocode API:
void write(buffer_t * buf, const unsigned char * bytes, size_t len);/* appends */
const unsigned char * buffer(buffer_t * buf);/* returns accumulated data */
I'm thinking about the growth strategy I should pick for that buffer.
I do not know if my users would prefer memory or speed — or what would be the nature of user's data.
I've seen two strategies in the wild: grow buffer by fixed size increments (that is what I've currently implemented) or grow data exponentially. (There is also a strategy to allocate the exact amount of memory needed — but this is not that interesting in my case.)
Perhaps I should let user to pick the strategy... But that would make code a bit more complex...
Once upon a time, Herb Sutter wrote (referencing Andrew Koenig) that the best strategy is, probably, exponential growth with factor 1.5 (search for "Growth Strategy"). Is this still the best choice?
Any advice? What does your experience say?
Unless you have a good reason to do otherwise, exponential growth is probably the best choice. Using 1.5 for the exponent isn't really magical, and in fact that's not what Andrew Koenig originally said. What he originally said was that the growth factor should be less than (1+sqrt(5))/2 (~1.6).
Pete Becker says when he was at Dinkumware P.J. Plauger, owner of Dinkumware, says they did some testing and found that 1.5 worked well. When you allocate a block of memory, the allocator will usually allocate a block that's at least slightly larger than you requested to give it room for a little book-keeping information. My guess (though unconfirmed by any testing) is that reducing the factor a little lets the real block size still fit within the limit.
References:
I believe Andrew originally published this in a magazine (the Journal of Object Oriented Programming, IIRC) which hasn't been published in years now, so getting a re-print would probably be quite difficult.
Andrew Koenig's Usenet post, and P.J. Plauger's Usenet post.
The exponential growth strategy is used throughout STL and it seems to work fine. I'd say stick with that at least until you find a definite case where it won't work.
I usually use a combination of addition of a small fixed amount and multiplication by 1.5 because it is efficent to implement and leads to reasonable step widths which are bigger at first and more memory sensible when the buffer grows. As fixed offset I usually use the initial size of the buffer and start with rather small initial sizes:
new_size = old_size + ( old_size >> 1 ) + initial_size;
As initial_size I use 4 for collection types, 8, 12 or 16 for string types and 128 to 4096 for in-/output buffers depending on the context.
Here is a little chart that shows that this grows much faster (yellow+red) in the early steps compared to multiplying by 1.5 only (red).
So, if you started with 100 you would need for example 6 increases to accommodate 3000 elements while multiplying with 1.5 alone would need 9.
At larger sizes the influence of the addition becomes negligible, which makes both approaches scale equally well by a factor of 1.5 then. These are the effective growth factors if you use the initial size as fixed amount for the addition:
2.5
1.9
1.7
1.62
1.57
1.54
1.53
1.52
1.51
1.5
...
The key point is that the exponential growth strategy lets you avoid expensive copies of the buffer content when you hit the current size for the cost of some wasted memory. The article you link has the numbers for the trade-of.
The answer, as always is, it "depends".
The idea behind exponential growth - ie allocating a new buffer that is x times the current size is that as you require more buffer, you'll need more buffer ansd the chances are that you'll be needing much more buffer than a small fixed increment provides.
So, if you have a 8-byte buffer, and need more allocating an extra 8 bytes is ok, then allocating an additional 16 bytes is probably a good idea - someone with a 16-byte buffer is not likely to require a extra 1 byte. And if they do, all that's happening is you're wasting a little memory.
I thought the best growth factor was 2 - ie double your buffer, but if Koenig/Sutter say 1.5 is optimal, then I'm agreeing with them. You may want to tweak your growth rate after getting some usage statistics though.
So exponential growth is a good trade-off between performance and keeping memory usage low.
Double the size until a threshold (~100MB?) and then lower the exponential growth toward 1.5,..,
1.3
Another option would be to make the default buffer size configurable at runtime.
The point of using exponential growth (whether the factor be 1.5 or 2) is to avoid copies. Each time you realloc the array, you can trigger an implicit copy of the item, which, of course, gets more expensive the larger it gets. By using an exponential growth, you get an amortized constant number of recopies -- i.e. you rarely end up copying.
As long as you're running on a desktop computer of some kind, you can expect an essentially unlimited amount of memory, so time is probably the right side of that tradeoff. For hard real-time systems, you would probably want to find a way to avoid the copies altogether -- a linked list comes to mind.
There's no way anyone can give good advice without knowing something about the allocations, runtime environment, execution characteristics, etc., etc.
Code which works is way more important than highly optimized code... which is under development. Choose some algorithm—any workable algorithm—and try it! If it proves suboptimal, then change the strategy. Placing this in the control of the library user often does them no favors. But if you already have some option scheme in place, then adding it could be useful, unless you hit on a good algorithm (and n^1.5 is a pretty good one).
Also, the use of a function named write in C (not C++) conflicts with <io.h> and <stdio.h>. It's fine if nothing uses them, but it would also be hard to add them later. Best to use a more descriptive name.
As a wild idea, for this specific case, you could change the API to require the caller to allocate the memory for each chunk, and then remembering the chunks instead of copying the data.
Then, when it's time to actually produce the result, you know exactly how much memory is going to be needed and can allocate exactly that.
This has the benefit that the caller will need to allocate memory for the chunks anyway, and so you might as well make use of that. This also avoids copying data more than once.
It has the disadvantage that the caller will have to dynamically allocate each chunk. To get around that, you could allocate memory for each chunk, and remember those, rather than keeping one large buffer, which gets resized when it gets full. This way, you'll copy data twice (once into the chunk you allocate, another time into the resulting string), but no more. If you have to resize several times, you may end up with more than two copies.
Further, really large areas of free memory may be difficult for the memory allocator to find. Allocating smaller chunks may well be easier. There might not be space for a one-gigabyte chunk of memory, but there might be space for a thousand megabyte chunks.

Why is that data structures usually have a size of 2^n?

Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.
There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?
Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.
Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.
Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.
I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.
Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.
I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.
It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.
It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.
In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.

Resources