How to peak best allocation strategy

How to peak best allocation strategy - c

I'm writing a program what reads data from stream (pipe or socket in my example) and put that data into array. The problem is what I can't know how much data I need to read from my stream and what's why I don't know how much memory I need to allocate for my array. If I know what, there is no need in this question. The only thing I know is what than some value (-1 for example) appears in stream what means end of stream. So the function what reads data from stream could look like this:
int next_value() {
return (rand() % 100) - 1;
}
Code what work with this data looks like this:
int main()
{
int len = 0;
int *arr = NULL;
int val, res = 0;
srand(time(NULL));
while ((val = next_value()) != -1) {
if ((res = set_value_in_array(val, &arr, &len))) {
perror("set_value_in_array");
exit(EXIT_FAILURE);
}
}
// uncomment next line if set_value_in_array_v2 or set_value_in_array_v3
//realloc(arr, len * sizeof(*arr));
free(arr);
return 0;
}
I have three strategies of putting data into array with memory allocation routine for that array.
The easiest is to allocate (reallocate) memory for each new value what appears from next_value() like this:
// allocate new element in array for each call
int set_value_in_array_v1(int val, int **arr, int *len) {
int *tmp;
tmp = realloc(*arr, ((*len) + 1) * sizeof(**arr));
if (tmp) {
*arr = tmp;
} else {
return -1;
}
*((*arr) + (*len)) = val;
(*len)++;
return 0;
}
Easy, but I think that it's not ideal. I don't know how many values will be read from stream. The number of values could be in range from 0 to infinity. Another strategy is to allocate memory for more than one element. This will decrease number of calls to memory management unit. The code can look like this:
// allocate ELEMS_PER_ALLOC every time allocation needed
int set_value_in_array_v2(int val, int **arr, int *len) {
#define ELEMS_PER_ALLOC 4 // how many elements allocate on next allocation
int *tmp;
if ((*len) % ELEMS_PER_ALLOC == 0) {
tmp = realloc(*arr, ((*len) + ELEMS_PER_ALLOC) * sizeof(**arr));
if (tmp) {
*arr = tmp;
} else {
return -1;
}
}
*((*arr) + (*len)) = val;
(*len)++;
return 0;
}
Much more better, but is it the best solution? What if I will allocate memory in geometric progression like this:
// allocate *len * FRAC_FOR_ALLOC each time allocation needed
int set_value_in_array_v3(int val, int **arr, int *len) {
#define FRAC_FOR_ALLOC 3 // how many times increase number of allocated memory on next allocation
static int allocated = 0; // i know this is bad to use static but it's for experiments only
int *tmp;
if (allocated == (*len)) {
if (allocated == 0) {
allocated = 1;
}
allocated *= FRAC_FOR_ALLOC;
tmp = realloc(*arr, allocated * sizeof(**arr));
if (tmp) {
*arr = tmp;
} else {
return -1;
}
}
*((*arr) + (*len)) = val;
(*len)++;
return 0;
}
The same way is used in .NET Framework List<T> data structure. This way has one big problem: it will allocate a lot of memory after 100 of elements and situations when there is no way to increase current chunk of memory will be more likely to appear.
In the other hand, set_value_in_array_v2 will call memory manager very often what is also not a good idea if there are many data in stream.
So my question is what is the best strategy of memory allocation in situations similar to mine? I can't find any answers for my question in Internet. Every link just show me the best practices for memory management API usage.
Thanks in advance.

The number of reallocations if you are reallocating every time a new element is added is n. There is no worst case scenario for memory usage.
The number of reallocations if you are reallocating memory in multiples of 4 is nearly n/4. In the worst case scenario , you'll be wasting a constant 3 units of memory.
The number of reallocations required if you are reallocating the memory by a factor of k each time you run out of space is log n where the base of the logarithm is k. In the worst case scenario, you will have (1 - 1/k)*100% of memory being wasted. For k = 2, you'll have 50% of the allocated memory being unused. On average, you will have (1 - 1/k)*0.5*100% of your memory unused.
While reallocating memory using a geometric sequence, you will be guaranteed logarithmic time complexity. However, large factors of k will also put a limit on the maximum amount of memory you can allocate.
Suppose you could allocate just 1GB of memory for your requirement and you already storing 216MB. If you use a k factor of 20, your next reallocation would fail because you would demand more than 1GB of memory.
The larger your base is, the smaller the time complexity will be but it also increases the amount of memory going unused in the worst (and average) case and also caps the maximum memory to something lesser than what you could have actually used (this of course varies from situation to situation; if you had 1296MB of allocatable memory and your base was 6, the cap on the array size would be 1296MB as 1296 is a power of 6 assuming that you started off with memory which is a power of 6).
What you need depends on your situation. In most cases, you would have a rough estimate of your memory requirements. You can do the first optimization by setting up the initial memory to your estimate. You can keep doubling the memory thereon every time you run out of memory. After your stream is closed, you can reallocate the memory to match the exact size of your data (if you really really need to free the unused memory).

This question was part of my bachelor's thesis, unfortunately it is in german.
I compared 3 allocation methods: fixed increase (your case 2), fixed factor (your case 3), and a dynamic factor.
The analysis in the other answers are quite good, but I want to add an important finding of my practical tests: The fixed step increase can use the most memory in runtime! (and is some orders of magnitude slower...)
Why? Suppose you have allocated space for 10 items. Then when adding the 11th item, the space should grow by 10. Now it might not be possible to simply increase the space adjacent to the first 10 items (because it is used otherwise). So fresh space for 20 items is allocated, the origninal 10 are copied, and the original space is freed. You have now allocated 30 items, when you can actually only use 20. This gets worse with every allocation.
My dynamic factor approach intendet to grow fast, as long as the steps are not too big, and later use smaller factors, so that the risk of getting out of memory is minimized. It is some kind of inverted sigmoid function.
The thesis can be found here: XML Toolbox for Matlab. Relevant chapters are 3.2 (implementation) and 5.3.2 (practical tests)

Related

cannot allocate memory in C [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I want to define a 3D array/pointer to store my computing results:
#include <stdio.h>
int main(void) {
float ***ww;
int i, j, k;
int n1 = 3000, n2 = 6000, n3 = 9000;
ww = floatalloc3(n1, n2, n3); /* floatalloc3 is a self-defined function used to allocate memory space for 3D array/pointer */
for (i = 0; i < n1; i++) {
for (j = 0; j < n2; j++) {
for (k = 0; k < n3; k++) {
ww[i][j][k] = 0.0;
}
}
}
free(**ww);
free(*ww);
free(ww);
return 0;
}
But an error pops up says cannot allocate 648000000000 bytes: Cannot allocate memory. So how can I solve such problem, any idea?
Is it possible to split the data and store to different cores?

You want to allocate about 3000 * 6000 * 9000 * 4 bytes of data, which is ~648GB, a little too much.

If you truly intend to compute 3000×6000×9000 floating-point values (162,000,000,000 values), you need to reconsider your approach.
There are several approaches, but the most typical is to split the task into smaller sections, and compute them one by one.
If the grid is mostly empty, but you need to access it in an unpredictable fashion (making splitting the task undesirable), you can use OS-specific methods to memory-map the dataset. (You do need a 64-bit OS to do this with this particular dataset, though; and you do also need sufficient storage on some filesystem to store the data in.) I've shown how to do this in Linux in 2011 in another forum here; this example program manipulates a terabyte-sized memory-mapped dataset, almost twice the size of the dataset OP is considering.
In any case, you definitely do not wish to use two-level indirection to access the data. This wreaks havoc with current CPU's ability to predict and cache accesses, and will lead to poor performance. Instead, you should use a linear data structure. For example:
size_t xsize;
size_t ysize;
size_t zsize;
float *cells;
#define CELL(x,y,z) cells[(x) + xsize*( (y) + ysize * (z) )]
In other words, the index of each cell in the grid is (x) + (y)*xsize + (z)*xsize*ysize. Not only is the data then consecutive in memory (which is important for caching), but the CPU (and your compiler) can also better predict future accesses, based on access patterns.

The amount of memory you are trying to allocate seems humongous: 648 billion bytes plus the overhead of 2 levels of indirection! Your system probably does not let you allocate that much memory.
You should test the return value of floatalloc3 to detect allocation failure. As a matter of fact, it would be useful to post the source code for this self-defined function to ascertain its correctness.
Note also that the 3 free calls might not be enough to free the allocated blocks, but without the source code to floatalloc3, one can only speculate.

Benefits of contiguous memory allocation

In terms of performance, what are the benefits of allocating a contiguous memory block versus separate memory blocks for a matrix? I.e., instead of writing code like this:
char **matrix = malloc(sizeof(char *) * 50);
for(i = 0; i < 50; i++)
matrix[i] = malloc(50);
giving me 50 disparate blocks of 50 bytes each and one block of 50 pointers, if I were to instead write:
char **matrix = malloc(sizeof(char *) * 50 + 50 * 50);
char *data = matrix + sizeof(char *) * 50;
for(i = 0; i < 50; i++) {
matrix[i] = data;
data += 50;
}
giving me one contiguous block of data, what would the benefits be? Avoiding cache misses is the only thing I can think of, and even that's only for small amounts of data (small enough to fit on the cache), right? I've tested this on a small application and have noticed a small speed-up and was wondering why.

It's complicated - you need to measure.
Using an intermediate pointer instead of calculating addresses in a two-dimensional array is most likely a loss on current processors, and both of your examples do that.
Next, everything fitting into L1 cache is a big win. malloc () most likely rounds up to multiples of 64 bytes. 180 x 180 = 32,400 bytes might fit into L1 cache, while individual mallocs might allocate 180 x 192 = 34,560 bytes might not fit, especially if you add another 180 pointers.
One contiguous array means you know how the data fits into cache lines, and you know you'll have the minimum number of page table lookups in the hardware. With hundreds of mallocs, no guarantee.

Watch Scott Meyers' "CPU Caches and Why You Care" presentation on Youtube. The performance gains can be entire orders of magnitude.
https://www.youtube.com/watch?v=WDIkqP4JbkE
As for the discussion above, the intermediate pointer argument died a long time ago. Compilers optimize them away. An N-Dimensional array is allocated as a flat 1D vector, ALWAYS. If you do std::vector>, THEN you might get the equivalent of an ordered forward list of vectors, but for raw arrays, they're always allocated as one long, contiguous strip in a flat manner, and multi-dimensional access reduces to pointer arithmetic the same way 1-Dimensional access does.
To access array[i][j][k] (assume width, height, depth of {A, B, C}), you add i*(BC) + (jC) + k to the address at the front of the array. You'd have to do this math manually in a 1-D representation anyway.

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:

There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.

Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula

You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

Is doubling the capacity of a dynamic array necessary?

When making automatically expanding arrays (like C++'s std::vector) in C, it is often common (or at least common advice) to double the size of the array each time it is filled to limit the amount of calls to realloc in order to avoid copying the entire array as much as possible.
Eg. we start by allocating room for 8 elements, 8 elements are inserted, we then allocate room for 16 elements, 8 more elements are inserted, we allocate for 32.., etc.
But realloc does not have to actually copy the data if it can expand the existing memory allocation. For example, the following code only does 1 copy (the initial NULL allocation, so it is not really a copy) on my system, even though it calls realloc 10000 times:
#include <stdlib.h>
#include <stdio.h>
int main()
{
int i;
int copies = 0;
void *data = NULL;
void *ndata;
for (i = 0; i < 10000; i++)
{
ndata = realloc(data, i * sizeof(int));
if (data != ndata)
copies++;
data = ndata;
}
printf("%d\n", copies);
}
I realize that this example is very clinical - a real world application would probably have more memory fragmentation and would do more copies, but even if I make a bunch of random allocations before the realloc loop, it only does marginally worse with 2-4 copies instead.
So, is the "doubling method" really necessary? Would it not be better to just call realloc each time a element is added to the dynamic array?

You have to step back from your code for a minute and thing abstractly. What is the cost of growing a dynamic container? Programmers and researchers don't think in terms of "this took 2ms", but rather in terms of asymptotic complexity: What is the cost of growing by one element given that I already have n elements; how does this change as n increases?
If you only ever grew by a constant (or bounded) amount, then you would periodically have to move all the data, and so the cost of growing would depend on, and grow with, the size of the container. By contrast, when you grow the container geometrically, i.e. multiply its size by a fixed factor, every time it is full, then the expected cost of inserting is actually independent of the number of elements, i.e. constant.
It is of course not always constant, but it's amortized constant, meaning that if you keep inserting elements, then the average cost per element is constant. Every now and then you have to grow and move, but those events get rarer and rarer as you insert more and more elements.
I once asked whether it makes sense for C++ allocators to be able to grow, in the way that realloc does. The answers I got indicated that the non-moving growing behaviour of realloc is actually a bit of a red herring when you think asymptotically. Eventually you won't be able to grow anymore, and you'll have to move, and so for the sake of studying the asymptotic cost, it's actually irrelevant whether realloc can sometimes be a no-op or not. (Moreover, non-moving growth seems to upset moder, arena-based allocators, which expect all their allocations to be of a similar size.)

Compared to almost every other type of operation, malloc, calloc, and especially realloc are very memory expensive. I've personally benchmarked 10,000,000 reallocs, and it takes a HUGE amount of time to do that.
Even though I had other operations going on at the same time (in both benchmark tests), I found that I could literally cut HOURS off of the run time by using max_size *= 2 instead of max_size += 1.

Q: 'doubling the capacity of a dynamic array necessary"
A: No. One could grow only to the extent needed. But then you may truly copy data many times. It is a classic trade off between memory and processor time. A good growth algorithm takes into account what is known about the program's data needs and also not to over-think those needs. An exponential growth of 2x is a happy compromise.
But now to your claim "following code only does 1 copy".
The amount of copying with advanced memory allocators may not be what OP thinks. Getting the same address does not mean that the underlying memory mapping did not perform significant work. All sorts of activity go on under-the-hood.
For memory allocations that grow & shrink a lot over the life of the code, I like grow and shrink thresholds geometrically placed apart from each other.
const size_t Grow[] = {1, 4, 16, 64, 256, 1024, 4096, ... };
const size_t Shrink[] = {0, 2, 8, 32, 128, 512, 2048, ... };
By using the grow thresholds while getting larger and shrink one while contracting, one avoid thrashing near a boundary. Sometimes a factor of 1.5 is used instead.

C: Is there an advantage to allocating more memory than is needed?

I am working on a Windows C project which is string-intensive: I need to convert a marked up string from one form to another. The basic flow is something like:
DWORD convert(char *point, DWORD extent)
{
char *point_end = point + extent;
char *result = memory_alloc(1);
char *p_result = result;
while (point < point_end)
{
switch (*point)
{
case FOO:
result_extent = p_result - result;
result = memory_realloc(12);
result += result_extent;
*p_result++ = '\n';
*p_result++ = '\t';
memcpy(result, point, 10);
point += 10;
result += 10;
break;
case BAR:
result_extent = p_result - result;
result = memory_realloc(1);
result += result_extent;
*result++ = *point++;
break;
default:
point++;
break;
}
}
// assume point is big enough to take anything I would copy to it
memcpy(point, result, result_extent);
return result_extent;
}
memory_alloc() and memory_realloc() are fake functions to highlight the purpose of my question. I do not know beforehand how big the result 'string' will be (technically, it's not a C-style/null-terminate string I'm working with, just a pointer to a memory address and a length/extent), so I'll need to dynamically size the result string (it might be bigger than the input, or smaller).
In my initial pass, I used malloc() to create room for the first byte/bytes and then subsequently realloc() whenever I needed to append another byte/handful of bytes...it works, but it feels like this approach will needlessly hammer away at the OS and likely result in shifting bytes around in memory over and over.
So I made a second pass, which determines how long the result_string will be after an individual unit of the transformation (illustrated above with the FOO and BAR cases) and picks a 'preferred allocation size', e.g. 256 bytes. For example, if result_extent is 250 bytes and I'm in the FOO case, I know I need to grow the memory 12 bytes (newline, tab and 10 bytes from the input string) -- rather than reallocating 260 bytes of memory, I'd reach for 512 bytes, hedging my bet that I'm likely going to continue to add more data (and thus I can save myself a few calls into realloc).
On to my question: is this latter thinking sound or is it premature optimization that the compiler/OS is probably already taking care of for me? Other than not wasting memory space, is there an advantage to reallocating memory by a couple bytes, as needed?
I have some rough ideas of what I might expect during a single conversion instance, e.g. a worse case scenario might be a 2MB input string with a couple hundred bytes of markup that will result in 50-100 bytes of data to be added to the result string, per markup instance (so, say 200 reallocs stretching the string by 50-100 bytes with another 100 reallocations caused by simply copying data from the input string into the result string, aside from the markup).
Any thoughts on the subject would be appreciated. thanks

As you might know, realloc can move your data at each call. This results in an additional copy. In cases like this, I think it is much better to allocate a large buffer that will most probably be sufficient for the operation (an upper bound). In the end, you can allocate the exact amount for the result and do a final copy/free. This is better and is not premature optimization at all. IMO using realloc might be considered premature optimization in this case.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight