When making automatically expanding arrays (like C++'s std::vector) in C, it is often common (or at least common advice) to double the size of the array each time it is filled to limit the amount of calls to realloc in order to avoid copying the entire array as much as possible.
Eg. we start by allocating room for 8 elements, 8 elements are inserted, we then allocate room for 16 elements, 8 more elements are inserted, we allocate for 32.., etc.
But realloc does not have to actually copy the data if it can expand the existing memory allocation. For example, the following code only does 1 copy (the initial NULL allocation, so it is not really a copy) on my system, even though it calls realloc 10000 times:
#include <stdlib.h>
#include <stdio.h>
int main()
{
int i;
int copies = 0;
void *data = NULL;
void *ndata;
for (i = 0; i < 10000; i++)
{
ndata = realloc(data, i * sizeof(int));
if (data != ndata)
copies++;
data = ndata;
}
printf("%d\n", copies);
}
I realize that this example is very clinical - a real world application would probably have more memory fragmentation and would do more copies, but even if I make a bunch of random allocations before the realloc loop, it only does marginally worse with 2-4 copies instead.
So, is the "doubling method" really necessary? Would it not be better to just call realloc each time a element is added to the dynamic array?
You have to step back from your code for a minute and thing abstractly. What is the cost of growing a dynamic container? Programmers and researchers don't think in terms of "this took 2ms", but rather in terms of asymptotic complexity: What is the cost of growing by one element given that I already have n elements; how does this change as n increases?
If you only ever grew by a constant (or bounded) amount, then you would periodically have to move all the data, and so the cost of growing would depend on, and grow with, the size of the container. By contrast, when you grow the container geometrically, i.e. multiply its size by a fixed factor, every time it is full, then the expected cost of inserting is actually independent of the number of elements, i.e. constant.
It is of course not always constant, but it's amortized constant, meaning that if you keep inserting elements, then the average cost per element is constant. Every now and then you have to grow and move, but those events get rarer and rarer as you insert more and more elements.
I once asked whether it makes sense for C++ allocators to be able to grow, in the way that realloc does. The answers I got indicated that the non-moving growing behaviour of realloc is actually a bit of a red herring when you think asymptotically. Eventually you won't be able to grow anymore, and you'll have to move, and so for the sake of studying the asymptotic cost, it's actually irrelevant whether realloc can sometimes be a no-op or not. (Moreover, non-moving growth seems to upset moder, arena-based allocators, which expect all their allocations to be of a similar size.)
Compared to almost every other type of operation, malloc, calloc, and especially realloc are very memory expensive. I've personally benchmarked 10,000,000 reallocs, and it takes a HUGE amount of time to do that.
Even though I had other operations going on at the same time (in both benchmark tests), I found that I could literally cut HOURS off of the run time by using max_size *= 2 instead of max_size += 1.
Q: 'doubling the capacity of a dynamic array necessary"
A: No. One could grow only to the extent needed. But then you may truly copy data many times. It is a classic trade off between memory and processor time. A good growth algorithm takes into account what is known about the program's data needs and also not to over-think those needs. An exponential growth of 2x is a happy compromise.
But now to your claim "following code only does 1 copy".
The amount of copying with advanced memory allocators may not be what OP thinks. Getting the same address does not mean that the underlying memory mapping did not perform significant work. All sorts of activity go on under-the-hood.
For memory allocations that grow & shrink a lot over the life of the code, I like grow and shrink thresholds geometrically placed apart from each other.
const size_t Grow[] = {1, 4, 16, 64, 256, 1024, 4096, ... };
const size_t Shrink[] = {0, 2, 8, 32, 128, 512, 2048, ... };
By using the grow thresholds while getting larger and shrink one while contracting, one avoid thrashing near a boundary. Sometimes a factor of 1.5 is used instead.
Related
I will read a file which has an array with unknown-size
like that
1, 2, 3, ....
5, 6, 8 ....
Is that algorithm safe and fast to use ?
array =NULL; /* for realloc */
for(i=0;fgets(line,256,input) != NULL ;++i){
array =(double**)realloc(array,sizeof(double*)*(i+1));
value =strtok(line,selector);
for(j=0;value != NULL;++j){
array[i] =(double*)realloc(array[i],sizeof(double)*(j+1));
sscanf(value,"%lf",&array[i][j]);
value =strtok(NULL,selector);
}
}
On the speed: Your algorithm has quadratic complexity O(n^2), where n is the number of values per line or the number of lines. This is not efficient.
The normal workaround for this is, to keep track of two sizes, the size of the allocated array, and the number of elements that are currently in use. A value is added either by just incrementing the number of elements in current use (and storing the value at the correct location, of course), or by first realloc()ing the array to twice the current size. The result of this is, that even when n is very large, the average element in the array is copied only once. This brings the complexity down to O(n).
Of course, all of this is irrelevant if you only ever have something like ten entries in your arrays. But you were asking for speed.
On the safety: The only risk that I see is that you are fragmenting your address space more than necessary by creating tons of temporary objects which are just created to be replaced by a slightly larger one in the next iteration. This may lead to increased memory hunger in the long run, but it's virtually impossible to gauge this effect precisely.
I have a program where I need to read bytes of data in RAM, sort that data using qsort, and write the data back out to a file, the catch is that I'm only allowed use a certain amount of memory to do so.
Here's a jist of what I've done:
FILE *fp;
/* open file for reading up here , blah blah blah*/
...
int mb = 1024*1024;
int mem_size = 20*mb;
int total_cookies = mem_size/sizeof(Cookie);
Cookie *buffer = (Cookie *) calloc(total_cookies, sizeof(Cookie));
/* read bytes into buffer*/
while ((result = fread(buffer, total_cookies , sizeof(Cookie), fp)) > 0) {
qsort(buffer, total_cookies, sizeof(Cookie), compare);
fwrite(....)
}
free(buffer);
My problem is, when I run my program against /usr/bin/time -v and check the maximum resident set size, I use twice the amount of memory that I'm intended to, and the problem points back to the qsort function.
How do I get qsort to sort in place, and not use extra memory?
Since there is no requirement in the specification about the memory consumption or time complexity you will need to use another function (possibly self implemented) if you have constraints on memory consumption (at least if you want portability).
There is indeed algorithms (fx quick sort) that have O(n log n) complexity and O(n log n) memory consumption. There are probably implementations that actually use such an algorithm for implementing qsort, but yours is perhaps none of these.
There are other decent algorithms that require more memory. For example merge sort would require to make a copy of the data for the last merge step (which is consistent with your observation). Merge sort has it's advantages (for example having better worst case time complexity) which may be a reason why the implementation would opt for that algorithm.
Actually the implementation of qsort could have been a lot worse than that both time- and memorywise. The statement that qsort sorts in place is only true to the extent that the result finally ends up in the same array as the input, but before that it could scatter the data all over the place.
Suppose we have an array of data and another array with indexes.
data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from index. Result should be
[4, 2, 5, 7, 3, 1]
Naive algorithm works for O(N) but it performs random memory access.
Can you suggest CPU cache friendly algorithm with the same complexity.
PS
In my certain case all elements in data array are integers.
PPS
Arrays might contain millions of elements.
PPPS I'm ok with SSE/AVX or any other x64 specific optimizations
Combine index and data into a single array. Then use some cache-friendly sorting algorithm to sort these pairs (by index). Then get rid of indexes. (You could combine merging/removing indexes with the first/last pass of the sorting algorithm to optimize this a little bit).
For cache-friendly O(N) sorting use radix sort with small enough radix (at most half number of cache lines in CPU cache).
Here is C implementation of radix-sort-like algorithm:
void reorder2(const unsigned size)
{
const unsigned min_bucket = size / kRadix;
const unsigned large_buckets = size % kRadix;
g_counters[0] = 0;
for (unsigned i = 1; i <= large_buckets; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket + 1;
for (unsigned i = large_buckets + 1; i < kRadix; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_index[i] % kRadix]++;
g_sort[dst].index = g_index[i] / kRadix;
g_sort[dst].value = g_input[i];
__builtin_prefetch(&g_sort[dst + 1].value, 1);
}
g_counters[0] = 0;
for (unsigned i = 1; i < (size + kRadix - 1) / kRadix; ++i)
g_counters[i] = g_counters[i - 1] + kRadix;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_sort[i].index]++;
g_output[dst] = g_sort[i].value;
__builtin_prefetch(&g_output[dst + 1], 1);
}
}
It differs from radix sort in two aspects: (1) it does not do counting passes because all counters are known in advance; (2) it avoids using power-of-2 values for radix.
This C++ code was used for benchmarking (if you want to run it on 32-bit system, slightly decrease kMaxSize constant).
Here are benchmark results (on Haswell CPU with 6Mb cache):
It is easy to see that small arrays (below ~2 000 000 elements) are cache-friendly even for naive algorithm. Also you may notice that sorting approach starts to be cache-unfriendly at the last point on diagram (with size/radix near 0.75 cache lines in L3 cache). Between these limits sorting approach is more efficient than naive algorithm.
In theory (if we compare only memory bandwidth needed for these algorithms with 64-byte cache lines and 4-byte values) sorting algorithm should be 3 times faster. In practice we have much smaller difference, about 20%. This could be improved if we use smaller 16-bit values for data array (in this case sorting algorithm is about 1.5 times faster).
One more problem with sorting approach is its worst-case behavior when size/radix is close to some power-of-2. This may be either ignored (because there are not so many "bad" sizes) or fixed by making this algorithm slightly more complicated.
If we increase number of passes to 3, all 3 passes use mostly L1 cache, but memory bandwidth is increased by 60%. I used this code to get experimental results: TL; DR. After determining (experimentally) the best radix value, I got somewhat better results for sizes greater than 4 000 000 (where 2-pass algorithm uses L3 cache for one pass) but somewhat worse results for smaller arrays (where 2-pass algorithm uses L2 cache for both passes). As it may be expected, performance is better for 16-bit data.
Conclusion: performance difference is much smaller than difference in complexity of algorithms, so naive approach is almost always better; if performance is very important and only 2 or 4 byte values are used, sorting approach is preferable.
data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from
index. Result should be
result -> [4, 2, 5, 7, 3, 1]
Single thread, one pass
I think, for a few million elements and on a single thread, the naive approach might be the best here.
Both data and index are accessed (read) sequentially, which is already optimal for the CPU cache. That leaves the random writing, but writing to memory isn't as cache friendly as reading from it anyway.
This would only need one sequential pass through data and index. And chances are some (sometimes many) of the writes will already be cache-friendly too.
Using multiple blocks for result - multiple threads
We could allocate or use cache-friendly sized blocks for the result (blocks being regions in the result array), and loop through index and data multiple times (while they stay in the cache).
In each loop we then only write elements to result that fit in the current result-block. This would be 'cache friendly' for the writes too, but needs multiple loops (the number of loops could even get rather high - i.e. size of data / size of result-block).
The above might be an option when using multiple threads: data and index, being read-only, would be shared by all cores at some level in the cache (depending on the cache architecture). The result blocks in each thread would be totally independent (one core never has to wait for the result of another core, or a write in the same region). For example: 10 million elements - each thread could be working on an independent result block of say 500.000 elements (number should be a power of 2).
Combining the values as a pair and sorting them first: this would already take much more time than the naive option (and wouldn't be that cache friendly either).
Also, if there are only a few million of elements (integers), it won't make much of a difference. If we would be talking about billions, or data that doesn't fit in memory, other strategies might be preferable (like for example memory mapping the result set if it doesn't fit in memory).
If your problem deals with a lot more data than you show here the fastest way - and probably the most cache friendly - would be to do a large and wide merge sort operation.
So you would divide the input data into reasonable chunks, and have a seperate thread operate on each chunk. The result of this operation would be two arrays much like the input (one data and one destination indexes), however the indexes would be sorted. Then you would have a final thread do a merge operation on the data into the final output array.
As long as the segments are chosen well this should be quite a cache friendly algorithm. By wisely I mean so that the data used by different threads maps onto different cache lines (of your chosen processor) so as to avoid cache thrashing.
If you have a lot of data and that is indeed the bottle neck you will need to use a block based algorithm where you read and write from the same blocks as much as possible. It will take up to 2 passes over the data to ensure the new array is entirely populated and the block size will need to be set appropriately. The pseudocode is below.
def populate(index,data,newArray,cache)
blockSize = 1000
for i = 0; i < size(index); i++
//We cached this value earlier
if i in cache
newArray[i] = cache[i]
remove(cache,i)
else
newIndex = index[i]
newValue = data[i]
//Check if this index is in our block
if i%blockSize != newIndex%blockSize
//This index is not in our current block, cache it
cache[newIndex] = newValue
else
//This value is in our current block
newArray[newIndex] = newValue
cache = {}
newArray = []
populate(index,data,newArray,cache)
populate(index,data,newArray,cache)
Analysis
The naive solution accesses the index and data array in order but the new array is accessed in random order. Since the new array is randomly accessed you essentially end up with O(N^2) where N is the number of blocks in the array.
The block based solution does not jump from block to block. It reads the index, data, and new array all in sequence to read and write to the same blocks. If an index will be in another block, it is cached and either retrieved when the block it belongs in comes up or if the block is already passed, it will be retrieved in the second pass. A second pass will not hurt at all. This is O(N).
The only caveat is in dealing with the cache. There are a lot of opportunities to get creative here but in general if a lot of the reads and writes end up being on different blocks, the cache will grow and this is not optimal. It depends on the makeup of your data, how often this occurs and your cache implementation.
Lets imagine that all of the information inside of the cache exists on one block and it fits in memory. And lets say the cache has y elements. The naive approach would have randomly accessed at least y times. The block based approach will get those in the second pass.
I notice your index completely covers the domain but is in random order.
If you were to sort the index but also apply the same operations to the index array to the data array, the data array would become the result you are after.
There are plenty of sort algoritms to select from, all would satisfy your cache friendly criteria. But their complexity varies. I'd consider either quicksort or mergesort.
If you're interested in this answer I can elaborate with pseudo code.
I am concerned this may not be a winning pattern.
We had a piece of code which performed well, and we optimized it by removing a copy.
The result was that it performed poorly (due to cache issues). I can't see how you can produce a single pass algorithm which solves the issue. Using OpenMP, may allow the stalls this will cause to be shared amongst multiple threads.
I assume that the reordering happens only once in the same way. If it happens multiple times, then creating some better strategy beforehand (by and appropriate sorting algorithm) will improve performance
I wrote the following program to actually test if a simple split of the target in N blocks helps, and my finding were:
a) even for the worst cases it was not possible to the single thread performance (using segmented writes) does not exceed the naive strategy, and is usually worse by at least a factor of 2
b) However, the performance approaches unity for some subdivisions (probably depends on the processor) and array sizes, thus indicating that it actually would improve the multi-core performance
The consequence of this is: Yes, it's more "cache-friendly" than not subdividing, but for a single thread (and only one reordering) this wont help you a bit.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void main(char **ARGS,int ARGC) {
int N=1<<26;
double* source = malloc(N*sizeof(double));
double* target = malloc(N*sizeof(double));
int* idx = malloc(N*sizeof(double));
int i;
for(i=0;i<N;i++) {
source[i]=i;
target[i]=0;
idx[i] = rand() % N ;
};
struct timeval now,then;
gettimeofday(&now,NULL);
for(i=0;i<N;i++) {
target[idx[i]]=source[i];
};
gettimeofday(&then,NULL);
printf("%f\n",(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);
gettimeofday(&now,NULL);
int j;
int targetblocks;
int M = 24;
int targetblocksize = 1<<M;
targetblocks = (N/targetblocksize);
for(i=0;i<N;i++) {
for(j=0;j<targetblocks;j++) {
int k = idx[i];
if ((k>>M) == j) {
target[k]=source[i];
};
};
};
gettimeofday(&then,NULL);
printf("%d,%f\n",targetblocks,(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);
};
Given following data, what is the best way to organize an array of elements so that the fastest random access will be possible?
Each element has some int number, a name of 3 characters with '\0' at the end, and a floating point value.
I see two possible methods to organize and access such array:
First:
typedef struct { int num; char name[4]; float val; } t_Element;
t_Element array[900000000];
//random access:
num = array[i].num;
name = array[i].name;
val = array[i].val;
//sequential access:
some_cycle:
num = array[i].num
i++;
Second:
#define NUMS 0
#define NAMES 1
#define VALS 2
#define SIZE (VALS+1)
int array[SIZE][900000000];
//random access:
num = array[NUMS][i];
name = (char*) array[NAMES][i];
val = (float) array[VALS][i];
//sequential access:
p_array_nums = &array[NUMS][i];
some_cycle:
num = *p_array_nums;
p_array_nums++;
My question is, what method is faster and why? My first thought was the second method makes fastest code and allows fastest block copy, but I doubt whether it saves any sensitive number of CPU instructions in comparison to the first method?
It depends on the common access patterns. If you plan to iterate over the data, accessing every element as you go, the struct approach is better. If you plan to iterate independently over each component, then parallel arrays are better.
This is not a subtle distinction, either. With main memory typically being around two orders of magnitude slower than L1 cache, using the data structure that is appropriate for the usage pattern can possibly triple performance.
I must say, though, that your approach to implementing parallel arrays leaves much to be desired. You should simply declare three arrays instead of getting "clever" with two-dimensional arrays and casting:
int nums[900000000];
char names[900000000][4];
float vals[900000000];
Impossible to say. As with any performance related test, the answer my vary by any one or more of your OS, your CPU, your memory, your compiler etc.
So you need to test for yourself. Set your performance targets, measure, optimise, repeat.
The first one is probably faster, since memory access latency will be the dominant factor in performance. Ideally you should access memory sequentially and contiguously, to make best use of loaded cache lines and reduce cache misses.
Of course the access pattern is critical in any such discussion, which is why sometimes it's better to use SoA (structure of arrays) and other times AoS (array of structures), at least when performance is critical.
Most of the time of course you shouldn't worry about such things (premature optimisation, and all that).
Imagine you have some memory containing a bunch of bytes:
++++ ++-- ---+ +++-
-++- ++++ ++++ ----
---- ++++ +
Let us say + means allocated and - means free.
I'm searching for the formula of how to calculate the percentage of fragmentation.
Background
I'm implementing a tiny dynamic memory management for an embedded device with static memory. My goal is to have something I can use for storing small amounts of data. Mostly incoming packets over a wireless connection, at about 128 Bytes each.
As R. says, it depends exactly what you mean by "percentage of fragmentation" - but one simple formula you could use would be:
(free - freemax)
---------------- x 100% (or 100% for free=0)
free
where
free = total number of bytes free
freemax = size of largest free block
That way, if all memory is in one big block, the fragmentation is 0%, and if memory is all carved up into hundreds of tiny blocks, it will be close to 100%.
Calculate how many 128 bytes packets you could fit in the current memory layout.
Let be that number n.
Calculate how many 128 bytes packets you could fit in a memory layout with the same number of bytes allocated than the current one, but with no holes (that is, move all the + to the left for example).
Let be that number N.
Your "fragmentation ratio" would be alpha = n/N
If your allocations are all roughly the same size, just split your memory up into TOTAL/MAXSIZE pieces each consisting of MAXSIZE bytes. Then fragmentation is irrelevant.
To answer your question in general, there is no magic number for "fragmentation". You have to evaluate the merits of different functions in reflecting how fragmented memory is. Here is one I would recommend, as a function of a size n:
fragmentation(n) = -log(n * number_of_free_slots_of_size_n / total_bytes_free)
Note that the log is just there to map things to a "0 to infinity" scale; you should not actually evaluate that in practice. Instead you might simply evaluate:
freespace_quality(n) = n * number_of_free_slots_of_size_n / total_bytes_free
with 1.0 being ideal (able to allocate the maximum possible number of objects of size n) and 0.0 being very bad (unable to allocate any).
If you had [++++++-----++++--++-++++++++--------+++++] and you wanted to measure the fragmentation of the free space (or any other allocation)
You could measure the average contiguous block size
Total blocks / Count of contiguous blocks.
In this case it would be
4/(5 + 2 + 1 + 8) / 4 = 4
Based on R.. GitHub STOP HELPING ICE's answer, I came up with the following way of computing fragmentation as a single percentage number:
Where:
n is the total number of free blocks
FreeSlots(i) means how many i-sized slots you can fit in the available free memory space
IdealFreeSlots(i) means how many i-sized slots would fit in a perfectly unfragmented memory of size n. This is a simple calculation: IdealFreeSlots(i) = floor(n / i).
How I came up with this formula:
I was thinking about how I could combine all the freespace_quality(i) values to get a single fragmentation percentage, but I wasn't very happy with the result of this function. Even in an ideal scenario, you could have freespace_quality(i) != 1 if the free space size n is not divisible by i. For example, if n=10 and i=3, freespace_quality(3) = 9/10 = 0.9.
So, I created a derived function freespace_relative_quality(i) which looks like this:
This would always have the output 1 in the ideal "perfectly unfragmented" scenario.
After doing the math:
All that's left to do now to get to the final fragmentation formula is to calculate the average freespace quality for all values of i (from 1 to n), and then invert the range by doing 1 - the average quality so that 0 means completely unfragmented (maximum quality) and 1 means most fragmented (minimum quality).