Imagine you have some memory containing a bunch of bytes:
++++ ++-- ---+ +++-
-++- ++++ ++++ ----
---- ++++ +
Let us say + means allocated and - means free.
I'm searching for the formula of how to calculate the percentage of fragmentation.
Background
I'm implementing a tiny dynamic memory management for an embedded device with static memory. My goal is to have something I can use for storing small amounts of data. Mostly incoming packets over a wireless connection, at about 128 Bytes each.
As R. says, it depends exactly what you mean by "percentage of fragmentation" - but one simple formula you could use would be:
(free - freemax)
---------------- x 100% (or 100% for free=0)
free
where
free = total number of bytes free
freemax = size of largest free block
That way, if all memory is in one big block, the fragmentation is 0%, and if memory is all carved up into hundreds of tiny blocks, it will be close to 100%.
Calculate how many 128 bytes packets you could fit in the current memory layout.
Let be that number n.
Calculate how many 128 bytes packets you could fit in a memory layout with the same number of bytes allocated than the current one, but with no holes (that is, move all the + to the left for example).
Let be that number N.
Your "fragmentation ratio" would be alpha = n/N
If your allocations are all roughly the same size, just split your memory up into TOTAL/MAXSIZE pieces each consisting of MAXSIZE bytes. Then fragmentation is irrelevant.
To answer your question in general, there is no magic number for "fragmentation". You have to evaluate the merits of different functions in reflecting how fragmented memory is. Here is one I would recommend, as a function of a size n:
fragmentation(n) = -log(n * number_of_free_slots_of_size_n / total_bytes_free)
Note that the log is just there to map things to a "0 to infinity" scale; you should not actually evaluate that in practice. Instead you might simply evaluate:
freespace_quality(n) = n * number_of_free_slots_of_size_n / total_bytes_free
with 1.0 being ideal (able to allocate the maximum possible number of objects of size n) and 0.0 being very bad (unable to allocate any).
If you had [++++++-----++++--++-++++++++--------+++++] and you wanted to measure the fragmentation of the free space (or any other allocation)
You could measure the average contiguous block size
Total blocks / Count of contiguous blocks.
In this case it would be
4/(5 + 2 + 1 + 8) / 4 = 4
Based on R.. GitHub STOP HELPING ICE's answer, I came up with the following way of computing fragmentation as a single percentage number:
Where:
n is the total number of free blocks
FreeSlots(i) means how many i-sized slots you can fit in the available free memory space
IdealFreeSlots(i) means how many i-sized slots would fit in a perfectly unfragmented memory of size n. This is a simple calculation: IdealFreeSlots(i) = floor(n / i).
How I came up with this formula:
I was thinking about how I could combine all the freespace_quality(i) values to get a single fragmentation percentage, but I wasn't very happy with the result of this function. Even in an ideal scenario, you could have freespace_quality(i) != 1 if the free space size n is not divisible by i. For example, if n=10 and i=3, freespace_quality(3) = 9/10 = 0.9.
So, I created a derived function freespace_relative_quality(i) which looks like this:
This would always have the output 1 in the ideal "perfectly unfragmented" scenario.
After doing the math:
All that's left to do now to get to the final fragmentation formula is to calculate the average freespace quality for all values of i (from 1 to n), and then invert the range by doing 1 - the average quality so that 0 means completely unfragmented (maximum quality) and 1 means most fragmented (minimum quality).
Related
I have following problem:
Tourist wants to create an optimal route for the hike. He has a
terrain elevation map at a certain scale - an NxM-sized matrix
containing the elevation values at the corresponding terrain points.
Tourist wants to make a route from the start point to the end point in
such a way that the total change in altitude while passing the route
is minimal. The total change in altitude on the route is the sum of
the changes in altitude modulo on each segment of the route. For example, if there is a continuous ascent or descent from the starting
point of the route to the end point, then such a route will be
optimal.
For simplicity, let's assume that you can only walk along the lines of
an imaginary grid, i.e. from position (i, j), which is not on the edge
of the card, you can go to position (i-1, j), (i + 1, j), (i, j-1),
(i, j + 1). You cannot go beyond the edge of the map.
On standard input: non-negative integers N, M, N0, M0, N1, M1 are entered. N is the number of rows in the heightmap, M is the number of columns. Point (N0, M0) is the starting point of route, point (N1, M1) is the end point of the route. Point coordinates are numbered starting
from zero. The points can be the same. It is known that the total
number of matrix elements is limited to 1,100,000.
After numbers height map is entered in rows - at first the first line, then
the second, and so on. Height is a non-negative integer not exceeding
10000.
Print to standard output the total change in elevation while
traversing the optimal route.
I came to conclusion that it's about shortest path in graph and wrote this
But for m=n=1000 program eats too much (~169MiB, mostly heap) memory.
Limits are as following:
Time limit: 2 s
Memory limit: 50M
Stack limit: 64M
I also wrote C++ program doing same thing with priority_queue(just to check, problem must be solved in C), but it still needs ~78MiB (mostly heap)
How should I solve this problem (use another algorithm, optimize existing C code or something else)?
You can fit a height value into a 16-bit unsigned int (uint_16_t, if using C++11). To store 1.1M of those 2-byte values requires 2.2M of memory. So far so good.
Your code builds an in-memory graph with linked lists and lots of pointers. Your heap-based queue also has a lot of pointers. You can greatly decrease memory usage by relying on a more-efficient representation - for example, you could build an NxM array of elements with
struct Cell {
uint_32_t distance; // shortest distance from start, initially INF
uint_16_t height;
int_16_t parent; // add this to cell index to find parent; 0 for `none`
};
A cell at row, col will be at index row + N*col in this array. I strongly recommend building utility methods to find each neighbor, which would return -1 to indicate "out of bounds" and a direct index otherwise. The difference between two indices would be usable in parent.
You can implement a (not very efficient) priority queue in C by calling stdlib's "sort" on an array of node indices, and sorting them by distance. This would cut a lot of additional overhead, as each pointer in your program probably takes 8 bytes (64-bit pointers).
By avoiding a lot of pointers, and using an in-memory description instead of a graph, you should be able to cut down memory consumption to
1.1M Cells x 8 bytes/cell = 8.8M
1.1M indices in the pq-array x 4 bytes/index = 4.4M
For a total of around 16 Mb w/overheads - well under stated limits. If it takes too long, you should use a better queue.
Where l_1 = 1, l_2 = 4, l_3 = 5 are blocks with different length and I need to make one big block with the length of l = 8 using the formula.
Can someone explain me the following formula:
The formula is in LaTeX, with array L = [l + 1]
Sorry about the formatting, but I can`t upload images.
The question seems to be about finding what is the minimum number of blocks needed to make a bigger block. Also, there seems to be no restriction on the number of individual blocks available.
Assuming you have blocks of n different lengths. l1, l2 .. ln. What is the minimum number of blocks you can use to make one big block of length k?
The idea behind the recursive formula is that you can make a block of length i by adding one block of length l1 to a hypothetical big block of length i-l1 that you might already have made using the minimum number of blocks (because that is what your L array holds. For any index j, it holds the minimum number of blocks needed to make a block of size j). Say the i-l1 block was built using 4 blocks. Using those 4 blocks and 1 more block of size l1, you created a block of size i using 5 blocks.
But now, say a block of size i-l2 was made only using 3 blocks. Then you could easily add another block of size l2 to this block of size i-l2 and make a block of size i using only 4 blocks!
That is the idea behind iterating over all possible block lengths and choosing the minimum of them all (mentioned in the third line of your latex image).
Hope that helps.
When making automatically expanding arrays (like C++'s std::vector) in C, it is often common (or at least common advice) to double the size of the array each time it is filled to limit the amount of calls to realloc in order to avoid copying the entire array as much as possible.
Eg. we start by allocating room for 8 elements, 8 elements are inserted, we then allocate room for 16 elements, 8 more elements are inserted, we allocate for 32.., etc.
But realloc does not have to actually copy the data if it can expand the existing memory allocation. For example, the following code only does 1 copy (the initial NULL allocation, so it is not really a copy) on my system, even though it calls realloc 10000 times:
#include <stdlib.h>
#include <stdio.h>
int main()
{
int i;
int copies = 0;
void *data = NULL;
void *ndata;
for (i = 0; i < 10000; i++)
{
ndata = realloc(data, i * sizeof(int));
if (data != ndata)
copies++;
data = ndata;
}
printf("%d\n", copies);
}
I realize that this example is very clinical - a real world application would probably have more memory fragmentation and would do more copies, but even if I make a bunch of random allocations before the realloc loop, it only does marginally worse with 2-4 copies instead.
So, is the "doubling method" really necessary? Would it not be better to just call realloc each time a element is added to the dynamic array?
You have to step back from your code for a minute and thing abstractly. What is the cost of growing a dynamic container? Programmers and researchers don't think in terms of "this took 2ms", but rather in terms of asymptotic complexity: What is the cost of growing by one element given that I already have n elements; how does this change as n increases?
If you only ever grew by a constant (or bounded) amount, then you would periodically have to move all the data, and so the cost of growing would depend on, and grow with, the size of the container. By contrast, when you grow the container geometrically, i.e. multiply its size by a fixed factor, every time it is full, then the expected cost of inserting is actually independent of the number of elements, i.e. constant.
It is of course not always constant, but it's amortized constant, meaning that if you keep inserting elements, then the average cost per element is constant. Every now and then you have to grow and move, but those events get rarer and rarer as you insert more and more elements.
I once asked whether it makes sense for C++ allocators to be able to grow, in the way that realloc does. The answers I got indicated that the non-moving growing behaviour of realloc is actually a bit of a red herring when you think asymptotically. Eventually you won't be able to grow anymore, and you'll have to move, and so for the sake of studying the asymptotic cost, it's actually irrelevant whether realloc can sometimes be a no-op or not. (Moreover, non-moving growth seems to upset moder, arena-based allocators, which expect all their allocations to be of a similar size.)
Compared to almost every other type of operation, malloc, calloc, and especially realloc are very memory expensive. I've personally benchmarked 10,000,000 reallocs, and it takes a HUGE amount of time to do that.
Even though I had other operations going on at the same time (in both benchmark tests), I found that I could literally cut HOURS off of the run time by using max_size *= 2 instead of max_size += 1.
Q: 'doubling the capacity of a dynamic array necessary"
A: No. One could grow only to the extent needed. But then you may truly copy data many times. It is a classic trade off between memory and processor time. A good growth algorithm takes into account what is known about the program's data needs and also not to over-think those needs. An exponential growth of 2x is a happy compromise.
But now to your claim "following code only does 1 copy".
The amount of copying with advanced memory allocators may not be what OP thinks. Getting the same address does not mean that the underlying memory mapping did not perform significant work. All sorts of activity go on under-the-hood.
For memory allocations that grow & shrink a lot over the life of the code, I like grow and shrink thresholds geometrically placed apart from each other.
const size_t Grow[] = {1, 4, 16, 64, 256, 1024, 4096, ... };
const size_t Shrink[] = {0, 2, 8, 32, 128, 512, 2048, ... };
By using the grow thresholds while getting larger and shrink one while contracting, one avoid thrashing near a boundary. Sometimes a factor of 1.5 is used instead.
I have allocated memory using valloc, let's say array A of [15*sizeof(double)]. Now I divided it into three pieces and I want to bind each piece (of length 5) into three NUMA nodes (let's say 0,1, and 2). Currently, I am doing the following:
double* A=(double*)valloc(15*sizeof(double));
piece=5;
nodemask=1;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=2;
mbind(&A[5],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=4;
mbind(&A[10],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
First question is am I doing it right? I.e. is there any problems with being properly aligned to page size for example? Currently with size of 15 for array A it runs fine, but if I reset the array size to something like 6156000 and piece=2052000, and subsequently three calls to mbind start with &A[0], &A[2052000], and &A[4104000] then I am getting a segmentation fault (and sometimes it just hangs there). Why it runs for small size fine but for larger gives me segfault? Thanks.
For this to work, you need to deal with chunks of memory that are at least page-size and page-aligned - that means 4KB in most systems. In your case, I suspect the page gets moved twice (possibly three times), due to you calling mbind() three times over.
The way numa memory is located is that CPU socket 0 has a range of 0..X-1 MB, socket 1 has X..2X-1, socket three has 2X-3X-1, etc. Of course, if you stick a 4GB stick of ram next to socket 0 and a 16GB in the socket 1, then the distribution isn't even. But the principle still stands that a large chunk of memory is allocated for each socket, in accordance to where the memory is actually located.
As a consequence of how the memory is located, the physical location of the memory you are using will have to be placed in the linear (virtual) address space by page-mapping.
So, for large "chunks" of memory, it is fine to move it around, but for small chunks, it won't work quite right - you certainly can't "split" a page into something that is affine to two different CPU sockets.
Edit:
To split an array, you first need to find the page-aligned size.
page_size = sysconf(_SC_PAGESIZE);
objs_per_page = page_size / sizeof(A[0]);
// We should be an even number of "objects" per page. This checks that that
// no object straddles a page-boundary
ASSERT(page_size % sizeof(A[0]));
split_three = SIZE / 3;
aligned_size = (split_three / objs_per_page) * objs_per_page;
remnant = SIZE - (aligned_size * 3);
piece = aligned_size;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size*2 + remnant],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
Obviously, you will now need to split the three threads similarly using the aligned size and remnant as needed.
How do they map an index directly to a value without having to iterate though the indices?
If it's quite complex where can I read more?
An array is just a contiguous chunk of memory, starting at some known address. So if the start address is p, and you want to access the i-th element, then you just need to calculate:
p + i * size
where size is the size (in bytes) of each element.
Crudely speaking, accessing an arbitrary memory address takes constant time.
Essentially, computer memory can be described as a series of addressed slots. To make an array, you set aside a continuous block of those. So, if you need fifty slots in your array, you set aside 50 slots from memory. In this example, let's say you set aside the slots from 1019 through 1068 for an array called A. Slot 0 in A is slot 1019 in memory. Slot 1 in A is slot 1020 in memory. Slot 2 in A is slot 1021 in memory, and so forth. So, in general, to get the nth slot in an array we would just do 1019+n. So all we need to do is to remember what the starting slot is and add to it appropriately.
If we want to make sure that we don't write to memory beyond the end of our array, we may also want to store the length of A and check our n against it. It's also the case that not all values we wish to keep track of are the same size, so we may have an array where each item in the array takes up more than one slot. In that case, if s is the size of each item, then we need to set aside s times the number of items in the array and when we fetch the nth item, we need to add s time n to the start rather than just n. But in practice, this is pretty easy to handle. The only restriction is that each item in the array be the same size.
Wikipedia explains this very well:
http://en.wikipedia.org/wiki/Array_data_structure
Basically, a memory base is chosen. Then the index is added to the base. Like so:
if base = 2000 and the size of each element is 5 bytes, then:
array[5] is at 2000 + 5*5.
array[i] is at 2000 + 5*i.
Two-dimensional arrays multiply this effect, like so:
base = 2000, size-of-each = 5 bytes
array[i][j] is at 2000 + 5*i*j
And if every index is of a different size, more calculation is necessary:
for each index
slot-in-memory += size-of-element-at-index
So, in this case, it is almost impossible to map directly without iteration.