Where l_1 = 1, l_2 = 4, l_3 = 5 are blocks with different length and I need to make one big block with the length of l = 8 using the formula.
Can someone explain me the following formula:
The formula is in LaTeX, with array L = [l + 1]
Sorry about the formatting, but I can`t upload images.
The question seems to be about finding what is the minimum number of blocks needed to make a bigger block. Also, there seems to be no restriction on the number of individual blocks available.
Assuming you have blocks of n different lengths. l1, l2 .. ln. What is the minimum number of blocks you can use to make one big block of length k?
The idea behind the recursive formula is that you can make a block of length i by adding one block of length l1 to a hypothetical big block of length i-l1 that you might already have made using the minimum number of blocks (because that is what your L array holds. For any index j, it holds the minimum number of blocks needed to make a block of size j). Say the i-l1 block was built using 4 blocks. Using those 4 blocks and 1 more block of size l1, you created a block of size i using 5 blocks.
But now, say a block of size i-l2 was made only using 3 blocks. Then you could easily add another block of size l2 to this block of size i-l2 and make a block of size i using only 4 blocks!
That is the idea behind iterating over all possible block lengths and choosing the minimum of them all (mentioned in the third line of your latex image).
Hope that helps.
Related
I would like to use the "MPI_Reduce" function with a variable number of elements for each process.
For example.
We have 4 processes with an allocated buffer with a dynamic size.
P (0) size buffer = 21
P (1) size buffer = 24
P (2) size buffer = 21
P (3) size buffer = 12
I would like to reduce the values of these elements on the processor with rank 0.
In my thoughts I would like to allocate a receive buffer of a size equal to the maximum of objects to be received by a process (in this case 24) and use that to retrieve the values from the various processes.
There is a way in
which is it possible to do without increasing the execution times too much?
I am using Open MPI 2.1.1 in C, Thanks.
There is no reduction variant that works with different numbers of elements per rank in MPI. It wouldn't know what to fill in for missing operands in the reduction operation. It's pretty straightforward to write though, just as you suggested:
Determine the maximum buffer size
Allocate max-sized buffer on each rank, copy in local buffer, pad with whatever the neutral element of your reduction operation is
Run reduction on the now equal-sized buffers
Hey so I'm looking at a matrix shift code, and need to make it cache friendly (fewest cache misses possible). The code looks like this:
int i, j, temp;
for(i=1;, i< M; i++){
for(j=0; j< N; j++){
temp = A[i][j];
A[i][j] = A[i-1][j];
A[i-1]][j] = temp;
}
}
Assume M and N are parameters of the function, noting M to number of rows, and N to number of columns. Now to make this more cache friendly, the book gives out two problems for optimization. When the matrix is 4x4, s=1, E=2, b=3 , and when the matrix is 128x128, s=5, E=2, b=3.
(s = # of set index bits (S = s^2 is the number of sets, E = number of lines per set, and b = # of block bits (so B = b^2 is block size))
So using the blocking method, I should access the matrix by block size, to avoid getting a miss, and the cache having to fetch the information from the cache a level higher. So here is what I assume:
Block size is 9 bytes for each
With the 4x4 matrix, the number of elements that fit evenly on a block is:
blocksize*(number of columns/blocksize) = 9*(4/9) = 4
So if each row will fit on one block, why is it not cache friendly?
With the 128x128 matrix, with the same logic as above, each block will hold (9*(128/9)) = 128.
So obviously after calculating that, this equation is wrong. I'm looking at the code from this page http://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf
Once I reached this point, I knew I was lost, which is where you guys come in! Is it as simple as saying each block holds 9 bytes, and 8 bytes (two integers) are what fits evenly into it? Sorry this stuff really confuses me, I know I'm all over the place. Just to be clear, these are my concerns:
How do you know how many elements will fit in a block?
Do the number of lines or sets affect this number? If so, how?
Any in depth explanation of the code posted on the linked page.
Really just trying to get a grasp of this.
UPDATE:
Okay so here is where I'm at for the 4x4 matrix.
I can read 8 bytes at a time, which is 2 integers. The original function will have cache misses because C loads into row-major order, so every time it wants A[i-1][j] it will miss, and load the block that holds A[i-1][j] which would either be A[i-1][0] and A[i-1][1] or A[i-1][2] and A[i-1][3].
So, would the best way to go about this be to create another temp variable, and do A[i][0] = temp, A[i][1] = temp2, then load A[i-1][0] A[i-1][1] and set them to temp, and temp2 and just set the loop to j<2? For this question, it is specifically for the matrices described; I understand this wouldn't work on all sizes.
The solution to this problem was to think of the matrix in column major order rather than row major order.
Hopefully this helps someone in the future. Thanks to #Michael Dorgan for getting me thinking.
End results for 128x128 matrix:
Original: 16218 misses
Optimized: 8196 misses
for I := 1 to 1024 do
for J := 1 to 1024 do
A[J,I] := A[J,I] * B[I,J]
For the given code, I want to count how many pages are transferred between disk and main memory given the following assumptions:
page size = 512 words
no more than 256 pages can be in main memory
LRU replacement strategy
all 2d arrays size (1:1024,1:1024)
each array element occupies 1 word
2d arrays are mapped in main memory in row-major order
I was given the solution, and my questions stems from that:
A[J,I] := A[J,I] * B[I,J]
writeA := readA * readB
Notice that there are 2 transfers changing every J loop and 1 transfer
that only changes every I loop.
1024 * (8 + 1024 * (1 + 1)) = 2105344 transfers
So the entire row of B is read every time we use it, therefore we
count the entire row as transferred (8 pages). But since we only read
a portion of each A row (1 value) when we transfer it, we only grab 1
page each time.
So what I'm trying to figure out is, how do we get that 8 pages are transferred every time we read B but only 1 transfer for each read and write of A?
I'm not surprised you're confused, because I certainly am.
Part of the confusion comes from labelling the arrays 1:1024. I couldn't think like that, I relabelled them 0:1023.
I take "row-major order" to mean that A[0,0] is in the same disk block as A[0,511]. The next block is A[0,512] to A[0,1023]. Then A[1,0] to A[1,511]... And the same arrangement for B.
As the inner loop first executes, the system will fetch the block containing A[0,0], then B[0,0]. As J increments, each element of A referenced will come from a separate disk block. A[1,0] is in a different block from A[0,0]. But only every 512th B element referenced will come from a different block; B[0,0] is in the same block as B[0,511]. So for one complete iteration through the inner loop, 1024 calculations, there will be 1024 fetches of blocks from A, 1024 writes of dirty blocks from A, and 2 fetches of blocks from B. 2050 accesses overall. I don't understand why the answer you have says there will be 8 fetches from B. If B were not aligned on a 512-word boundary, there would be 3 fetches from B per cycle; but not 8.
This same pattern happens for each value of I in the outer loop. That makes 2050*1024 = 2099200 total blocks read and written, assuming B is 512-word aligned.
I'm entirely prepared for someone to point out my obvious bloomer - they usually do - but the explanation you've been given seems wrong to me.
How do they map an index directly to a value without having to iterate though the indices?
If it's quite complex where can I read more?
An array is just a contiguous chunk of memory, starting at some known address. So if the start address is p, and you want to access the i-th element, then you just need to calculate:
p + i * size
where size is the size (in bytes) of each element.
Crudely speaking, accessing an arbitrary memory address takes constant time.
Essentially, computer memory can be described as a series of addressed slots. To make an array, you set aside a continuous block of those. So, if you need fifty slots in your array, you set aside 50 slots from memory. In this example, let's say you set aside the slots from 1019 through 1068 for an array called A. Slot 0 in A is slot 1019 in memory. Slot 1 in A is slot 1020 in memory. Slot 2 in A is slot 1021 in memory, and so forth. So, in general, to get the nth slot in an array we would just do 1019+n. So all we need to do is to remember what the starting slot is and add to it appropriately.
If we want to make sure that we don't write to memory beyond the end of our array, we may also want to store the length of A and check our n against it. It's also the case that not all values we wish to keep track of are the same size, so we may have an array where each item in the array takes up more than one slot. In that case, if s is the size of each item, then we need to set aside s times the number of items in the array and when we fetch the nth item, we need to add s time n to the start rather than just n. But in practice, this is pretty easy to handle. The only restriction is that each item in the array be the same size.
Wikipedia explains this very well:
http://en.wikipedia.org/wiki/Array_data_structure
Basically, a memory base is chosen. Then the index is added to the base. Like so:
if base = 2000 and the size of each element is 5 bytes, then:
array[5] is at 2000 + 5*5.
array[i] is at 2000 + 5*i.
Two-dimensional arrays multiply this effect, like so:
base = 2000, size-of-each = 5 bytes
array[i][j] is at 2000 + 5*i*j
And if every index is of a different size, more calculation is necessary:
for each index
slot-in-memory += size-of-element-at-index
So, in this case, it is almost impossible to map directly without iteration.
Imagine you have some memory containing a bunch of bytes:
++++ ++-- ---+ +++-
-++- ++++ ++++ ----
---- ++++ +
Let us say + means allocated and - means free.
I'm searching for the formula of how to calculate the percentage of fragmentation.
Background
I'm implementing a tiny dynamic memory management for an embedded device with static memory. My goal is to have something I can use for storing small amounts of data. Mostly incoming packets over a wireless connection, at about 128 Bytes each.
As R. says, it depends exactly what you mean by "percentage of fragmentation" - but one simple formula you could use would be:
(free - freemax)
---------------- x 100% (or 100% for free=0)
free
where
free = total number of bytes free
freemax = size of largest free block
That way, if all memory is in one big block, the fragmentation is 0%, and if memory is all carved up into hundreds of tiny blocks, it will be close to 100%.
Calculate how many 128 bytes packets you could fit in the current memory layout.
Let be that number n.
Calculate how many 128 bytes packets you could fit in a memory layout with the same number of bytes allocated than the current one, but with no holes (that is, move all the + to the left for example).
Let be that number N.
Your "fragmentation ratio" would be alpha = n/N
If your allocations are all roughly the same size, just split your memory up into TOTAL/MAXSIZE pieces each consisting of MAXSIZE bytes. Then fragmentation is irrelevant.
To answer your question in general, there is no magic number for "fragmentation". You have to evaluate the merits of different functions in reflecting how fragmented memory is. Here is one I would recommend, as a function of a size n:
fragmentation(n) = -log(n * number_of_free_slots_of_size_n / total_bytes_free)
Note that the log is just there to map things to a "0 to infinity" scale; you should not actually evaluate that in practice. Instead you might simply evaluate:
freespace_quality(n) = n * number_of_free_slots_of_size_n / total_bytes_free
with 1.0 being ideal (able to allocate the maximum possible number of objects of size n) and 0.0 being very bad (unable to allocate any).
If you had [++++++-----++++--++-++++++++--------+++++] and you wanted to measure the fragmentation of the free space (or any other allocation)
You could measure the average contiguous block size
Total blocks / Count of contiguous blocks.
In this case it would be
4/(5 + 2 + 1 + 8) / 4 = 4
Based on R.. GitHub STOP HELPING ICE's answer, I came up with the following way of computing fragmentation as a single percentage number:
Where:
n is the total number of free blocks
FreeSlots(i) means how many i-sized slots you can fit in the available free memory space
IdealFreeSlots(i) means how many i-sized slots would fit in a perfectly unfragmented memory of size n. This is a simple calculation: IdealFreeSlots(i) = floor(n / i).
How I came up with this formula:
I was thinking about how I could combine all the freespace_quality(i) values to get a single fragmentation percentage, but I wasn't very happy with the result of this function. Even in an ideal scenario, you could have freespace_quality(i) != 1 if the free space size n is not divisible by i. For example, if n=10 and i=3, freespace_quality(3) = 9/10 = 0.9.
So, I created a derived function freespace_relative_quality(i) which looks like this:
This would always have the output 1 in the ideal "perfectly unfragmented" scenario.
After doing the math:
All that's left to do now to get to the final fragmentation formula is to calculate the average freespace quality for all values of i (from 1 to n), and then invert the range by doing 1 - the average quality so that 0 means completely unfragmented (maximum quality) and 1 means most fragmented (minimum quality).