for I := 1 to 1024 do
for J := 1 to 1024 do
A[J,I] := A[J,I] * B[I,J]
For the given code, I want to count how many pages are transferred between disk and main memory given the following assumptions:
page size = 512 words
no more than 256 pages can be in main memory
LRU replacement strategy
all 2d arrays size (1:1024,1:1024)
each array element occupies 1 word
2d arrays are mapped in main memory in row-major order
I was given the solution, and my questions stems from that:
A[J,I] := A[J,I] * B[I,J]
writeA := readA * readB
Notice that there are 2 transfers changing every J loop and 1 transfer
that only changes every I loop.
1024 * (8 + 1024 * (1 + 1)) = 2105344 transfers
So the entire row of B is read every time we use it, therefore we
count the entire row as transferred (8 pages). But since we only read
a portion of each A row (1 value) when we transfer it, we only grab 1
page each time.
So what I'm trying to figure out is, how do we get that 8 pages are transferred every time we read B but only 1 transfer for each read and write of A?
I'm not surprised you're confused, because I certainly am.
Part of the confusion comes from labelling the arrays 1:1024. I couldn't think like that, I relabelled them 0:1023.
I take "row-major order" to mean that A[0,0] is in the same disk block as A[0,511]. The next block is A[0,512] to A[0,1023]. Then A[1,0] to A[1,511]... And the same arrangement for B.
As the inner loop first executes, the system will fetch the block containing A[0,0], then B[0,0]. As J increments, each element of A referenced will come from a separate disk block. A[1,0] is in a different block from A[0,0]. But only every 512th B element referenced will come from a different block; B[0,0] is in the same block as B[0,511]. So for one complete iteration through the inner loop, 1024 calculations, there will be 1024 fetches of blocks from A, 1024 writes of dirty blocks from A, and 2 fetches of blocks from B. 2050 accesses overall. I don't understand why the answer you have says there will be 8 fetches from B. If B were not aligned on a 512-word boundary, there would be 3 fetches from B per cycle; but not 8.
This same pattern happens for each value of I in the outer loop. That makes 2050*1024 = 2099200 total blocks read and written, assuming B is 512-word aligned.
I'm entirely prepared for someone to point out my obvious bloomer - they usually do - but the explanation you've been given seems wrong to me.
Related
Where l_1 = 1, l_2 = 4, l_3 = 5 are blocks with different length and I need to make one big block with the length of l = 8 using the formula.
Can someone explain me the following formula:
The formula is in LaTeX, with array L = [l + 1]
Sorry about the formatting, but I can`t upload images.
The question seems to be about finding what is the minimum number of blocks needed to make a bigger block. Also, there seems to be no restriction on the number of individual blocks available.
Assuming you have blocks of n different lengths. l1, l2 .. ln. What is the minimum number of blocks you can use to make one big block of length k?
The idea behind the recursive formula is that you can make a block of length i by adding one block of length l1 to a hypothetical big block of length i-l1 that you might already have made using the minimum number of blocks (because that is what your L array holds. For any index j, it holds the minimum number of blocks needed to make a block of size j). Say the i-l1 block was built using 4 blocks. Using those 4 blocks and 1 more block of size l1, you created a block of size i using 5 blocks.
But now, say a block of size i-l2 was made only using 3 blocks. Then you could easily add another block of size l2 to this block of size i-l2 and make a block of size i using only 4 blocks!
That is the idea behind iterating over all possible block lengths and choosing the minimum of them all (mentioned in the third line of your latex image).
Hope that helps.
I have allocated memory using valloc, let's say array A of [15*sizeof(double)]. Now I divided it into three pieces and I want to bind each piece (of length 5) into three NUMA nodes (let's say 0,1, and 2). Currently, I am doing the following:
double* A=(double*)valloc(15*sizeof(double));
piece=5;
nodemask=1;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=2;
mbind(&A[5],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=4;
mbind(&A[10],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
First question is am I doing it right? I.e. is there any problems with being properly aligned to page size for example? Currently with size of 15 for array A it runs fine, but if I reset the array size to something like 6156000 and piece=2052000, and subsequently three calls to mbind start with &A[0], &A[2052000], and &A[4104000] then I am getting a segmentation fault (and sometimes it just hangs there). Why it runs for small size fine but for larger gives me segfault? Thanks.
For this to work, you need to deal with chunks of memory that are at least page-size and page-aligned - that means 4KB in most systems. In your case, I suspect the page gets moved twice (possibly three times), due to you calling mbind() three times over.
The way numa memory is located is that CPU socket 0 has a range of 0..X-1 MB, socket 1 has X..2X-1, socket three has 2X-3X-1, etc. Of course, if you stick a 4GB stick of ram next to socket 0 and a 16GB in the socket 1, then the distribution isn't even. But the principle still stands that a large chunk of memory is allocated for each socket, in accordance to where the memory is actually located.
As a consequence of how the memory is located, the physical location of the memory you are using will have to be placed in the linear (virtual) address space by page-mapping.
So, for large "chunks" of memory, it is fine to move it around, but for small chunks, it won't work quite right - you certainly can't "split" a page into something that is affine to two different CPU sockets.
Edit:
To split an array, you first need to find the page-aligned size.
page_size = sysconf(_SC_PAGESIZE);
objs_per_page = page_size / sizeof(A[0]);
// We should be an even number of "objects" per page. This checks that that
// no object straddles a page-boundary
ASSERT(page_size % sizeof(A[0]));
split_three = SIZE / 3;
aligned_size = (split_three / objs_per_page) * objs_per_page;
remnant = SIZE - (aligned_size * 3);
piece = aligned_size;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size*2 + remnant],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
Obviously, you will now need to split the three threads similarly using the aligned size and remnant as needed.
Having some trouble figuring out the hit and miss rates of the following two snippets of code.
Given info: we have a 1024 Byte direct-mapped cache with block sizes of 16 bytes. So that makes 64 lines (sets in this case) then. Assume the cache starts empty. Consider the following code:
struct pos {
int x;
int y;
};
struct pos grid[16][16];
int total_x = 0; int total_y = 0;
void function1() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[j][i].x;
total_y += grid[j][i].y;
}
}
}
void function2() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[i][j].x;
total_y += grid[i][j].y;
}
}
}
I can tell from some basic rules (i.e. C arrays are row-major order) that function2 should be better. But I don't understand how to calculate the hit/miss percentages. Apparently function1() misses 50% of the time, while function2() only misses 25% of the time.
Could somebody walk me through how those calculations work? All I can really see is that no more than half the grid will ever fit inside the cache at once. Also, is this concept easy to extend to k-way associative caches?
Thanks.
How data are stored in memory
Every structure pos has a size of 8 Bytes, thus the total size of pos[16][16] is 2048 Bytes. And the order of the array are as follows:
pos[0][0] pos[0][1] pos[0][2] ...... pos[0][15] pos[1]0[] ...... pos[1][15].......pos[15][0] ......pos[15][15]
The cache organization compared to the data
For the cache, each block is 16 Bytes, which is the same size as two elements of the array. The Entire cache is 1024 Bytes, which is half the size of the entire array. Since cache is direct-mapped, that means if we label the cache block from 0 to 63, we can safely assume that the mapping should look like this
------------ memory----------------------------cache
pos[0][0] pos[0][1] -----------> block 0
pos[0][2] pos[0][3] -----------> block 1
pos[0][4] pos[0][5] -----------> block 2
pos[0][14] pos[0][15] --------> block 7
.......
pos[1][0] pos[1][1] -----------> block 8
pos[1][2] pos[1][3] -----------> block 9
.......
pos[7][14] pos[7][15] --------> block 63
pos[8][0] pos[8][1] -----------> block 0
.......
pos[15][14] pos[15][15] -----> block 63
How function1 manipulates memory
The loop follows a column-wise inner loop, that means the first iteration loads pos[0][0] and pos[0][1] to cache block 0, the second iteration loads pos[1][0] and pos[1][1] to cache block 8. Caches are cold, so the first column x is always miss, while y is always hit. The second column data are supposedly all loaded in cache during the first column access, but this is NOT the case. Since pos[8][0] access has already evict the former pos[0][0] page(they both map to block 0!).So on, the miss rate is 50%.
How function2 manipulates memory
The second function has nice stride-1 access pattern. That means when accessing pos[0][0].x pos[0][0].y pos[0][1].x pos[0][1].y only the first one is a miss due to the cold cache. The following patterns are all the same. So the miss rate is only 25%.
K-way associative cache follows the same analysis, although that may be more tedious. For getting the most out of the cache system, try to initiate a nice access pattern, say stride-1, and use the data as much as possible during each loading from memory. Real world cpu microarchitecture employs other intelligent design and algorithm to enhance the efficiency. The best method is always to measure the time in real world, dump the core code, and do a thorough analysis.
Ok, my computer science lectures are a bit far off but I think I figured it out (it's actually a very easy example when you think about it).
Your struct is 8 byte long (2 x 4). Since your cache blocks are 16 bytes, a memory access grid[i][j] will fetch exactly two struct entries (grid[i][j] and grid[i][j+1]). Therefore, if you loop through the second index only every 4th access will lead to a memory read. If you loop through the first index, you probably throw away the second entry that has been fetched, that depends on the number of fetches in the inner loop vs. the overall cache-size though.
Now we have to think about the cache size as well: You say that you have 64 lines that are directly mapped. In function 1, an inner loop is 16 fetches. That means, the 17th fetch you get to grid[j][i+1]. This should actually be a hit, since it should have been kept in the cache since the last inner loop walk. Every second inner loop should therefore only consist of hits.
Well, if my reasonings are correct, the answer that has been given to you should be wrong. Both functions should perform with 25% misses. Maybe someone finds a better answer but if you understand my reasoning I'd ask a TA about that.
Edit: Thinking about it again, we should first define what actually qualifies as a miss/hit. When you look at
total_x += grid[j][i].x;
total_y += grid[j][i].y;
are these defined as two memory accesses or one? A decent compiler with optimization settings should optimize this to
pos temp = grid[j][i];
total_x += temp.x;
total_y += temp.y;
which could be counted as one memory access. I therefore propose the universal answer to all CS questions: "It depends."
Given the code :
void transpose2(array dst,array src)
{
int i,j;
for ( i=0; i<4; i++) {
for ( j=0; j<4; j++) {
dst[i][j] = src[j][i];
}
}
}
Assumptions :
int is 4 bytes
src array starts at address 0 , dst starts at address 64
the size of the cache is 32 bytes , at the beginning the cache is empty
Assuming that I have a cache with size of 32 bytes , under write through ,write allocate & LRU , using 2way set associative method , where each block is 8 bytes :
When I read from the memory , how many bytes do I take each iteration from the memory ?
is it 4 or 8 ?
What I'm quite sure about is that the cache has 4 cells , or rows , and each row has 8 bytes .Is this correct ?
What is a little confusing is the 2way part , I think that each way has 4 bytes , right ? please correct me if I'm wrong ...
Then when I "take" a block from the memory , I just don't exactly understand how many bytes !!?
Thanks in advance
Ron
The cache way (aka its associativity) does not affect the amount of data that's transferred when a transfer occurs; the block size is the block size.
Associativity is simply a measure how many possible locations there are in the cache that a given block from memory could be stored. So:
For a direct-mapped cache (associativity=1), memory address xyz will always map to the same cache location.
For a two-way cache, xyz could map to either of two cache locations.
For a fully-associative cache, xyz could map to anywhere in cache.
I'm really not saying anything here which isn't already explained at e.g. Wikipedia: http://en.wikipedia.org/wiki/CPU_cache#Associativity.
When the CPU references (load or store) a word from a block that is not in the cache, that block is demanded to memory. So, with the parameters supplied, every cache miss involves a 8 byte transfer from memory to cache.
Related to the terminology, your cache has 4 entries, containers or cache lines (32 bytes / 8 bytes/block). As it is 2-way associative, there are 2 sets of 2 entries. Blocks with even addreses map to set 0, while blocks with odd addresses map to set 1.
Block addresses are obtained by shifting the word address log2(block_size) bits (3 bits in your cache).
For example:
address 64 belongs to block 8
address 72 belongs to block 9
How do they map an index directly to a value without having to iterate though the indices?
If it's quite complex where can I read more?
An array is just a contiguous chunk of memory, starting at some known address. So if the start address is p, and you want to access the i-th element, then you just need to calculate:
p + i * size
where size is the size (in bytes) of each element.
Crudely speaking, accessing an arbitrary memory address takes constant time.
Essentially, computer memory can be described as a series of addressed slots. To make an array, you set aside a continuous block of those. So, if you need fifty slots in your array, you set aside 50 slots from memory. In this example, let's say you set aside the slots from 1019 through 1068 for an array called A. Slot 0 in A is slot 1019 in memory. Slot 1 in A is slot 1020 in memory. Slot 2 in A is slot 1021 in memory, and so forth. So, in general, to get the nth slot in an array we would just do 1019+n. So all we need to do is to remember what the starting slot is and add to it appropriately.
If we want to make sure that we don't write to memory beyond the end of our array, we may also want to store the length of A and check our n against it. It's also the case that not all values we wish to keep track of are the same size, so we may have an array where each item in the array takes up more than one slot. In that case, if s is the size of each item, then we need to set aside s times the number of items in the array and when we fetch the nth item, we need to add s time n to the start rather than just n. But in practice, this is pretty easy to handle. The only restriction is that each item in the array be the same size.
Wikipedia explains this very well:
http://en.wikipedia.org/wiki/Array_data_structure
Basically, a memory base is chosen. Then the index is added to the base. Like so:
if base = 2000 and the size of each element is 5 bytes, then:
array[5] is at 2000 + 5*5.
array[i] is at 2000 + 5*i.
Two-dimensional arrays multiply this effect, like so:
base = 2000, size-of-each = 5 bytes
array[i][j] is at 2000 + 5*i*j
And if every index is of a different size, more calculation is necessary:
for each index
slot-in-memory += size-of-element-at-index
So, in this case, it is almost impossible to map directly without iteration.