Given the code :
void transpose2(array dst,array src)
{
int i,j;
for ( i=0; i<4; i++) {
for ( j=0; j<4; j++) {
dst[i][j] = src[j][i];
}
}
}
Assumptions :
int is 4 bytes
src array starts at address 0 , dst starts at address 64
the size of the cache is 32 bytes , at the beginning the cache is empty
Assuming that I have a cache with size of 32 bytes , under write through ,write allocate & LRU , using 2way set associative method , where each block is 8 bytes :
When I read from the memory , how many bytes do I take each iteration from the memory ?
is it 4 or 8 ?
What I'm quite sure about is that the cache has 4 cells , or rows , and each row has 8 bytes .Is this correct ?
What is a little confusing is the 2way part , I think that each way has 4 bytes , right ? please correct me if I'm wrong ...
Then when I "take" a block from the memory , I just don't exactly understand how many bytes !!?
Thanks in advance
Ron
The cache way (aka its associativity) does not affect the amount of data that's transferred when a transfer occurs; the block size is the block size.
Associativity is simply a measure how many possible locations there are in the cache that a given block from memory could be stored. So:
For a direct-mapped cache (associativity=1), memory address xyz will always map to the same cache location.
For a two-way cache, xyz could map to either of two cache locations.
For a fully-associative cache, xyz could map to anywhere in cache.
I'm really not saying anything here which isn't already explained at e.g. Wikipedia: http://en.wikipedia.org/wiki/CPU_cache#Associativity.
When the CPU references (load or store) a word from a block that is not in the cache, that block is demanded to memory. So, with the parameters supplied, every cache miss involves a 8 byte transfer from memory to cache.
Related to the terminology, your cache has 4 entries, containers or cache lines (32 bytes / 8 bytes/block). As it is 2-way associative, there are 2 sets of 2 entries. Blocks with even addreses map to set 0, while blocks with odd addresses map to set 1.
Block addresses are obtained by shifting the word address log2(block_size) bits (3 bits in your cache).
For example:
address 64 belongs to block 8
address 72 belongs to block 9
Related
I have allocated memory using valloc, let's say array A of [15*sizeof(double)]. Now I divided it into three pieces and I want to bind each piece (of length 5) into three NUMA nodes (let's say 0,1, and 2). Currently, I am doing the following:
double* A=(double*)valloc(15*sizeof(double));
piece=5;
nodemask=1;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=2;
mbind(&A[5],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=4;
mbind(&A[10],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
First question is am I doing it right? I.e. is there any problems with being properly aligned to page size for example? Currently with size of 15 for array A it runs fine, but if I reset the array size to something like 6156000 and piece=2052000, and subsequently three calls to mbind start with &A[0], &A[2052000], and &A[4104000] then I am getting a segmentation fault (and sometimes it just hangs there). Why it runs for small size fine but for larger gives me segfault? Thanks.
For this to work, you need to deal with chunks of memory that are at least page-size and page-aligned - that means 4KB in most systems. In your case, I suspect the page gets moved twice (possibly three times), due to you calling mbind() three times over.
The way numa memory is located is that CPU socket 0 has a range of 0..X-1 MB, socket 1 has X..2X-1, socket three has 2X-3X-1, etc. Of course, if you stick a 4GB stick of ram next to socket 0 and a 16GB in the socket 1, then the distribution isn't even. But the principle still stands that a large chunk of memory is allocated for each socket, in accordance to where the memory is actually located.
As a consequence of how the memory is located, the physical location of the memory you are using will have to be placed in the linear (virtual) address space by page-mapping.
So, for large "chunks" of memory, it is fine to move it around, but for small chunks, it won't work quite right - you certainly can't "split" a page into something that is affine to two different CPU sockets.
Edit:
To split an array, you first need to find the page-aligned size.
page_size = sysconf(_SC_PAGESIZE);
objs_per_page = page_size / sizeof(A[0]);
// We should be an even number of "objects" per page. This checks that that
// no object straddles a page-boundary
ASSERT(page_size % sizeof(A[0]));
split_three = SIZE / 3;
aligned_size = (split_three / objs_per_page) * objs_per_page;
remnant = SIZE - (aligned_size * 3);
piece = aligned_size;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size*2 + remnant],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
Obviously, you will now need to split the three threads similarly using the aligned size and remnant as needed.
for I := 1 to 1024 do
for J := 1 to 1024 do
A[J,I] := A[J,I] * B[I,J]
For the given code, I want to count how many pages are transferred between disk and main memory given the following assumptions:
page size = 512 words
no more than 256 pages can be in main memory
LRU replacement strategy
all 2d arrays size (1:1024,1:1024)
each array element occupies 1 word
2d arrays are mapped in main memory in row-major order
I was given the solution, and my questions stems from that:
A[J,I] := A[J,I] * B[I,J]
writeA := readA * readB
Notice that there are 2 transfers changing every J loop and 1 transfer
that only changes every I loop.
1024 * (8 + 1024 * (1 + 1)) = 2105344 transfers
So the entire row of B is read every time we use it, therefore we
count the entire row as transferred (8 pages). But since we only read
a portion of each A row (1 value) when we transfer it, we only grab 1
page each time.
So what I'm trying to figure out is, how do we get that 8 pages are transferred every time we read B but only 1 transfer for each read and write of A?
I'm not surprised you're confused, because I certainly am.
Part of the confusion comes from labelling the arrays 1:1024. I couldn't think like that, I relabelled them 0:1023.
I take "row-major order" to mean that A[0,0] is in the same disk block as A[0,511]. The next block is A[0,512] to A[0,1023]. Then A[1,0] to A[1,511]... And the same arrangement for B.
As the inner loop first executes, the system will fetch the block containing A[0,0], then B[0,0]. As J increments, each element of A referenced will come from a separate disk block. A[1,0] is in a different block from A[0,0]. But only every 512th B element referenced will come from a different block; B[0,0] is in the same block as B[0,511]. So for one complete iteration through the inner loop, 1024 calculations, there will be 1024 fetches of blocks from A, 1024 writes of dirty blocks from A, and 2 fetches of blocks from B. 2050 accesses overall. I don't understand why the answer you have says there will be 8 fetches from B. If B were not aligned on a 512-word boundary, there would be 3 fetches from B per cycle; but not 8.
This same pattern happens for each value of I in the outer loop. That makes 2050*1024 = 2099200 total blocks read and written, assuming B is 512-word aligned.
I'm entirely prepared for someone to point out my obvious bloomer - they usually do - but the explanation you've been given seems wrong to me.
Having some trouble figuring out the hit and miss rates of the following two snippets of code.
Given info: we have a 1024 Byte direct-mapped cache with block sizes of 16 bytes. So that makes 64 lines (sets in this case) then. Assume the cache starts empty. Consider the following code:
struct pos {
int x;
int y;
};
struct pos grid[16][16];
int total_x = 0; int total_y = 0;
void function1() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[j][i].x;
total_y += grid[j][i].y;
}
}
}
void function2() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[i][j].x;
total_y += grid[i][j].y;
}
}
}
I can tell from some basic rules (i.e. C arrays are row-major order) that function2 should be better. But I don't understand how to calculate the hit/miss percentages. Apparently function1() misses 50% of the time, while function2() only misses 25% of the time.
Could somebody walk me through how those calculations work? All I can really see is that no more than half the grid will ever fit inside the cache at once. Also, is this concept easy to extend to k-way associative caches?
Thanks.
How data are stored in memory
Every structure pos has a size of 8 Bytes, thus the total size of pos[16][16] is 2048 Bytes. And the order of the array are as follows:
pos[0][0] pos[0][1] pos[0][2] ...... pos[0][15] pos[1]0[] ...... pos[1][15].......pos[15][0] ......pos[15][15]
The cache organization compared to the data
For the cache, each block is 16 Bytes, which is the same size as two elements of the array. The Entire cache is 1024 Bytes, which is half the size of the entire array. Since cache is direct-mapped, that means if we label the cache block from 0 to 63, we can safely assume that the mapping should look like this
------------ memory----------------------------cache
pos[0][0] pos[0][1] -----------> block 0
pos[0][2] pos[0][3] -----------> block 1
pos[0][4] pos[0][5] -----------> block 2
pos[0][14] pos[0][15] --------> block 7
.......
pos[1][0] pos[1][1] -----------> block 8
pos[1][2] pos[1][3] -----------> block 9
.......
pos[7][14] pos[7][15] --------> block 63
pos[8][0] pos[8][1] -----------> block 0
.......
pos[15][14] pos[15][15] -----> block 63
How function1 manipulates memory
The loop follows a column-wise inner loop, that means the first iteration loads pos[0][0] and pos[0][1] to cache block 0, the second iteration loads pos[1][0] and pos[1][1] to cache block 8. Caches are cold, so the first column x is always miss, while y is always hit. The second column data are supposedly all loaded in cache during the first column access, but this is NOT the case. Since pos[8][0] access has already evict the former pos[0][0] page(they both map to block 0!).So on, the miss rate is 50%.
How function2 manipulates memory
The second function has nice stride-1 access pattern. That means when accessing pos[0][0].x pos[0][0].y pos[0][1].x pos[0][1].y only the first one is a miss due to the cold cache. The following patterns are all the same. So the miss rate is only 25%.
K-way associative cache follows the same analysis, although that may be more tedious. For getting the most out of the cache system, try to initiate a nice access pattern, say stride-1, and use the data as much as possible during each loading from memory. Real world cpu microarchitecture employs other intelligent design and algorithm to enhance the efficiency. The best method is always to measure the time in real world, dump the core code, and do a thorough analysis.
Ok, my computer science lectures are a bit far off but I think I figured it out (it's actually a very easy example when you think about it).
Your struct is 8 byte long (2 x 4). Since your cache blocks are 16 bytes, a memory access grid[i][j] will fetch exactly two struct entries (grid[i][j] and grid[i][j+1]). Therefore, if you loop through the second index only every 4th access will lead to a memory read. If you loop through the first index, you probably throw away the second entry that has been fetched, that depends on the number of fetches in the inner loop vs. the overall cache-size though.
Now we have to think about the cache size as well: You say that you have 64 lines that are directly mapped. In function 1, an inner loop is 16 fetches. That means, the 17th fetch you get to grid[j][i+1]. This should actually be a hit, since it should have been kept in the cache since the last inner loop walk. Every second inner loop should therefore only consist of hits.
Well, if my reasonings are correct, the answer that has been given to you should be wrong. Both functions should perform with 25% misses. Maybe someone finds a better answer but if you understand my reasoning I'd ask a TA about that.
Edit: Thinking about it again, we should first define what actually qualifies as a miss/hit. When you look at
total_x += grid[j][i].x;
total_y += grid[j][i].y;
are these defined as two memory accesses or one? A decent compiler with optimization settings should optimize this to
pos temp = grid[j][i];
total_x += temp.x;
total_y += temp.y;
which could be counted as one memory access. I therefore propose the universal answer to all CS questions: "It depends."
I have a pointer that's holding 100 bytes of data.
i would like to add 5 to every 2nd byte.
example:
1 2 3 4 5 6
will become:
1 7 3 9 5 11
Now i know i can do a for loop, is there a quicker way? something like memset that will increase the value of every 2nd byte ?
thanks
A loop would be the best way. memset() is efficient for setting a contiguous block of memory. Wouldn't be much help for you here.
In which format do you have your bytes? As concatenated chars? Or are the bytes a subpart of e.g. a uint32?
In general, a loop is the best way for doing this - even if you would be able to apply a pattern-like mask with memset, you would still need to create it and that would take the same amount of CPU cycles.
If you would have 4 bytes per element (e.g. uint32), you could cut the cpu cycles in half by creating a pre-defined mask for adding. But attention: such a solution would not check for overflows (pseudocode):
uint32* ptr = new uint32[16]; // creates 64 bytes of data
(...) fill data
for (int k=0; k < 16; ++k)
{
// Hardcored Add-Mask for Little Endian systems
ptr[k] += 0x05000500; // dereference and add mask to content
}
Edit: Please note that this assumes a little endian system and is C++ Pseudocode.
If memset supported increasing the value of every nth byte, how do you think it would accomplish it?
You can in fact make it faster by using loop unrolling, but you'll need to know that your array is a fixed size. Then you would skip the loop overhead by just repeatedly assigning the value:
array[ 1 ] += 5;
array[ 3 ] += 5;
array[ 5 ] += 5;
...
By doing this you don't have the overhead incurred by jump and test instructions that would be present in a loop, but you pay for it in code bloat.
This is a question about OS file storage management and Inode. This is a question for review of final exam..Lecturer didn't give the answer about second question. Does anybody can do this and help me or give some hints?
THnaks!
[10 points] File Storage Management and Inode
b) Consider the organization of an Unix file a represented by Inode. Assume that there 10 direct block pointers, and a singly, doubly and triply indirect pointers in each Inode. Assume that the system block size is 4K. Disk block pointer is 4 bytes.
i. What is the maximum file size supported by the system?
ii. Assuming no information other than the file Inode is in the main memory, how many disk accesses are required to access the byte in position 54, 423,956.
10 block pointers = 10 4K blocks = 40KB
singly indirect: 1 block full of pointers = 4K / 4 pointers = 1024 pointers = 4MB
double indirect: 1 block of pointers = 1024 single indirects = 4GB
triple indirect: 1 block of pointers = 1024 double indirects = 4TB
total max size= 4TB+4GB+4MB+40KB = 4402345713664 bytes
position 54,423,956 is in one of the double indirect blocks, so it has to read the two steps and the data block => 3 random blocks read