Estimating the miss rate for C function - c

I've been stuck on this question for a while, I think I might be missing something while trying to solve.
Assumptions:
16-way set associative L1 cache (E = 16) with a block size of 32 bytes (B = 32).
N is very large, so that a single row or column cannot fit in the cache.
sizeof(int) == 4
Variables i, k, and sum are stored in registers.
The cache is cold before each function is called.
int sum1(int A[N][N], int B[N][N])
{
int i, k, sum = 0;
for (i = 0; i < N; i++)
for (k = 0; k < N; k++)
sum += A[i][k] + B[k][i];
return sum;
}
Find the closest miss rate for sum1. (answer is 9/16)
I tried to solve as follows:
A[0][0],...,A[0][7] map to the first cache line in the first set
A[0][8],...,A[0][15] map to the first cache line in the second set and so on till the last set in the cache, then we start filling the second cache line of each set until A is finished, then the part about calculating B was tricky because if we still have space in the cache we can fill it or we can start replacing the oldest cache blocks in each set.
in miss rate wise, A will miss every time it maps once -> a miss of N/32*N = 1/32 for one cache line and 1/2 for all (16/32).
Now I'm stuck trying to approach B's misses as I don't understand precisely how its being done.
Thanks in advance

Related

How to do a proper Cache Blocked Matrix Transposition?

I am trying to do a Cache Blocked Matrix Transposition in C but I am having some troubles with something in my code. My guess is that it has to do with the indexes. Can you tell me where am I going wrong?
I am considering this both algorithm I found on the web: http://users.cecs.anu.edu.au/~Alistair.Rendell/papers/coa.pdf and http://iosrjen.org/Papers/vol3_issue11%20(part-4)/I031145055.pdf
But I couldn't figure it out yet how to correctly code those.
for (i = 0; i < N; i += block) {
for (j = 0; j < i; j += block ) {
for(ii=i;ii<i+block;ii++){
for(jj=j;jj<j+block;jj++){
temp1[ii][jj] = A2[ii][jj];
temp2[ii][jj] = A2[jj][ii];
A2[ii][jj] = temp1[ii][jj];
A2[ii][jj] = temp2[ii][jj];
}
}
}
}
temp1 and temp2 are two matrices of size block x block filled with zeros.
I am not sure if I have to do another for when I am returning the values to A2 (the before and after transposed matrix).
I also tried this:
for (i = 0; i < N; i += block) {
for (j = 0; j < N; j += block ) {
ii = A2[i][j];
jj = A2[j][i];
A2[j][i] = ii;
A2[i][j] = jj;
}
}
I am expecting better performance than the "naive" Matrix Transposition algorithm:
for (i = 1; i < N; i++) {
for(j = 0; j < i; j++) {
TEMP= A[i][j];
A[i][j]=A[j][i];
A[j][i]=TEMP;
}
}
The proper way to do blocked matrix transposition is not what is in your program. The extra temp1 and temp2 array will uselessly fill you cache. And your second version is incorrect. More you do too much operations: elements are transposed twice and diagonal elements are "tranposed".
But first we can do some simple (and approximate) cache behavior analysis. I assume that you have a matrix of doubles and that cache lines are 64 bytes (8 doubles).
A blocked implementation is equivalent to a naive implementation if the cache can completely contain the matrix. You only have mandatory cache misses to fetch the matrix elements. The number of cache misses will be N×N/8 to process N×N elements, with an average number of misses of 1/8 per element.
Now, for the naive implementation, look at the situation after you have processed 1 line in the cache. Assuming you cache is large enough, you will have in your cache :
* the complete line A[0][i]
* the first 8 elements of every other lines of the matrix A[i][0..7]
This means that, if you cache is large enough, you can process the 7 successive lines without any cache miss other than the one to fetch the lines. So if your matrix is N×N, if cache size is larger than ~2×N×8, you will have only 8×N/8(lines)+N(cols)=2N cache misses to process 8×N elements, and the average number of misses per element is 1/4. Numerically, if L1 cache size is 32k, this will happen if N<2k. And if L2 cache is 256k, data will remain in cache L2 if N<16k. I do not think the difference between data in L1 and data in L2 will be really visible, thanks to the very efficient prefetch in modern processors.
If you have a very large matrix, after the end of first line, the beginning of the second line has been ejected from cache. This will happen if a line of your matrix completely fills the cache. In this situation, the number of cache misses will be much more important. Every line will have N/8 (to get the line) + N (to get the first elements of columns) cache misses, and there is an average on (9×N/8)/N&approx;1 miss per element.
So you can gain with a blocked implementation, but only for large matrices.
Here is a correct implementation of matrix transpose. It avoids a dual processing of element A[l][m] (when i=l and j=m or i=m and j=l), do not transpose diagonal elements and uses registers for the transposition.
Naive version
for (i=0;i<N;i++)
for (j=i+1;j<N;j++)
{
temp=A[i][j];
A[i][j]=A[j][i];
A[j][i]=temp;
}
And the blocked version (we assume the matrix size is a multiple of block size)
for (ii=0;ii<N;ii+=block)
for (jj=0;jj<N;jj+=block)
for (i=ii;i<ii+block;i++)
for (j=jj+i+1;j<jj+block;j++)
{
temp=A[i][j];
A[i][j]=A[j][i];
A[j][i]=temp;
}
I am using your code but I am not getting the same answer when I compare the naive with the blocked algorithm. I put this matrix A and I am getting the matrix At as follows:
A
2 8 1 8
6 8 2 4
7 2 6 5
6 8 6 5
At
2 6 1 6
8 8 2 4
7 2 6 5
8 8 6 5
with a matrix of size N=4 and block= 2

What type of cache misses occur when accessing consecutive elements of array?

In C, f you have this type of loop:
for (i = 0; i < N; i++)
sum += a[i]
where the array 'a' contains ints (4 bytes) and a cache block can store say, 32 bytes, then I know that there will be a cold miss every 8 iterations of the loop, since the processor will load 8 ints into a block, then not get a cache miss until the 9th iteration. Am I understanding that correctly, that when it gets a cache miss at a[0] it loads a[0]-a[7] into a cache block, and won't load any of 'a' into cache again until it gets another cold miss at a[8]?
Assuming that ^^ is correct, my real question is, what happens if you have something like this:
for (i = 0; i < N; i++)
a[i] = a[i+1]
where 'a' has not been initialized? Would you get something similar to above, where the processor looks for each consecutive value of a[i+1] and misses only every 8? Or does it search the cache for a[i] as well in order to set the value? Would there be cache misses associated with a[i] or just a[i+1]?
And finally, what would happen if you had
for (i = 0; i < N; i++)
b[i] = a[i]
Would this be analogous to the first example, where it looks for each value of a[i] and gets cache misses on every 8th iteration, or does setting the value of b[i] incur cache misses as well?
Thanks!
It depend on the asm code, you should read it and see whether your program read a[i] first or a[i+1] first.

Best approach to FIFO implementation in a kernel OpenCL

Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).
I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).
Description of the picture:
One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.
What happens with my code?
For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.
Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.
I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.
Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).
float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];
float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {
//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end
int globalId = get_global_id(0);
float output = 0.0f;
//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
tempArrayForShift[i] = temp[i];
}
//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
temp[i] = tempArrayForShift[i];
}
//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];
//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i < 58; i ++) {
// output = multipliedResult[i] + output;
}
//Returned summed value of temp array
return output;
}
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
//Initialize the temporary array values to 0
for (int i = 0; i < 58; i ++) {
temp[i] = 0;
tempArrayForShift[i] = 0;
multipliedResult[i] = 0;
}
//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i < 60; i ++) {
Output[i] = fifo(Array[i], coefficients, 58);
}
}
I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.
To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.
A simple implementation would be (launch with as many threads as output values):
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output)
{
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 58; i++)
{
float tmp=0;
if (globalId+i > 56)
{
tmp=Array[i+globalId-57]*coefficient[57-i];
}
sum += tmp;
}
output[globalId]=sum;
}
This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.
Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.
the steps:
init local values to 0
copy coefficients to local variable
looping over the output elements to compute:
shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier
the code:
__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){
int globalId = get_global_id(0);
int localId = get_local_id(0);
int localSize = get_local_size(0);
//1 init local values to 0
localArray[localId] = 0.0f
//2 copy coefficients to local
//don't bother with this id __constant is working for you
//requires another local to be passed in: localCoeff
//localCoeff[localId] = coefficients[localId];
//barrier for both steps 1 and 2
barrier(CLK_LOCAL_MEM_FENCE);
float tmp;
for(int i = 0; i< outputSize; i++)
{
//3 shift elements (+barrier)
if(localId > 0){
tmp = localArray[localId -1]
}
barrier(CLK_LOCAL_MEM_FENCE);
localArray[localId] = tmp
//4 copy new element (work item 0 only, + barrier)
if(localId == 0){
localArray[0] = Array[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
//5 compute dot product
//5a multiply + barrier
localSums[localId] = localArray[localId] * coefficients[localId];
barrier(CLK_LOCAL_MEM_FENCE);
//5b reduction loop + barrier
for(int j = 1; j < localSize; j <<= 1) {
int mask = (j << 1) - 1;
if ((localId & mask) == 0) {
localSums[local_index] += localSums[localId +j]
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//6 copy dot product (WI 0 only)
if(localId == 0){
Output[i] = localSums[0];
}
//7 barrier
//only needed if there is more code after the loop.
//the barrier in #3 covers this in the case where the loop continues
//barrier(CLK_LOCAL_MEM_FENCE);
}
}
What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:
for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)
Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.
//1 init local values
if(groupId == 0){
localArray[localId] = 0.0f
}else{
localArray[localSize - localId] = Array[workAmount - localId];
}
These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.
Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.
Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.

Is vectorization profitable in this case?

I broke a kernel down to several loops, in order to vectorize each one of them afterwards. One of this loops looks like:
int *array1; //Its size is "size+1";
int *array2; //Its size is "size+1";
//All positions of array1 and array2 are set to 0 here;
int *sArray1 = array1+1; //Shift one position so I start writing on pos 1
int *sArray2 = array2+1; //Shift one position so I start writing on pos 1
int bb = 0;
for(int i=0; i<size; i++){
if(A[i] + bb > B[i]){
bb = 1;
sArray1[i] = S;
sArray2[i] = 1;
}
else
bb = 0;
}
Please note the loop-carried dependency, in bb - each comparison depends upon bb's value, which is modified on the previous iteration.
What I thought about:
I can be absolutely certain of some cases. For example, when A[i] is already greater than B[i], I do not need to know what value bb carries from the previous iteration;
When A[i] equals B[i], I need to know what value bb carries from the previous iteration. However, I also need to account for the case when this happens in two consecutive positions; When I started to shape up these cases, it seemed that these becomes overly complicated and vectorization doesn't pay off.
Essentially, I'd like to know if this can be vectorized in an effective manner or if it is simply better to run this without any vectorization whatsoever.
You might not want to iterate over single elements, but have a loop over the chunks (where a chunk is defined by all elements within yielding the same bb).
The search for chunk boundraries could be vectorized (by hand using compiler specific SIMD intrinics probably).
And the action to be taken for single chunk of bb=1 could be vectorized, too.
The loop transformation is as follows:
size_t i_chunk_start = 0, i_chunk_end;
int bb_chunk = A[0] > B[0] ? 1 : 0;
while (i_chunk_start < isize) {
if(bb_chunk) {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] < B[i_chunk_end]) {
break;
}
}
/* process current chunk */
for(size_t i = i_chunk_start; i < i_chunk_end; ++i) {
sArray1[i] = S;
sArray2[i] = 1;
}
bb_chunk = 0;
} else {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] > B[i_chunk_end]) {
break;
}
}
bb_chunk = 1;
}
/* prepare for next chunk */
i_chunk_start = i_chunk_end;
}
Now, each of the inner loops (all for loops) could potentially get vectorized.
Whether or not vectorization in this manner is superior to non-vectorization depends on whether the chunks have sufficient length in average. You will only find out by benchmarking.
The effect of your loop body depends on two conditions:
A[i] > B[i]
A[i] + 1 > B[i]
Their calculation can be vectorized easily. Assuming int has 32 bits, and vectorized instructions work on 4 int values at a time, there are 8 bits per vectorized iteration (4 bits for each condition).
You can harvest those bits from a SSE register by _mm_movemask_epi8. It's a bit inconvenient that it works on bytes and not on ints, but you can take care of it by a suitable shuffle.
Afterwards, use the 8 bits as an address to a LUT (of 256 entries), which stores 4-bit masks. These masks can be used to store the elements into destination conditionally, using _mm_maskmoveu_si128.
I am not sure such a complicated program is worthwhile - it involves much bit-fiddling for just x4 improvement in speed. Maybe it's better to build the masks by examining the decision bits individually. But vectorizing your comparisons and stores seems worthwhile in any case.

Improve C function performance with cache locality?

I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is
int diagonal_diff(int x[512][512])
I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms.
Sample data:
[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]
The difference is:
|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16
In order to do that, I use the following algorithm:
int i,j,result=0;
for(i=0; i<4; i++)
for(j=0; j<4; j++)
result+=abs(array[i][j]-[j][i]);
return result;
But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache.
Is there a way to improve my function?
EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.
For example, if you have a cache line of 32-bytes and an int is 4 bytes, you can fit a 8x8 int matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.
If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.
A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.
So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:
int sum_diagonal_difference(int array[512][512], int block_size)
{
int i,j, block_i, block_j,result=0;
// sum diagonal blocks
for (block_i= 0; block_i<512; block_i+= block_size)
for (block_j= block_i + block_size; block_j<512; block_j+= block_size)
for(i=0; i<block_size; i++)
for(j=0; j<block_size; j++)
result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]);
result+= result;
// sum diagonal
for (int block_offset= 0; block_offset<512; block_offset+= block_size)
{
for (i= 0; i<block_size; ++i)
{
for (j= i+1; j<block_size; ++j)
{
int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]);
result+= value + value;
}
}
}
return result;
}
You should experiment with various values for block_size. On my machine, 8 lead to the biggest speed up (2.5x) compared to a block_size of 1 (and ~5x compared to the original iteration over the entire matrix). The block_size should ideally be cache_line_size_in_bytes/sizeof(int).
If you have a good vector/matrix library like intel MKL, also try the vectorized way.
very simple in matlab:
result = sum(sum(abs(x-x')));
I reproduced Hans's method and MSN's method in matlab too, and the results are:
Elapsed time is 0.211480 seconds. (Hans)
Elapsed time is 0.009172 seconds. (MSN)
Elapsed time is 0.002193 seconds. (Mine)
With one minor change you can have your loops only operate on the desired indices. I just changed the j loop initialization.
int i, j, result = 0;
for (i = 0; i < 4; ++i) {
for (j = i + 1; j < 4; ++j) {
result += abs(array[i][j] - array[j][i]);
}
}

Resources