How to do a proper Cache Blocked Matrix Transposition? - c

I am trying to do a Cache Blocked Matrix Transposition in C but I am having some troubles with something in my code. My guess is that it has to do with the indexes. Can you tell me where am I going wrong?
I am considering this both algorithm I found on the web: http://users.cecs.anu.edu.au/~Alistair.Rendell/papers/coa.pdf and http://iosrjen.org/Papers/vol3_issue11%20(part-4)/I031145055.pdf
But I couldn't figure it out yet how to correctly code those.
for (i = 0; i < N; i += block) {
for (j = 0; j < i; j += block ) {
for(ii=i;ii<i+block;ii++){
for(jj=j;jj<j+block;jj++){
temp1[ii][jj] = A2[ii][jj];
temp2[ii][jj] = A2[jj][ii];
A2[ii][jj] = temp1[ii][jj];
A2[ii][jj] = temp2[ii][jj];
}
}
}
}
temp1 and temp2 are two matrices of size block x block filled with zeros.
I am not sure if I have to do another for when I am returning the values to A2 (the before and after transposed matrix).
I also tried this:
for (i = 0; i < N; i += block) {
for (j = 0; j < N; j += block ) {
ii = A2[i][j];
jj = A2[j][i];
A2[j][i] = ii;
A2[i][j] = jj;
}
}
I am expecting better performance than the "naive" Matrix Transposition algorithm:
for (i = 1; i < N; i++) {
for(j = 0; j < i; j++) {
TEMP= A[i][j];
A[i][j]=A[j][i];
A[j][i]=TEMP;
}
}

The proper way to do blocked matrix transposition is not what is in your program. The extra temp1 and temp2 array will uselessly fill you cache. And your second version is incorrect. More you do too much operations: elements are transposed twice and diagonal elements are "tranposed".
But first we can do some simple (and approximate) cache behavior analysis. I assume that you have a matrix of doubles and that cache lines are 64 bytes (8 doubles).
A blocked implementation is equivalent to a naive implementation if the cache can completely contain the matrix. You only have mandatory cache misses to fetch the matrix elements. The number of cache misses will be N×N/8 to process N×N elements, with an average number of misses of 1/8 per element.
Now, for the naive implementation, look at the situation after you have processed 1 line in the cache. Assuming you cache is large enough, you will have in your cache :
* the complete line A[0][i]
* the first 8 elements of every other lines of the matrix A[i][0..7]
This means that, if you cache is large enough, you can process the 7 successive lines without any cache miss other than the one to fetch the lines. So if your matrix is N×N, if cache size is larger than ~2×N×8, you will have only 8×N/8(lines)+N(cols)=2N cache misses to process 8×N elements, and the average number of misses per element is 1/4. Numerically, if L1 cache size is 32k, this will happen if N<2k. And if L2 cache is 256k, data will remain in cache L2 if N<16k. I do not think the difference between data in L1 and data in L2 will be really visible, thanks to the very efficient prefetch in modern processors.
If you have a very large matrix, after the end of first line, the beginning of the second line has been ejected from cache. This will happen if a line of your matrix completely fills the cache. In this situation, the number of cache misses will be much more important. Every line will have N/8 (to get the line) + N (to get the first elements of columns) cache misses, and there is an average on (9×N/8)/N&approx;1 miss per element.
So you can gain with a blocked implementation, but only for large matrices.
Here is a correct implementation of matrix transpose. It avoids a dual processing of element A[l][m] (when i=l and j=m or i=m and j=l), do not transpose diagonal elements and uses registers for the transposition.
Naive version
for (i=0;i<N;i++)
for (j=i+1;j<N;j++)
{
temp=A[i][j];
A[i][j]=A[j][i];
A[j][i]=temp;
}
And the blocked version (we assume the matrix size is a multiple of block size)
for (ii=0;ii<N;ii+=block)
for (jj=0;jj<N;jj+=block)
for (i=ii;i<ii+block;i++)
for (j=jj+i+1;j<jj+block;j++)
{
temp=A[i][j];
A[i][j]=A[j][i];
A[j][i]=temp;
}

I am using your code but I am not getting the same answer when I compare the naive with the blocked algorithm. I put this matrix A and I am getting the matrix At as follows:
A
2 8 1 8
6 8 2 4
7 2 6 5
6 8 6 5
At
2 6 1 6
8 8 2 4
7 2 6 5
8 8 6 5
with a matrix of size N=4 and block= 2

Related

Estimating the miss rate for C function

I've been stuck on this question for a while, I think I might be missing something while trying to solve.
Assumptions:
16-way set associative L1 cache (E = 16) with a block size of 32 bytes (B = 32).
N is very large, so that a single row or column cannot fit in the cache.
sizeof(int) == 4
Variables i, k, and sum are stored in registers.
The cache is cold before each function is called.
int sum1(int A[N][N], int B[N][N])
{
int i, k, sum = 0;
for (i = 0; i < N; i++)
for (k = 0; k < N; k++)
sum += A[i][k] + B[k][i];
return sum;
}
Find the closest miss rate for sum1. (answer is 9/16)
I tried to solve as follows:
A[0][0],...,A[0][7] map to the first cache line in the first set
A[0][8],...,A[0][15] map to the first cache line in the second set and so on till the last set in the cache, then we start filling the second cache line of each set until A is finished, then the part about calculating B was tricky because if we still have space in the cache we can fill it or we can start replacing the oldest cache blocks in each set.
in miss rate wise, A will miss every time it maps once -> a miss of N/32*N = 1/32 for one cache line and 1/2 for all (16/32).
Now I'm stuck trying to approach B's misses as I don't understand precisely how its being done.
Thanks in advance

Tiled Matrix Multiplication using AVX

I have coded the following C function for multiplying two NxN matrices using tiling/blocking and AVX vectors to speed up the calculation. Right now though I'm getting a segmentation fault when I try to combine AVX intrinsics with tiling. Any idea why that happens?
Also, is there a better memory access pattern for matrix B? Maybe transposing it first or even changing the k and j loop? Because right now, I'm traversing it column-wise which is probably not very efficient in regards to spatial locality and cache lines.
1 void mmult(double A[SIZE_M][SIZE_N], double B[SIZE_N][SIZE_K], double C[SIZE_M][SIZE_K])
2 {
3 int i, j, k, i0, j0, k0;
4 // double sum;
5 __m256d sum;
6 for(i0 = 0; i0 < SIZE_M; i0 += BLOCKSIZE) {
7 for(k0 = 0; k0 < SIZE_N; k0 += BLOCKSIZE) {
8 for(j0 = 0; j0 < SIZE_K; j0 += BLOCKSIZE) {
9 for (i = i0; i < MIN(i0+BLOCKSIZE, SIZE_M); i++) {
10 for (j = j0; j < MIN(j0+BLOCKSIZE, SIZE_K); j++) {
11 // sum = C[i][j];
12 sum = _mm256_load_pd(&C[i][j]);
13 for (k = k0; k < MIN(k0+BLOCKSIZE, SIZE_N); k++) {
14 // sum += A[i][k] * B[k][j];
15 sum = _mm256_add_pd(sum, _mm256_mul_pd(_mm256_load_pd(&A[i][k]), _mm256_broadcast_sd(&B[k][j])));
16 }
17 // C[i][j] = sum;
18 _mm256_store_pd(&C[i][j], sum);
19 }
20 }
21 }
22 }
23 }
24 }
_mm256_load_pd is an alignment-required load but you're only stepping by k++, not k+=4 in the inner-most loop that loads a 32-byte vector of 4 doubles. So it faults because 3 of every 4 loads are misaligned.
You don't want to be doing overlapping loads, your real bug is the indexing; if your input pointers are 32-byte aligned you should be able to keep using _mm256_load_pd instead of _mm256_loadu_pd. So using _mm256_load_pd successfully caught your bug instead of working but giving numerically wrong results.
Your strategy for vectorizing four row*column dot products (to produce a C[i][j+0..3] vector) should load 4 contiguous doubles from 4 different columns (B[k][j+0..3] via a vector load from B[k][j]), and broadcast 1 double from A[i][k]. Remember you want 4 dot products in parallel.
Another strategy might involve a horizontal sum at the end down to a scalar C[i][j] += horizontal_add(__m256d), but I think that would require transposing one input first so both row and column vectors are in contiguous memory for one dot product. But then you need shuffles for a horizontal sum at the end of each inner loop.
You probably also want to use at least 2 sum variables so you can read a whole cache line at once, and hide FMA latency in the inner loop and hopefully bottleneck on throughput. Or better do 4 or 8 vectors in parallel. So you produce C[i][j+0..15] as sum0, sum1, sum2, sum3. (Or use an array of __m256d; compilers will typically fully unroll a loop of 8 and optimize the array into registers.)
I think you only need 5 nested loops, to block over rows and columns. Although apparently 6 nested loops are a valid option: see loop tiling/blocking for large dense matrix multiplication which has a 5-nested loop in the question but a 6-nested loop in an answer. (Just scalar, though, not vectorized).
There might be other bugs besides the row*column dot product strategy here, I'm not sure.
If you're using AVX, you might want to use FMA as well, unless you need to run on Sandbybridge/Ivybridge, and AMD Bulldozer. (Piledriver and later have FMA3).
Other matmul strategies include adding into the destination inside the inner loop so you're loading C and A inside the inner loop, with a load from B hoisted. (Or B and A swapped, I forget.) What Every Programmer Should Know About Memory? has a vectorized cache-blocked example that works this way in an appendix, for SSE2 __m128d vectors. https://www.akkadia.org/drepper/cpumemory.pdf

How to use AVX/SIMD with nested loops and += format?

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.
This is the code I would like to change into a SIMD/AVX implementation.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
temp[i] = 0.0;
for (size_t j = 0; j < npages; j++) {
temp[i] += P[j] * matrix_cap[IDX(i,j)];
}
}
For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.
I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.
As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.
ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
for (size_t j = 0; j < mod; j += 4) {
__m256d p = _mm256_loadu_pd(P + j);
__m256d m = _mm256_loadu_pd(matrix_hat + i + j);
__m256d pm = _mm256_mul_pd(p, m);
_mm256_storeu_pd(&res + j, pm);
for (size_t k = 0; k < 4; k++) {
sum += res[j + k];
}
}
for (size_t i = mod; i < npages; i++) {
for (size_t j = 0; j < npages; j++) {
sum += P[j] * matrix_cap[IDX(i,j)];
}
}
temp[i] = sum;
sum = 0.0;
}
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
#pragma omp simd reduction (+:tmp_reduction)
for (size_t j = 0; j < npages; j++) {
tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
}
temp[i] = tmp_reduction;
}
For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.
P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).
First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.
You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.
(see the bottom for a suggestion on a different data arrangement for the matrix)
Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.
So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)
It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.
FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).
However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.
So you should do something like
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
__m256d s0 = _mm256_setzero_pd(),
s1 = _mm256_setzero_pd(),
s2 = _mm256_setzero_pd(),
...
s7 = _mm256_setzero_pd(); // 8 accumulators for 8 i values
for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
__m256d Pj = _mm256_loadu_pd(P+j); // reused 8 times after loading
//temp[i] += P[j] * matrix_cap[IDX(i,j)];
s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
// ...
s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
}
// or do this block with a hsum+transpose and do vector stores.
// taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
temp[i+0] = hsum_pd256(s0); // See the horizontal-sum link earlier for how to write this function
temp[i+1] = hsum_pd256(s1);
//...
temp[i+7] = hsum_pd256(s7);
// if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.
You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.
Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.
P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

"Blocking" method to make code cache friendly

Hey so I'm looking at a matrix shift code, and need to make it cache friendly (fewest cache misses possible). The code looks like this:
int i, j, temp;
for(i=1;, i< M; i++){
for(j=0; j< N; j++){
temp = A[i][j];
A[i][j] = A[i-1][j];
A[i-1]][j] = temp;
}
}
Assume M and N are parameters of the function, noting M to number of rows, and N to number of columns. Now to make this more cache friendly, the book gives out two problems for optimization. When the matrix is 4x4, s=1, E=2, b=3 , and when the matrix is 128x128, s=5, E=2, b=3.
(s = # of set index bits (S = s^2 is the number of sets, E = number of lines per set, and b = # of block bits (so B = b^2 is block size))
So using the blocking method, I should access the matrix by block size, to avoid getting a miss, and the cache having to fetch the information from the cache a level higher. So here is what I assume:
Block size is 9 bytes for each
With the 4x4 matrix, the number of elements that fit evenly on a block is:
blocksize*(number of columns/blocksize) = 9*(4/9) = 4
So if each row will fit on one block, why is it not cache friendly?
With the 128x128 matrix, with the same logic as above, each block will hold (9*(128/9)) = 128.
So obviously after calculating that, this equation is wrong. I'm looking at the code from this page http://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf
Once I reached this point, I knew I was lost, which is where you guys come in! Is it as simple as saying each block holds 9 bytes, and 8 bytes (two integers) are what fits evenly into it? Sorry this stuff really confuses me, I know I'm all over the place. Just to be clear, these are my concerns:
How do you know how many elements will fit in a block?
Do the number of lines or sets affect this number? If so, how?
Any in depth explanation of the code posted on the linked page.
Really just trying to get a grasp of this.
UPDATE:
Okay so here is where I'm at for the 4x4 matrix.
I can read 8 bytes at a time, which is 2 integers. The original function will have cache misses because C loads into row-major order, so every time it wants A[i-1][j] it will miss, and load the block that holds A[i-1][j] which would either be A[i-1][0] and A[i-1][1] or A[i-1][2] and A[i-1][3].
So, would the best way to go about this be to create another temp variable, and do A[i][0] = temp, A[i][1] = temp2, then load A[i-1][0] A[i-1][1] and set them to temp, and temp2 and just set the loop to j<2? For this question, it is specifically for the matrices described; I understand this wouldn't work on all sizes.
The solution to this problem was to think of the matrix in column major order rather than row major order.
Hopefully this helps someone in the future. Thanks to #Michael Dorgan for getting me thinking.
End results for 128x128 matrix:
Original: 16218 misses
Optimized: 8196 misses

Improve C function performance with cache locality?

I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is
int diagonal_diff(int x[512][512])
I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms.
Sample data:
[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]
The difference is:
|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16
In order to do that, I use the following algorithm:
int i,j,result=0;
for(i=0; i<4; i++)
for(j=0; j<4; j++)
result+=abs(array[i][j]-[j][i]);
return result;
But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache.
Is there a way to improve my function?
EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.
For example, if you have a cache line of 32-bytes and an int is 4 bytes, you can fit a 8x8 int matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.
If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.
A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.
So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:
int sum_diagonal_difference(int array[512][512], int block_size)
{
int i,j, block_i, block_j,result=0;
// sum diagonal blocks
for (block_i= 0; block_i<512; block_i+= block_size)
for (block_j= block_i + block_size; block_j<512; block_j+= block_size)
for(i=0; i<block_size; i++)
for(j=0; j<block_size; j++)
result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]);
result+= result;
// sum diagonal
for (int block_offset= 0; block_offset<512; block_offset+= block_size)
{
for (i= 0; i<block_size; ++i)
{
for (j= i+1; j<block_size; ++j)
{
int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]);
result+= value + value;
}
}
}
return result;
}
You should experiment with various values for block_size. On my machine, 8 lead to the biggest speed up (2.5x) compared to a block_size of 1 (and ~5x compared to the original iteration over the entire matrix). The block_size should ideally be cache_line_size_in_bytes/sizeof(int).
If you have a good vector/matrix library like intel MKL, also try the vectorized way.
very simple in matlab:
result = sum(sum(abs(x-x')));
I reproduced Hans's method and MSN's method in matlab too, and the results are:
Elapsed time is 0.211480 seconds. (Hans)
Elapsed time is 0.009172 seconds. (MSN)
Elapsed time is 0.002193 seconds. (Mine)
With one minor change you can have your loops only operate on the desired indices. I just changed the j loop initialization.
int i, j, result = 0;
for (i = 0; i < 4; ++i) {
for (j = i + 1; j < 4; ++j) {
result += abs(array[i][j] - array[j][i]);
}
}

Resources