I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is
int diagonal_diff(int x[512][512])
I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms.
Sample data:
The difference is:
|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16
In order to do that, I use the following algorithm:
int i,j,result=0;
for(i=0; i<4; i++)
for(j=0; j<4; j++)
return result;
But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache.
Is there a way to improve my function?

EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.
For example, if you have a cache line of 32-bytes and an int is 4 bytes, you can fit a 8x8 int matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.
If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.
A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.
So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:
int sum_diagonal_difference(int array[512][512], int block_size)
int i,j, block_i, block_j,result=0;
// sum diagonal blocks
for (block_i= 0; block_i<512; block_i+= block_size)
for (block_j= block_i + block_size; block_j<512; block_j+= block_size)
for(i=0; i<block_size; i++)
for(j=0; j<block_size; j++)
result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]);
result+= result;
// sum diagonal
for (int block_offset= 0; block_offset<512; block_offset+= block_size)
for (i= 0; i<block_size; ++i)
for (j= i+1; j<block_size; ++j)
int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]);
result+= value + value;
return result;
You should experiment with various values for block_size. On my machine, 8 lead to the biggest speed up (2.5x) compared to a block_size of 1 (and ~5x compared to the original iteration over the entire matrix). The block_size should ideally be cache_line_size_in_bytes/sizeof(int).

If you have a good vector/matrix library like intel MKL, also try the vectorized way.
very simple in matlab:
result = sum(sum(abs(x-x')));
I reproduced Hans's method and MSN's method in matlab too, and the results are:
Elapsed time is 0.211480 seconds. (Hans)
Elapsed time is 0.009172 seconds. (MSN)
Elapsed time is 0.002193 seconds. (Mine)

With one minor change you can have your loops only operate on the desired indices. I just changed the j loop initialization.
int i, j, result = 0;
for (i = 0; i < 4; ++i) {
for (j = i + 1; j < 4; ++j) {
result += abs(array[i][j] - array[j][i]);


What is stride in c

I am new to C, before I learned Python, that's why I don't know what stride is and how to use them in code.
This Question FInd the Answer. Thanks#Eric Postpischil
Generally, stride is the distance steps take through something.
In the addition routine, we have these loops:
for (long i = 0; i < COLS; i++)
for (long j = 0; j < ROWS; j++) {
sum += table[j][i];
In successive iterations of the innermost loop with j equal to x in the first iteration, one iteration accesses table[x][i], and the next accesses table[x+1][i]. The distance between these two accesses is the size of one table[j], which is COLS (2000) elements of short (likely two bytes), so likely 4000 bytes. So the stride is 4000 bytes.
This is generally bad for the cache memory on typical processors, as cache memory is designed mostly for memory accesses that are close to each other (small strides). This is the cause of the program’s slow performance.
Since the operation in the loop, sum += table[j][i];, is independent of the order it is executed in for all the i and j, we can easily remedy this problem by swapping the two for statements:
for (long j = 0; j < ROWS; j++)
for (long i = 0; i < COLS; i++)
sum += table[j][i];
Then successive iterations of the innermost loop will access table[j][x] and table[j][x+1], which have a stride of one short, likely two bytes.
On my system, the program runs about twenty times faster with this change.

How to do a proper Cache Blocked Matrix Transposition?

I am trying to do a Cache Blocked Matrix Transposition in C but I am having some troubles with something in my code. My guess is that it has to do with the indexes. Can you tell me where am I going wrong?
I am considering this both algorithm I found on the web: and
But I couldn't figure it out yet how to correctly code those.
for (i = 0; i < N; i += block) {
for (j = 0; j < i; j += block ) {
temp1[ii][jj] = A2[ii][jj];
temp2[ii][jj] = A2[jj][ii];
A2[ii][jj] = temp1[ii][jj];
A2[ii][jj] = temp2[ii][jj];
temp1 and temp2 are two matrices of size block x block filled with zeros.
I am not sure if I have to do another for when I am returning the values to A2 (the before and after transposed matrix).
I also tried this:
for (i = 0; i < N; i += block) {
for (j = 0; j < N; j += block ) {
ii = A2[i][j];
jj = A2[j][i];
A2[j][i] = ii;
A2[i][j] = jj;
I am expecting better performance than the "naive" Matrix Transposition algorithm:
for (i = 1; i < N; i++) {
for(j = 0; j < i; j++) {
TEMP= A[i][j];
The proper way to do blocked matrix transposition is not what is in your program. The extra temp1 and temp2 array will uselessly fill you cache. And your second version is incorrect. More you do too much operations: elements are transposed twice and diagonal elements are "tranposed".
But first we can do some simple (and approximate) cache behavior analysis. I assume that you have a matrix of doubles and that cache lines are 64 bytes (8 doubles).
A blocked implementation is equivalent to a naive implementation if the cache can completely contain the matrix. You only have mandatory cache misses to fetch the matrix elements. The number of cache misses will be N×N/8 to process N×N elements, with an average number of misses of 1/8 per element.
Now, for the naive implementation, look at the situation after you have processed 1 line in the cache. Assuming you cache is large enough, you will have in your cache :
* the complete line A[0][i]
* the first 8 elements of every other lines of the matrix A[i][0..7]
This means that, if you cache is large enough, you can process the 7 successive lines without any cache miss other than the one to fetch the lines. So if your matrix is N×N, if cache size is larger than ~2×N×8, you will have only 8×N/8(lines)+N(cols)=2N cache misses to process 8×N elements, and the average number of misses per element is 1/4. Numerically, if L1 cache size is 32k, this will happen if N<2k. And if L2 cache is 256k, data will remain in cache L2 if N<16k. I do not think the difference between data in L1 and data in L2 will be really visible, thanks to the very efficient prefetch in modern processors.
If you have a very large matrix, after the end of first line, the beginning of the second line has been ejected from cache. This will happen if a line of your matrix completely fills the cache. In this situation, the number of cache misses will be much more important. Every line will have N/8 (to get the line) + N (to get the first elements of columns) cache misses, and there is an average on (9×N/8)/N&approx;1 miss per element.
So you can gain with a blocked implementation, but only for large matrices.
Here is a correct implementation of matrix transpose. It avoids a dual processing of element A[l][m] (when i=l and j=m or i=m and j=l), do not transpose diagonal elements and uses registers for the transposition.
Naive version
for (i=0;i<N;i++)
for (j=i+1;j<N;j++)
And the blocked version (we assume the matrix size is a multiple of block size)
for (ii=0;ii<N;ii+=block)
for (jj=0;jj<N;jj+=block)
for (i=ii;i<ii+block;i++)
for (j=jj+i+1;j<jj+block;j++)
I am using your code but I am not getting the same answer when I compare the naive with the blocked algorithm. I put this matrix A and I am getting the matrix At as follows:
2 8 1 8
6 8 2 4
7 2 6 5
6 8 6 5
2 6 1 6
8 8 2 4
7 2 6 5
8 8 6 5
with a matrix of size N=4 and block= 2

Matrix-Multiplication: Why non-blocked outperforms blocked?

I'm trying to speed up a matrix multiplication algorithm by blocking the loops to improve cache performance, yet the non-blocked version remains significantly faster regardless of matrix size, block size (I've tried lots of values between 2 and 200, potenses of 2 and others) and optimization level.
Non-blocked version:
for(size_t i = 0; i < n; ++i)
for(size_t k = 0; k < n; ++k)
int r = a[i][k];
for(size_t j = 0; j < n; ++j)
c[i][j] += r * b[k][j];
Blocked version:
for(size_t kk = 0; kk < n; kk += BLOCK)
for(size_t jj = 0; jj < n; jj += BLOCK)
for(size_t i = 0; i < n; ++i)
for(size_t k = kk; k < kk + BLOCK; ++k)
int r = a[i][k];
for(size_t j = jj; j < jj + BLOCK; ++j)
c[i][j] += r * b[k][j];
I also have a bijk version and a 6-loops bikj version but they all gets outperformed by the non-blocked version and I don't get why this happens. Every paper and tutorial that I've come across seems to indicate that the the blocked version should be significantly faster. I'm running this on a Core i5 if that matters.
Try blocking in one dimension only, not in both dimensions.
Matrix multiplication exhaustively processes elements from both matrices. Each row vector on the left matrix is repeatedly processed, taken into successive columns of the right matrix.
If the matrices do not both fit into the cache, some data will invariably end up loaded multiple times.
What we can do is break up the operation so that we work with about a cache-sized amount of data at one time. We want the row vector from the left operand to be cached, since it is repeatedly applied against multiple columns. But we should only take enough columns (at a time) to stay within the limit of the cache. For instance, if we can only take 25% of the columns, it means we will have to pass over the row vectors four times. We end up loading the left matrix from memory four times, and the right matrix only once.
(If anything is to be loaded more than once, it should be the row vectors on the left, because they are flat in memory, which benefits from burst loading. Many cache architectures can perform a burst load from memory into adjacent cache lines faster than random access loads. If the right matrix were stored in column-major order, that would be even better: then we are doing cross-products between flat arrays, which prefetch into memory nicely.)
Let's also not forget the output matrix. The output matrix occupies space in the cache also.
I suspect one flaw in the 2D blocked approach is that each element of the output matrix depends on two inputs: its entire entire row in the left matrix, and the entire column in the right matrix. If the matrices are visited in blocks, that means that each target element is visited multiple times to accumulate the partial result.
If we do a complete row-column dot product, we don't have to visit the c[i][j] more than once; once we take column j into row i, we are done with that c[i][j].

How to use AVX/SIMD with nested loops and += format?

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.
This is the code I would like to change into a SIMD/AVX implementation.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
temp[i] = 0.0;
for (size_t j = 0; j < npages; j++) {
temp[i] += P[j] * matrix_cap[IDX(i,j)];
For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.
I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.
As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.
ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
for (size_t j = 0; j < mod; j += 4) {
__m256d p = _mm256_loadu_pd(P + j);
__m256d m = _mm256_loadu_pd(matrix_hat + i + j);
__m256d pm = _mm256_mul_pd(p, m);
_mm256_storeu_pd(&res + j, pm);
for (size_t k = 0; k < 4; k++) {
sum += res[j + k];
for (size_t i = mod; i < npages; i++) {
for (size_t j = 0; j < npages; j++) {
sum += P[j] * matrix_cap[IDX(i,j)];
temp[i] = sum;
sum = 0.0;
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
#pragma omp simd reduction (+:tmp_reduction)
for (size_t j = 0; j < npages; j++) {
tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
temp[i] = tmp_reduction;
For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.
P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).
First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on
You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.
(see the bottom for a suggestion on a different data arrangement for the matrix)
Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.
So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)
It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.
FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).
However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.
So you should do something like
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
__m256d s0 = _mm256_setzero_pd(),
s1 = _mm256_setzero_pd(),
s2 = _mm256_setzero_pd(),
s7 = _mm256_setzero_pd(); // 8 accumulators for 8 i values
for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
__m256d Pj = _mm256_loadu_pd(P+j); // reused 8 times after loading
//temp[i] += P[j] * matrix_cap[IDX(i,j)];
s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
// ...
s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
// or do this block with a hsum+transpose and do vector stores.
// taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
temp[i+0] = hsum_pd256(s0); // See the horizontal-sum link earlier for how to write this function
temp[i+1] = hsum_pd256(s1);
temp[i+7] = hsum_pd256(s7);
// if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
// TODO: cleanup for the last up-to-7 odd elements.
You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.
Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.
P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

What memory access patterns are most efficient for outer-product-type double loops?

What access patterns are most efficient for writing cache-efficient outer-product type code that maximally exploits data data locality?
Consider a block of code for processing all pairs of elements of two arrays such as:
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
out[i*M + j] = X[i] binary-op Y[j];
This is a standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but this same pattern is also matrix multiplication when X and Y are matrices and binary-op is a dot product between the ith row and j-th column of two matrices.
For matrix multiplication, I know optimized BLASs like OpenBLAS and MKL can get much higher performance than you get from the double loop style code above, because they process the elements in chunks in such a way as to exploit the CPU cache much more. Unfortunately, OpenBLAS kernels are written in assembly so it's pretty difficult to figure out what's going on.
Are there any good "tricks of the trade" for re-organizing these types of double loops to improve cache performance?
Since each element of out is only hit once, we're clearly free to reorder the iterations. The straight linear traversal of out is the easiest to write, but I don't think it's the most efficient pattern to execute, since you don't exploit any locality in X.
I'm especially interested in the setting where M and N are large, and the size of each element (X[i], and Y[j]) is pretty small (like O(1) bytes), so were talking about something analogous to vector-vector outer product or the multiplication of a tall and skinny matrix by a short and fat matrix (e.g. N x D by D x M where D is small).
For large enough M, The Y vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y.
You should block up your accesses to Y; something like this:
for (jj = 0; jj < M; jj += CACHE_SIZE) { // Iterate over blocks
for (i = 0; i < N; i++) {
for (j = jj; j < (jj + CACHE_SIZE); j++) { // Iterate within block
out[i*M + j] = X[i] * Y[j];
The above doesn't do anything smart with accesses to X, but new values are only being accessed 1/CACHE_SIZE as often, so the impact is probably negligible.
* If everything is small enough to already fit in cache, then you can't do better than what you already have (vectorisation opportunities notwithstanding).
