Optimized matrix multiplication in C - c

I'm trying to compare different methods for matrix multiplication.
The first one is normal method:
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
suma = 0;
for (l = 0; l < i; l++)
suma += MatrixA[j][l]*MatrixB[l][k];
MatrixR[j][k] = suma;
}
}
}
c++;
} while (c<iteraciones);
The second one consist of transposing the matrix B first and then do the multiplication by rows:
int f, co;
for (f = 0; f < i; f++) {
for ( co = 0; co < i; co++) {
MatrixB[f][co] = MatrixB[co][f];
}
}
c = 0;
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
suma = 0;
for (l = 0; l < i; l++)
suma += MatrixA[j][l]*MatrixB[k][l];
MatrixR[j][k] = suma;
}
}
}
c++;
} while (c<iteraciones);
The second method supposed to be much faster, because we are accessing contiguous memory slots, but I'm not getting a significant improvement in the performance. Am I doing something wrong?
I can post the complete code, but I think is not needed.

What Every Programmer Should Know About Memory (pdf link) by Ulrich Drepper has a lot of good ideas about memory efficiency, but in particular, he uses matrix multiplication as an example of how knowing about memory and using that knowledge can speed this process. Look at appendix A.1 in his paper, and read through section 6.2.1. Table 6.2 in the paper shows that he could get his running time to be 10% from a naive implementation's time for a 1000x1000 matrix.
Granted, his final code is pretty hairy and uses a lot of system-specific stuff and compile-time tuning, but still, if you really need speed, reading that paper and reading his implementation is definitely worth it.

Getting this right is non-trivial. Using an existing BLAS library is highly recommended.
Should you really be inclined to roll your own matrix multiplication, loop tiling is an optimization that is of particular importance for large matrices. The tiling should be tuned to the cache size to ensure that the cache is not being continually thrashed, which will occur with a naive implementation. I once measured a 12x performance difference tiling a matrix multiply with matrix sizes picked to consume multiples of my cache (circa '97 so the cache was probably small).
Loop tiling algorithms assume that a contiguous linear array of elements is used, as opposed to rows or columns of pointers. With such a storage choice, your indexing scheme determines which dimension changes fastest, and you are free to decide whether row or column access will have the best cache performance.
There's a lot of literature on the subject. The following references, especially the Banerjee books, may be helpful:
[Ban93] Banerjee, Utpal, Loop Transformations for Restructuring Compilers: the Foundations, Kluwer Academic Publishers, Norwell, MA, 1993.
[Ban94] Banerjee, Utpal, Loop Parallelization, Kluwer Academic Publishers, Norwell, MA, 1994.
[BGS93] Bacon, David F., Susan L. Graham, and Oliver Sharp, Compiler Transformations for High-Performance Computing, Computer Science Division, University of California, Berkeley, Calif., Technical Report No UCB/CSD-93-781.
[LRW91] Lam, Monica S., Edward E. Rothberg, and Michael E Wolf. The Cache Performance and Optimizations of Blocked Algorithms, In 4th International Conference on Architectural Support for Programming Languages, held in Santa Clara, Calif., April, 1991, 63-74.
[LW91] Lam, Monica S., and Michael E Wolf. A Loop Transformation Theory and an Algorithm to Maximize Parallelism, In IEEE Transactions on Parallel and Distributed Systems, 1991, 2(4):452-471.
[PW86] Padua, David A., and Michael J. Wolfe, Advanced Compiler Optimizations for Supercomputers, In Communications of the ACM, 29(12):1184-1201, 1986.
[Wolfe89] Wolfe, Michael J. Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, MA, 1989.
[Wolfe96] Wolfe, Michael J., High Performance Compilers for Parallel Computing, Addison-Wesley, CA, 1996.

ATTENTION: You have a BUG in your second implementation
for (f = 0; f < i; f++) {
for (co = 0; co < i; co++) {
MatrixB[f][co] = MatrixB[co][f];
}
}
When you do f=0, c=1
MatrixB[0][1] = MatrixB[1][0];
you overwrite MatrixB[0][1] and lose that value! When the loop gets to f=1, c=0
MatrixB[1][0] = MatrixB[0][1];
the value copied is the same that was already there.

You should not write matrix multiplication. You should depend on external libraries. In particular you should use the GEMM routine from the BLAS library. GEMM often provides the following optimizations
Blocking
Efficient Matrix Multiplication relies on blocking your matrix and performing several smaller blocked multiplies. Ideally the size of each block is chosen to fit nicely into cache greatly improving performance.
Tuning
The ideal block size depends on the underlying memory hierarchy (how big is the cache?). As a result libraries should be tuned and compiled for each specific machine. This is done, among others, by the ATLAS implementation of BLAS.
Assembly Level Optimization
Matrix multiplicaiton is so common that developers will optimize it by hand. In particular this is done in GotoBLAS.
Heterogeneous(GPU) Computing
Matrix Multiply is very FLOP/compute intensive, making it an ideal candidate to be run on GPUs. cuBLAS and MAGMA are good candidates for this.
In short, dense linear algebra is a well studied topic. People devote their lives to the improvement of these algorithms. You should use their work; it will make them happy.

If the matrix is not large enough or you don't repeat the operations a high number of times you won't see appreciable differences.
If the matrix is, say, 1,000x1,000 you will begin to see improvements, but I would say that if it is below 100x100 you should not worry about it.
Also, any 'improvement' may be of the order of milliseconds, unless yoy are either working with extremely large matrices or repeating the operation thousands of times.
Finally, if you change the computer you are using for a faster one the differences will be even narrower!

How big improvements you get will depend on:
The size of the cache
The size of a cache line
The degree of associativity of the cache
For small matrix sizes and modern processors it's highly probable that the data fron both MatrixA and MatrixB will be kept nearly entirely in the cache after you touch it the first time.

Just something for you to try (but this would only make a difference for large matrices): seperate out your addition logic from the multiplication logic in the inner loop like so:
for (k = 0; k < i; k++)
{
int sums[i];//I know this size declaration is illegal in C. consider
//this pseudo-code.
for (l = 0; l < i; l++)
sums[l] = MatrixA[j][l]*MatrixB[k][l];
int suma = 0;
for(int s = 0; s < i; s++)
suma += sums[s];
}
This is because you end up stalling your pipeline when you write to suma. Granted, much of this is taken care of in register renaming and the like, but with my limited understanding of hardware, if I wanted to squeeze every ounce of performance out of the code, I would do this because now you don't have to stall the pipeline to wait for a write to suma. Since multiplication is more expensive than addition, you want to let the machine paralleliz it as much as possible, so saving your stalls for the addition means you spend less time waiting in the addition loop than you would in the multiplication loop.
This is just my logic. Others with more knowledge in the area may disagree.

Can you post some data comparing your 2 approaches for a range of matrix sizes ? It may be that your expectations are unrealistic and that your 2nd version is faster but you haven't done the measurements yet.
Don't forget, when measuring execution time, to include the time to transpose matrixB.
Something else you might want to try is comparing the performance of your code with that of the equivalent operation from your BLAS library. This may not answer your question directly, but it will give you a better idea of what you might expect from your code.

The computation complexity of multiplication of two N*N matrix is O(N^3). The performance will be dramatically improved if you use O(N^2.73) algorithm which probably has been adopted by MATLAB. If you installed a MATLAB, try to multiply two 1024*1024 matrix. On my computer, MATLAB complete it in 0.7s, but the C\C++ implementation of the naive algorithm like yours takes 20s. If you really care about the performance, refer to lower-complex algorithms. I heard there exists O(N^2.4) algorithm, however it needs a very large matrix so that other manipulations can be neglected.

not so special but better :
c = 0;
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
sum = 0; sum_ = 0;
for (l = 0; l < i; l++) {
MatrixB[j][k] = MatrixB[k][j];
sum += MatrixA[j][l]*MatrixB[k][l];
l++;
MatrixB[j][k] = MatrixB[k][j];
sum_ += MatrixA[j][l]*MatrixB[k][l];
sum += sum_;
}
MatrixR[j][k] = sum;
}
}
c++;
} while (c<iteraciones);

Generally speaking, transposing B should end up being much faster than the naive implementation, but at the expense of wasting another NxN worth of memory. I just spent a week digging around matrix multiplication optimization, and so far the absolute hands-down winner is this:
for (int i = 0; i < N; i++)
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
if (likely(k)) /* #define likely(x) __builtin_expect(!!(x), 1) */
C[i][j] += A[i][k] * B[k][j];
else
C[i][j] = A[i][k] * B[k][j];
This is even better than Drepper's method mentioned in an earlier comment, as it works optimally regardless of the cache properties of the underlying CPU. The trick lies in reordering the loops so that all three matrices are accessed in row-major order.

If you are working on small numbers, then the improvement you are mentioning is negligible. Also, performance will vary depend on the Hardware on which you are running. But if you are working on numbers in millions, then it will effect.
Coming to the program, can you paste the program you have written.

Very old question, but heres my current implementation for my opengl projects:
typedef float matN[N][N];
inline void matN_mul(matN dest, matN src1, matN src2)
{
unsigned int i;
for(i = 0; i < N^2; i++)
{
unsigned int row = (int) i / 4, col = i % 4;
dest[row][col] = src1[row][0] * src2[0][col] +
src1[row][1] * src2[1][col] +
....
src[row][N-1] * src3[N-1][col];
}
}
Where N is replaced with the size of the matrix. So if you are multiplying 4x4 matrices, then you use:
typedef float mat4[4][4];
inline void mat4_mul(mat4 dest, mat4 src1, mat4 src2)
{
unsigned int i;
for(i = 0; i < 16; i++)
{
unsigned int row = (int) i / 4, col = i % 4;
dest[row][col] = src1[row][0] * src2[0][col] +
src1[row][1] * src2[1][col] +
src1[row][2] * src2[2][col] +
src1[row][3] * src2[3][col];
}
}
This function mainly minimizes loops but the modulus might be taxing... On my computer this function performed roughly 50% faster than a triple for loop multiplication function.
Cons:
Lots of code needed (ex. different functions for mat3 x mat3, mat5 x mat5...)
Tweaks needed for irregular multiplication (ex. mat3 x mat4).....

This is a very old question but I recently wandered down the rabbit hole and developed 9 different matrix multiplication implementations for both contiguous memory and non-contiguous memory (about 18 different functions). The results are interesting:
https://github.com/cubiclesoft/matrix-multiply
Blocking (aka loop tiling) didn't always produce the best results. In fact, I found that blocking may produce worse results than other algorithms depending on matrix size. And blocking really only started doing marginally better than other algorithms somewhere around 1200x1200 and performed worse at around 2000x2000 but got better again past that point. This seems to be a common problem with blocking - certain matrix sizes just don't work well. Also, blocking on contiguous memory performed slightly worse than the non-contiguous version. Contrary to common thinking, non-contiguous memory storage also generally outperformed contiguous memory storage. Blocking on contiguous memory also performed worse than an optimized straight pointer math version. I'm sure someone will point out areas of optimization that I missed/overlooked but the general conclusion is that blocking/loop tiling may: Do slightly better, do slightly worse (smaller matrices), or it may do much worse. Blocking adds a lot of complexity to the code for largely inconsequential gains and a non-smooth/wacky performance curve that's all over the place.
In my opinion, while it isn't the fastest implementation of the nine options I developed and tested, Implementation 6 has the best balance between code length, code readability, and performance:
void MatrixMultiply_NonContiguous_6(double **C, double **A, double **B, size_t A_rows, size_t A_cols, size_t B_cols)
{
double tmpa;
for (size_t i = 0; i < A_rows; i++)
{
tmpa = A[i][0];
for (size_t j = 0; j < B_cols; j++)
{
C[i][j] = tmpa * B[0][j];
}
for (size_t k = 1; k < A_cols; k++)
{
tmpa = A[i][k];
for (size_t j = 0; j < B_cols; j++)
{
C[i][j] += tmpa * B[k][j];
}
}
}
}
void MatrixMultiply_Contiguous_6(double *C, double *A, double *B, size_t A_rows, size_t A_cols, size_t B_cols)
{
double tmpa;
for (size_t i = 0; i < A_rows; i++)
{
tmpa = A[i * A_cols];
for (size_t j = 0; j < B_cols; j++)
{
C[i * B_cols + j] = tmpa * B[j];
}
for (size_t k = 1; k < A_cols; k++)
{
tmpa = A[i * A_cols + k];
for (size_t j = 0; j < B_cols; j++)
{
C[i * B_cols + j] += tmpa * B[k * B_cols + j];
}
}
}
}
Simply swapping j and k (Implementation 3) does a lot all on its own but little adjustments to use a temporary var for A and removing the if conditional notably improves performance over Implementation 3.
Here are the implementations (copied verbatim from the linked repository):
Implementation 1 - The classic naive implementation. Also the slowest. Good for showing the baseline worst case and validating the other implementations. Not so great for actual, real world usage.
Implementation 2 - Uses a temporary variable for matrix C which might end up using a CPU register to do the addition.
Implementation 3 - Swaps the j and k loops from Implementation 1. The result is a bit more CPU cache friendly but adds a comparison per loop and the temporary from Implementation 2 is lost.
Implementation 4 - The temporary variable makes a comeback but this time on one of the operands (matrix A) instead of the assignment.
Implementation 5 - Move the conditional outside the innermost for loop. Now we have two inner for-loops.
Implementation 6 - Remove conditional altogether. This implementation arguably offers the best balance between code length, code readability, and performance. That is, both contiguous and non-contiguous functions are short, easy to understand, and faster than the earlier implementations. It is good enough that the next three Implementations use it as their starting point.
Implementation 7 - Precalculate base row start for the contiguous memory implementation. The non-contiguous version remains identical to Implementation 6. However, the performance gains are negligible over Implementation 6.
Implementation 8 - Sacrifice a LOT of code readability to use simple pointer math. The result completely removes all array access multiplication. Variant of Implementation 6. The contiguous version performs better than Implementation 9. The non-contiguous version performs about the same as Implementation 6.
Implementation 9 - Return to the readable code of Implementation 6 to implement a blocking algorithm. Processing small blocks at a time allows larger arrays to stay in the CPU cache during inner loop processing for a small increase in performance at around 1200x1200 array sizes but also results in a wacky performance graph that shows it can actually perform far worse than other Implementations.

Related

Would it be faster to use a for loop or list out the operations?

I'm working on code to do matrix operations for a satellite my school is making. Would it be faster and less resource intensive to use a for loop or to just write out the operations? All matrices are of a known size
for (i = 0; i < 3; i++)//Row
{
for (j = 0; j < 3; j++)//Column
{
result[i][j] = a[i][j] * b;
}
}
or
result[1][1] = a[1][1] * b;
result[1][2] = a[1][2] * b;
etc...
You are talking about loop unrolling. You're right, it is a common technique to decrease computing time of a program. However, as it has been said in the comments, it is not absolutely sure you will save time, because it depends on many factors (compiler, compiler optimisation level, etc...). It is also possible that the compiler does unroll loops itself if you choose a high optimisation level.
Don't forget it takes more code size, which is also a valuable resource.
Keep in mind there are other ways to optimize code. For example here, you multiply all elements in an array by the same variable. Perhaps you can do this multiplication later in the code when you access again the resultarray ? It will save travelling an array with all the memory accesses it implies.

Matrix multiplication in 2 different ways (comparing time)

I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;
You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.

Cache Performance (concerning loops) in C

I was wondering, why does one set of loops allow for better cache performance than another in spite of logically doing the same thing?
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++) {
accum += b[j][k] * a[k][i];
}
c[j][i] = accum;
}
}
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++) {
c[j][i] += val * a[k][i];
}
}
}
I believe the first one above delivers better cache performance, but why?
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?
Just generally speaking, the most efficient loops through a matrix are going to cycle through the last dimension, not the first ("last" being c in m[a][b][c]).
For example, given a 2D matrix like an image which has its pixels represented in memory from top-left to bottom-right, the quickest way to sequentially iterate through it is going to be horizontally across each scanline, like so:
for (int y=0; y < h; ++y) {
for (int x=0; x < w; ++x)
// access pixel[y][x]
}
... not like this:
for (int x=0; x < w; ++x) {
for (int y=0; y < h; ++y)
// access pixel[y][x]
}
... due to spatial locality. It's because the computer grabs memory from slower, bigger regions of the hierarchy and moves it to faster, small regions in large, aligned chunks (ex: 64 byte cache lines, 4 kilobyte pages, and down to a little teeny 64-bit general-purpose register, e.g.). The first example accesses all the data from such a contiguous chunk immediately and prior to eviction.
harold on this site gave me a nice view on how to look at and explain this subject by suggesting not to focus so much on cache misses, but instead focusing on striving to use all the data in a cache prior to eviction. The second example fails to do that for all but the most trivially-small images by iterating through the image vertically with a large, scanline-sized stride rather than horizontally with a small, pixel-sized one.
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?
The answer here would be "yes", as an increase in block size would naturally equate to more compulsory misses (that would be more simply "misses" though rather than "miss rate") but also just more data to process which won't all necessarily fit into the fastest L1 cache. If we're accessing a large amount of data with a large stride, we end up getting a higher non-compulsory miss rate as a result of more data being evicted from the cache before we utilize it, only to then redundantly load it back into a faster cache.
There is also a case where, if the block size is small enough and aligned properly, all the data will just fit into a single cache line and it wouldn't matter so much how we sequentially access it.
Matrix Multiplication
Now your example is quite a bit more complex than this straightforward image example above, but the same concepts tend to apply.
Let's look at the first one:
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++)
accum += b[j][k] * a[k][i];
c[j][i] = accum;
}
}
If we look at the innermost k loop, we access b[j][k]. That's a fairly optimal access pattern: "horizontal" if we imagine a row-order memory layout. However, we also access a[k][i]. That's not so optimal, especially for a very large matrix, as it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.
If we look at the second j loop, that's accessing c[j][i], again in a vertical fashion which is not so optimal.
Now let's have a glance at the second example:
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++)
c[j][i] += val * a[k][i];
}
}
If we look at the second k loop in this case, it's starting off accessing b[j][k] which is optimal (horizontal). Furthermore, it's explicitly memoizing the value to val, which might improve the odds of the compiler moving that to a register and keeping it there for the following loop (this relates to compiler concepts related to aliasing, however, rather than CPU cache).
In the innermost i loop, we're accessing c[j][i] which is also optimal (horizontal) along with a[k][i] which is also optimal (horizontal).
So this second version is likely to be more efficient in practice. Note that we can't absolutely say that, as aggressive optimizing compilers can do all sorts of magical things like rearranging and unrolling loops for you. Yet short of that, we should be able to say the second one has higher odds of being more efficient.
"What's a profiler?"
I just noticed this question in the comments. A profiler is a measuring tool that can give you a precise breakdown of where time is spent in your code, along with possibly further statistics like cache misses and branch mispredictions.
It's not only good for optimizing real-world production code and helping you more effectively prioritize your efforts to places that really matter, but it can also accelerate the learning process of understanding why inefficiencies exist through the process of chasing one hotspot after another.
Loop Tiling/Blocking
It's worth mentioning an advanced optimization technique which can be useful for large matrices -- loop tiling/blocking. It's beyond the scope of this subject but that one plays to temporal locality.
Deep C Optimization
Hopefully later you will be able to C these things clearly as a deep C explorer. While most optimization is best saved for hindsight with a profiler in hand, it's useful to know the basics of how the memory hierarchy works as you go deeper and deeper exploring the C.

What memory access patterns are most efficient for outer-product-type double loops?

What access patterns are most efficient for writing cache-efficient outer-product type code that maximally exploits data data locality?
Consider a block of code for processing all pairs of elements of two arrays such as:
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
out[i*M + j] = X[i] binary-op Y[j];
This is a standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but this same pattern is also matrix multiplication when X and Y are matrices and binary-op is a dot product between the ith row and j-th column of two matrices.
For matrix multiplication, I know optimized BLASs like OpenBLAS and MKL can get much higher performance than you get from the double loop style code above, because they process the elements in chunks in such a way as to exploit the CPU cache much more. Unfortunately, OpenBLAS kernels are written in assembly so it's pretty difficult to figure out what's going on.
Are there any good "tricks of the trade" for re-organizing these types of double loops to improve cache performance?
Since each element of out is only hit once, we're clearly free to reorder the iterations. The straight linear traversal of out is the easiest to write, but I don't think it's the most efficient pattern to execute, since you don't exploit any locality in X.
I'm especially interested in the setting where M and N are large, and the size of each element (X[i], and Y[j]) is pretty small (like O(1) bytes), so were talking about something analogous to vector-vector outer product or the multiplication of a tall and skinny matrix by a short and fat matrix (e.g. N x D by D x M where D is small).
For large enough M, The Y vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y.
You should block up your accesses to Y; something like this:
for (jj = 0; jj < M; jj += CACHE_SIZE) { // Iterate over blocks
for (i = 0; i < N; i++) {
for (j = jj; j < (jj + CACHE_SIZE); j++) { // Iterate within block
out[i*M + j] = X[i] * Y[j];
}
}
}
The above doesn't do anything smart with accesses to X, but new values are only being accessed 1/CACHE_SIZE as often, so the impact is probably negligible.
* If everything is small enough to already fit in cache, then you can't do better than what you already have (vectorisation opportunities notwithstanding).

Why does the order of loops in a matrix multiply algorithm affect performance? [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 8 years ago.
I am given two functions for finding the product of two matrices:
void MultiplyMatrices_1(int **a, int **b, int **c, int n){
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
}
void MultiplyMatrices_2(int **a, int **b, int **c, int n){
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
}
I ran and profiled two executables using gprof, each with identical code except for this function. The second of these is significantly (about 5 times) faster for matrices of size 2048 x 2048. Any ideas as to why?
I believe that what you're looking at is the effects of locality of reference in the computer's memory hierarchy.
Typically, computer memory is segregated into different types that have different performance characteristics (this is often called the memory hierarchy). The fastest memory is in the processor's registers, which can (usually) be accessed and read in a single clock cycle. However, there are usually only a handful of these registers (usually no more than 1KB). The computer's main memory, on the other hand, is huge (say, 8GB), but is much slower to access. In order to improve performance, the computer is usually physically constructed to have several levels of caches in-between the processor and main memory. These caches are slower than registers but much faster than main memory, so if you do a memory access that looks something up in the cache it tends to be a lot faster than if you have to go to main memory (typically, between 5-25x faster). When accessing memory, the processor first checks the memory cache for that value before going back to main memory to read the value in. If you consistently access values in the cache, you will end up with much better performance than if you're skipping around memory, randomly accessing values.
Most programs are written in a way where if a single byte in memory is read into memory, the program later reads multiple different values from around that memory region as well. Consequently, these caches are typically designed so that when you read a single value from memory, a block of memory (usually somewhere between 1KB and 1MB) of values around that single value is also pulled into the cache. That way, if your program reads the nearby values, they're already in the cache and you don't have to go to main memory.
Now, one last detail - in C/C++, arrays are stored in row-major order, which means that all of the values in a single row of a matrix are stored next to each other. Thus in memory the array looks like the first row, then the second row, then the third row, etc.
Given this, let's look at your code. The first version looks like this:
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Now, let's look at that innermost line of code. On each iteration, the value of k is changing increasing. This means that when running the innermost loop, each iteration of the loop is likely to have a cache miss when loading the value of b[k][j]. The reason for this is that because the matrix is stored in row-major order, each time you increment k, you're skipping over an entire row of the matrix and jumping much further into memory, possibly far past the values you've cached. However, you don't have a miss when looking up c[i][j] (since i and j are the same), nor will you probably miss a[i][k], because the values are in row-major order and if the value of a[i][k] is cached from the previous iteration, the value of a[i][k] read on this iteration is from an adjacent memory location. Consequently, on each iteration of the innermost loop, you are likely to have one cache miss.
But consider this second version:
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Now, since you're increasing j on each iteration, let's think about how many cache misses you'll likely have on the innermost statement. Because the values are in row-major order, the value of c[i][j] is likely to be in-cache, because the value of c[i][j] from the previous iteration is likely cached as well and ready to be read. Similarly, b[k][j] is probably cached, and since i and k aren't changing, chances are a[i][k] is cached as well. This means that on each iteration of the inner loop, you're likely to have no cache misses.
Overall, this means that the second version of the code is unlikely to have cache misses on each iteration of the loop, while the first version almost certainly will. Consequently, the second loop is likely to be faster than the first, as you've seen.
Interestingly, many compilers are starting to have prototype support for detecting that the second version of the code is faster than the first. Some will try to automatically rewrite the code to maximize parallelism. If you have a copy of the Purple Dragon Book, Chapter 11 discusses how these compilers work.
Additionally, you can optimize the performance of this loop even further using more complex loops. A technique called blocking, for example, can be used to notably increase performance by splitting the array into subregions that can be held in cache longer, then using multiple operations on these blocks to compute the overall result.
Hope this helps!
This may well be the memory locality. When you reorder the loop, the memory that's needed in the inner-most loop is nearer and can be cached, while in the inefficient version you need to access memory from the entire data set.
The way to test this hypothesis is to run a cache debugger (like cachegrind) on the two pieces of code and see how many cache misses they incur.
Apart from locality of memory there is also compiler optimisation. A key one for vector and matrix operations is loop unrolling.
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
You can see in this inner loop i and j do not change. This means it can be rewritten as
for (int k = 0; k < n; k+=4) {
int * aik = &a[i][k];
c[i][j] +=
+ aik[0]*b[k][j]
+ aik[1]*b[k+1][j]
+ aik[2]*b[k+2][j]
+ aik[3]*b[k+3][j];
}
You can see there will be
four times fewer loops and accesses to c[i][j]
a[i][k] is being accessed continuously in memory
the memory accesses and multiplies can be pipelined (almost concurrently) in the CPU.
What if n is not a multiple of 4 or 6 or 8? (or whatever the compiler decides to unroll it to) The compiler handles this tidy up for you. ;)
To speed up this solution faster, you could try transposing the b matrix first. This is a little extra work and coding, but it means that accesses to b-transposed are also continuous in memory. (As you are swapping [k] with [j])
Another thing you can do to improve performance is to multi-thread the multiplication. This can improve performance by a factor of 3 on a 4 core CPU.
Lastly you might consider using float or double You might think int would be faster, however that is not always the case as floating point operations can be more heavily optimised (both in hardware and the compiler)
The second example has c[i][j] is changing on each iteration which makes it harder to optimise.
Probably the second one has to skip around in memory more to access the array elements. It might be something else, too -- you could check the compiled code to see what is actually happening.

Resources