Cache Performance (concerning loops) in C - c

I was wondering, why does one set of loops allow for better cache performance than another in spite of logically doing the same thing?
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++) {
accum += b[j][k] * a[k][i];
}
c[j][i] = accum;
}
}
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++) {
c[j][i] += val * a[k][i];
}
}
}
I believe the first one above delivers better cache performance, but why?
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?

Just generally speaking, the most efficient loops through a matrix are going to cycle through the last dimension, not the first ("last" being c in m[a][b][c]).
For example, given a 2D matrix like an image which has its pixels represented in memory from top-left to bottom-right, the quickest way to sequentially iterate through it is going to be horizontally across each scanline, like so:
for (int y=0; y < h; ++y) {
for (int x=0; x < w; ++x)
// access pixel[y][x]
}
... not like this:
for (int x=0; x < w; ++x) {
for (int y=0; y < h; ++y)
// access pixel[y][x]
}
... due to spatial locality. It's because the computer grabs memory from slower, bigger regions of the hierarchy and moves it to faster, small regions in large, aligned chunks (ex: 64 byte cache lines, 4 kilobyte pages, and down to a little teeny 64-bit general-purpose register, e.g.). The first example accesses all the data from such a contiguous chunk immediately and prior to eviction.
harold on this site gave me a nice view on how to look at and explain this subject by suggesting not to focus so much on cache misses, but instead focusing on striving to use all the data in a cache prior to eviction. The second example fails to do that for all but the most trivially-small images by iterating through the image vertically with a large, scanline-sized stride rather than horizontally with a small, pixel-sized one.
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?
The answer here would be "yes", as an increase in block size would naturally equate to more compulsory misses (that would be more simply "misses" though rather than "miss rate") but also just more data to process which won't all necessarily fit into the fastest L1 cache. If we're accessing a large amount of data with a large stride, we end up getting a higher non-compulsory miss rate as a result of more data being evicted from the cache before we utilize it, only to then redundantly load it back into a faster cache.
There is also a case where, if the block size is small enough and aligned properly, all the data will just fit into a single cache line and it wouldn't matter so much how we sequentially access it.
Matrix Multiplication
Now your example is quite a bit more complex than this straightforward image example above, but the same concepts tend to apply.
Let's look at the first one:
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++)
accum += b[j][k] * a[k][i];
c[j][i] = accum;
}
}
If we look at the innermost k loop, we access b[j][k]. That's a fairly optimal access pattern: "horizontal" if we imagine a row-order memory layout. However, we also access a[k][i]. That's not so optimal, especially for a very large matrix, as it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.
If we look at the second j loop, that's accessing c[j][i], again in a vertical fashion which is not so optimal.
Now let's have a glance at the second example:
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++)
c[j][i] += val * a[k][i];
}
}
If we look at the second k loop in this case, it's starting off accessing b[j][k] which is optimal (horizontal). Furthermore, it's explicitly memoizing the value to val, which might improve the odds of the compiler moving that to a register and keeping it there for the following loop (this relates to compiler concepts related to aliasing, however, rather than CPU cache).
In the innermost i loop, we're accessing c[j][i] which is also optimal (horizontal) along with a[k][i] which is also optimal (horizontal).
So this second version is likely to be more efficient in practice. Note that we can't absolutely say that, as aggressive optimizing compilers can do all sorts of magical things like rearranging and unrolling loops for you. Yet short of that, we should be able to say the second one has higher odds of being more efficient.
"What's a profiler?"
I just noticed this question in the comments. A profiler is a measuring tool that can give you a precise breakdown of where time is spent in your code, along with possibly further statistics like cache misses and branch mispredictions.
It's not only good for optimizing real-world production code and helping you more effectively prioritize your efforts to places that really matter, but it can also accelerate the learning process of understanding why inefficiencies exist through the process of chasing one hotspot after another.
Loop Tiling/Blocking
It's worth mentioning an advanced optimization technique which can be useful for large matrices -- loop tiling/blocking. It's beyond the scope of this subject but that one plays to temporal locality.
Deep C Optimization
Hopefully later you will be able to C these things clearly as a deep C explorer. While most optimization is best saved for hindsight with a profiler in hand, it's useful to know the basics of how the memory hierarchy works as you go deeper and deeper exploring the C.

Related

How to loop through blocks of pixels with minimum number of for loops

I have an image of width * height pixels in which i want to loop through blocks of pixels, say block size of 10 * 10. How can i do this with minimum number of loops?
I have tried by first looping through each column, then through each row and took the starting x and y position from this two outer loops. Then the loop goes from start position of the block and loops till the block size and manipulates the pixels. This consumes four nested loops.
for (int i = 0; i < Width; i+=Block_Size) {
for (int j = 0; j < Height; j+=Block_Size) {
for (int x = i; x < i + Block_Size; x++) {
for (int y = j; y < j + Block_Size; y++) {
//Get pixel values within the block
}
}
}
}
How can i do this with minimum number of loops?
You can reduce the number of loops by completely unrolling as many loop levels as you like. For fixed raster dimensions, you could unroll them all, yielding a (probably lengthy) implementation with zero loops. For known Block_Size you can unroll one or both of the inner loops regardless of whether the overall dimensions are known, yielding as few as two loops remaining.
But why would you consider such a thing? The question seems to assume that there would be some kind of inherent advantage to reducing the depth of loop nest, but that's not necessarily true, and whatever effect there might be is likely to be small.
I'm inclined to guess that you've studied a bit of computational complexity theory, and taken away the idea that deep loop nests necessarily yield poorly-scaling performance, or even that deep loop nests have inherently poor performance, period. These are misconceptions, albeit relatively common ones, and they anyway look at the problem backwards.
The primary consideration in how the performance of your loop nest scales is how many times the body of the innermost loop,
//Get pixel values within the block
, is executed. You'll have roughly the same performance for any reasonable approach that causes it to be executed exactly once for every pixel in the raster, regardless of how many loops are involved. With that being the case, code clarity should be your goal, and your original four-loop nest is pretty clear.
It is possible to achieve this with three loops, but in order to do that you will need to store information about where each block of pixels starts and how many blocks of pixels there are in total!
Independent of that, both the width as well as the height of the image have to be multiples of your Block_Size.
Here is how it is possible with three loops:
int numberOfBlocks = x;
int pixelBlockStartingPoints[numberOfBlocks] = { startingPoint1, startingPoint2, ... };
for(int i = 0; i < numberOfBlocks; i++){
for(int j = pixelBlockStartingPoints[i]; j < pixelBlockStartingPoint[i] + Block_Size; j++){
for(int k = pixelBlockStartingPoints[i]; k < pixelBlockStartingPoint[i] + Block_Size; k++){
// Get Pixel-Data
}
}
}

What memory access patterns are most efficient for outer-product-type double loops?

What access patterns are most efficient for writing cache-efficient outer-product type code that maximally exploits data data locality?
Consider a block of code for processing all pairs of elements of two arrays such as:
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
out[i*M + j] = X[i] binary-op Y[j];
This is a standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but this same pattern is also matrix multiplication when X and Y are matrices and binary-op is a dot product between the ith row and j-th column of two matrices.
For matrix multiplication, I know optimized BLASs like OpenBLAS and MKL can get much higher performance than you get from the double loop style code above, because they process the elements in chunks in such a way as to exploit the CPU cache much more. Unfortunately, OpenBLAS kernels are written in assembly so it's pretty difficult to figure out what's going on.
Are there any good "tricks of the trade" for re-organizing these types of double loops to improve cache performance?
Since each element of out is only hit once, we're clearly free to reorder the iterations. The straight linear traversal of out is the easiest to write, but I don't think it's the most efficient pattern to execute, since you don't exploit any locality in X.
I'm especially interested in the setting where M and N are large, and the size of each element (X[i], and Y[j]) is pretty small (like O(1) bytes), so were talking about something analogous to vector-vector outer product or the multiplication of a tall and skinny matrix by a short and fat matrix (e.g. N x D by D x M where D is small).
For large enough M, The Y vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y.
You should block up your accesses to Y; something like this:
for (jj = 0; jj < M; jj += CACHE_SIZE) { // Iterate over blocks
for (i = 0; i < N; i++) {
for (j = jj; j < (jj + CACHE_SIZE); j++) { // Iterate within block
out[i*M + j] = X[i] * Y[j];
}
}
}
The above doesn't do anything smart with accesses to X, but new values are only being accessed 1/CACHE_SIZE as often, so the impact is probably negligible.
* If everything is small enough to already fit in cache, then you can't do better than what you already have (vectorisation opportunities notwithstanding).

Why does the order of loops in a matrix multiply algorithm affect performance? [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 8 years ago.
I am given two functions for finding the product of two matrices:
void MultiplyMatrices_1(int **a, int **b, int **c, int n){
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
}
void MultiplyMatrices_2(int **a, int **b, int **c, int n){
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
}
I ran and profiled two executables using gprof, each with identical code except for this function. The second of these is significantly (about 5 times) faster for matrices of size 2048 x 2048. Any ideas as to why?
I believe that what you're looking at is the effects of locality of reference in the computer's memory hierarchy.
Typically, computer memory is segregated into different types that have different performance characteristics (this is often called the memory hierarchy). The fastest memory is in the processor's registers, which can (usually) be accessed and read in a single clock cycle. However, there are usually only a handful of these registers (usually no more than 1KB). The computer's main memory, on the other hand, is huge (say, 8GB), but is much slower to access. In order to improve performance, the computer is usually physically constructed to have several levels of caches in-between the processor and main memory. These caches are slower than registers but much faster than main memory, so if you do a memory access that looks something up in the cache it tends to be a lot faster than if you have to go to main memory (typically, between 5-25x faster). When accessing memory, the processor first checks the memory cache for that value before going back to main memory to read the value in. If you consistently access values in the cache, you will end up with much better performance than if you're skipping around memory, randomly accessing values.
Most programs are written in a way where if a single byte in memory is read into memory, the program later reads multiple different values from around that memory region as well. Consequently, these caches are typically designed so that when you read a single value from memory, a block of memory (usually somewhere between 1KB and 1MB) of values around that single value is also pulled into the cache. That way, if your program reads the nearby values, they're already in the cache and you don't have to go to main memory.
Now, one last detail - in C/C++, arrays are stored in row-major order, which means that all of the values in a single row of a matrix are stored next to each other. Thus in memory the array looks like the first row, then the second row, then the third row, etc.
Given this, let's look at your code. The first version looks like this:
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Now, let's look at that innermost line of code. On each iteration, the value of k is changing increasing. This means that when running the innermost loop, each iteration of the loop is likely to have a cache miss when loading the value of b[k][j]. The reason for this is that because the matrix is stored in row-major order, each time you increment k, you're skipping over an entire row of the matrix and jumping much further into memory, possibly far past the values you've cached. However, you don't have a miss when looking up c[i][j] (since i and j are the same), nor will you probably miss a[i][k], because the values are in row-major order and if the value of a[i][k] is cached from the previous iteration, the value of a[i][k] read on this iteration is from an adjacent memory location. Consequently, on each iteration of the innermost loop, you are likely to have one cache miss.
But consider this second version:
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Now, since you're increasing j on each iteration, let's think about how many cache misses you'll likely have on the innermost statement. Because the values are in row-major order, the value of c[i][j] is likely to be in-cache, because the value of c[i][j] from the previous iteration is likely cached as well and ready to be read. Similarly, b[k][j] is probably cached, and since i and k aren't changing, chances are a[i][k] is cached as well. This means that on each iteration of the inner loop, you're likely to have no cache misses.
Overall, this means that the second version of the code is unlikely to have cache misses on each iteration of the loop, while the first version almost certainly will. Consequently, the second loop is likely to be faster than the first, as you've seen.
Interestingly, many compilers are starting to have prototype support for detecting that the second version of the code is faster than the first. Some will try to automatically rewrite the code to maximize parallelism. If you have a copy of the Purple Dragon Book, Chapter 11 discusses how these compilers work.
Additionally, you can optimize the performance of this loop even further using more complex loops. A technique called blocking, for example, can be used to notably increase performance by splitting the array into subregions that can be held in cache longer, then using multiple operations on these blocks to compute the overall result.
Hope this helps!
This may well be the memory locality. When you reorder the loop, the memory that's needed in the inner-most loop is nearer and can be cached, while in the inefficient version you need to access memory from the entire data set.
The way to test this hypothesis is to run a cache debugger (like cachegrind) on the two pieces of code and see how many cache misses they incur.
Apart from locality of memory there is also compiler optimisation. A key one for vector and matrix operations is loop unrolling.
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
You can see in this inner loop i and j do not change. This means it can be rewritten as
for (int k = 0; k < n; k+=4) {
int * aik = &a[i][k];
c[i][j] +=
+ aik[0]*b[k][j]
+ aik[1]*b[k+1][j]
+ aik[2]*b[k+2][j]
+ aik[3]*b[k+3][j];
}
You can see there will be
four times fewer loops and accesses to c[i][j]
a[i][k] is being accessed continuously in memory
the memory accesses and multiplies can be pipelined (almost concurrently) in the CPU.
What if n is not a multiple of 4 or 6 or 8? (or whatever the compiler decides to unroll it to) The compiler handles this tidy up for you. ;)
To speed up this solution faster, you could try transposing the b matrix first. This is a little extra work and coding, but it means that accesses to b-transposed are also continuous in memory. (As you are swapping [k] with [j])
Another thing you can do to improve performance is to multi-thread the multiplication. This can improve performance by a factor of 3 on a 4 core CPU.
Lastly you might consider using float or double You might think int would be faster, however that is not always the case as floating point operations can be more heavily optimised (both in hardware and the compiler)
The second example has c[i][j] is changing on each iteration which makes it harder to optimise.
Probably the second one has to skip around in memory more to access the array elements. It might be something else, too -- you could check the compiled code to see what is actually happening.

Performance of memory operations on iPhone

Here's the code that I use to create a differently ordered array:
const unsigned int height = 1536;
const unsigned int width = 2048;
uint32_t* buffer1 = (uint32_t*)malloc(width * height * BPP);
uint32_t* buffer2 = (uint32_t*)malloc(width * height * BPP);
int i = 0;
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++)
buffer1[x+y*width] = buffer2[i++];
Can anyone explain why using the following assignment:
buffer1[i++] = buffer2[x+y*width];
instead of the one in my code take twice as much time?
It's likely down to CPU cache behaviour (at 12MB, your images far exceed the 256KB L2 cache in the ARM Cortex A8 that's inside an iphone3gs).
The first example accesses the reading array in sequential order, which is fast, but has to access the writing array out of order, which is slow.
The second example is the opposite - the writing array is written in fast, sequential order and the reading array is accessed in a slower fashion. Write misses are evidently less costly under this workload than read misses.
Ulrich Drepper's article What Every Programmer Should Know About Memory is recommended reading if you want to know more about this kind of thing.
Note that if you have this operation wrapped up into a function, then you will help the optimiser to generate better code if you use the restrict qualifier on your pointer arguments, like this:
void reorder(uint32_t restrict *buffer1, uint32_t restrict *buffer2)
{
int i = 0;
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++)
buffer1[x+y*width] = buffer2[i++];
}
(The restrict qualifier promises the compiler that the data pointed to by the two pointers doesn't overlap - which in this case is necessary for the function to make sense anyway).
Each pixel access in the first has a linear locality of reference, the second blows your cache on every read having to goto main memory for each.
The processor can much more efficiently handle writes with bad locality than reads, if the write has to go to main memory, that write can happen in parallel to another read/arithmetic operation. If a read misses the cache it can completely stall the processor waiting for more data to filter through the caches hierarchies.

Optimized matrix multiplication in C

I'm trying to compare different methods for matrix multiplication.
The first one is normal method:
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
suma = 0;
for (l = 0; l < i; l++)
suma += MatrixA[j][l]*MatrixB[l][k];
MatrixR[j][k] = suma;
}
}
}
c++;
} while (c<iteraciones);
The second one consist of transposing the matrix B first and then do the multiplication by rows:
int f, co;
for (f = 0; f < i; f++) {
for ( co = 0; co < i; co++) {
MatrixB[f][co] = MatrixB[co][f];
}
}
c = 0;
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
suma = 0;
for (l = 0; l < i; l++)
suma += MatrixA[j][l]*MatrixB[k][l];
MatrixR[j][k] = suma;
}
}
}
c++;
} while (c<iteraciones);
The second method supposed to be much faster, because we are accessing contiguous memory slots, but I'm not getting a significant improvement in the performance. Am I doing something wrong?
I can post the complete code, but I think is not needed.
What Every Programmer Should Know About Memory (pdf link) by Ulrich Drepper has a lot of good ideas about memory efficiency, but in particular, he uses matrix multiplication as an example of how knowing about memory and using that knowledge can speed this process. Look at appendix A.1 in his paper, and read through section 6.2.1. Table 6.2 in the paper shows that he could get his running time to be 10% from a naive implementation's time for a 1000x1000 matrix.
Granted, his final code is pretty hairy and uses a lot of system-specific stuff and compile-time tuning, but still, if you really need speed, reading that paper and reading his implementation is definitely worth it.
Getting this right is non-trivial. Using an existing BLAS library is highly recommended.
Should you really be inclined to roll your own matrix multiplication, loop tiling is an optimization that is of particular importance for large matrices. The tiling should be tuned to the cache size to ensure that the cache is not being continually thrashed, which will occur with a naive implementation. I once measured a 12x performance difference tiling a matrix multiply with matrix sizes picked to consume multiples of my cache (circa '97 so the cache was probably small).
Loop tiling algorithms assume that a contiguous linear array of elements is used, as opposed to rows or columns of pointers. With such a storage choice, your indexing scheme determines which dimension changes fastest, and you are free to decide whether row or column access will have the best cache performance.
There's a lot of literature on the subject. The following references, especially the Banerjee books, may be helpful:
[Ban93] Banerjee, Utpal, Loop Transformations for Restructuring Compilers: the Foundations, Kluwer Academic Publishers, Norwell, MA, 1993.
[Ban94] Banerjee, Utpal, Loop Parallelization, Kluwer Academic Publishers, Norwell, MA, 1994.
[BGS93] Bacon, David F., Susan L. Graham, and Oliver Sharp, Compiler Transformations for High-Performance Computing, Computer Science Division, University of California, Berkeley, Calif., Technical Report No UCB/CSD-93-781.
[LRW91] Lam, Monica S., Edward E. Rothberg, and Michael E Wolf. The Cache Performance and Optimizations of Blocked Algorithms, In 4th International Conference on Architectural Support for Programming Languages, held in Santa Clara, Calif., April, 1991, 63-74.
[LW91] Lam, Monica S., and Michael E Wolf. A Loop Transformation Theory and an Algorithm to Maximize Parallelism, In IEEE Transactions on Parallel and Distributed Systems, 1991, 2(4):452-471.
[PW86] Padua, David A., and Michael J. Wolfe, Advanced Compiler Optimizations for Supercomputers, In Communications of the ACM, 29(12):1184-1201, 1986.
[Wolfe89] Wolfe, Michael J. Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, MA, 1989.
[Wolfe96] Wolfe, Michael J., High Performance Compilers for Parallel Computing, Addison-Wesley, CA, 1996.
ATTENTION: You have a BUG in your second implementation
for (f = 0; f < i; f++) {
for (co = 0; co < i; co++) {
MatrixB[f][co] = MatrixB[co][f];
}
}
When you do f=0, c=1
MatrixB[0][1] = MatrixB[1][0];
you overwrite MatrixB[0][1] and lose that value! When the loop gets to f=1, c=0
MatrixB[1][0] = MatrixB[0][1];
the value copied is the same that was already there.
You should not write matrix multiplication. You should depend on external libraries. In particular you should use the GEMM routine from the BLAS library. GEMM often provides the following optimizations
Blocking
Efficient Matrix Multiplication relies on blocking your matrix and performing several smaller blocked multiplies. Ideally the size of each block is chosen to fit nicely into cache greatly improving performance.
Tuning
The ideal block size depends on the underlying memory hierarchy (how big is the cache?). As a result libraries should be tuned and compiled for each specific machine. This is done, among others, by the ATLAS implementation of BLAS.
Assembly Level Optimization
Matrix multiplicaiton is so common that developers will optimize it by hand. In particular this is done in GotoBLAS.
Heterogeneous(GPU) Computing
Matrix Multiply is very FLOP/compute intensive, making it an ideal candidate to be run on GPUs. cuBLAS and MAGMA are good candidates for this.
In short, dense linear algebra is a well studied topic. People devote their lives to the improvement of these algorithms. You should use their work; it will make them happy.
If the matrix is not large enough or you don't repeat the operations a high number of times you won't see appreciable differences.
If the matrix is, say, 1,000x1,000 you will begin to see improvements, but I would say that if it is below 100x100 you should not worry about it.
Also, any 'improvement' may be of the order of milliseconds, unless yoy are either working with extremely large matrices or repeating the operation thousands of times.
Finally, if you change the computer you are using for a faster one the differences will be even narrower!
How big improvements you get will depend on:
The size of the cache
The size of a cache line
The degree of associativity of the cache
For small matrix sizes and modern processors it's highly probable that the data fron both MatrixA and MatrixB will be kept nearly entirely in the cache after you touch it the first time.
Just something for you to try (but this would only make a difference for large matrices): seperate out your addition logic from the multiplication logic in the inner loop like so:
for (k = 0; k < i; k++)
{
int sums[i];//I know this size declaration is illegal in C. consider
//this pseudo-code.
for (l = 0; l < i; l++)
sums[l] = MatrixA[j][l]*MatrixB[k][l];
int suma = 0;
for(int s = 0; s < i; s++)
suma += sums[s];
}
This is because you end up stalling your pipeline when you write to suma. Granted, much of this is taken care of in register renaming and the like, but with my limited understanding of hardware, if I wanted to squeeze every ounce of performance out of the code, I would do this because now you don't have to stall the pipeline to wait for a write to suma. Since multiplication is more expensive than addition, you want to let the machine paralleliz it as much as possible, so saving your stalls for the addition means you spend less time waiting in the addition loop than you would in the multiplication loop.
This is just my logic. Others with more knowledge in the area may disagree.
Can you post some data comparing your 2 approaches for a range of matrix sizes ? It may be that your expectations are unrealistic and that your 2nd version is faster but you haven't done the measurements yet.
Don't forget, when measuring execution time, to include the time to transpose matrixB.
Something else you might want to try is comparing the performance of your code with that of the equivalent operation from your BLAS library. This may not answer your question directly, but it will give you a better idea of what you might expect from your code.
The computation complexity of multiplication of two N*N matrix is O(N^3). The performance will be dramatically improved if you use O(N^2.73) algorithm which probably has been adopted by MATLAB. If you installed a MATLAB, try to multiply two 1024*1024 matrix. On my computer, MATLAB complete it in 0.7s, but the C\C++ implementation of the naive algorithm like yours takes 20s. If you really care about the performance, refer to lower-complex algorithms. I heard there exists O(N^2.4) algorithm, however it needs a very large matrix so that other manipulations can be neglected.
not so special but better :
c = 0;
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
sum = 0; sum_ = 0;
for (l = 0; l < i; l++) {
MatrixB[j][k] = MatrixB[k][j];
sum += MatrixA[j][l]*MatrixB[k][l];
l++;
MatrixB[j][k] = MatrixB[k][j];
sum_ += MatrixA[j][l]*MatrixB[k][l];
sum += sum_;
}
MatrixR[j][k] = sum;
}
}
c++;
} while (c<iteraciones);
Generally speaking, transposing B should end up being much faster than the naive implementation, but at the expense of wasting another NxN worth of memory. I just spent a week digging around matrix multiplication optimization, and so far the absolute hands-down winner is this:
for (int i = 0; i < N; i++)
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
if (likely(k)) /* #define likely(x) __builtin_expect(!!(x), 1) */
C[i][j] += A[i][k] * B[k][j];
else
C[i][j] = A[i][k] * B[k][j];
This is even better than Drepper's method mentioned in an earlier comment, as it works optimally regardless of the cache properties of the underlying CPU. The trick lies in reordering the loops so that all three matrices are accessed in row-major order.
If you are working on small numbers, then the improvement you are mentioning is negligible. Also, performance will vary depend on the Hardware on which you are running. But if you are working on numbers in millions, then it will effect.
Coming to the program, can you paste the program you have written.
Very old question, but heres my current implementation for my opengl projects:
typedef float matN[N][N];
inline void matN_mul(matN dest, matN src1, matN src2)
{
unsigned int i;
for(i = 0; i < N^2; i++)
{
unsigned int row = (int) i / 4, col = i % 4;
dest[row][col] = src1[row][0] * src2[0][col] +
src1[row][1] * src2[1][col] +
....
src[row][N-1] * src3[N-1][col];
}
}
Where N is replaced with the size of the matrix. So if you are multiplying 4x4 matrices, then you use:
typedef float mat4[4][4];
inline void mat4_mul(mat4 dest, mat4 src1, mat4 src2)
{
unsigned int i;
for(i = 0; i < 16; i++)
{
unsigned int row = (int) i / 4, col = i % 4;
dest[row][col] = src1[row][0] * src2[0][col] +
src1[row][1] * src2[1][col] +
src1[row][2] * src2[2][col] +
src1[row][3] * src2[3][col];
}
}
This function mainly minimizes loops but the modulus might be taxing... On my computer this function performed roughly 50% faster than a triple for loop multiplication function.
Cons:
Lots of code needed (ex. different functions for mat3 x mat3, mat5 x mat5...)
Tweaks needed for irregular multiplication (ex. mat3 x mat4).....
This is a very old question but I recently wandered down the rabbit hole and developed 9 different matrix multiplication implementations for both contiguous memory and non-contiguous memory (about 18 different functions). The results are interesting:
https://github.com/cubiclesoft/matrix-multiply
Blocking (aka loop tiling) didn't always produce the best results. In fact, I found that blocking may produce worse results than other algorithms depending on matrix size. And blocking really only started doing marginally better than other algorithms somewhere around 1200x1200 and performed worse at around 2000x2000 but got better again past that point. This seems to be a common problem with blocking - certain matrix sizes just don't work well. Also, blocking on contiguous memory performed slightly worse than the non-contiguous version. Contrary to common thinking, non-contiguous memory storage also generally outperformed contiguous memory storage. Blocking on contiguous memory also performed worse than an optimized straight pointer math version. I'm sure someone will point out areas of optimization that I missed/overlooked but the general conclusion is that blocking/loop tiling may: Do slightly better, do slightly worse (smaller matrices), or it may do much worse. Blocking adds a lot of complexity to the code for largely inconsequential gains and a non-smooth/wacky performance curve that's all over the place.
In my opinion, while it isn't the fastest implementation of the nine options I developed and tested, Implementation 6 has the best balance between code length, code readability, and performance:
void MatrixMultiply_NonContiguous_6(double **C, double **A, double **B, size_t A_rows, size_t A_cols, size_t B_cols)
{
double tmpa;
for (size_t i = 0; i < A_rows; i++)
{
tmpa = A[i][0];
for (size_t j = 0; j < B_cols; j++)
{
C[i][j] = tmpa * B[0][j];
}
for (size_t k = 1; k < A_cols; k++)
{
tmpa = A[i][k];
for (size_t j = 0; j < B_cols; j++)
{
C[i][j] += tmpa * B[k][j];
}
}
}
}
void MatrixMultiply_Contiguous_6(double *C, double *A, double *B, size_t A_rows, size_t A_cols, size_t B_cols)
{
double tmpa;
for (size_t i = 0; i < A_rows; i++)
{
tmpa = A[i * A_cols];
for (size_t j = 0; j < B_cols; j++)
{
C[i * B_cols + j] = tmpa * B[j];
}
for (size_t k = 1; k < A_cols; k++)
{
tmpa = A[i * A_cols + k];
for (size_t j = 0; j < B_cols; j++)
{
C[i * B_cols + j] += tmpa * B[k * B_cols + j];
}
}
}
}
Simply swapping j and k (Implementation 3) does a lot all on its own but little adjustments to use a temporary var for A and removing the if conditional notably improves performance over Implementation 3.
Here are the implementations (copied verbatim from the linked repository):
Implementation 1 - The classic naive implementation. Also the slowest. Good for showing the baseline worst case and validating the other implementations. Not so great for actual, real world usage.
Implementation 2 - Uses a temporary variable for matrix C which might end up using a CPU register to do the addition.
Implementation 3 - Swaps the j and k loops from Implementation 1. The result is a bit more CPU cache friendly but adds a comparison per loop and the temporary from Implementation 2 is lost.
Implementation 4 - The temporary variable makes a comeback but this time on one of the operands (matrix A) instead of the assignment.
Implementation 5 - Move the conditional outside the innermost for loop. Now we have two inner for-loops.
Implementation 6 - Remove conditional altogether. This implementation arguably offers the best balance between code length, code readability, and performance. That is, both contiguous and non-contiguous functions are short, easy to understand, and faster than the earlier implementations. It is good enough that the next three Implementations use it as their starting point.
Implementation 7 - Precalculate base row start for the contiguous memory implementation. The non-contiguous version remains identical to Implementation 6. However, the performance gains are negligible over Implementation 6.
Implementation 8 - Sacrifice a LOT of code readability to use simple pointer math. The result completely removes all array access multiplication. Variant of Implementation 6. The contiguous version performs better than Implementation 9. The non-contiguous version performs about the same as Implementation 6.
Implementation 9 - Return to the readable code of Implementation 6 to implement a blocking algorithm. Processing small blocks at a time allows larger arrays to stay in the CPU cache during inner loop processing for a small increase in performance at around 1200x1200 array sizes but also results in a wacky performance graph that shows it can actually perform far worse than other Implementations.

Resources