Row-wise permutation of matrix using SIMD instructions - permutation

I'm trying to figure out a suitable way to apply row-wise permutation of a matrix using SIMD intrinsics (mainly AVX/AVX2 and AVX512).
The problem is basically calculating R = PX where P is a permutation matrix (sparse) with only only 1 nonzero element per column. This allows one to represent matrix P as a vector p where p[i] is the row index of nonzero value for column i. Code below shows a simple loop to achieve this:
// R and X are 2d matrices with shape = (m,n), same size
for (size_t i = 0; i < m; ++i){
for (size_t j = 0; j < n; ++j) {
R[p[i],j] += X[i,j]
}
}
I assume it all boils down to gather, but before spending long time trying implement various approaches, I would love to know what you folks think about this and what is the more/most suitable approach tackling this?
Isn't it strange that none of the compilers use avx-512 for this?
https://godbolt.org/z/ox9nfjh8d
Why is it that gcc doesn't do register blocking? I see clang does a better job, is this common?

Related

Efficiently accessing columns of a matrix in C

I have a Nx x Ny matrix U stored as a one-dimensional array of length Nx*Ny. In terms of application, each entry represents the solution value to some differential equation at the grid point (x_i, y_j), although I don't think that's important.
I am not very proficient in C, but I know that it is row-major, so to avoid too many cache misses, it is better to loop over the columns first:
#define U(i,j) U[j+Ny*i]
for (int i=0; i<Nx; ++i)
for (int j=0; j<Ny; ++j)
U(i,j) = i*j; // example operation
My algorithm requires me to do two different types of operations:
For row i of U, do some computation that outputs row i of another array F
For column j of U, do some computation that outputs column j of another array G
where F and G have the same length and "shape" as U. The goal is a computational step like this:
#define U(i,j) U[j+Ny*i]
#define F(i,j) F[j+Ny*i]
#define G(i,j) G[j+Ny*i]
for (int i; i<Nx; ++i)
/* use U(i,:) to compute F(i,:); the : is just pseudocode short-hand to indicate an entire column or row */
for (int j; j<Ny; ++j)
/* use U(:,j) to compute G(:,j) */
for (int i=0; i<Nx; ++i)
for (int j=0; j<Ny; ++j)
U(i,j) += F(i,j) + G(i,j); // example computation
I am struggling a bit to see how to do this computation efficiently. The steps that operate on rows of U seem fine, but then the operations on the columns of U will be quite slow, and entering values into G in a column-wise fashion will also be slow.
One method I thought of would involve storing both U and its transpose, that way operations on columns of U can be done on rows of UT. But I have to do the computational steps many thousands of times, and it seems like explicitly computing a transpose would be even slower. Likewise, I could assemble the transpose of G so that I'm only ever entering values in a row-major fashion, but then in the step U(i,j) += F(i,j) + G(j,i), I am now having to get column-wise values of G.
How should I deal with this situation in an efficient way?

Get the sum of surrounding elements in a matrix

In a [N][N] Matrix, what would be the best way of obtaining the sum of the 8 elements surrounding a certain element?
We've been doing it the brute way, just checking with a lot of if statements but i was wondering if there could be a most clever way of doing this.
The problems we face are the borders of the matrix, since we cannot find a way that looks more subtle than the original bunch of if(i>0 && j>0){...}
Assuming the matrix has been initialized and you are considering calculating sums of those elements whose all eight counterparts exist.Then you can save your time if you apply double for loops for only those elements by doing the following :
Let a N x N matrix then use the following to cover all the elements satisfying the above conditions
for( i = 1; i < N - 1 ;i++)
{
for( j = 1;j < N -1 ;j++)
{
//YOUR CODE
}
}

What memory access patterns are most efficient for outer-product-type double loops?

What access patterns are most efficient for writing cache-efficient outer-product type code that maximally exploits data data locality?
Consider a block of code for processing all pairs of elements of two arrays such as:
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
out[i*M + j] = X[i] binary-op Y[j];
This is a standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but this same pattern is also matrix multiplication when X and Y are matrices and binary-op is a dot product between the ith row and j-th column of two matrices.
For matrix multiplication, I know optimized BLASs like OpenBLAS and MKL can get much higher performance than you get from the double loop style code above, because they process the elements in chunks in such a way as to exploit the CPU cache much more. Unfortunately, OpenBLAS kernels are written in assembly so it's pretty difficult to figure out what's going on.
Are there any good "tricks of the trade" for re-organizing these types of double loops to improve cache performance?
Since each element of out is only hit once, we're clearly free to reorder the iterations. The straight linear traversal of out is the easiest to write, but I don't think it's the most efficient pattern to execute, since you don't exploit any locality in X.
I'm especially interested in the setting where M and N are large, and the size of each element (X[i], and Y[j]) is pretty small (like O(1) bytes), so were talking about something analogous to vector-vector outer product or the multiplication of a tall and skinny matrix by a short and fat matrix (e.g. N x D by D x M where D is small).
For large enough M, The Y vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y.
You should block up your accesses to Y; something like this:
for (jj = 0; jj < M; jj += CACHE_SIZE) { // Iterate over blocks
for (i = 0; i < N; i++) {
for (j = jj; j < (jj + CACHE_SIZE); j++) { // Iterate within block
out[i*M + j] = X[i] * Y[j];
}
}
}
The above doesn't do anything smart with accesses to X, but new values are only being accessed 1/CACHE_SIZE as often, so the impact is probably negligible.
* If everything is small enough to already fit in cache, then you can't do better than what you already have (vectorisation opportunities notwithstanding).

What sort of indexing method can I use to store the distances between X^2 vectors in an array without redundancy?

I'm working on a demo that requires a lot of vector math, and in profiling, I've found that it spends the most time finding the distances between given vectors.
Right now, it loops through an array of X^2 vectors, and finds the distance between each one, meaning it runs the distance function X^4 times, even though (I think) there are only (X^2)/2 unique distances.
It works something like this: (pseudo c)
#define MATRIX_WIDTH 8
typedef float vec2_t[2];
vec2_t matrix[MATRIX_WIDTH * MATRIX_WIDTH];
...
for(int i = 0; i < MATRIX_WIDTH; i++)
{
for(int j = 0; j < MATRIX_WIDTH; j++)
{
float xd, yd;
float distance;
for(int k = 0; k < MATRIX_WIDTH; k++)
{
for(int l = 0; l < MATRIX_WIDTH; l++)
{
int index_a = (i * MATRIX_LENGTH) + j;
int index_b = (k * MATRIX_LENGTH) + l;
xd = matrix[index_a][0] - matrix[index_b][0];
yd = matrix[index_a][1] - matrix[index_b][1];
distance = sqrtf(powf(xd, 2) + powf(yd, 2));
}
}
// More code that uses the distances between each vector
}
}
What I'd like to do is create and populate an array of (X^2) / 2 distances without redundancy, then reference that array when I finally need it. However, I'm drawing a blank on how to index this array in a way that would work. A hash table would do it, but I think it's much too complicated and slow for a problem that seems like it could be solved by a clever indexing method.
EDIT: This is for a flocking simulation.
performance ideas:
a) if possible work with the squared distance, to avoid root calculation
b) never use pow for constant, integer powers - instead use xd*xd
I would consider changing your algorithm - O(n^4) is really bad. When dealing with interactions in physics (also O(n^4) for distances in 2d field) one would implement b-trees etc and neglect particle interactions with a low impact. But it will depend on what "more code that uses the distance..." really does.
just did some considerations: the number of unique distances is 0.5*n*n(+1) with n = w*h.
If you write down when unique distances occur, you will see that both inner loops can be reduced, by starting at i and j.
Additionally if you only need to access those distances via the matrix index, you can set up a 4D-distance matrix.
If memory is limited we can save up nearly 50%, as mentioned above, with a lookup function that will access a triangluar matrix, as Code-Guru said. We would probably precalculate the line index to avoid summing up on access
float distanceArray[(H*W+1)*H*W/2];
int lineIndices[H];
searchDistance(int i, int j)
{
return i<j?distanceArray[i+lineIndices[j]]:distanceArray[j+lineIndices[i]];
}

Optimized matrix multiplication in C

I'm trying to compare different methods for matrix multiplication.
The first one is normal method:
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
suma = 0;
for (l = 0; l < i; l++)
suma += MatrixA[j][l]*MatrixB[l][k];
MatrixR[j][k] = suma;
}
}
}
c++;
} while (c<iteraciones);
The second one consist of transposing the matrix B first and then do the multiplication by rows:
int f, co;
for (f = 0; f < i; f++) {
for ( co = 0; co < i; co++) {
MatrixB[f][co] = MatrixB[co][f];
}
}
c = 0;
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
suma = 0;
for (l = 0; l < i; l++)
suma += MatrixA[j][l]*MatrixB[k][l];
MatrixR[j][k] = suma;
}
}
}
c++;
} while (c<iteraciones);
The second method supposed to be much faster, because we are accessing contiguous memory slots, but I'm not getting a significant improvement in the performance. Am I doing something wrong?
I can post the complete code, but I think is not needed.
What Every Programmer Should Know About Memory (pdf link) by Ulrich Drepper has a lot of good ideas about memory efficiency, but in particular, he uses matrix multiplication as an example of how knowing about memory and using that knowledge can speed this process. Look at appendix A.1 in his paper, and read through section 6.2.1. Table 6.2 in the paper shows that he could get his running time to be 10% from a naive implementation's time for a 1000x1000 matrix.
Granted, his final code is pretty hairy and uses a lot of system-specific stuff and compile-time tuning, but still, if you really need speed, reading that paper and reading his implementation is definitely worth it.
Getting this right is non-trivial. Using an existing BLAS library is highly recommended.
Should you really be inclined to roll your own matrix multiplication, loop tiling is an optimization that is of particular importance for large matrices. The tiling should be tuned to the cache size to ensure that the cache is not being continually thrashed, which will occur with a naive implementation. I once measured a 12x performance difference tiling a matrix multiply with matrix sizes picked to consume multiples of my cache (circa '97 so the cache was probably small).
Loop tiling algorithms assume that a contiguous linear array of elements is used, as opposed to rows or columns of pointers. With such a storage choice, your indexing scheme determines which dimension changes fastest, and you are free to decide whether row or column access will have the best cache performance.
There's a lot of literature on the subject. The following references, especially the Banerjee books, may be helpful:
[Ban93] Banerjee, Utpal, Loop Transformations for Restructuring Compilers: the Foundations, Kluwer Academic Publishers, Norwell, MA, 1993.
[Ban94] Banerjee, Utpal, Loop Parallelization, Kluwer Academic Publishers, Norwell, MA, 1994.
[BGS93] Bacon, David F., Susan L. Graham, and Oliver Sharp, Compiler Transformations for High-Performance Computing, Computer Science Division, University of California, Berkeley, Calif., Technical Report No UCB/CSD-93-781.
[LRW91] Lam, Monica S., Edward E. Rothberg, and Michael E Wolf. The Cache Performance and Optimizations of Blocked Algorithms, In 4th International Conference on Architectural Support for Programming Languages, held in Santa Clara, Calif., April, 1991, 63-74.
[LW91] Lam, Monica S., and Michael E Wolf. A Loop Transformation Theory and an Algorithm to Maximize Parallelism, In IEEE Transactions on Parallel and Distributed Systems, 1991, 2(4):452-471.
[PW86] Padua, David A., and Michael J. Wolfe, Advanced Compiler Optimizations for Supercomputers, In Communications of the ACM, 29(12):1184-1201, 1986.
[Wolfe89] Wolfe, Michael J. Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, MA, 1989.
[Wolfe96] Wolfe, Michael J., High Performance Compilers for Parallel Computing, Addison-Wesley, CA, 1996.
ATTENTION: You have a BUG in your second implementation
for (f = 0; f < i; f++) {
for (co = 0; co < i; co++) {
MatrixB[f][co] = MatrixB[co][f];
}
}
When you do f=0, c=1
MatrixB[0][1] = MatrixB[1][0];
you overwrite MatrixB[0][1] and lose that value! When the loop gets to f=1, c=0
MatrixB[1][0] = MatrixB[0][1];
the value copied is the same that was already there.
You should not write matrix multiplication. You should depend on external libraries. In particular you should use the GEMM routine from the BLAS library. GEMM often provides the following optimizations
Blocking
Efficient Matrix Multiplication relies on blocking your matrix and performing several smaller blocked multiplies. Ideally the size of each block is chosen to fit nicely into cache greatly improving performance.
Tuning
The ideal block size depends on the underlying memory hierarchy (how big is the cache?). As a result libraries should be tuned and compiled for each specific machine. This is done, among others, by the ATLAS implementation of BLAS.
Assembly Level Optimization
Matrix multiplicaiton is so common that developers will optimize it by hand. In particular this is done in GotoBLAS.
Heterogeneous(GPU) Computing
Matrix Multiply is very FLOP/compute intensive, making it an ideal candidate to be run on GPUs. cuBLAS and MAGMA are good candidates for this.
In short, dense linear algebra is a well studied topic. People devote their lives to the improvement of these algorithms. You should use their work; it will make them happy.
If the matrix is not large enough or you don't repeat the operations a high number of times you won't see appreciable differences.
If the matrix is, say, 1,000x1,000 you will begin to see improvements, but I would say that if it is below 100x100 you should not worry about it.
Also, any 'improvement' may be of the order of milliseconds, unless yoy are either working with extremely large matrices or repeating the operation thousands of times.
Finally, if you change the computer you are using for a faster one the differences will be even narrower!
How big improvements you get will depend on:
The size of the cache
The size of a cache line
The degree of associativity of the cache
For small matrix sizes and modern processors it's highly probable that the data fron both MatrixA and MatrixB will be kept nearly entirely in the cache after you touch it the first time.
Just something for you to try (but this would only make a difference for large matrices): seperate out your addition logic from the multiplication logic in the inner loop like so:
for (k = 0; k < i; k++)
{
int sums[i];//I know this size declaration is illegal in C. consider
//this pseudo-code.
for (l = 0; l < i; l++)
sums[l] = MatrixA[j][l]*MatrixB[k][l];
int suma = 0;
for(int s = 0; s < i; s++)
suma += sums[s];
}
This is because you end up stalling your pipeline when you write to suma. Granted, much of this is taken care of in register renaming and the like, but with my limited understanding of hardware, if I wanted to squeeze every ounce of performance out of the code, I would do this because now you don't have to stall the pipeline to wait for a write to suma. Since multiplication is more expensive than addition, you want to let the machine paralleliz it as much as possible, so saving your stalls for the addition means you spend less time waiting in the addition loop than you would in the multiplication loop.
This is just my logic. Others with more knowledge in the area may disagree.
Can you post some data comparing your 2 approaches for a range of matrix sizes ? It may be that your expectations are unrealistic and that your 2nd version is faster but you haven't done the measurements yet.
Don't forget, when measuring execution time, to include the time to transpose matrixB.
Something else you might want to try is comparing the performance of your code with that of the equivalent operation from your BLAS library. This may not answer your question directly, but it will give you a better idea of what you might expect from your code.
The computation complexity of multiplication of two N*N matrix is O(N^3). The performance will be dramatically improved if you use O(N^2.73) algorithm which probably has been adopted by MATLAB. If you installed a MATLAB, try to multiply two 1024*1024 matrix. On my computer, MATLAB complete it in 0.7s, but the C\C++ implementation of the naive algorithm like yours takes 20s. If you really care about the performance, refer to lower-complex algorithms. I heard there exists O(N^2.4) algorithm, however it needs a very large matrix so that other manipulations can be neglected.
not so special but better :
c = 0;
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
sum = 0; sum_ = 0;
for (l = 0; l < i; l++) {
MatrixB[j][k] = MatrixB[k][j];
sum += MatrixA[j][l]*MatrixB[k][l];
l++;
MatrixB[j][k] = MatrixB[k][j];
sum_ += MatrixA[j][l]*MatrixB[k][l];
sum += sum_;
}
MatrixR[j][k] = sum;
}
}
c++;
} while (c<iteraciones);
Generally speaking, transposing B should end up being much faster than the naive implementation, but at the expense of wasting another NxN worth of memory. I just spent a week digging around matrix multiplication optimization, and so far the absolute hands-down winner is this:
for (int i = 0; i < N; i++)
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
if (likely(k)) /* #define likely(x) __builtin_expect(!!(x), 1) */
C[i][j] += A[i][k] * B[k][j];
else
C[i][j] = A[i][k] * B[k][j];
This is even better than Drepper's method mentioned in an earlier comment, as it works optimally regardless of the cache properties of the underlying CPU. The trick lies in reordering the loops so that all three matrices are accessed in row-major order.
If you are working on small numbers, then the improvement you are mentioning is negligible. Also, performance will vary depend on the Hardware on which you are running. But if you are working on numbers in millions, then it will effect.
Coming to the program, can you paste the program you have written.
Very old question, but heres my current implementation for my opengl projects:
typedef float matN[N][N];
inline void matN_mul(matN dest, matN src1, matN src2)
{
unsigned int i;
for(i = 0; i < N^2; i++)
{
unsigned int row = (int) i / 4, col = i % 4;
dest[row][col] = src1[row][0] * src2[0][col] +
src1[row][1] * src2[1][col] +
....
src[row][N-1] * src3[N-1][col];
}
}
Where N is replaced with the size of the matrix. So if you are multiplying 4x4 matrices, then you use:
typedef float mat4[4][4];
inline void mat4_mul(mat4 dest, mat4 src1, mat4 src2)
{
unsigned int i;
for(i = 0; i < 16; i++)
{
unsigned int row = (int) i / 4, col = i % 4;
dest[row][col] = src1[row][0] * src2[0][col] +
src1[row][1] * src2[1][col] +
src1[row][2] * src2[2][col] +
src1[row][3] * src2[3][col];
}
}
This function mainly minimizes loops but the modulus might be taxing... On my computer this function performed roughly 50% faster than a triple for loop multiplication function.
Cons:
Lots of code needed (ex. different functions for mat3 x mat3, mat5 x mat5...)
Tweaks needed for irregular multiplication (ex. mat3 x mat4).....
This is a very old question but I recently wandered down the rabbit hole and developed 9 different matrix multiplication implementations for both contiguous memory and non-contiguous memory (about 18 different functions). The results are interesting:
https://github.com/cubiclesoft/matrix-multiply
Blocking (aka loop tiling) didn't always produce the best results. In fact, I found that blocking may produce worse results than other algorithms depending on matrix size. And blocking really only started doing marginally better than other algorithms somewhere around 1200x1200 and performed worse at around 2000x2000 but got better again past that point. This seems to be a common problem with blocking - certain matrix sizes just don't work well. Also, blocking on contiguous memory performed slightly worse than the non-contiguous version. Contrary to common thinking, non-contiguous memory storage also generally outperformed contiguous memory storage. Blocking on contiguous memory also performed worse than an optimized straight pointer math version. I'm sure someone will point out areas of optimization that I missed/overlooked but the general conclusion is that blocking/loop tiling may: Do slightly better, do slightly worse (smaller matrices), or it may do much worse. Blocking adds a lot of complexity to the code for largely inconsequential gains and a non-smooth/wacky performance curve that's all over the place.
In my opinion, while it isn't the fastest implementation of the nine options I developed and tested, Implementation 6 has the best balance between code length, code readability, and performance:
void MatrixMultiply_NonContiguous_6(double **C, double **A, double **B, size_t A_rows, size_t A_cols, size_t B_cols)
{
double tmpa;
for (size_t i = 0; i < A_rows; i++)
{
tmpa = A[i][0];
for (size_t j = 0; j < B_cols; j++)
{
C[i][j] = tmpa * B[0][j];
}
for (size_t k = 1; k < A_cols; k++)
{
tmpa = A[i][k];
for (size_t j = 0; j < B_cols; j++)
{
C[i][j] += tmpa * B[k][j];
}
}
}
}
void MatrixMultiply_Contiguous_6(double *C, double *A, double *B, size_t A_rows, size_t A_cols, size_t B_cols)
{
double tmpa;
for (size_t i = 0; i < A_rows; i++)
{
tmpa = A[i * A_cols];
for (size_t j = 0; j < B_cols; j++)
{
C[i * B_cols + j] = tmpa * B[j];
}
for (size_t k = 1; k < A_cols; k++)
{
tmpa = A[i * A_cols + k];
for (size_t j = 0; j < B_cols; j++)
{
C[i * B_cols + j] += tmpa * B[k * B_cols + j];
}
}
}
}
Simply swapping j and k (Implementation 3) does a lot all on its own but little adjustments to use a temporary var for A and removing the if conditional notably improves performance over Implementation 3.
Here are the implementations (copied verbatim from the linked repository):
Implementation 1 - The classic naive implementation. Also the slowest. Good for showing the baseline worst case and validating the other implementations. Not so great for actual, real world usage.
Implementation 2 - Uses a temporary variable for matrix C which might end up using a CPU register to do the addition.
Implementation 3 - Swaps the j and k loops from Implementation 1. The result is a bit more CPU cache friendly but adds a comparison per loop and the temporary from Implementation 2 is lost.
Implementation 4 - The temporary variable makes a comeback but this time on one of the operands (matrix A) instead of the assignment.
Implementation 5 - Move the conditional outside the innermost for loop. Now we have two inner for-loops.
Implementation 6 - Remove conditional altogether. This implementation arguably offers the best balance between code length, code readability, and performance. That is, both contiguous and non-contiguous functions are short, easy to understand, and faster than the earlier implementations. It is good enough that the next three Implementations use it as their starting point.
Implementation 7 - Precalculate base row start for the contiguous memory implementation. The non-contiguous version remains identical to Implementation 6. However, the performance gains are negligible over Implementation 6.
Implementation 8 - Sacrifice a LOT of code readability to use simple pointer math. The result completely removes all array access multiplication. Variant of Implementation 6. The contiguous version performs better than Implementation 9. The non-contiguous version performs about the same as Implementation 6.
Implementation 9 - Return to the readable code of Implementation 6 to implement a blocking algorithm. Processing small blocks at a time allows larger arrays to stay in the CPU cache during inner loop processing for a small increase in performance at around 1200x1200 array sizes but also results in a wacky performance graph that shows it can actually perform far worse than other Implementations.

Resources