A is an MxK matrix, B is a vector of size K, and C is a KxN matrix. What set of BLAS operators should I use to compute the matrix below?
M = A*diag(B)*C
One way to implement this would be using three for loops like below
for (int i=0; i<M; ++i)
for (int j=0; j<N; ++j)
for (int k=0; k<K; ++k)
M(i,j) = A(i,k)*B(k)*C(k,j);
Is it actually worth implementing this in BLAS in order to gain better speed efficiency?
First compute D = diag(B)*C, then use the appropriate BLAS matrix-multiply to compute A*D.
You can implement diag(B)*C using a loop over elements of B and calling to the appropriate BLAS scalar-multiplication routine.
Related
I'm trying to figure out a suitable way to apply row-wise permutation of a matrix using SIMD intrinsics (mainly AVX/AVX2 and AVX512).
The problem is basically calculating R = PX where P is a permutation matrix (sparse) with only only 1 nonzero element per column. This allows one to represent matrix P as a vector p where p[i] is the row index of nonzero value for column i. Code below shows a simple loop to achieve this:
// R and X are 2d matrices with shape = (m,n), same size
for (size_t i = 0; i < m; ++i){
for (size_t j = 0; j < n; ++j) {
R[p[i],j] += X[i,j]
}
}
I assume it all boils down to gather, but before spending long time trying implement various approaches, I would love to know what you folks think about this and what is the more/most suitable approach tackling this?
Isn't it strange that none of the compilers use avx-512 for this?
https://godbolt.org/z/ox9nfjh8d
Why is it that gcc doesn't do register blocking? I see clang does a better job, is this common?
//Cannot understand use of this function
for(int i=0; i<n; i++) {
for(int j=0; j<n; j++) {
double sum = 0;
for(int k=0; k<n; k++) {
//Why is i*n+k used here?
sum += A[i*n+k]*A[j*n+k];
}
C[i*n+j] = sum;
int main() {
double *m4 = (double*)malloc(sizeof(double)*n*n);
//Why was gemm_ATA function used here?
gemm_ATA(m3, m4, n); //make a positive-definite matrix
printf("\n");
//show_matrix(m4,n);
}
I am making a project for parallelizing Cholesky method and found a useful code. In the given, project this function is used and I have no idea why is it used.
Also, can someone help me understand the code and its function used in the code given in the link:-
http://coliru.stacked-crooked.com/a/6f5750c20d456da9
The function gemm_ATA takes an input matrix A and calculates C = A^T * A, which is positive semi-definite by definition (note the semi-definiteness, depending on the properties of the input matrix).
Mathematically, calculating this matrix would be:
c_i,j = sum_k a_k,i * a_k,j
c_i,j is the entry of C in the i-th row and j-th column. The expressions i*n+k and j*n+k transform these 2D indices (row and column) to a 1D index of the underlying array.
gemm_ATA calculate AA^T and stores it in C. A^T is A but flipped over its diagonal. So A*A^T multiply each row of matrix A (call A[-,i]) with each column of the matrix A (call A^T[j,-]) which is also the row of A (A[-,j]).
Also, if you flatten an n*n 2D matrix to 1D matrix, you can access the first element of each i-th row by i*n+0. So if you want k-th column of ith row you have it with A[i*n+k].
Note that since you pass C by reference to your function, after calling the function, m4 is your positive definite matrix created from m3.
I have a Nx x Ny matrix U stored as a one-dimensional array of length Nx*Ny. In terms of application, each entry represents the solution value to some differential equation at the grid point (x_i, y_j), although I don't think that's important.
I am not very proficient in C, but I know that it is row-major, so to avoid too many cache misses, it is better to loop over the columns first:
#define U(i,j) U[j+Ny*i]
for (int i=0; i<Nx; ++i)
for (int j=0; j<Ny; ++j)
U(i,j) = i*j; // example operation
My algorithm requires me to do two different types of operations:
For row i of U, do some computation that outputs row i of another array F
For column j of U, do some computation that outputs column j of another array G
where F and G have the same length and "shape" as U. The goal is a computational step like this:
#define U(i,j) U[j+Ny*i]
#define F(i,j) F[j+Ny*i]
#define G(i,j) G[j+Ny*i]
for (int i; i<Nx; ++i)
/* use U(i,:) to compute F(i,:); the : is just pseudocode short-hand to indicate an entire column or row */
for (int j; j<Ny; ++j)
/* use U(:,j) to compute G(:,j) */
for (int i=0; i<Nx; ++i)
for (int j=0; j<Ny; ++j)
U(i,j) += F(i,j) + G(i,j); // example computation
I am struggling a bit to see how to do this computation efficiently. The steps that operate on rows of U seem fine, but then the operations on the columns of U will be quite slow, and entering values into G in a column-wise fashion will also be slow.
One method I thought of would involve storing both U and its transpose, that way operations on columns of U can be done on rows of UT. But I have to do the computational steps many thousands of times, and it seems like explicitly computing a transpose would be even slower. Likewise, I could assemble the transpose of G so that I'm only ever entering values in a row-major fashion, but then in the step U(i,j) += F(i,j) + G(j,i), I am now having to get column-wise values of G.
How should I deal with this situation in an efficient way?
I have a matrix S(n x m) and a vector Sigma(n), and I would like to multiply each row S(i) by Sigma(i).
I have thought of 3 things :
-> Convert Sigma to a square diagonal matrix and compute S = Sigma * S, but it seems the functions exist only for general or triangular matrix...
-> Multiply each line by a scalar Sigma[i] using a DSCAL, in a loop
-> mkl_ddiamm, but it seems kinda obscure to me.
Any advices on how I should implement that ? Thank you !
It is a very simple operation that MKL/BLAS does not provide a function for it. You could implement it by yourself with for loops.
for(int i=0; i<nrow; ++i) {
for(int j=0; j<ncols; ++j) {
s[i][j] += sigma[i];
}
}
What access patterns are most efficient for writing cache-efficient outer-product type code that maximally exploits data data locality?
Consider a block of code for processing all pairs of elements of two arrays such as:
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
out[i*M + j] = X[i] binary-op Y[j];
This is a standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but this same pattern is also matrix multiplication when X and Y are matrices and binary-op is a dot product between the ith row and j-th column of two matrices.
For matrix multiplication, I know optimized BLASs like OpenBLAS and MKL can get much higher performance than you get from the double loop style code above, because they process the elements in chunks in such a way as to exploit the CPU cache much more. Unfortunately, OpenBLAS kernels are written in assembly so it's pretty difficult to figure out what's going on.
Are there any good "tricks of the trade" for re-organizing these types of double loops to improve cache performance?
Since each element of out is only hit once, we're clearly free to reorder the iterations. The straight linear traversal of out is the easiest to write, but I don't think it's the most efficient pattern to execute, since you don't exploit any locality in X.
I'm especially interested in the setting where M and N are large, and the size of each element (X[i], and Y[j]) is pretty small (like O(1) bytes), so were talking about something analogous to vector-vector outer product or the multiplication of a tall and skinny matrix by a short and fat matrix (e.g. N x D by D x M where D is small).
For large enough M, The Y vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y.
You should block up your accesses to Y; something like this:
for (jj = 0; jj < M; jj += CACHE_SIZE) { // Iterate over blocks
for (i = 0; i < N; i++) {
for (j = jj; j < (jj + CACHE_SIZE); j++) { // Iterate within block
out[i*M + j] = X[i] * Y[j];
}
}
}
The above doesn't do anything smart with accesses to X, but new values are only being accessed 1/CACHE_SIZE as often, so the impact is probably negligible.
* If everything is small enough to already fit in cache, then you can't do better than what you already have (vectorisation opportunities notwithstanding).