OpenFOAM, PETSc or other sparse matrix multiplication source code - sparse-matrix

Could someone tell me, where I can find source code for matrix multiplication realized by OpenFOAM, PETSc or something similar? It can't be trivial algorithm.
I have found homepages of OpenFOAM and PETSc but in doc I cant find multiply methods and source code.

PETSc implements matrix multiplication for many formats, look at this part of MatMult_SeqAIJ for the most basic implementation. For a sparse matrix stored in compressed sparse row form with row starts ai, column indices aj, and entries aa, multiplication consists of the following simple kernel.
for (i=0; i<m; i++) {
y[i] = 0;
for (j=ai[i]; j<ai[i+1]; j++)
y[i] += aa[j] * x[aj[j]];
}

Related

Row-wise permutation of matrix using SIMD instructions

I'm trying to figure out a suitable way to apply row-wise permutation of a matrix using SIMD intrinsics (mainly AVX/AVX2 and AVX512).
The problem is basically calculating R = PX where P is a permutation matrix (sparse) with only only 1 nonzero element per column. This allows one to represent matrix P as a vector p where p[i] is the row index of nonzero value for column i. Code below shows a simple loop to achieve this:
// R and X are 2d matrices with shape = (m,n), same size
for (size_t i = 0; i < m; ++i){
for (size_t j = 0; j < n; ++j) {
R[p[i],j] += X[i,j]
}
}
I assume it all boils down to gather, but before spending long time trying implement various approaches, I would love to know what you folks think about this and what is the more/most suitable approach tackling this?
Isn't it strange that none of the compilers use avx-512 for this?
https://godbolt.org/z/ox9nfjh8d
Why is it that gcc doesn't do register blocking? I see clang does a better job, is this common?

MKL Matrix Transpose

I have a very large rectangular and square float as well as complex matrix. I want to know Is there any in place MKL transpose routine? There is mkl_?imatcopy in MKL, please help me with an example.
I have tried this, but it didnot transpose matrix
size_t nEle = noOfCols * noOfRows;
float *data = (float*)calloc(nEle,sizeof(float));
initalizeData(data,noOfCols,noOfRows);
printdata(data,noOfCols,noOfRows);
printf("After transpose \n\n");
mkl_simatcopy('R','T',noOfCols,noOfRows,1,data,noOfPix,noOfCols);
//writeDataFile((char *)data,"AfterTranspose.img",nEle*sizeof(float));
printdata(data,noOfCols,noOfRows);
You may try to look at the existing in-place transposition routines for float real and complex datatypes. MKL package contains such examples: cimatcopy.c dimatcopy.c simatcopy.c zimatcopy.c. Please refer to the mklroot/examples/transc/source directory

Elementwise product between a vector and a matrix using GNU Blas subroutines

I am working on C, using GNU library for scientific computing. Essentially, I need to do the equivalent of the following MATLAB code:
x=x.*(A*x);
where x is a gsl_vector, and A is a gsl_matrix.
I managed to do (A*x) with the following command:
gsl_blas_dgemv(CblasNoTrans, 1.0, A, x, 1.0, res);
where res is an another gsl_vector, which stores the result. If the matrix A has size m * m, and vector x has size m * 1, then vector res will have size m * 1.
Now, what remains to be done is the elementwise product of vectors x and res (the results should be a vector). Unfortunately, I am stuck on this and cannot find the function which does that.
If anyone can help me on that, I would be very grateful. In addition, does anyone know if there is some better documentation of GNU rather than https://www.gnu.org/software/gsl/manual/html_node/GSL-BLAS-Interface.html#GSL-BLAS-Interface which so far is confusing me.
Finally, would I lose in time performance if I do this step by simply using a for loop (the size of the vector is around 11000 and this step will be repeated 500-5000 times)?
for (i = 0; i < m; i++)
gsl_vector_set(res, i, gsl_vector_get(x, i) * gsl_vector_get(res, i));
Thanks!
The function you want is:
gsl_vector_mul(res, x)
I have used Intel's MKL, and I like the documentation on their website for these BLAS routines.
The for-loop is ok if GSL is well designed. For example gsl_vector_set() and gsl_vector_get() can be inlined. You could compare the running time with gsl_blas_daxpy. The for-loop is well optimized if the timing result is similar.
On the other hand, you may want to try a much better matrix library Eigen, with which you can implement your operation with the code similar to this
x = x.array() * (A * x).array();

C Code Wavelet Transform and Explanation

I am trying to implement a wavelet transform in C and I have never done it before. I have read some about Wavelets, and understand the 'growing subspaces' idea, and how Mallat's one sided filter bank is essentially the same idea.
However, I am stuck on how to actually implement Mallat's fast wavelet transform. This is what I understand so far:
The high pass filter, h(t), gives you the detail coefficients. For a given scale j, it is a reflected, dilated, and normed version of the mother wavelet W(t).
g(t) is then the low pass filter that makes up the difference. It is supposed to be the quadrature mirror of h(t)
To get the detail coefficients, or the approximation coefficients for the jth level, you need to convolve your signal block with h(t) or g(t) respectively, and downsample the signal by 2^{j} (ie take every 2^{j} value)
However these are my questions:
How can I find g(t) when I know h(t)?
How can I compute the inverse of this transform?
Do you have any C code that I can reference? (Yes I found the one on wiki but it doesn't help)
What I would like some code to say is:
A. Here is the filter
B. Here is the transform (very explicitly)
C.) Here is the inverse transform (again for dummies)
Thanks for your patience, but there doesn't seem to be a Step1 - Step2 - Step3 -- etc guide out there with explicit examples (that aren't HAAR because all the coefficients are 1s and that makes things confusing).
the Mallat recipe for the fwt is really simple. If you look at the matlab code, eg the script by Jeffrey Kantor, all the steps are obvious.
In C it is a bit more work but that is mainly because you need to take care of your own declarations and allocations.
Firstly, about your summary:
usually the filter h is a lowpass filter, representing the scaling function (father)
likewise, g is usually the highpass filter representing the wavelet (mother)
you cannot perform a J-level decomposition in 1 filtering+downsampling step. At each level, you create an approximation signal c by filtering with h and downsampling, and a detail signal d by filtering with g and downsampling, and repeat this at the next level (using the current c)
About your questions:
for a filter h of an an orthogonal wavelet basis, [h_1 h_2 .. h_m h_n], the QMF is [h_n -h_m .. h_2 -h_1], where n is an even number and m==n-1
the inverse transform does the opposite of the fwt: at each level it upsamples detail d and approximation c, convolves d with g and c with h, and adds the signals together -- see the corresponding matlab script.
Using this information, and given a signal x of len points of type double, scaling h and wavelet g filters of f coefficients (also of type double), and a decomposition level lev, this piece of code implements the Mallat fwt:
double *t=calloc(len+f-1, sizeof(double));
memcpy(t, x, len*sizeof(double));
for (int i=0; i<lev; i++) {
memset(y, 0, len*sizeof(double));
int len2=len/2;
for (int j=0; j<len2; j++)
for (int k=0; k<f; k++) {
y[j] +=t[2*j+k]*h[k];
y[j+len2]+=t[2*j+k]*g[k];
}
len=len2;
memcpy(t, y, len*sizeof(double));
}
free(t);
It uses one extra array: a 'workspace' t to copy the approximation c (the input signal x to start with) for the next iteration.
See this example C program, which you can compile with gcc -std=c99 -fpermissive main.cpp and run with ./a.out.
The inverse should also be something along these lines. Good luck!
The only thing that is missing is some padding for the filter operation.
The lines
y[j] +=t[2*j+k]*h[k];
y[j+len2]+=t[2*j+k]*g[k];
exceed the boundaries of the t-array during first iteration and exceed the approximation part of the array during the following iterations. One must add (f-1) elements at the beginning of the t-array.
double *t=calloc(len+f-1, sizeof(double));
memcpy(&t[f], x, len*sizeof(double));
for (int i=0; i<lev; i++) {
memset(t, 0, (f-1)*sizeof(double));
memset(y, 0, len*sizeof(double));
int len2=len/2;
for (int j=0; j<len2; j++)
for (int k=0; k<f; k++) {
y[j] +=t[2*j+k]*h[k];
y[j+len2]+=t[2*j+k]*g[k];
}
len=len2;
memcpy(&t[f], y, len*sizeof(double));
}

Multiplying three matrices in BLAS with the middle one being diagonal

A is an MxK matrix, B is a vector of size K, and C is a KxN matrix. What set of BLAS operators should I use to compute the matrix below?
M = A*diag(B)*C
One way to implement this would be using three for loops like below
for (int i=0; i<M; ++i)
for (int j=0; j<N; ++j)
for (int k=0; k<K; ++k)
M(i,j) = A(i,k)*B(k)*C(k,j);
Is it actually worth implementing this in BLAS in order to gain better speed efficiency?
First compute D = diag(B)*C, then use the appropriate BLAS matrix-multiply to compute A*D.
You can implement diag(B)*C using a loop over elements of B and calling to the appropriate BLAS scalar-multiplication routine.

Resources