What does the Parameter superb in LAPACKE_dgesvd(..) mean? - c

Putting up questions like this one raises a bad conscience... nevertheless I find it surprisingly difficult to google this one. I am experimenting with
lapack_int LAPACKE_dgesvd(
int matrix_order, char jobu, char jobvt,
lapack_int m, lapack_int n, double* a,
lapack_int lda, double* s, double* u, lapack_int ldu,
double* vt, lapack_int ldvt, double* superb);
which promises a Singular Value Decomposition. Having already stopped to fear Fortran I found a gold mine of information here: http://www.netlib.no/netlib/lapack/double/dgesvd.f
Actually that link's target explains all parameters but the LAPACKE specific double* superb (well and the order parameter, but in FORTRAN all is COL_MAJOR).
Next, here http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_lapack_examples/lapacke_dgesvd_row.c.htm I found a program which seems to hint at 'this is some kind of worker cache'.
However, if that were true what is the reason for LAPACKE_dgesvd_work(..)?
In addition I have a second question: In the example they use min(M,N)-1 as a size for superb. Why?

According to http://www.netlib.no/netlib/lapack/double/dgesvd.f , about parameter WORK of the fortran version :
WORK (workspace/output) DOUBLE PRECISION array, dimension (MAX(1,LWORK))
On exit, if INFO = 0, WORK(1) returns the optimal LWORK; if INFO > 0, WORK(2:MIN(M,N)) contains the unconverged superdiagonal elements of an upper bidiagonal matrix B whose diagonal is in S (not necessarily sorted). B satisfies A = U * B * VT, so it has the same singular values as A, and singular vectors related by U and VT.
There is a chance that superb is the superdiagonal of this upper bidiagonal matrix B which has the same singular values as A. This also explain the length min(n,m)-1
A look at lapack-3.5.0/lapacke/src/lapacke_dgesvd.c downloaded from http://www.netlib.org/lapack/ confirms it.
The source code also shows that the high level function lapacke_dgesvd() calls the middle level interface lapacke_dgesvd_work(). If you use the high level interface, you don't have to care about the optimal size of WORK. It will be computed and WORK will be allocated in lapacke_dgesvd()
I wonder if there is any gain to use the middle level interface instead...Maybe when this function is called many times on little matrices of same sizes...
Bye,
Francis

Related

Elementwise product between a vector and a matrix using GNU Blas subroutines

I am working on C, using GNU library for scientific computing. Essentially, I need to do the equivalent of the following MATLAB code:
x=x.*(A*x);
where x is a gsl_vector, and A is a gsl_matrix.
I managed to do (A*x) with the following command:
gsl_blas_dgemv(CblasNoTrans, 1.0, A, x, 1.0, res);
where res is an another gsl_vector, which stores the result. If the matrix A has size m * m, and vector x has size m * 1, then vector res will have size m * 1.
Now, what remains to be done is the elementwise product of vectors x and res (the results should be a vector). Unfortunately, I am stuck on this and cannot find the function which does that.
If anyone can help me on that, I would be very grateful. In addition, does anyone know if there is some better documentation of GNU rather than https://www.gnu.org/software/gsl/manual/html_node/GSL-BLAS-Interface.html#GSL-BLAS-Interface which so far is confusing me.
Finally, would I lose in time performance if I do this step by simply using a for loop (the size of the vector is around 11000 and this step will be repeated 500-5000 times)?
for (i = 0; i < m; i++)
gsl_vector_set(res, i, gsl_vector_get(x, i) * gsl_vector_get(res, i));
Thanks!
The function you want is:
gsl_vector_mul(res, x)
I have used Intel's MKL, and I like the documentation on their website for these BLAS routines.
The for-loop is ok if GSL is well designed. For example gsl_vector_set() and gsl_vector_get() can be inlined. You could compare the running time with gsl_blas_daxpy. The for-loop is well optimized if the timing result is similar.
On the other hand, you may want to try a much better matrix library Eigen, with which you can implement your operation with the code similar to this
x = x.array() * (A * x).array();

C Code Wavelet Transform and Explanation

I am trying to implement a wavelet transform in C and I have never done it before. I have read some about Wavelets, and understand the 'growing subspaces' idea, and how Mallat's one sided filter bank is essentially the same idea.
However, I am stuck on how to actually implement Mallat's fast wavelet transform. This is what I understand so far:
The high pass filter, h(t), gives you the detail coefficients. For a given scale j, it is a reflected, dilated, and normed version of the mother wavelet W(t).
g(t) is then the low pass filter that makes up the difference. It is supposed to be the quadrature mirror of h(t)
To get the detail coefficients, or the approximation coefficients for the jth level, you need to convolve your signal block with h(t) or g(t) respectively, and downsample the signal by 2^{j} (ie take every 2^{j} value)
However these are my questions:
How can I find g(t) when I know h(t)?
How can I compute the inverse of this transform?
Do you have any C code that I can reference? (Yes I found the one on wiki but it doesn't help)
What I would like some code to say is:
A. Here is the filter
B. Here is the transform (very explicitly)
C.) Here is the inverse transform (again for dummies)
Thanks for your patience, but there doesn't seem to be a Step1 - Step2 - Step3 -- etc guide out there with explicit examples (that aren't HAAR because all the coefficients are 1s and that makes things confusing).
the Mallat recipe for the fwt is really simple. If you look at the matlab code, eg the script by Jeffrey Kantor, all the steps are obvious.
In C it is a bit more work but that is mainly because you need to take care of your own declarations and allocations.
Firstly, about your summary:
usually the filter h is a lowpass filter, representing the scaling function (father)
likewise, g is usually the highpass filter representing the wavelet (mother)
you cannot perform a J-level decomposition in 1 filtering+downsampling step. At each level, you create an approximation signal c by filtering with h and downsampling, and a detail signal d by filtering with g and downsampling, and repeat this at the next level (using the current c)
About your questions:
for a filter h of an an orthogonal wavelet basis, [h_1 h_2 .. h_m h_n], the QMF is [h_n -h_m .. h_2 -h_1], where n is an even number and m==n-1
the inverse transform does the opposite of the fwt: at each level it upsamples detail d and approximation c, convolves d with g and c with h, and adds the signals together -- see the corresponding matlab script.
Using this information, and given a signal x of len points of type double, scaling h and wavelet g filters of f coefficients (also of type double), and a decomposition level lev, this piece of code implements the Mallat fwt:
double *t=calloc(len+f-1, sizeof(double));
memcpy(t, x, len*sizeof(double));
for (int i=0; i<lev; i++) {
memset(y, 0, len*sizeof(double));
int len2=len/2;
for (int j=0; j<len2; j++)
for (int k=0; k<f; k++) {
y[j] +=t[2*j+k]*h[k];
y[j+len2]+=t[2*j+k]*g[k];
}
len=len2;
memcpy(t, y, len*sizeof(double));
}
free(t);
It uses one extra array: a 'workspace' t to copy the approximation c (the input signal x to start with) for the next iteration.
See this example C program, which you can compile with gcc -std=c99 -fpermissive main.cpp and run with ./a.out.
The inverse should also be something along these lines. Good luck!
The only thing that is missing is some padding for the filter operation.
The lines
y[j] +=t[2*j+k]*h[k];
y[j+len2]+=t[2*j+k]*g[k];
exceed the boundaries of the t-array during first iteration and exceed the approximation part of the array during the following iterations. One must add (f-1) elements at the beginning of the t-array.
double *t=calloc(len+f-1, sizeof(double));
memcpy(&t[f], x, len*sizeof(double));
for (int i=0; i<lev; i++) {
memset(t, 0, (f-1)*sizeof(double));
memset(y, 0, len*sizeof(double));
int len2=len/2;
for (int j=0; j<len2; j++)
for (int k=0; k<f; k++) {
y[j] +=t[2*j+k]*h[k];
y[j+len2]+=t[2*j+k]*g[k];
}
len=len2;
memcpy(&t[f], y, len*sizeof(double));
}

Solve a banded matrix system of equations

I need to solve a 2D Poisson equation, that is, a system of equations in the for AX=B where A is an n-by-n matrix and B is a n-by-1 vector. Being A a discretization matrix for the 2D Poisson problem, I know that only 5 diagonals will be not null. Lapack doesn't provide functions to solve this particular problem, but it has functions for solving banded matrix system of equations, namely DGBTRF (for LU factorization) and DGBTRS. Now, the 5 diagonals are: the main diagonal, the first diagonals above and below the main and two diagonals above and below by m diagonals wrt the main diagonal. After reading the lapack documentation about band storage, I learned that I have to create a (3*m+1)-by-n matrix to store A in band storage format, let's call this matrix AB. Now the questions:
1) what is the difference between dgbtrs and dgbtrs_? Intel MKL provides both but I can't understand why
2) dgbtrf requires the band storage matrix to be an array. Should I linearize AB by rows or by columns?
3) is this the correct way to call the two functions?
int n, m;
double *AB;
/*... fill n, m, AB, with appropriate numbers */
int *pivots;
int nrows = 3 * m + 1, info, rhs = 1;
dgbtrf_(&n, &n, &m, &m, AB, &nrows, pivots, &info);
char trans = 'N';
dgbtrs_(&trans, &n, &m, &m, &rhs, AB, &nrows, pivots, B, &n, &info);
It also provides DGBTRS and DGBTRS_. Those are fortran administrativa that you should not care about. Just call dgbtrs (reason is that on some architectures, fortran routine names have underscore appended, on other not, and names may be either upper or lower case -- Intel MKL #defines the right one to dgbtrs).
LAPACK routines expects column major matrices (ie. Fortran style): store columns one after the others. The banded storage you must use is not hard : http://www.netlib.org/lapack/lug/node124.html.
It seems good to me, but please try it on small problems beforehand (always a good idea by the way). Also make sure you handle non-zero info (this is the way errors are reported).
Better style is to use MKL_INT instead of plain int, this is a typedef to the right type (may be different on some architectures).
Also make sure to allocate memory for pivots before calling dgbtrf.
This might be off topic. But for Poisson equation, FFT based solution is much faster. Just do 2D FFT of your potential field, divided by -(k^2+lambda^2) then do IFFT. lambda is a small number to avoid divergence for k=0. The 5-diagonal equation is a band-limited approximation of the Poisson equation, which approximate the differential operator by finite difference.
http://en.wikipedia.org/wiki/Screened_Poisson_equation

Is it safe to pass GEMV the same output- as input vector to achieve a destructive matrix application?

If A is an n x n matrix and x a vector of dimension n, is it then possible to pass x to GEMV as the argument to both the x and y parameter, with beta=0, to achieve the operation x ← A ⋅ x ?
I'm specifically interested in the Cublas implementation, with C interface.
No. And for Fortran it has nothing to do with the implementation - In Fortran it breaks the language standard to have aliased actual arguments for any subprogram as it breaks the language standard unless those arguments are Intent(In). Thus if the interface has dummy arguments that are Intent(Out), Intent(InOut) or have no Intent you should always use separate variables for the corresponding actual arguments when invoking the subprogram.
NO.
Each element of the output depends on ALL elements of the input vector x
For example: if x is the input and y is the output, A is the matrix,
The ith element of y would be generated in the following manner.
y_i = A_i1*x_1 + A_i2 * x_2 ... + A_in * x_n
So if you over-write x_i with the result from above, some other x_r which depends on x_i will not receive the proper input and produce improper results.
EDIT
I was going to make this a comment, but it was getting too big. So here is the explanation why the above reasoning holds good for parallel implementations too.
Unless each parallel group / thread makes a local copy of the original data, in which case the original data can be destroyed, this line of reasoning holds.
However, doing so (making a local copy) is only practical and beneficial when
Each parallel thread / block would not be able to access the
original array without significant amount of over-head.
There is enough local memory (call it cache, or shared memory or even
regular memory in case of MPI) to hold a separate copy for each
parallel thread / block.
Notes:
(1) may not be true for many multi-threaded applications on a single machine.
(1) may be true for CUDA but (2) is definitely not applicable for CUDA.

Mex sparse matrix

I created a sparse matrix in MEX using mxCreateSparse.
mxArray *W;
W=mxCreateSparse(n*n,n*n,xsize,mxREAL);
double *wpoint;
wpoint=mxGetPr(W);
for(p=0;p<xsize;p++)
{
Wpoint[(returnindex1(xi[p][0],xi[p][1])-1)*n*n + (returnindex1(xj[p][0],xj[p][1]))]= exp(-df[p]/(SIGMAI*SIGMAI)) * exp(-dx[p]/(SIGMAJ*SIGMAJ));
}
the maximum value which comes from (returnindex1(xi[p][0],xi[p][1])-1)*n*n + (returnindex1(xj[p][0],xj[p][1])) is n*n*n*n and I have created the sparse matrix of dimension (n*n)X(n*n)
When I display the whole matrix, some of the zero elements come as junk.
Also for large values of n, segmentation fault occurs at wpoint.
The pr array holds xsize elements and you accessing the array with out of bounds indices . Hence the seg violation.
I think your fundamental problem is that you have not fully grasped how sparse matrices are stored in MATLAB. I'm not an expert on the MATLAB implementation details but my recollection is that it uses compressed column storage.
In essence there are 3 arrays as follows:
double pr[NZMAX] which contains the NZMAX non-zero values.
int ir[NZMAX] which contains the row number of each value in pr.
int jc[m] which indexes into pr and ir identifying the first item in each of the m columns.
That's the executive summary, but I recommend that you read up on the details more carefully.

Resources