Sparse matrix-matrix multiplication - c

I'm currently working with sparse matrices, and I have to compare the computation time of sparse matrix-matrix multiplication with full matrix-matrix multiplication. The issue is that sparse matrix computation is waaaaay slower than full matrix computation.
I'm compressing my matrices with the Compressed Row Storage, and multiplicating 2 matrices is very time consuming (quadruple for loop), so I'm wondering if there is a better compression format more suitable for matrix-matrix operation (CRS is very handy with matrix-vector computation).
Thanks in advance!

It's usually referred to as "Compressed Sparse Rows" (CSR), not CRS. The transpose, Compressed Sparse Columns (CSC) is also commonly used, including by the CSparse package that ends up being the backend of quite a lot of systems including MatLAB and SciPy (I think).
There is also a less-common Doubly-Compressed Sparse Columns (DCSC) format used by the Combinatorial BLAS. It compresses the column index again, and is useful for cases where the matrix is hypersparse. A hypersparse matrix has most columns empty, something that happens with 2D matrix decomposition.
That said, yes there is more overhead. However your operations are now dominated by the number of nonzeros, not the dimensions. So your FLOPS might be less but you still get your answer quicker.

You might look at the paper EFFICIENT SPARSE MATRIX-MATRIX PRODUCTS USING COLORINGS http://www.mcs.anl.gov/papers/P5007-0813_1.pdf for a discussion of how to achieve high performance with sparse matrix matrix products.

Related

Julia: all eigenvalues of large sparse matrix

I have a large sparse matrix, for example, 128000×128000 SparseMatrixCSC{Complex{Float64},Int64} with 1376000 stored entries.
How to quickly get all eigenvalues of the sparse matrix ? Is it possible ?
I tried eigs for 128000×128000 with 1376000 stored entries but the kernel was dead.
I use a mac book pro with 16GB memory and Julia 1.3.1 on jupyter notebook.
As far as I'm aware (and I would love to be proven wrong) there is no efficient way to get all the eigenvalues of a general sparse matrix.
The main algorithm to compute the eigenvalues of a matrix is the QR algorithm. The first step of the QR algorithm is to reduce the matrix to a Hessenberg form (in order to do the QR factorisations in O(n) time). The problem is that reducing a matrix to Hessenberg form destroys the sparsity and you just end up with a dense matrix.
There are also other methods to compute the eigenvalues of a matrix like the (inverse) power iteration, that only require matrix vector products and solving linear systems, but these only give you the largest or smallest eigenvalues, and they become expensive when you want to compute all the eigenvalues (they require storing the eigenvectors for the "deflation").
So that was in general, now if your matrix has some special structure there may some better alternatives. For example, if your matrix is symmetric, then its Hessenberg form is tridiagonal and you can compute all the eigenvalues pretty fast.
TLDR: Is it possible ? — in general, no.
P.S: I tried to keep this short but if you're interested I can give you more details on any part of the answer.

MPI partition a matrix into smaller matrices

The situation is this:
I have an array of dimensions 4x4 and what I have to do is partition this matrix into "blocks"(aka smaller matrices) and distribute them to the "slave-processes". More specifically, suppose the total amount of processes is 4(1 master, 3 slaves and all will calculate what is to be calculated) which makes the partitioning of the 4x4 matrix into four 2x2 matrices. However besides making "buffers" of size 2x2 should be avoid(actually I want to avoid it).
The question is: is there any "clever", more "painless" way to do manage it?
PS: I have to manage this problem http://www.cas.usf.edu/~cconnor/parallel/2dheat/2dheat.html meaning that a Cartesian Communicator will be created.
This is essentially how (plain) MPI works. The 2×2 matrices constitute a distributed data structure. Together, they comprise the actual 4×4 matrix. You could of course also use four 1×4 or 4×1 matrices, that does have some advantages (easier programming) and disadvantages (more communication needed when scaling).
In actual problems, such as a 2D heat equation, you often need to consider the halo around each local matrix. This halo is then exchanged during simulation step.
Note that the code you linked uses full sized matrices on each worker rank. This is a simplification, but wastes resources and is thus not scalable.
MPI gives you some help to manage those distributed data, for instance via Cartesian communicators, or one-sided communication for easier halo-exchange, but essentially you have to manage the distributed data structure.
There are parallel paradigms that provide higher level abstractions of distributed data structures, but even an overview would IMHO bee too broad for this format. Many of those are related to the Partitioned Global Address Space (PGAS) concept. Implementations can range from new languages, language extensions (Co-array Fortran) to libraries and frameworks. Some use MPI internally.

Eigen Sparse Matrix

I am trying to multiply two large sparse matrices of size 300k * 1000k and 1000k*300k using Eigen. The matrices are highly sparse ~0.01% non zero entries, however there's no block or other structure in their sparsity.
It turns out that Eigen chokes and ends up taking 55-60G of memory. Actually, it makes the final matrix dense which explains why it takes so much memory.
I have tried multiplying matrices of similar sizes when one of the matrix is diagonal and the multiplication works fine, with ~2-3 G of memory.
Any thoughts on whats going wrong?
Even though your matrices are sparse, the result might be completely dense. You can try to remove smallest entries with (A*B).prune(ref,eps); where ref is a reference value for what is not a zero and eps is a tolerance value. Basically, all entries smaller than ref*eps will be removed during the computation of the product, thus reducing both the memory usage and size of the result. A better option would be to find a way to avoid performing this product.

Finding eigenvalues of large, sparse matrix

I am working on Fermion and Boson Hubbard Model, in which dimension of Hilbert Space are quite large (~50k). I am currently using the Lapack routine DSYEV to determine the eigenvalues & eigenfunctions of the large (50k x 50k) Hamiltonian matrix, but this takes a long time, about 8 hours on a Xeon workstation.
I would like to reduce this run time on this particular machine. I am looking at the Lanczos method and wondering if this is the best option, or if there is another choice.
Lanczos (or other iterative) method is used to compute extreme (small/big) eigenvalues. It is better than direct DSYEV, if you need eigenvalues and eigenfunctions much less than the system size (50k). Especially, if the matrix you have is sparse then the acceleration you will get is much better.
If you are looking for all eigenvalues and your matrix is dense then the better method is direct DSYEV.

How to transpose a matrix in an optimal way using blas?

I'm doing some calculations, and doing some analysis on the forces and weakness of different BLAS implementations. however I have come across a problem.
I'm testing cuBlas, doing linAlg on the GPU would seem like a good idea, but there is one problem.
The cuBlas implementation using column-major format, and since this is not what I need in the end, I'm curious if there is a way in with one can make BLAS do matrix-transpose?
BLAS doesn't have a matrix transpose routine built in. The CUDA SDK includes a matrix transpose example with a paper which discusses optimal strategy for performing a transpose. Your best strategy is probably to use row major inputs to CUBLAS with the transpose input version of the calls, then perform the intermediate calculations in column major, and lastly perform a transpose operation afterwards using the SDK transpose kernel.
Edited to add that CUBLAS added a transpose routine in CUBLAS version 5, geam, which can performed matrix transposition in GPU memory and should be regarded as optimal for whatever architecture you are using.

Resources