How to transpose a matrix in an optimal way using blas? - c

I'm doing some calculations, and doing some analysis on the forces and weakness of different BLAS implementations. however I have come across a problem.
I'm testing cuBlas, doing linAlg on the GPU would seem like a good idea, but there is one problem.
The cuBlas implementation using column-major format, and since this is not what I need in the end, I'm curious if there is a way in with one can make BLAS do matrix-transpose?

BLAS doesn't have a matrix transpose routine built in. The CUDA SDK includes a matrix transpose example with a paper which discusses optimal strategy for performing a transpose. Your best strategy is probably to use row major inputs to CUBLAS with the transpose input version of the calls, then perform the intermediate calculations in column major, and lastly perform a transpose operation afterwards using the SDK transpose kernel.
Edited to add that CUBLAS added a transpose routine in CUBLAS version 5, geam, which can performed matrix transposition in GPU memory and should be regarded as optimal for whatever architecture you are using.

Related

Sparse matrix-matrix multiplication

I'm currently working with sparse matrices, and I have to compare the computation time of sparse matrix-matrix multiplication with full matrix-matrix multiplication. The issue is that sparse matrix computation is waaaaay slower than full matrix computation.
I'm compressing my matrices with the Compressed Row Storage, and multiplicating 2 matrices is very time consuming (quadruple for loop), so I'm wondering if there is a better compression format more suitable for matrix-matrix operation (CRS is very handy with matrix-vector computation).
Thanks in advance!
It's usually referred to as "Compressed Sparse Rows" (CSR), not CRS. The transpose, Compressed Sparse Columns (CSC) is also commonly used, including by the CSparse package that ends up being the backend of quite a lot of systems including MatLAB and SciPy (I think).
There is also a less-common Doubly-Compressed Sparse Columns (DCSC) format used by the Combinatorial BLAS. It compresses the column index again, and is useful for cases where the matrix is hypersparse. A hypersparse matrix has most columns empty, something that happens with 2D matrix decomposition.
That said, yes there is more overhead. However your operations are now dominated by the number of nonzeros, not the dimensions. So your FLOPS might be less but you still get your answer quicker.
You might look at the paper EFFICIENT SPARSE MATRIX-MATRIX PRODUCTS USING COLORINGS http://www.mcs.anl.gov/papers/P5007-0813_1.pdf for a discussion of how to achieve high performance with sparse matrix matrix products.

Finding eigenvalues of large, sparse matrix

I am working on Fermion and Boson Hubbard Model, in which dimension of Hilbert Space are quite large (~50k). I am currently using the Lapack routine DSYEV to determine the eigenvalues & eigenfunctions of the large (50k x 50k) Hamiltonian matrix, but this takes a long time, about 8 hours on a Xeon workstation.
I would like to reduce this run time on this particular machine. I am looking at the Lanczos method and wondering if this is the best option, or if there is another choice.
Lanczos (or other iterative) method is used to compute extreme (small/big) eigenvalues. It is better than direct DSYEV, if you need eigenvalues and eigenfunctions much less than the system size (50k). Especially, if the matrix you have is sparse then the acceleration you will get is much better.
If you are looking for all eigenvalues and your matrix is dense then the better method is direct DSYEV.

Matrix operations in CUDA

What is the best way to organize matrix operations in CUDA (in terms of performance)?
For example, I want to calculate C * C^(-1) * B^T + C, C and B are matrices.
Should I write separate functions for multiplication, transposition and so on or write one function for the whole expression?
Which way is the fastest?
I'd recommend you to use the CUBLAS library. It's normally much daster and more reliable than everything you could write on your own. In addition it's API is similar to the BLAS library which is the standard library for numerical linear algebra.
I think the answer depends heavily on the size of your matrices.
If you can fit a matrix in shared memory, I would probably use a single block to compute that and have all inside a single kernel (probably bigger, where this computation is only a part of it). Hopefully, if you have more matrices, and you need to compute the above equation several times, you can do it in parallel, utilising all GPU computing power.
However, if your matrices are much bigger, you will want more blocks to compute that (check matrix multiplication example in CUDA manual). You need a guarantee that multiplication is finished by all blocks before you proceed with the next part of your equation, and if so, you will need a kernel call for each of your operations.

Which haskell array implementation to use? AKA what are the pros and cons of each

What do I need? [an unordered list]
VERY easy parallelization
support for map, filter etc.
ability to perform array based computations efficiently, like A=B+C, sort of like matlab arrays.
Generation of SIMD code. I guess this is out of the question in the near future for anything, but hey, I can ask :)
support for matrices should be there at a minimum, higher dimensions are lower priority right now.
ability to get a pointer to it and create one from a C pointer.
Support from other libraries. IE, bindings to popular C math packages, i/o to disk or images if the arrays are 2D
What do I see?
Array package in haskell-platform. It's the blessed one and can do parallel
Data.Vector. Has loop fusion, but not in platform, so its maturity is unknown to me.
repa package, contributed by the DPH team, but doesn't work well with any stable ghc today.
Lots of variation in the level of support for array implementations. For instance, there doesn't seem to be an easy way to dump a 2D vector to a image file. IOW, the haskell community apparently hasn't settled on an array implementation.
So please, help me choose.
EDIT A=B+C refers to element wise addition, and not list concatenation
Correct, the community hasn't settled on a good array implementation. I think it would be a good Haskell Prime submission to put forward the Vector API and remove Data.Array.
Vector is very mature! It has:
VERY easy parallelization
support for map, filter etc.
performs array based computations efficiently, like A=B+C (but I'm not in tune with how matlab does it)
vector creation from a pointer via Vector.Storable
It does not:
have enough support from other libraries. IE, bindings to popular C math packages
support matrices, but you can have vectors of vectors. If you build some vector-based matrix operations then perhaps you could upload to hackage as vector-matrix.
Generate SIMD code.
NOTE: You can turn bytestrings into vectors of whatever, so if you have an image as a bytestring then, via Vector.Storable, you might be able to do what you want with the image as a vector.
(I am not allowed to comment)
rpg: Does hmatrix accept Data.Vector? It has a Data.Packed.Vector but are they the same?
Yes. The last version of hmatrix uses by default Data.Vector.Storable for 1D vectors (previously it was optional). The dependency on vector is not shown in Hackage, probably because it is in a configuration flag.
For LAPACK compatibility matrices are not Vector or Vector t, but they can be easily converted (e.g.: Data.Vector.fromList . toRows).
If you want bindings to popular C libraries, the best options are probably hmatrix and blas. Blas is just a binding to a BLAS library, whereas hmatrix provides some higher-level operations. There are also many libraries built upon hmatrix offering further functionality. If you're doing any sort of matrix work, that's what I would start with.
The vector package is also a good choice; it's stable and provides excellent performance. The Data.Vector.Storable types are represented as C arrays, so it's trivial to interface from them to other C libraries. The biggest drawback is that there's no matrix support, so you'd have to do that yourself.
As for exporting to an image format, most haskell image libraries seem to use ByteStrings. You could either convert to a ByteString, or bind to a C library that does what you want. If you find a Haskell library that does what you want, it should be easy enough to convert hmatrix data to the proper format.

BLAS Library Benchmark

Is there a benchmark that compares the different BLAS (Basic Linear Algebra Subprograms) libraries? I am especially interested in sparse matrix multiplication for single- and multi-core systems?
BLAS performance is very much system dependent, so you'll best do the benchmarks yourself on the very machine you want to use. Since there are only a few BLAS implementations, that is less work than it sounds (normally the hardware vendors implementation, ATLAS and the GOTO BLAS).
But note that BLAS only covers dense matrices, so for sparse matrix multiplication you'll need Sparse-BLAS or some other code. Here performance will differ not only depending on hardware but also on the sparse format you want to use and even on the type of matrix you are working with (things like sparsity pattern, bandwidth etc. matter). So even more than in the dense case, if you need maximum performance you will need to do your own benchmarks.

Resources