Matrix operations in CUDA - c

What is the best way to organize matrix operations in CUDA (in terms of performance)?
For example, I want to calculate C * C^(-1) * B^T + C, C and B are matrices.
Should I write separate functions for multiplication, transposition and so on or write one function for the whole expression?
Which way is the fastest?

I'd recommend you to use the CUBLAS library. It's normally much daster and more reliable than everything you could write on your own. In addition it's API is similar to the BLAS library which is the standard library for numerical linear algebra.

I think the answer depends heavily on the size of your matrices.
If you can fit a matrix in shared memory, I would probably use a single block to compute that and have all inside a single kernel (probably bigger, where this computation is only a part of it). Hopefully, if you have more matrices, and you need to compute the above equation several times, you can do it in parallel, utilising all GPU computing power.
However, if your matrices are much bigger, you will want more blocks to compute that (check matrix multiplication example in CUDA manual). You need a guarantee that multiplication is finished by all blocks before you proceed with the next part of your equation, and if so, you will need a kernel call for each of your operations.

Related

Matrix vector multiplication using BLAS taking more time than for loop

I am using the following code to multiply matrices:
cblas_sgemv(CblasRowMajor, CblasNoTrans, n, n, 1, (float *)A, n, B, 1, 1.0f, C, 1);
Where A is a n x n matrix, and B is n x 1 matrix.
The alternative is to do it the usual way -
for (k = 0; k < n; k++)
for (i = 0; i < n; i++)
C[i] += A[i * n+ k] * B[k];
Surprisingly, the Blas implementation is taking more time than the for loop version. What could be the reason for that?
If we look at the documentation for that function here:
https://developer.apple.com/documentation/accelerate/1513065-cblas_sgemv
... then it's a pretty complex function able to handle different memory layouts, apply a scaling factor, optionally transpose, use an arbitrary stride across rows and columns that probably isn't subject to constant folding if calling the function requires going across module boundaries, etc.
So that's a whole lot more to do, and with the compiler having much less compile-time information to optimize against, than your simple loop version. Where I think it could have an edge in performance is if your matrices are extremely large. Then the BLAS implementation might be able to manually use SIMD in ways that beat your most aggressive optimizers, parallelize the loop, use loop tiling, etc. But those methods usually only provide improvements to justify their overhead when used against especially large matrices, at which point the extra overhead to handle all those extra parameters and the cost of the indirect function call would also be trivialized.
If n in your example is sufficiently large (say 1000+), then I would be slightly more surprised that your simple loopy version is beating it, but it still doesn't seem like a huge surprise since that's a pretty complex function that involves a lot of runtime overhead that can't be optimized away (given that it's a dylib API from what I can tell) with all the possible parameters you can specify. If the library is decent, then I suspect it will begin to beat your simple scalar code at some threshold for n, but that might require n to be quite large, especially given that our optimizers are getting better and better these days at vectorizing our scalar logic.
I'm not familiar with this library but browsing over its API, it's quite generalized in nature. Typically if you want to get the most out of SIMD, you have to organize your data in a certain way suited for your application requirements. For example, if you can represent matrices and vectors in SoA form, then I've found that I can get close to the theoretical boosts in data consumption that SIMD offers (ex: close to 4x with 128-bit registers for SPFP and 8x with 256-bit) and beat my optimizers. But in AoS form doing a single matrix/vector multiplication at a time, I find the speedups far more negligible (say 25% or less), or sometimes even slower than straightforward loops involving scalar instructions that I leave up to my optimizers to vectorize away.
What I would typically expect as far as API design in a library that offers the fastest matrix/vector multiplication is for the multiply function to input an entire container/array of vectors (multiple vectors at once, i.e., against a single matrix). That would generally give it the maximum breathing room to most effectively use parallelization as well as trivializing the overhead of the function call (ex: the branching overhead to determine if scale factors are ~1.0f to determine whether or not an additional multiplication is needed per iteration), and that should have a higher probability of performing well even if your matrices/vectors are on the smaller side provided that you pass in many small vectors at once. There might be a way to do that with your library. If so, I'd give that a shot.

LAPACK from Pytrilinos for least-squares

I'm trying to solve a large sparse 30,000x1,000 matrix using the a LAPACK solver in the Trilinos package. My goal is to minimise computation time however this Lapack solver only takes square matrices. So I'm manually converting my non-square matrix (A) into a square matrix by multiplying by its transpose, essentially solving the system like so:
(AT*A)x = AT*b
What is making my solve slow are the matrix multiplications by AT steps.
Any ideas on how to fix this?
You can directly compute the QR-decomposition (wikipedia-link) and use that to solve your least squares problem.
This way you don't have to compute the matrix product A^T A.
There are multiple LAPACK routines for the computation of QR-decompositions, e.g. gels is a driver routine for general matrices (intel-link).
I must say that I don't know if Trilinos supplies that routine but it is a standard LAPACK routine.

Basic GPU application, integer calculations

Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some GPU power to entlast the CPU from redundant operations. However I cannot find a good "guideline" what exact technology/tools should I pick in my situation. I just read plethora of docs, it drains my mental powers very fast. I am not sure if it is possible at all, so I'm puzzled.
Here I've made a very rough sketch of my typical application skeleton that I develop, but given that it uses GPU now (note, I have almost zero practical knowledge about GPU programming). Still important is that data types and functionality must be exactly preserved. Here it is:
So F(A,R,P) is some custom function, for example element substitution, repetition, etc. Function is presumably constant in program lifetime, rectangle's shapes generally are not equal with A shape, so it is not in-place calculation. So they are simply generated whith my functions. Examples of F: repeat rows and columns of A; substitute values with values from Substitution tables; compose some tiles into single array; any math function on A values, etc. As said all this can be easily made on CPU, but app must be really smooth. BTW in pure Python it became just unusable after adding several visual features, which are based on numpy arrays. Cython helps to make fast custom functions but then the source code is already kind of a salad.
Question:
Does this schema reflect some (standart) technology/dev.tools?
Is CUDA what I am looking for? If yes, some links/examples which coincides whith my application structure, would be great.
I realise, this a big question, so I will give more details if it helps.
Update
Here is a concrete example of two typical calculations for my prototype of bitmap editor. So the editor works with indexes and the data include layers with corresponding bit masks. I can determine the size of layers and masks are same size as layers and, say, all layers are same size (1024^2 pixels = 4 MB for 32 bit values). And my palette is say, 1024 elements (4 Kilobytes for 32 bpp format).
Consider I want to do two things now:
Step 1. I want to flatten all layers in one. Say A1 is default layer (background) and layers 'A2' and 'A3' have masks 'm2' and 'm3'. In python i'd write:
from numpy import logical_not
...
Result = (A1 * logical_not(m2) + A2 * m2) * logical_not(m3) + A3 * m3
Since the data is independent I believe it must give speedup proportionl to number of parallel blocks.
Step 2. Now I have an array and want to 'colorize' it with some palette, so it will be my lookup table. As I see now, there is a problem with simultanous read of lookup table element.
But my idea is, probably one can just duplicate the palette for all blocks, so each block can read its own palette? Like this:
When your code is highly parallel (i.e. there are small or no data dependencies between stages of processing) then you can go for CUDA (more finegrained control over synching) or OpenCL (very similar AND portable OpenGL-like API to interface with the GPU for kernel processing). Most of the acceleration work we do happens in OpenCL, which has excellent interop with both OpenGL and DirectX, but we also have the same setup working with CUDA. One big difference between CUDA and OpenCL is that in CUDA you can compile kernels once and delay-load (and/or link) them in your app, whereas in OpenCL the compiler plays nice with the OpenCL driver stack to ensure the kernel is compiled when the app starts.
One alternative that is often overlooked if you're using Microsoft Visual Studio is C++AMP, a C++ syntax-friendly and intuitive api for those who do not want to dig into the logic twists and turns of OpenCL/CUDA API's. Big advantage here is that the code also works if you do not have a GPU in the system, but then you do not have as many options to tweak performance. Still, in a lot of cases, this is a fast and efficient way to write proof your concept code and re-implement bits and parts in CUDA or OpenCL later.
OpenMP and Thread Building Blocks are only good alternatives when you have synching issues and lots of data dependencies. Native threading using worker threads is also a viable solution, but only if you have a good idea on how synch-points can be set up between the different processes in such a way that threads do not starve each-other out when fighting for priority. This is a lot harder to get right, and tools such as Parallel Studio are a must. But then, so is NVida NSight if you're writing GPU code.
Appendix:
A new platform called Quasar (http://quasar.ugent.be/blog/) is being developed that enables you to write your math problems in a syntax that is very similar to Matlab, but with full support of c/c++/c# or java integration, and cross-compiles (LLVM, CLANG) your "kernel" code to any underlying hardware configuration. It generates CUDA ptx files, or runs on openCL, or even on your CPU using TBB's, or a mixture of them. Using a few monikers, you can decorate the algorithm so that the underlying compiler can infer types (you can also explicitly use strict typing), so you can leave the type-heavy stuff entirely up to the compiler. To be fair, at the time of writing, the system is still w.i.p. and the first OpenCL compiled programs are just being tested, but most important benefit is fast prototyping with almost identical performance compared to optimized cuda.
What you want to do is send values really fast to the GPU using the high frequency dispatch and then display the result of a function which is basically texture lookups and some parameters.
I would say this problem will only be worth solving on the GPU if two conditions are met:
The size of A[] is optimised to make the transfer times irrelevant (Look at, http://blog.theincredibleholk.org/blog/2012/11/29/a-look-at-gpu-memory-transfer/).
The lookup table is not too big and/or the lookup values are organized in a way that the cache can be maximally utilized, in general random lookups on the GPU can be slow, ideally you can pre-load the R[] values in a shared memory buffer for each element of the A[] buffer.
If you can answer both of those questions positively then and only then consider having a go at using the GPU for your problem, else those 2 factors will overpower the computational speed-up that the GPU can provide you with.
Another thing you can have a look at is to as best as you can overlap the transfer and computing times to hide as much as possible the slow transfer rates of CPU->GPU data.
Regarding your F(A, R, P) function you need to make sure that you do not need to know the value of F(A, R, P)[0] in order to know what the value of F(A, R, P)[1] is because if you do then you need to rewrite F(A, R, P) to go around this issue, using some parallelization technique. If you have a limited number of F() functions then this can be solved by writing a parallel version of each F() function for the GPU to use, but if F() is user-defined then your problem becomes a bit trickier.
I hope this is enough information to have an informed guess towards whether you should or not use a GPU to solve your problem.
EDIT
Having read your edit, I would say yes. The palette could fit in shared memory (See GPU shared memory size is very small - what can I do about it?) which is very fast, if you have more than one palette, you could fit 16KB (size of shared mem on most cards) / 4KB per palette = 4 palettes per block of threads.
One last warning, integer operations are not the fastest on the GPU, consider using floating points if necessary after you have implemented your algorithm and it is working as a cheap optimization.
There is not much difference between OpenCL/CUDA so choose which works better for you. Just remember that CUDA will limit you to the NVidia GPUs.
If i understand corretly to your problem, kernel (function executed on GPU) should be simple. It should follow this pseudocode:
kernel main(shared A, shared outA, const struct R, const struct P, const int maxOut, const int sizeA)
int index := getIndex() // get offset in input array
if(sizeA >= index) return // GPU often works better when n of threads is 2^n
int outIndex := index*maxOut // to get offset in output array
outA[outIndex] := F(A[index], R, P)
end
Functions F should be inlined and you can use switch or if for different function. Since there is not known size of the output of F, then you have to use more memory. Each kernel instance must know positions for correct memory writes and reads so there have to be some maximum size (if there is none, than this all is useless and you have to use CPU!). If different sizes are sparse, then I would use something like computing these different sizes after getting the array back to RAM and compute these few with CPU, while filling outA with some zeros or indication values.
Sizes of arrays are obviously length(A) * maxOut = length(outA).
I forgot to mention that if execution of F is not same in most of the cases (same source code), than GPU will serialize it. GPU multiprocessors have a few cores connected into the same instruction cache so it will have to serialize the code, which is not the same for all cores! OpenMP or threads are better choice for this kind of problem!

Computing the determinant of a C array

I have written a program that generates a few N x N matrices for which I would like to compute their determinant. Related to this I have two questions
What library is the best for doing so? I would like the fastest possible library since I have millions of such matrices.
Are there any specifics I should take care of when casting the result to an integer? All matrices that I will generate have integer determinants and I would like to make sure that no rounding error skews the correct value of the determinant.
Edit. If possible provide an example of computing the determinant for the recommended library.
As for Matrix Libraries, it looks like that question is answered here:
Recommendations for a small c-based vector and matrix library
As for casting to an integer: if the determinant is not an integer, then you shouldn't be casting it to an integer, you should be using round, floor, or ceil to convert it in an acceptable way. These will probably give you integral values, but you will still need to cast them; however, you will now be able to do so without fear of losing any information.
You can work wonders with matrices by blas and lapack. They are actually written in fortran and using them from "c" is kind of a tweak. but all in all they can crunch numbers in horrendous speed.
http://www.netlib.org/lapack/lug/node11.html
You have GSL, but the choice really depends on your matrices. Are the matrices dense or sparse? Is N big or small? For small Ns you may find that coding the determinant yourself using Cramer's rule or Gauss elimination is faster, since most hight performance libraries focus on big matrices and their optimisations may introduce overhead on simple problems.

How to transpose a matrix in an optimal way using blas?

I'm doing some calculations, and doing some analysis on the forces and weakness of different BLAS implementations. however I have come across a problem.
I'm testing cuBlas, doing linAlg on the GPU would seem like a good idea, but there is one problem.
The cuBlas implementation using column-major format, and since this is not what I need in the end, I'm curious if there is a way in with one can make BLAS do matrix-transpose?
BLAS doesn't have a matrix transpose routine built in. The CUDA SDK includes a matrix transpose example with a paper which discusses optimal strategy for performing a transpose. Your best strategy is probably to use row major inputs to CUBLAS with the transpose input version of the calls, then perform the intermediate calculations in column major, and lastly perform a transpose operation afterwards using the SDK transpose kernel.
Edited to add that CUBLAS added a transpose routine in CUBLAS version 5, geam, which can performed matrix transposition in GPU memory and should be regarded as optimal for whatever architecture you are using.

Resources