BLAS Library Benchmark - benchmarking

Is there a benchmark that compares the different BLAS (Basic Linear Algebra Subprograms) libraries? I am especially interested in sparse matrix multiplication for single- and multi-core systems?

BLAS performance is very much system dependent, so you'll best do the benchmarks yourself on the very machine you want to use. Since there are only a few BLAS implementations, that is less work than it sounds (normally the hardware vendors implementation, ATLAS and the GOTO BLAS).
But note that BLAS only covers dense matrices, so for sparse matrix multiplication you'll need Sparse-BLAS or some other code. Here performance will differ not only depending on hardware but also on the sparse format you want to use and even on the type of matrix you are working with (things like sparsity pattern, bandwidth etc. matter). So even more than in the dense case, if you need maximum performance you will need to do your own benchmarks.

Related

Does C internally use Karatsuba algorithm to multiply two integers?

When C multiplies two n-bits integers, does it internally use the normal O(n^2) multiplication algorithm, or does it use a variation of Karatsuba's O(n^log_2(3)) multiplication algorithm ?
No. The 'book-keeping' overhead of the Karatsuba algorithm is too high and too complex. It would take up far more silicon than a multiplier, even if it was to achieve a break-even recursive depth on a machine word level. A hardware crypto accelerator or FPGA might make it worthwhile for large enough n. Even then, the break-even might be too high to be useful for crypto needs. There's no free lunch.
On the other end of the spectrum, we can look at the gmp-mparam.h files in the GMP library, which define threshold values at which asymptotically faster algorithms actually begin to pay off. 'Karatsuba' is the 2x2 case of the more general Toom-Cook algorithm. Even on monsters like Broadwell and Skylake CPUs, the threshold is around 28 'words', or 1792 bits. That's due to the overhead in (recursively) adding 3 results back together, with carry propagation. These thresholds will keep getting higher as multiply instruction throughput increases.
C doesn't multiply. It defines the semantics of the binary * operator. How it's implemented is left to the compiler.
The compiler doesn't multiply. It converts the binary * operator into (a) machine instruction(s), which are executed by the CPU.
The CPU multiplies. It generally provides specialized instructions for that. What instructions it offers to the compiler, and how it implements these instructions depends on the CPU and it's intended use. But multiplication is such a common task that any desktop CPU worth it's salt has dedicated hardware circuitry for it.

Should cublas be outperformed by atlas?

According to my measurements of dgemm from both cublas and atlas, atlas severly beats cublas in terms of speed. Is this to be expected for a system with an Intel i7 950 and Nvidia GTX470?
I tested matrices of size 10x10 up to 6000x6000 in increments of 50. Atlas always wins. I measure both total application execution and just the multiplication step.
Anyone else have experience with this? Is this the expected results?
Thanks in advance.
edit: (same code, same results on a Xeon X5670 and Nvidia Tesla C2050)
edit2: It appears a great deal of slowness if attributed to initialisation of the cublas library. I continue to work on it. I'll update here when I learn more.
Did you use the single-threaded versions of both libraries? As far as I understand, both GotoBLAS and Atlas tend to sneakily use multiple threads when working on large matrices.
That said, at large matrix sizes the algorithm used tends to matter much more than the low-level implementation. Naive matrix multiplication is O(N^3), whereas Strassen algorithm scales much better, about O(N^2.81) or so. However, Strassen algorithm happens to vectorize very nicely (to much larger SSE and AVX registers, yielding almost 2 to 8-fold increase in efficiency, depending on floating-point format and register size).
I am not sure how well the two GPUs you mentioned handle double-precision math. Typically they're optimized for single precision (32-bit floats), dropping to a third or a quarter of that speed when handling doubles.
There are other factors in your tests that may skew the results. For example, you may be including the matrix transfer time to the CPU. Whether that matches real world use cases, I don't know; I don't have an Nvidia GPU to test.. but I suspect not. Usually there are multiple operations, and the matrix does not need to be transferred between operations.
I've been writing my own low-level SSE3 matrix functions using SSE/AVX vector built-in functions provided by GCC and ICC C99 compilers; early testing indicates it beats the current Fortran implementations by a wide margin, especially at the very small (say up to 8x8, optimized for each size) and very large (above 1000x1000, using Strassen algorithm) sizes for dense matrices.

How to transpose a matrix in an optimal way using blas?

I'm doing some calculations, and doing some analysis on the forces and weakness of different BLAS implementations. however I have come across a problem.
I'm testing cuBlas, doing linAlg on the GPU would seem like a good idea, but there is one problem.
The cuBlas implementation using column-major format, and since this is not what I need in the end, I'm curious if there is a way in with one can make BLAS do matrix-transpose?
BLAS doesn't have a matrix transpose routine built in. The CUDA SDK includes a matrix transpose example with a paper which discusses optimal strategy for performing a transpose. Your best strategy is probably to use row major inputs to CUBLAS with the transpose input version of the calls, then perform the intermediate calculations in column major, and lastly perform a transpose operation afterwards using the SDK transpose kernel.
Edited to add that CUBLAS added a transpose routine in CUBLAS version 5, geam, which can performed matrix transposition in GPU memory and should be regarded as optimal for whatever architecture you are using.

should use GPU?

how can i know if my serial code will run faster if i used a GPU? i know it depends on a lot of things... ie if the code could be parallalized in an SMID fation and all this stuff... but what considerations should i take into account to be "sure" that i will gain speed? should the algorithm be embarrassingly parallel? therefore i wouldn't bother trying the GPU if parts of the algorithm cannot be parallelized? should i take into consideration how much memory is required for a sample input?
what are the "specs" of a serial code that would make it run faster on a GPU? can a complex algorithm gain speed on a GPU?
i don't want to waste time and try to code my algorithm on GPU and i am 100% sure that speed will be gained.... that is my problem....
i think that my algorithm could be parallelized on GPU... would it be worth trying it?
It depends upon two factors:
1) The speedup of having many cores performing the floating point operations
This is dependent upon the inherent parallelization of the operations you are performing, the number of cores on your GPU, and the differences in clock rates between your CPU and GPU.
2) The overhead of transferring the data back and forth between main memory and GPU memory.
This is mainly dependent upon the "memory bandwidth" of your particular GPU, and is greatly reduced by the Sandy Bridge architecture where the CPU and GPU are on the same die. With older architectures, some operations such as matrix multiplication where the inner dimensions are small get no improvement. This is because it takes longer to transfer the inner vectors back and forth across the system bus than it does to dot product the vectors on the CPU.
Unfortunately these two factors are tough to estimate and there is no way to "know" without trying it. If you currently use BLAS for your SIMD operations, it is fairly simple to substitute in CUBLAS which has the same API except it sends the operations over to the GPU to perform.
When looking for a parallel solution you should typically ask yourself the questions
The amount of data you have.
The amount of floating point computation you have.
How complicated is your algorithm i.e. conditions and branches in the algorithm. Is there any data localization?
what kind of speedup is required?
Is it Realtime computation or not?
Do alternate algorithms exist (but maybe they are not the most efficient serial algorithm)?
What kind of sw/hw you have access to.
Depending on the answers you are looking for you may want to use GPGPU, cluster computation or distributed computation or a combination of GPU and cluster/distributed machines.
If you could share the any information on your algorithm and size of data then it would be easier to comment.
Regular C code can be converted to CUDA remarkably easily. If the heavy hitters in your algorithm's profile can be parallelized, try it and see if it helps.

Dot product - SSE2 vs BLAS

What's my best bet for computing the dot product of a vector x with a large number of vectors y_i, where x and y_i are of length 10k or so.
Shove the y's in a matrix and use an optimized s/dgemv routine?
Or maybe try handcoding an SSE2 solution (I don't have SSE3, according to cpuinfo).
I'm just looking for general guidance here, so any suggestions will be useful.
And yes, I do need the performance.
Thanks for any light.
I think GPUs are specifically designed to perform operations like this quickly (among others). So you could probably make use of DirectX or OpenGL libraries to perform the vector operations. D3DXVec2Dot This will also save you CPU time.
Alternatives for optimised BLAS routines:
If you use intel compilers, you may
have access to intel MKL
For other compilers ATLAS usually provides nice performance numbers
Handcoding a SSE2 solution is not very difficult and will bring a nice speedup over a pure C routine. How much this will bring over a BLAS routine must be determined by you.
The greatest speedup is derived by structuring the data into a format, so that you can exploit data parallelism and alignment.
I use a GotoBLAS. It's the hight perfomance kernel routines. The many times better than MKL and BLAS.
The following provides BLAS level 1 (vector operations) routines using SSE.
http://www.applied-mathematics.net/miniSSEL1BLAS/miniSSEL1BLAS.html
If you have an nVidia graphics card you can get cuBLAS which will perform the operation on the graphics card.
http://developer.nvidia.com/cublas
For ATI (AMD) graphics cards
http://developer.amd.com/libraries/appmathlibs/pages/default.aspx

Resources