According to my measurements of dgemm from both cublas and atlas, atlas severly beats cublas in terms of speed. Is this to be expected for a system with an Intel i7 950 and Nvidia GTX470?
I tested matrices of size 10x10 up to 6000x6000 in increments of 50. Atlas always wins. I measure both total application execution and just the multiplication step.
Anyone else have experience with this? Is this the expected results?
Thanks in advance.
edit: (same code, same results on a Xeon X5670 and Nvidia Tesla C2050)
edit2: It appears a great deal of slowness if attributed to initialisation of the cublas library. I continue to work on it. I'll update here when I learn more.
Did you use the single-threaded versions of both libraries? As far as I understand, both GotoBLAS and Atlas tend to sneakily use multiple threads when working on large matrices.
That said, at large matrix sizes the algorithm used tends to matter much more than the low-level implementation. Naive matrix multiplication is O(N^3), whereas Strassen algorithm scales much better, about O(N^2.81) or so. However, Strassen algorithm happens to vectorize very nicely (to much larger SSE and AVX registers, yielding almost 2 to 8-fold increase in efficiency, depending on floating-point format and register size).
I am not sure how well the two GPUs you mentioned handle double-precision math. Typically they're optimized for single precision (32-bit floats), dropping to a third or a quarter of that speed when handling doubles.
There are other factors in your tests that may skew the results. For example, you may be including the matrix transfer time to the CPU. Whether that matches real world use cases, I don't know; I don't have an Nvidia GPU to test.. but I suspect not. Usually there are multiple operations, and the matrix does not need to be transferred between operations.
I've been writing my own low-level SSE3 matrix functions using SSE/AVX vector built-in functions provided by GCC and ICC C99 compilers; early testing indicates it beats the current Fortran implementations by a wide margin, especially at the very small (say up to 8x8, optimized for each size) and very large (above 1000x1000, using Strassen algorithm) sizes for dense matrices.
Related
When C multiplies two n-bits integers, does it internally use the normal O(n^2) multiplication algorithm, or does it use a variation of Karatsuba's O(n^log_2(3)) multiplication algorithm ?
No. The 'book-keeping' overhead of the Karatsuba algorithm is too high and too complex. It would take up far more silicon than a multiplier, even if it was to achieve a break-even recursive depth on a machine word level. A hardware crypto accelerator or FPGA might make it worthwhile for large enough n. Even then, the break-even might be too high to be useful for crypto needs. There's no free lunch.
On the other end of the spectrum, we can look at the gmp-mparam.h files in the GMP library, which define threshold values at which asymptotically faster algorithms actually begin to pay off. 'Karatsuba' is the 2x2 case of the more general Toom-Cook algorithm. Even on monsters like Broadwell and Skylake CPUs, the threshold is around 28 'words', or 1792 bits. That's due to the overhead in (recursively) adding 3 results back together, with carry propagation. These thresholds will keep getting higher as multiply instruction throughput increases.
C doesn't multiply. It defines the semantics of the binary * operator. How it's implemented is left to the compiler.
The compiler doesn't multiply. It converts the binary * operator into (a) machine instruction(s), which are executed by the CPU.
The CPU multiplies. It generally provides specialized instructions for that. What instructions it offers to the compiler, and how it implements these instructions depends on the CPU and it's intended use. But multiplication is such a common task that any desktop CPU worth it's salt has dedicated hardware circuitry for it.
how can i know if my serial code will run faster if i used a GPU? i know it depends on a lot of things... ie if the code could be parallalized in an SMID fation and all this stuff... but what considerations should i take into account to be "sure" that i will gain speed? should the algorithm be embarrassingly parallel? therefore i wouldn't bother trying the GPU if parts of the algorithm cannot be parallelized? should i take into consideration how much memory is required for a sample input?
what are the "specs" of a serial code that would make it run faster on a GPU? can a complex algorithm gain speed on a GPU?
i don't want to waste time and try to code my algorithm on GPU and i am 100% sure that speed will be gained.... that is my problem....
i think that my algorithm could be parallelized on GPU... would it be worth trying it?
It depends upon two factors:
1) The speedup of having many cores performing the floating point operations
This is dependent upon the inherent parallelization of the operations you are performing, the number of cores on your GPU, and the differences in clock rates between your CPU and GPU.
2) The overhead of transferring the data back and forth between main memory and GPU memory.
This is mainly dependent upon the "memory bandwidth" of your particular GPU, and is greatly reduced by the Sandy Bridge architecture where the CPU and GPU are on the same die. With older architectures, some operations such as matrix multiplication where the inner dimensions are small get no improvement. This is because it takes longer to transfer the inner vectors back and forth across the system bus than it does to dot product the vectors on the CPU.
Unfortunately these two factors are tough to estimate and there is no way to "know" without trying it. If you currently use BLAS for your SIMD operations, it is fairly simple to substitute in CUBLAS which has the same API except it sends the operations over to the GPU to perform.
When looking for a parallel solution you should typically ask yourself the questions
The amount of data you have.
The amount of floating point computation you have.
How complicated is your algorithm i.e. conditions and branches in the algorithm. Is there any data localization?
what kind of speedup is required?
Is it Realtime computation or not?
Do alternate algorithms exist (but maybe they are not the most efficient serial algorithm)?
What kind of sw/hw you have access to.
Depending on the answers you are looking for you may want to use GPGPU, cluster computation or distributed computation or a combination of GPU and cluster/distributed machines.
If you could share the any information on your algorithm and size of data then it would be easier to comment.
Regular C code can be converted to CUDA remarkably easily. If the heavy hitters in your algorithm's profile can be parallelized, try it and see if it helps.
How much faster can an algorithm on CUDA or OpenCL code run compared to a general single processor core? (considering the algorithm is written and optimized for both the CPU and GPU target).
I know it depends on both the graphics card and the CPU, but say, one of the fastest GPUs of NVIDIA and a (single core of a) Intel i7 processor ?
And I know it also depends on the type of algorithm.
I do not need a strict answer, but experienced examples like: for a image manipulation algorithm using double-precision floating point and 10 operations per pixel took first 5 minutes and now runs in x seconds using this hardware.
Your question is overly broad, and very difficult to answer. Moreover only a small percentage of algorithms (the ones that deal without much shared state) are feasable with GPUs.
But I do want to urge you to be critical about claims. I'm in imageprocessing, and read many an article on the subject, but quite often in the GPU case, the time to upload input data to the GPU, and download the results back to main memory is not included in the calculation of the factor.
While there are a few cases where this doesn't matter (both are small or there is a second stage calculation that further reduces the result in size), usually one does have to transfer the results and initial data.
I've seen this turning a claimed plus into a negative, because the upload/download time alone was longer than the main CPU would require to do the calculation.
Pretty much the same thing applies to combining results of different GPU cards.
Update Newer GPUs seem to be able to upload/download and calculate at the same time using ping-pong buffers. But the advise to check the border conditions thoroughly still stands. There is a lot of spin out there.
Update 2 Quite often using a GPU that is shared with video output for this is not optimal. Consider e.g. adding a low budget card for video, and using the onboard video for GPGPU tasks
I think that this video introduction to OpenCL gives a good answer to your question in the first or second episode (I do not remember). I think it was at the end of the first episode...
In general it depends on how well you can "parallelize" the problem. The problem size itself is also a factor, because it costs time to copy the data to the graphics card.
It depends very much on the algorithm and how efficient the implementation can be.
Overall, it's fair to say that GPU is better at computation than CPUs. Thus, an upper bound is to divide the theoretical GFlops rating of a top end GPU by a top end CPU. You can do similar computation for theoretical memory bandwidth.
For example, 1581.1 GFlops for a GTX580 vs. a 107.55 GFLOPS for i7 980XE. Note that the rating for GTX580 is for single precision. I believe you need to cut that down by a factor of 4 for Fermi class non-Tesla to get to the double precision rating. So in this instance, you might expect roughly 4x.
Caveats on why you might do better (or see results which claim far bigger speedups):
GPUs has better memory bandwidth than CPU once the data is on the card. Sometimes, memory bound algorithms can do well on the GPU.
Clever use of caches (texture memory etc.) which can let you do better than advertised bandwidth.
Like Marco says, the transfer time didn't get included. I personally always include such time in my work and thus have found that the biggest speedups I've seen to be in iterative algorithms where all the data fits on the GPU (I've gotten over 300x on a midrange CPU to midrange GPU here personally).
Apples to orange comparisons. Comparing a top end GPU vs. a low end CPU is inherently unfair. The rebuttal is that a high end CPU costs much more than a high end GPU. Once you go to a GFlops/$ or GFlops/Watt comparison, it can look much more favorable to the GPU.
__kernel void vecAdd(__global float* results )
{
int id = get_global_id(0);
}
this kernel code can spawn 16M threads on a new 60$ R7-240 GPU in 10 milliseconds.
This is equivalent to 16 thread creations or context switches in 10 nanoseconds. What is a 140$ FX-8150 8-core CPU timing? It is 1 thread in 50 nanoseconds per core.
Every instruction added in this kernel is a win for a gpu until it makes branching.
Your question is in general, hard to answer; there are simply many different variables that make it hard to give answers that are either accurate, or fair.
Notably, you are comparing both 1) choice of algorithm 2) relative performance of hardware 3) compiler optimisation ability 4) choice of implementation languages and 5) efficiency of algorithm implementation, all at the same time...
Note that, for example, different algorithms may be preferable on GPU vs CPU; and data transfers to and from GPU need to be accounted for in timings, too.
AMD has a case study (several, actually) in OpenCL performance for OpenCL code executing on the CPU and on the GPU. Here is one with performance results for sparse matrix vector multiply.
I've seen figures ranging from 2x to 400x. I also know that the middle-range GPUs cannot compete with high-range CPUs in double-precision computation - MKL on a 8-core Xeon will be faster than CULA or CUBLAS on an $300 GPU.
OpenCL is anecdotally much slower than CUDA.
A new benchmark suite called SHOC (Scalable Heterogeneous Computing) from Oak Ridge National Lab and Georgia Tech has both OpenCL and CUDA implementations of many important kernels. You can download the suite from http://bit.ly/shocmarx. Enjoy.
We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.
I have selected a typical representative of the modern desktop CPUs - Quad Core, about 10Mb cache, 3GHz, 45nm. Can you please help me to find out its limits:
1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?
2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).
UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.
First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.
Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:
number of cores * cycles per second * flops per cycle * vector width
Which in your case sounds like:
4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops
If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).
edit: notes on DPPS:
DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.
More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.
In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.
This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.
I'm short on details, sorry; this is just to give you an idea.
Some points you should consider:
1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.
2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).
3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.
4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).
5) Use Intel's highly efficient libraries
if possible.
Is there a benchmark that compares the different BLAS (Basic Linear Algebra Subprograms) libraries? I am especially interested in sparse matrix multiplication for single- and multi-core systems?
BLAS performance is very much system dependent, so you'll best do the benchmarks yourself on the very machine you want to use. Since there are only a few BLAS implementations, that is less work than it sounds (normally the hardware vendors implementation, ATLAS and the GOTO BLAS).
But note that BLAS only covers dense matrices, so for sparse matrix multiplication you'll need Sparse-BLAS or some other code. Here performance will differ not only depending on hardware but also on the sparse format you want to use and even on the type of matrix you are working with (things like sparsity pattern, bandwidth etc. matter). So even more than in the dense case, if you need maximum performance you will need to do your own benchmarks.