What's my best bet for computing the dot product of a vector x with a large number of vectors y_i, where x and y_i are of length 10k or so.
Shove the y's in a matrix and use an optimized s/dgemv routine?
Or maybe try handcoding an SSE2 solution (I don't have SSE3, according to cpuinfo).
I'm just looking for general guidance here, so any suggestions will be useful.
And yes, I do need the performance.
Thanks for any light.
I think GPUs are specifically designed to perform operations like this quickly (among others). So you could probably make use of DirectX or OpenGL libraries to perform the vector operations. D3DXVec2Dot This will also save you CPU time.
Alternatives for optimised BLAS routines:
If you use intel compilers, you may
have access to intel MKL
For other compilers ATLAS usually provides nice performance numbers
Handcoding a SSE2 solution is not very difficult and will bring a nice speedup over a pure C routine. How much this will bring over a BLAS routine must be determined by you.
The greatest speedup is derived by structuring the data into a format, so that you can exploit data parallelism and alignment.
I use a GotoBLAS. It's the hight perfomance kernel routines. The many times better than MKL and BLAS.
The following provides BLAS level 1 (vector operations) routines using SSE.
http://www.applied-mathematics.net/miniSSEL1BLAS/miniSSEL1BLAS.html
If you have an nVidia graphics card you can get cuBLAS which will perform the operation on the graphics card.
http://developer.nvidia.com/cublas
For ATI (AMD) graphics cards
http://developer.amd.com/libraries/appmathlibs/pages/default.aspx
Related
I'm trying to do an FFT->signal manipulation->Inverse FFT using Project NE10 in my CPP project and convert the complex output to amplitudes and phases for FFT and vice versa for IFFT. But the performance of my C++ code is not as good as the SIMD enabled NE10 code as per the benchmarks. Since I have no experience with arm assembly, I'm looking for some help to write neon code for the unoptimised C module. For example, before IFFT I do this:
for (int bin = 0; bin < NUM_FREQUENCY_BINS; bin++) {
input[bin].real = amplitudes[bin] * cosf(phases[bin]);
input[bin].imag = amplitudes[bin] * sinf(phases[bin]);
}
where input is an array of C structs (for complex values), amplitudes & phases are float arrays.
The above block (O(n) complexity) takes about 0.6ms for 8192 bins while NE10 FFT (O(n*log(n)) complexity) takes only 0.1ms because of SIMD operations. From what I've read so far on StackOverflow and other places, intrinsics are not worth the effort, so I'm trying in arm neon only.
You can use NEON for trig functions if you settle for approximations. I am not affiliated, but there is an implementation here that uses intrinsics to create vectorised sin/cos functions accurate to many decimal places that perform substantially better than simply calling sinf, etc (benchmarks are provided by the author).
The code is especially well suited to your polar to cartesian calculation, as it generates sin and cos results simultaneously. It might not be suitable for something where absolute precision is crucial, but for anything to do with frequency domain audio processing, this normally is not the case.
As I know NEON doesn't support vector operations for geometric functions (sin, cos). But of course you can improve your code. As variant you can use the table of pre-calculated values of functions sinus and cosine. It can lead to significant improvement of performance.
Concerning to using of intrinsics for NEON. I have tried to use both of them, but in most case they give practically the same result (for modern compiler). But using if assembler is more labor-intensive. The main performance improvement is given by the correct manipulation with data (loading, storing) and using of vector instructions but these actions can be performed with using of intrinsics .
Of course if you want to achieve 100% utilization of CPU you sometimes need to use assembler. But it is rare case.
According to my measurements of dgemm from both cublas and atlas, atlas severly beats cublas in terms of speed. Is this to be expected for a system with an Intel i7 950 and Nvidia GTX470?
I tested matrices of size 10x10 up to 6000x6000 in increments of 50. Atlas always wins. I measure both total application execution and just the multiplication step.
Anyone else have experience with this? Is this the expected results?
Thanks in advance.
edit: (same code, same results on a Xeon X5670 and Nvidia Tesla C2050)
edit2: It appears a great deal of slowness if attributed to initialisation of the cublas library. I continue to work on it. I'll update here when I learn more.
Did you use the single-threaded versions of both libraries? As far as I understand, both GotoBLAS and Atlas tend to sneakily use multiple threads when working on large matrices.
That said, at large matrix sizes the algorithm used tends to matter much more than the low-level implementation. Naive matrix multiplication is O(N^3), whereas Strassen algorithm scales much better, about O(N^2.81) or so. However, Strassen algorithm happens to vectorize very nicely (to much larger SSE and AVX registers, yielding almost 2 to 8-fold increase in efficiency, depending on floating-point format and register size).
I am not sure how well the two GPUs you mentioned handle double-precision math. Typically they're optimized for single precision (32-bit floats), dropping to a third or a quarter of that speed when handling doubles.
There are other factors in your tests that may skew the results. For example, you may be including the matrix transfer time to the CPU. Whether that matches real world use cases, I don't know; I don't have an Nvidia GPU to test.. but I suspect not. Usually there are multiple operations, and the matrix does not need to be transferred between operations.
I've been writing my own low-level SSE3 matrix functions using SSE/AVX vector built-in functions provided by GCC and ICC C99 compilers; early testing indicates it beats the current Fortran implementations by a wide margin, especially at the very small (say up to 8x8, optimized for each size) and very large (above 1000x1000, using Strassen algorithm) sizes for dense matrices.
I'm looking to calculate highly parallelized trig functions (in block of like 1024), and I'd like to take advantage of at least some of the parallelism that modern architectures have.
When I compile a block
for(int i=0; i<SIZE; i++) {
arr[i]=sin((float)i/1024);
}
GCC won't vectorize it, and says
not vectorized: relevant stmt not supported: D.3068_39 = __builtin_sinf (D.3069_38);
Which makes sense to me. However, I'm wondering if there's a library to do parallel trig computations.
With just a simple taylor series up the 11th order, GCC will vectorize all the loops, and I'm getting speeds over twice as fast as a naive sin loop (with bit-exact answers, or with 9th order series, only a single bit off for the last two out of 1600 values, for a >3x speedup). I'm sure someone has encountered a problem like this before, but when I google, I find no mentions of any libraries or the like.
A. Is there something existing already?
B. If not, advice for optimizing parallel trig functions?
EDIT: I found the following library called "SLEEF": http://shibatch.sourceforge.net/ which is described in this paper and uses SIMD instructions to calculate several elementary functions. It uses SSE and AVX specific code, but I don't think it will be hard to turn it into standard C loops.
Since you said you were using GCC it looks like there are some options:
http://gruntthepeon.free.fr/ssemath/
This uses SSE and SSE2 instructions to implement it.
http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php
This has an alternate implementation. Some of the comments are pretty good.
That said, I'd probably look into GPGPU for a solution. Maybe writing it in CUDA or OpenCL (If I remember correctly CUDA supports the sine function). Here are some libraries that look like they might make it easier.
https://code.google.com/p/slmath/
https://code.google.com/p/thrust/
Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.
What platform are you using? Many libraries of this sort already exist:
Intel's provides the Vector Math Library (VML) with icc.
Apple provides the vForce library as part of the Accelerate framework.
HP provides their own Vector Math Library for Itanium (and may other architectures, too).
Sun provided libmvec with their compiler tools.
...
Instead of the taylor series, I would look at the algorithms fdlibm uses. They should get you as much precision with fewer steps.
My answer was to create my own library to do exactly this called vectrig: https://github.com/jeremysalwen/vectrig
I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case.
Do you think the use of vector operators could give bigger speedups?
Thanks
The best optimization occurs in rethinking the algorithm. Eliminate unnecessary steps. Find more a direct way of accomplishing the same result. Compute the solution in a domain more relevant to the problem.
For example, if the vector array is a list of n which are all on the same line, then it is sufficient to transform the end points only and interpolate the intermediate points.
It CAN give better speeds up than 4 times over straight floating point as the SIMD instructions could be less exact (Not so much as to give too many problems though) and so take fewer cycles to execute. It really depends.
Best plan is to learn as much about the processor you are optimising for as possible. You may find it can give you far better than 4x improvements. You may find out you can't. We can't say though without knowing more about the algorithm you are optimising and what CPU you are targetting.
On their own, no. But if the process of re-writing your algorithms to support them also happens to improve, say, cache locality or branching behaviour, then you could find unrelated speed-ups. However, this is true of any re-write...
This is entirely possible.
You can do more clever instruction-level micro optimizations than a compiler, if you know what you're doing.
Most SIMD instruction sets offers several powerful operations that don't have any equivalent in normal scalar FPU/ALU code (e.g. PAVG/PMIN etc. in SSE2). Even if these don't fit your problem exactly, you can often combine these instructions for great effect.
Not sure about Cell, but most SIMD instruction sets have features to optimize memory access, for example to prefetch data into cache. I've had very good results with these.
Now this isn't Cell or PPC at all, but a simple image convolution filter of mine got a 20x speedup (C vs. SSE2) on Atom, which is higher than the level of parallelity (16 pixels at a time).
It depends on the architecture.. For the moment I assume x86 architecture (aka SSE).
You can get factor four on tight loops easily. Just replace your existing math with SSE instruction and you're done.
You can even get a little more than that because if you use SSE you do the math in registers which are usually not used by the compiler. This frees up the general purpose register for other task such as loop control and address calculation. In short the code that surrounds the SSE instruction will be more compact and execute faster.
And then there is the option to hint the memory controller how you want to access the memory, e.g. if you want to store data in a way that it bypasses the cache or not. For bandwidth hungry algorithms that may give you some more extra speed ontop of that.
Is there a benchmark that compares the different BLAS (Basic Linear Algebra Subprograms) libraries? I am especially interested in sparse matrix multiplication for single- and multi-core systems?
BLAS performance is very much system dependent, so you'll best do the benchmarks yourself on the very machine you want to use. Since there are only a few BLAS implementations, that is less work than it sounds (normally the hardware vendors implementation, ATLAS and the GOTO BLAS).
But note that BLAS only covers dense matrices, so for sparse matrix multiplication you'll need Sparse-BLAS or some other code. Here performance will differ not only depending on hardware but also on the sparse format you want to use and even on the type of matrix you are working with (things like sparsity pattern, bandwidth etc. matter). So even more than in the dense case, if you need maximum performance you will need to do your own benchmarks.