I'm trying to do an FFT->signal manipulation->Inverse FFT using Project NE10 in my CPP project and convert the complex output to amplitudes and phases for FFT and vice versa for IFFT. But the performance of my C++ code is not as good as the SIMD enabled NE10 code as per the benchmarks. Since I have no experience with arm assembly, I'm looking for some help to write neon code for the unoptimised C module. For example, before IFFT I do this:
for (int bin = 0; bin < NUM_FREQUENCY_BINS; bin++) {
input[bin].real = amplitudes[bin] * cosf(phases[bin]);
input[bin].imag = amplitudes[bin] * sinf(phases[bin]);
}
where input is an array of C structs (for complex values), amplitudes & phases are float arrays.
The above block (O(n) complexity) takes about 0.6ms for 8192 bins while NE10 FFT (O(n*log(n)) complexity) takes only 0.1ms because of SIMD operations. From what I've read so far on StackOverflow and other places, intrinsics are not worth the effort, so I'm trying in arm neon only.
You can use NEON for trig functions if you settle for approximations. I am not affiliated, but there is an implementation here that uses intrinsics to create vectorised sin/cos functions accurate to many decimal places that perform substantially better than simply calling sinf, etc (benchmarks are provided by the author).
The code is especially well suited to your polar to cartesian calculation, as it generates sin and cos results simultaneously. It might not be suitable for something where absolute precision is crucial, but for anything to do with frequency domain audio processing, this normally is not the case.
As I know NEON doesn't support vector operations for geometric functions (sin, cos). But of course you can improve your code. As variant you can use the table of pre-calculated values of functions sinus and cosine. It can lead to significant improvement of performance.
Concerning to using of intrinsics for NEON. I have tried to use both of them, but in most case they give practically the same result (for modern compiler). But using if assembler is more labor-intensive. The main performance improvement is given by the correct manipulation with data (loading, storing) and using of vector instructions but these actions can be performed with using of intrinsics .
Of course if you want to achieve 100% utilization of CPU you sometimes need to use assembler. But it is rare case.
Related
The problem
I'm working on implementing and refining an optimization algorithm with some fairly large arrays (from tens of millions of floats and up) and using mainly Intel MKL in C (not C++, at least not so far) to squeeze out every possible bit of performance. Now I've run into a silly problem - I have a parameter that sets maxima and minima for subsets of a set of (tens of millions) of coefficients. Actually applying these maxima and minima using MKL functions is easy - I can create equally-sized vectors with the limits for every element and use V?Fmax and V?Fmin to apply them. But I also need to account for this clipping in my error metric, which requires me to count the number of elements that fall outside these constraints.
However, I can't find an MKL function that allows me to do things like counting the number of elements that fulfill some condition, the way you can create and sum logical arrays with e.g. NumPy in Python or in MATLAB. Irritatingly, when I try to google this question, I only get answers relating to Python and R.
Obviously I can just write a loop that increments a counter for each element that fulfills one of the conditions, but if there is an already optimized implementation that allows me to achieve this, I would much prefer that just owing to the size of my arrays.
Does anyone know of a clever way to achieve this robustly and very efficiently using Intel MKL (maybe with the statistics toolbox or some creative use of elementary functions?), a similarly optimized library that does this, or a highly optimized way to hand-code this? I've been racking my brain trying to come up with some out-of-the box method, but I'm coming up empty.
Note that it's necessary for me to be able to do this in C, that it's not viable for me to shift this task to my Python frontend, and that it is indeed necessary for me to code this particular subprogram in C in the first place.
Thanks!
If you were using c++, count_if from the algorithms library with an execution policy of par_unseq may parallelize and vectorize the count. On Linux at least, it typically uses Intel TBB to do this.
It's not likely to be as easy in c. Because c doesn't have concepts like templates, callables or lambdas, the only way to specialize a generic (library-provided) count()-function would be to pass a function pointer as a callback (like qsort() does). Unless the compiler manages to devirtualize and inline the callback, you can't vectorize at all, leaving you with (possibly thread parallelized) scalar code. OTOH, if you use for example gcc vector intrinsics (my favourite!), you get vectorization but not parallelization. You could try to combine the approaches, but I'd say get over yourself and use c++.
However, if you only need vectorization, you can almost certainly just write sequential code and have the compiler autovectorize, unless the predicate for what should be counted is poorly written, or your compiler is braindamaged.
For example. gcc vectorizes the code on x86 if at least sse4 instructions are available (-msse4). With AVX[2/512] (-mavx / -mavx2 / -mavx512f) you can get wider vectors to do more elements at once. In general, if you're compiling on the same hardware you will be running the program on, I'd recommend letting gcc autodetect the optimal instruction set extensions (-march=native).
Note that in the provided code, the conditions should not use short-circuiting or (||), because then the read from the max-vector is semantically forbidden if the comparison with the min-vector was already true for the current element, severely hindering vectorization (though avx512 could potentially vectorize this with somewhat catastrophic slowdown).
I'm pretty sure gcc is not nearly optimal in the code it generates for avx512, since it could do the k-reg (mask register) or in the mask registers with kor[b/w/d/q], but maybe somebody with more experience in avx512 (*cougth* Peter Cordes *cough*) could weigh in on that.
MKL doesn't provide such functions but You may try to check another performance library - IPP which contains a set of threshold functions that could be useful to your case. Please refer to the IPP Developer Reference to check more details - https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/essential-functions/conversion-functions/threshold.html
I am trying to run my code on an ARM device. So far it's running and I also have a tool to measure complexity. Now I have lots of standard functions I am using to perform mathematical operations, like dividing, multiplying, adding and so an.
Is it easier (i.e. less complex) if I write those functions as e.g.
result = a + b;
or as
"qadd %0, %1, %4;"
which would be arm code for this operation if the values are in the respective registers. I am just wondering if writing everything in ARM code would really reduce the complexity.
Also, how does that behave for conditionals (like If and Else).
Thank you.
Let the compiler take care of it, until you discover a bottleneck.
Note that QADD is saturating arithmetic and has different behavior to the C code you show.
THIS QUESTION IS ABOUT C++ CODE TARGETED FOR AVX/AVX2 INSTRUCTIONS, as shipped in Intel processors since 2013 (and/or AVX-512 since 2015).
How do I generate one million random Gaussian unit normals fast on Intel processors with new instructions sets?
More generic versions of this question were asked a few times before, e.g., as in Generate random numbers following a normal distribution in C/C++. Yes, I know about Box-Muller and adding and other techniques. I am tempted to build my inverse normal distribution, sample (i.e., map) exactly according to expectations (pseudo-normals, then), and then randomly rearrange sort order.
But, I also know I am using an Intel Core processor with recent AVX vector and AES instruction sets. besides, I need C (not C++ with its std library), and it needs to work on Linux and OSX with gcc.
So, is there a better processor-specific way to generate so many random numbers fast? For such large quantities of random numbers, does Intel processor hardware even offer useful instructions? Are they an option worth looking into: and if so, is there an existing standard function implementation of "rnorm"?
According to my measurements of dgemm from both cublas and atlas, atlas severly beats cublas in terms of speed. Is this to be expected for a system with an Intel i7 950 and Nvidia GTX470?
I tested matrices of size 10x10 up to 6000x6000 in increments of 50. Atlas always wins. I measure both total application execution and just the multiplication step.
Anyone else have experience with this? Is this the expected results?
Thanks in advance.
edit: (same code, same results on a Xeon X5670 and Nvidia Tesla C2050)
edit2: It appears a great deal of slowness if attributed to initialisation of the cublas library. I continue to work on it. I'll update here when I learn more.
Did you use the single-threaded versions of both libraries? As far as I understand, both GotoBLAS and Atlas tend to sneakily use multiple threads when working on large matrices.
That said, at large matrix sizes the algorithm used tends to matter much more than the low-level implementation. Naive matrix multiplication is O(N^3), whereas Strassen algorithm scales much better, about O(N^2.81) or so. However, Strassen algorithm happens to vectorize very nicely (to much larger SSE and AVX registers, yielding almost 2 to 8-fold increase in efficiency, depending on floating-point format and register size).
I am not sure how well the two GPUs you mentioned handle double-precision math. Typically they're optimized for single precision (32-bit floats), dropping to a third or a quarter of that speed when handling doubles.
There are other factors in your tests that may skew the results. For example, you may be including the matrix transfer time to the CPU. Whether that matches real world use cases, I don't know; I don't have an Nvidia GPU to test.. but I suspect not. Usually there are multiple operations, and the matrix does not need to be transferred between operations.
I've been writing my own low-level SSE3 matrix functions using SSE/AVX vector built-in functions provided by GCC and ICC C99 compilers; early testing indicates it beats the current Fortran implementations by a wide margin, especially at the very small (say up to 8x8, optimized for each size) and very large (above 1000x1000, using Strassen algorithm) sizes for dense matrices.
What's my best bet for computing the dot product of a vector x with a large number of vectors y_i, where x and y_i are of length 10k or so.
Shove the y's in a matrix and use an optimized s/dgemv routine?
Or maybe try handcoding an SSE2 solution (I don't have SSE3, according to cpuinfo).
I'm just looking for general guidance here, so any suggestions will be useful.
And yes, I do need the performance.
Thanks for any light.
I think GPUs are specifically designed to perform operations like this quickly (among others). So you could probably make use of DirectX or OpenGL libraries to perform the vector operations. D3DXVec2Dot This will also save you CPU time.
Alternatives for optimised BLAS routines:
If you use intel compilers, you may
have access to intel MKL
For other compilers ATLAS usually provides nice performance numbers
Handcoding a SSE2 solution is not very difficult and will bring a nice speedup over a pure C routine. How much this will bring over a BLAS routine must be determined by you.
The greatest speedup is derived by structuring the data into a format, so that you can exploit data parallelism and alignment.
I use a GotoBLAS. It's the hight perfomance kernel routines. The many times better than MKL and BLAS.
The following provides BLAS level 1 (vector operations) routines using SSE.
http://www.applied-mathematics.net/miniSSEL1BLAS/miniSSEL1BLAS.html
If you have an nVidia graphics card you can get cuBLAS which will perform the operation on the graphics card.
http://developer.nvidia.com/cublas
For ATI (AMD) graphics cards
http://developer.amd.com/libraries/appmathlibs/pages/default.aspx