I'm looking to calculate highly parallelized trig functions (in block of like 1024), and I'd like to take advantage of at least some of the parallelism that modern architectures have.
When I compile a block
for(int i=0; i<SIZE; i++) {
arr[i]=sin((float)i/1024);
}
GCC won't vectorize it, and says
not vectorized: relevant stmt not supported: D.3068_39 = __builtin_sinf (D.3069_38);
Which makes sense to me. However, I'm wondering if there's a library to do parallel trig computations.
With just a simple taylor series up the 11th order, GCC will vectorize all the loops, and I'm getting speeds over twice as fast as a naive sin loop (with bit-exact answers, or with 9th order series, only a single bit off for the last two out of 1600 values, for a >3x speedup). I'm sure someone has encountered a problem like this before, but when I google, I find no mentions of any libraries or the like.
A. Is there something existing already?
B. If not, advice for optimizing parallel trig functions?
EDIT: I found the following library called "SLEEF": http://shibatch.sourceforge.net/ which is described in this paper and uses SIMD instructions to calculate several elementary functions. It uses SSE and AVX specific code, but I don't think it will be hard to turn it into standard C loops.
Since you said you were using GCC it looks like there are some options:
http://gruntthepeon.free.fr/ssemath/
This uses SSE and SSE2 instructions to implement it.
http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php
This has an alternate implementation. Some of the comments are pretty good.
That said, I'd probably look into GPGPU for a solution. Maybe writing it in CUDA or OpenCL (If I remember correctly CUDA supports the sine function). Here are some libraries that look like they might make it easier.
https://code.google.com/p/slmath/
https://code.google.com/p/thrust/
Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.
What platform are you using? Many libraries of this sort already exist:
Intel's provides the Vector Math Library (VML) with icc.
Apple provides the vForce library as part of the Accelerate framework.
HP provides their own Vector Math Library for Itanium (and may other architectures, too).
Sun provided libmvec with their compiler tools.
...
Instead of the taylor series, I would look at the algorithms fdlibm uses. They should get you as much precision with fewer steps.
My answer was to create my own library to do exactly this called vectrig: https://github.com/jeremysalwen/vectrig
Related
The problem
I'm working on implementing and refining an optimization algorithm with some fairly large arrays (from tens of millions of floats and up) and using mainly Intel MKL in C (not C++, at least not so far) to squeeze out every possible bit of performance. Now I've run into a silly problem - I have a parameter that sets maxima and minima for subsets of a set of (tens of millions) of coefficients. Actually applying these maxima and minima using MKL functions is easy - I can create equally-sized vectors with the limits for every element and use V?Fmax and V?Fmin to apply them. But I also need to account for this clipping in my error metric, which requires me to count the number of elements that fall outside these constraints.
However, I can't find an MKL function that allows me to do things like counting the number of elements that fulfill some condition, the way you can create and sum logical arrays with e.g. NumPy in Python or in MATLAB. Irritatingly, when I try to google this question, I only get answers relating to Python and R.
Obviously I can just write a loop that increments a counter for each element that fulfills one of the conditions, but if there is an already optimized implementation that allows me to achieve this, I would much prefer that just owing to the size of my arrays.
Does anyone know of a clever way to achieve this robustly and very efficiently using Intel MKL (maybe with the statistics toolbox or some creative use of elementary functions?), a similarly optimized library that does this, or a highly optimized way to hand-code this? I've been racking my brain trying to come up with some out-of-the box method, but I'm coming up empty.
Note that it's necessary for me to be able to do this in C, that it's not viable for me to shift this task to my Python frontend, and that it is indeed necessary for me to code this particular subprogram in C in the first place.
Thanks!
If you were using c++, count_if from the algorithms library with an execution policy of par_unseq may parallelize and vectorize the count. On Linux at least, it typically uses Intel TBB to do this.
It's not likely to be as easy in c. Because c doesn't have concepts like templates, callables or lambdas, the only way to specialize a generic (library-provided) count()-function would be to pass a function pointer as a callback (like qsort() does). Unless the compiler manages to devirtualize and inline the callback, you can't vectorize at all, leaving you with (possibly thread parallelized) scalar code. OTOH, if you use for example gcc vector intrinsics (my favourite!), you get vectorization but not parallelization. You could try to combine the approaches, but I'd say get over yourself and use c++.
However, if you only need vectorization, you can almost certainly just write sequential code and have the compiler autovectorize, unless the predicate for what should be counted is poorly written, or your compiler is braindamaged.
For example. gcc vectorizes the code on x86 if at least sse4 instructions are available (-msse4). With AVX[2/512] (-mavx / -mavx2 / -mavx512f) you can get wider vectors to do more elements at once. In general, if you're compiling on the same hardware you will be running the program on, I'd recommend letting gcc autodetect the optimal instruction set extensions (-march=native).
Note that in the provided code, the conditions should not use short-circuiting or (||), because then the read from the max-vector is semantically forbidden if the comparison with the min-vector was already true for the current element, severely hindering vectorization (though avx512 could potentially vectorize this with somewhat catastrophic slowdown).
I'm pretty sure gcc is not nearly optimal in the code it generates for avx512, since it could do the k-reg (mask register) or in the mask registers with kor[b/w/d/q], but maybe somebody with more experience in avx512 (*cougth* Peter Cordes *cough*) could weigh in on that.
MKL doesn't provide such functions but You may try to check another performance library - IPP which contains a set of threshold functions that could be useful to your case. Please refer to the IPP Developer Reference to check more details - https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/essential-functions/conversion-functions/threshold.html
I would like to perform a non-linear optimization algorithm using C.
The problem is:
over the five points that are in vector X.
X, Y(X), lower and upper bounds are known.
I have found the nlopt library on C but I do not know if It is possible to perform the optimization over the five discrete points.
Anything to suggest, even another library?
Thanks!
I would suggest Octave. For nonlinear programming on Octave, refer to
Octave Optimization.
You could implement using matlab-like language.
It also has C/C++ api.
See this post: How to embed the GNU Octave in C/C++ program?.
And also, this pdf
Consider optimizing matlab code instead of reimplementing algorithm in another language - matlab can be pretty fast if optimized properly (avoid using for loop, use vectorized computations, pre-allocate memory).
Take a look at http://www.mathworks.com/company/newsletters/news_notes/june07/patterns.html
My question may seem primitive or dumb because, I've just switched to C.
I have been working with MATLAB for several years and I've learned that any computation should be vectorized in MATLAB and I should avoid any for loop to get an acceptable performance.
It seems that if I want to add two vectors, or multiply matrices, or do any other matrix computation, I should use a for loop.
It is appreciated if you let me know whether or not there is any way to do the computations in a vectorized sense, e.g. reading all elements of a vector using only one command and adding those elements to another vector using one command.
Thanks
MATLAB suggests you to avoid any for loop because most of the operations available on vectors and matrices are already implements in its API and ready to be used. They are probably optimized and they work directly on underlying data instead that working at MATLAB language level, a sort of opaque implementation I guess.
Even MATLAB uses for loops underneath to implement most of its magic (or delegates them to highly specialized assembly instructions or through CUDA to the GPU).
What you are asking is not directly possible, you will need to use loops to work on vectors and matrices, actually you would search for a library which allows you to do most of the work without directly using a for loop but by using functions already defined that wraps them.
As it was mentioned, it is not possible to hide the for loops. However, I doubt that the code MATLAB produces is in any way faster the the one produced by C. If you compile your C code with the -O3 it will try to use every hardware feature your computer has available, such as SIMD extensions and multiple issue. Moreover, if your code is good and it doesn't cause too many pipeline stalls and you use the cache, it will be really fast.
But i think what you are looking for are some libraries, search google for LAPACK or BLAS, they might be what you are looking for.
In C there is no way to perform operations in a vectorized way. You can use structures and functions to abstract away the details of operations but in the end you will always be using fors to process your data.
As for speed C is a compiled language and you will not get a performance hit from using for loops in C. C has the benefit (compared to MATLAB) that it does not hide anything from you, so you can always see where your time is being used. On the downside you will notice that things that MATLAB makes trivial (svd,cholesky,inv,cond,imread,etc) are challenging in C.
Has anyone taken advantage of the automatic vectorization that gcc can do? In the real world (as opposed to example code)? Does it take restructuring of existing code to take advantage? Are there a significant number of cases in any production code that can be vectorized this way?
I have yet to see either GCC or Intel C++ automatically vectorize anything but very simple loops, even when given the code of algorithms that can (and were, after I manually rewrote them using SSE intrinsics) be vectorized.
Part of this is being conservative - especially when faced with possible pointer aliasing, it can be very difficult for a C/C++ compiler to 'prove' to itself that a vectorization would be safe, even if you as the programmer know that it is. Most compilers (sensibly) prefer to not optimize code rather than risking miscompiling it. This is one area where higher level languages have a real advantage over C, at least in theory (I say in theory since I'm not actually aware of any automatically vectorizing ML or Haskell compilers).
Another part of it is simply analytical limitations - most research in vectorization, I understand, is related to optimizing classical numerical problems (fluid dynamics, say) which was the bread and butter of most vector machines before a few years ago (when, between CUDA/OpenCL, Altivec/SSE, and the STI Cell, vector programming in various forms became widely available in commercial systems).
It's fairly unlikely that code written for a scalar processor in mind will be easy for a compiler to vectorize. Happily, many things you can do to make it easier for a compiler to understand how to vectorize it, like loop tiling and partial loop unrolling, also (tend to) help performance on modern processors even if the compiler doesn't figure out how to vectorize it.
It is hard to use in any business logic, but gives speed ups when you are processing volumes of data in the same way.
Good example is sound/video processing where you apply the same operation to every sample/pixel.
I have used VisualDSP for this, and you had to check the results after compiling - if it is really used where it should.
Vectorized instructions are not limited to Cell processors - most modern workstations-like CPU have them (PPC, x86 since pentium 3, Sparc, etc...). When used well for floating points operations, it can help quite a lot for very computing intensive tasks (filters, etc...). In my experience, automatic vectorization does not work so well.
You may have noticed that pretty much no-one actually knows how to make good use of GCC's Automatic Vectorization. If you search around the web to see people's comments, it always come to the idea that GCC allows you to enable automatic vectorization, but it extremely rarely makes actual use of it, and so if you want to use SIMD acceleration (eg: MMX, SSE, AVX, NEON, AltiVec), then you basically haveto figure out how to write it using compiler intrinsics or Assembly language code.
But the problem with intrinsics is that you effectively need to understand the Assembly language side of it and then also learn the Intrinsics method of describing what you want, which is likely to result in much less efficient code than if you wrote it in Assembly code (such as by a factor of 10x), because the compiler is still going to have trouble making good use of your intrinsic instructions!
For example, you might be using SIMD Intrinsics so that many operations can be performed in parallel at the same time, but your compiler will probably generate Assembly code that transfers the data between the SIMD registers and the normal CPU registers and back, effectively making your SIMD code run at a similar speed (or even slower) than normal code!
So basically:
If you want upto 100% speedups (2x
speed), then either buy the
official Intel/ARM compilers or convert some of your code to use SIMD C/C++ Intrinsics.
If you
want 1000% speedups (10x speed), then
write it in Assembly code using SIMD instructions by hand. Or if available on your hardware, use GPU acceleration instead such as OpenCL or Nvidia's CUDA SDK, since they can provide similar speedups in the GPU as SIMD does in the CPU.
I have a lot of nice MATLAB code that runs too slowly and would be a pain to write over in C. The MATLAB compiler for C does not seem to help much, if at all. Should it be speeding execution up more? Am I screwed?
If you are using the MATLAB complier (on a recent version of MATLAB) then you will almost certainly not see any speedups at all. This is because all the compiler actually does is give you a way of packaging up your code so that it can be distributed to people who don't have MATLAB. It doesn't convert it to anything faster (such as machine code or C) - it merely wraps it in C so you can call it.
It does this by getting your code to run on the MATLAB Compiler Runtime (MCR) which is essentially the MATLAB computational kernel - your code is still being interpreted. Thanks to the penalty incurred by having to invoke the MCR you may find that compiled code runs more slowly than if you simply ran it on MATLAB.
Put another way - you might say that the compiler doesn't actually compile - in the traditional sense of the word at least.
Older versions of the compiler worked differently and speedups could occur in certain situations. For Mathwork's take on this go to
http://www.mathworks.com/support/solutions/data/1-1ARNS.html
In my experience slow MATLAB code usually comes from not vectorizing your code (i.e., writing for-loops instead of just multiplying arrays (simple example)).
If you are doing file I/O look out for reading data in one piece at a time. Look in the help files for the vectorized version of fscanf.
Don't forget that MATLAB includes a profiler, too!
I'll echo what dwj said: if your MATLAB code is slow, this is probably because it is not sufficiently vectorized. If you're doing explicit loops when you could be doing operations on whole arrays, that's the culprit.
This applies equally to all array-oriented dynamic languages: Perl Data Language, Numeric Python, MATLAB/Octave, etc. It's even true to some extent in compiled C and FORTRAN compiled code: specially-designed vectorization libraries generally use carefully hand-coded inner loops and SIMD instructions (e.g. MMX, SSE, AltiVec).
First, I second all the above comments about profiling and vectorizing.
For a historical perspective...
Older version of Matlab allowed the user to convert m files to mex functions by pre-parsing the m code and converting it to a set of matlab library calls. These calls have all the error checking that the interpreter did, but old versions of the interpreter and/or online parser were slow, so compiling the m file would sometimes help. Usually it helped when you had loops because Matlab was smart enough to inline some of that in C. If you have one of those versions of Matlab, you can try telling the mex script to save the .c file and you can see exactly what it's doing.
In more recent version (probably 2006a and later, but I don't remember), Mathworks started using a just-in-time compiler for the interpreter. In effect, this JIT compiler automatically compiles all mex functions, so explicitly doing it offline doesn't help at all. In each version since then, they've also put a lot of effort into making the interpreter much faster. I believe that newer versions of Matlab don't even let you automatically compile m files to mex files because it doesn't make sense any more.
The MATLAB compiler wraps up your m-code and dispatches it to a MATLAB runtime. So, the performance you see in MATLAB should be the performance you see with the compiler.
Per the other answers, vectorizing your code is helpful. But, the MATLAB JIT is pretty good these days and lots of things perform roughly as well vectorized or not. That'a not to say there aren't performance benefits to be gained from vectorization, it's just not the magic bullet it once was. The only way to really tell is to use the profiler to find out where your code is seeing bottlenecks. Often times there are some places where you can do local refactoring to really improve the performance of your code.
There are a couple of other hardware approaches you can take on performance. First, much of the linear algebra subsystem is multithreaded. You may want to make sure you have enabled that in your preferences if you are working on a multi-core or multi-processor platform. Second, you may be able to use the parallel computing toolbox to take more advantage of multiple processors. Finally, if you are a Simulink user, you may be able to use emlmex to compile m-code into c. This is particularly effective for fixed point work.
Have you tried profiling your code? You don't need to vectorize ALL your code, just the functions that dominate running time. The MATLAB profiler will give you some hints on where your code is spending the most time.
There are many other things you you should read up on the Tips For Improving Performance section in the MathWorks manual.
mcc won't speed up your code at all--it's not really a compiler.
Before you give up, you need to run the profiler and figure out where all your time is going (Tools->Open Profiler). Also, judicious use of "tic" and "toc" can help. Don't optimize your code until you know where the time is going (don't try to guess).
Keep in mind that in matlab:
bit-level operations are really slow
file I/O is slow
loops are generally slow, but vectorizing is fast (if you don't know the vector syntax, learn it)
core operations are really fast (e.g. matrix multiply, fft)
if you think you can do something faster in C/Fortran/etc, you can write a MEX file
there are commercial solutions to convert matlab to C (google "matlab to c") and they work
You could port your code to "Embedded Matlab" and then use the Realtime-Workshop to translate it to C.
Embedded Matlab is a subset of Matlab. It does not support Cell-Arrays, Graphics, Marices of dynamic size, or some Matrix addressing modes. It may take considerable effort to port to Embedded Matlab.
Realtime-Workshop is at the core of the Code Generation Products. It spits out generic C, or can optimize for a range of embedded Platforms. Most interresting to you is perhaps the xPC-Target, which treats general purpose hardware as embedded target.
I would vote for profiling + then look at what are the bottlenecks.
If the bottleneck is matrix math, you're probably not going to do any better... EXCEPT one big gotcha is array allocation. e.g. if you have a loop:
s = [];
for i = 1:50000
s(i) = 3;
end
This has to keep resizing the array; it's much faster to presize the array (start with zeros or NaN) & fill it from there:
s = zeros(50000,1);
for i = 1:50000
s(i) = 3;
end
If the bottleneck is repeated executions of a lot of function calls, that's a tough one.
If the bottleneck is stuff that MATLAB doesn't do quickly (certain types of parsing, XML, stuff like that) then I would use Java since MATLAB already runs on a JVM and it interfaces really easily to arbitrary JAR files. I looked at interfacing with C/C++ and it's REALLY ugly. Microsoft COM is ok (on Windows only) but after learning Java I don't think I'll ever go back to that.
As others has noted, slow Matlab code is often the result of insufficient vectorization.
However, sometimes even perfectly vectorized code is slow. Then, you have several more options:
See if there are any libraries / toolboxes you can use. These were usually written to be very optimized.
Profile your code, find the tight spots and rewrite those in plain C. Connecting C code (as DLLs for instance) to Matlab is easy and is covered in the documentation.
By Matlab compiler you probably mean the command mcc, which does speed the code a little bit by circumventing Matlab interpreter. What would speed the MAtlab code significantly (by a factor of 50-200) is use of actual C code compiled by the mex command.