C/C++ code is much slower when executed within OpenFX

C/C++ code is much slower when executed within OpenFX - c

I have a segmentation algorithm written in C/C++ that makes extensive use of C pointers, in order to access linked lists of structures, which are calloc'ed in the beginning of the programme.
This algorithm takes about 3 sec. to run on Ubuntu 14.04, gcc 4.8.2. It also uses OpenCV 2.4.8.
The algorithm is meant to be embedded within a library of OpenFX, so that this library can be added as plug-in to software suites, like Natron.
When executed as a plug-in of a host, on SUSE, gcc 4.3.2, the very same method with the same inputs takes 12 sec. to execute. I've been debugging and can't figure out why it takes so long, when executed within OpenFX. My strongest guess is that OpenFX handles differently the access to memory and that makes the execution of the algorithm slower.
Could anyone give me any clue? Please let me know if you need further information.

Related

carry-less multiplication optimization for ECC over GF(2^m) in MIRACL

Link to MIRACL crypto library by CertiVox
Following the instructions in fastgf2m.txt, I've been able to get everything to compile. However, after execution, the benchmark (bmark.exe) program halts when evaluating curves over GF(2^m) with error, "This is not a point on the curve!"
I am able to get everything to work without the optimization but I'm unsure where the problem exists. I haven't modified any curve parameters and followed instructions in the distribution. I'm compiling on 64-bit Windows 8.1, on an Intel i7-3520M.
If anyone has any advice on how to correct this, it would be greatly appreciated.
Thanks!!

The method outlined in fastgf2m.txt is for generating unrolled code associated with a fixed m value determined at compile time. The bmark program changes m at runtime, and so the unrolled code will often not be correct in this case. The documentation could be clearer on this point.
Also make sure your processor does support the PCLMULQDQ instruction - many older processors will not.
It might be better to test the method on the ecsgen2/ecssign2/ecsver2 programs to implement ECDSA over GF(2^283) for example.

Parallel for loops in MPI?

I've never used MPI before nor taken a formal course on parallel programming. I'm an applied math student working on a large project that consists of a series of for loops. In each for loop, the iterations are completely independent of the others, so I'm pretty sure this can easily be parallelized.
I tried using openmp first but get an error that file can't be found (Mac user), and I'm not really sure how to fix that.
So is this a simple task to do in MPI? Google for some reason comes up short on answers here.

If iterations are independent, then OpenMP is the simplest way to go. On Mac OS X, you need to install gcc to compile with OpenMP, owing to the fact that the clang compiler does not support (for now) OpenMP. You can install easily a precompiled version of gcc from here:
http://hpc.sourceforge.net
You can also use MPI of course, but that is going to be much more difficult for you, (besides installation), given that you are not trained. If you want to stick with MPI, parallelizing a loop consisting of independent iterations requires computing for each process its initial and final iteration, managing data structures appropriately (remember that in MPI memory is not shared among processes) etc.

If you need OpenMP on Mac OS X, you can try to use unofficial clang with OpenMP support. The source code is avaialable here http://clang-omp.github.com. Also you need an OpenMP runtime library, which is available here http://openmp.llvm.org/.

Automatic Multithreading?

I'm writing a C program that looks like this:
for (int i=0;i<n;i++){
[makes new file for i: i.txt]
[runs a long and intensive computation]
[writes to i.txt]
[closes i.txt]}
where n is some large number. Obviously, several iterations at once could be run in parallel, as they do not depend on each other. I have eight cores on my processor, so it seems that I would want my program to distribute the iterations on the eight cores, so it runs as quickly as possible. This means I would want the program to be multi-threaded.
My question is whether I would need to manually multi-thread this process in C, or whether multi-threading is done automatically on some level. I'm compiling with GCC and I know that GCC optimizes well, but I don't know whether it automatically multithreads. I also don't know whether my OS (Debian Linux) might not automatically distribute the program onto all eight cores of my processor. If not, is there a way I can get it to multi-thread without having to actually write multi-threading code in C?

You can multithread it manually if you want, or you can use libraries and tools that take over as much or as little of the responsibility as you like. OpenMP is definitely worth looking at.

The simple answer is no: GCC does no automatic multithreading in itself. Multithreading is defined in layers strictly above C, so GCC doing that would be kind of a layering violation.
That being said, there are lots of libraries that make the job easier (not that manually multithreading your particular case is a lot of work to begin with), and some of them integrate more tightly at the language level. OpenMP has been mentioned by others.

You may also consider the new additions to the C11 standard regarding multithreading applications

Matlab mex file is slow compared to its straight C equivalent

I'm at a loss to explain (and avoid) the differences in speed between a Matlab mex program and the corresponding C program with no Matlab interface. I've been profiling a numerical analysis program:
int main(){
Well_optimized_code();
}
compiled with gcc 4.4 against the Matlab-Mex equivalent (directed to use gcc44, which is not the version currently supported by Matlab, but it's required for other reasons):
void mexFunction(int nlhs,mxArray* plhs[], int nrhs, const mxArray* prhs[]){
Well_optimized_code(); //literally the exact same code
}
I performed the timings as:
$ time ./C_version
vs.
>> tic; mex_version(); toc
The difference in timing is staggering. The version run from the command line takes 5.8 seconds on average. The version in Matlab runs in 21 seconds. For context, the mex file replaces an algorithm in the SimBiology toolbox that takes about 26 seconds to run.
As compared to Matlab's algorithm, both the C and mex versions scale linearly up to 27 threads using calls to openMP, but for the purposes of profiling these calls have been disabled and commented out.
The two versions have been compiled in the same way with the exception of the necessary flags to compile as a mex file: -fPIC --shared -lmex -DMATLAB_MEX_FILE being applied in the mex compilation/linking. I've removed all references to the left and right arguments of the mex file. That is to say it takes no inputs and gives no outputs, it is solely for profiling.
The Great and Glorious Google has informed me that the position independent code should not be the source of the slowdown and beyond that I'm at a loss.
Any help will be appreciated,
Andrew

After a month of emailing with my contacts at Mathworks, playing around with my own code, and profiling my code every which way, I have an answer; however, it may be the most dissatisfying answer I have ever had to a technical question:
The short version is "upgrade to Matlab version 2011a (officially released last week), this issue has now been resolved".
The longer version regards an issue of the overhead associated with the mex gateway in versions 2010b and earlier. The best explanation that I've been able to extract is that this overhead is not assessed once, rather we pay a little bit every time a function calls another function that is in a linked library.
While why this occurs baffles me, it is at least consistent with the SHARK profiling that I did. When I profile and compare the differences between the native app and the mex app there is a recurring pattern. The time spent in functions that are in the source code I wrote for the app does not change. The time spent in library functions increases a little when comparing between the native and mex implementations. Functions in another library used to build this library increase the difference a lot. The time difference continues to increase as we proceed ever deeper until we reach by BLAS implementation.
A couple of heavily used BLAS functions were the main culprits. A function that took ~1% of my computation time in the native app was clocking in at 30% in the mex function.
The implementation of the mex gateway appears to have changed between 2010b and 2011a. On my macbook the native app takes about 6 seconds and the mex version takes 6.5 seconds. This is overhead that I can deal with.
As for the underlying cause, I can only speculate. Matlab has it's roots in interpretive coding. Since mex functions are dynamic libraries, I'm guessing that each mex library was unaware of what it was linked against until runtime. Since Matlab suggests the user rarely use mex and then only for small computationally intensive chunks, I assume that large programs (such as an ODE solver) are rarely implemented. These programs, like mine, are the ones that suffer the most.
I've profiled a couple of Matlab functions that I know to be implemented in C then compiled using mex (especially sbiosimulate after calling sbioaccelerate on kinetic models, part of the SimBiology toolbox) and there appears to be some significant speed ups. So the 2011a update appears to be more broadly beneficial than the usual semi-yearly upgrade.
Best of luck to other coders with the similar issues. Thanks for all of the helpful advice that got me started in the right direction.
--Andrew

Recall that Matlab stores arrays as column major, and C/C++ as row major. Is it possible that your loop structure/algorithm is iterating in a row major fashion, resulting in poor memory access times in Matlab, but fast access times in C/C++ ?

Why don't I see a significant speed-up when using the MATLAB compiler?

I have a lot of nice MATLAB code that runs too slowly and would be a pain to write over in C. The MATLAB compiler for C does not seem to help much, if at all. Should it be speeding execution up more? Am I screwed?

If you are using the MATLAB complier (on a recent version of MATLAB) then you will almost certainly not see any speedups at all. This is because all the compiler actually does is give you a way of packaging up your code so that it can be distributed to people who don't have MATLAB. It doesn't convert it to anything faster (such as machine code or C) - it merely wraps it in C so you can call it.
It does this by getting your code to run on the MATLAB Compiler Runtime (MCR) which is essentially the MATLAB computational kernel - your code is still being interpreted. Thanks to the penalty incurred by having to invoke the MCR you may find that compiled code runs more slowly than if you simply ran it on MATLAB.
Put another way - you might say that the compiler doesn't actually compile - in the traditional sense of the word at least.
Older versions of the compiler worked differently and speedups could occur in certain situations. For Mathwork's take on this go to
http://www.mathworks.com/support/solutions/data/1-1ARNS.html

In my experience slow MATLAB code usually comes from not vectorizing your code (i.e., writing for-loops instead of just multiplying arrays (simple example)).
If you are doing file I/O look out for reading data in one piece at a time. Look in the help files for the vectorized version of fscanf.
Don't forget that MATLAB includes a profiler, too!

I'll echo what dwj said: if your MATLAB code is slow, this is probably because it is not sufficiently vectorized. If you're doing explicit loops when you could be doing operations on whole arrays, that's the culprit.
This applies equally to all array-oriented dynamic languages: Perl Data Language, Numeric Python, MATLAB/Octave, etc. It's even true to some extent in compiled C and FORTRAN compiled code: specially-designed vectorization libraries generally use carefully hand-coded inner loops and SIMD instructions (e.g. MMX, SSE, AltiVec).

First, I second all the above comments about profiling and vectorizing.
For a historical perspective...
Older version of Matlab allowed the user to convert m files to mex functions by pre-parsing the m code and converting it to a set of matlab library calls. These calls have all the error checking that the interpreter did, but old versions of the interpreter and/or online parser were slow, so compiling the m file would sometimes help. Usually it helped when you had loops because Matlab was smart enough to inline some of that in C. If you have one of those versions of Matlab, you can try telling the mex script to save the .c file and you can see exactly what it's doing.
In more recent version (probably 2006a and later, but I don't remember), Mathworks started using a just-in-time compiler for the interpreter. In effect, this JIT compiler automatically compiles all mex functions, so explicitly doing it offline doesn't help at all. In each version since then, they've also put a lot of effort into making the interpreter much faster. I believe that newer versions of Matlab don't even let you automatically compile m files to mex files because it doesn't make sense any more.

The MATLAB compiler wraps up your m-code and dispatches it to a MATLAB runtime. So, the performance you see in MATLAB should be the performance you see with the compiler.
Per the other answers, vectorizing your code is helpful. But, the MATLAB JIT is pretty good these days and lots of things perform roughly as well vectorized or not. That'a not to say there aren't performance benefits to be gained from vectorization, it's just not the magic bullet it once was. The only way to really tell is to use the profiler to find out where your code is seeing bottlenecks. Often times there are some places where you can do local refactoring to really improve the performance of your code.
There are a couple of other hardware approaches you can take on performance. First, much of the linear algebra subsystem is multithreaded. You may want to make sure you have enabled that in your preferences if you are working on a multi-core or multi-processor platform. Second, you may be able to use the parallel computing toolbox to take more advantage of multiple processors. Finally, if you are a Simulink user, you may be able to use emlmex to compile m-code into c. This is particularly effective for fixed point work.

Have you tried profiling your code? You don't need to vectorize ALL your code, just the functions that dominate running time. The MATLAB profiler will give you some hints on where your code is spending the most time.
There are many other things you you should read up on the Tips For Improving Performance section in the MathWorks manual.

mcc won't speed up your code at all--it's not really a compiler.
Before you give up, you need to run the profiler and figure out where all your time is going (Tools->Open Profiler). Also, judicious use of "tic" and "toc" can help. Don't optimize your code until you know where the time is going (don't try to guess).
Keep in mind that in matlab:
bit-level operations are really slow
file I/O is slow
loops are generally slow, but vectorizing is fast (if you don't know the vector syntax, learn it)
core operations are really fast (e.g. matrix multiply, fft)
if you think you can do something faster in C/Fortran/etc, you can write a MEX file
there are commercial solutions to convert matlab to C (google "matlab to c") and they work

You could port your code to "Embedded Matlab" and then use the Realtime-Workshop to translate it to C.
Embedded Matlab is a subset of Matlab. It does not support Cell-Arrays, Graphics, Marices of dynamic size, or some Matrix addressing modes. It may take considerable effort to port to Embedded Matlab.
Realtime-Workshop is at the core of the Code Generation Products. It spits out generic C, or can optimize for a range of embedded Platforms. Most interresting to you is perhaps the xPC-Target, which treats general purpose hardware as embedded target.

I would vote for profiling + then look at what are the bottlenecks.
If the bottleneck is matrix math, you're probably not going to do any better... EXCEPT one big gotcha is array allocation. e.g. if you have a loop:
s = [];
for i = 1:50000
s(i) = 3;
end
This has to keep resizing the array; it's much faster to presize the array (start with zeros or NaN) & fill it from there:
s = zeros(50000,1);
for i = 1:50000
s(i) = 3;
end
If the bottleneck is repeated executions of a lot of function calls, that's a tough one.
If the bottleneck is stuff that MATLAB doesn't do quickly (certain types of parsing, XML, stuff like that) then I would use Java since MATLAB already runs on a JVM and it interfaces really easily to arbitrary JAR files. I looked at interfacing with C/C++ and it's REALLY ugly. Microsoft COM is ok (on Windows only) but after learning Java I don't think I'll ever go back to that.

As others has noted, slow Matlab code is often the result of insufficient vectorization.
However, sometimes even perfectly vectorized code is slow. Then, you have several more options:
See if there are any libraries / toolboxes you can use. These were usually written to be very optimized.
Profile your code, find the tight spots and rewrite those in plain C. Connecting C code (as DLLs for instance) to Matlab is easy and is covered in the documentation.

By Matlab compiler you probably mean the command mcc, which does speed the code a little bit by circumventing Matlab interpreter. What would speed the MAtlab code significantly (by a factor of 50-200) is use of actual C code compiled by the mex command.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

C/C++ code is much slower when executed within OpenFX - c

Related

carry-less multiplication optimization for ECC over GF(2^m) in MIRACL

Parallel for loops in MPI?

Automatic Multithreading?

Matlab mex file is slow compared to its straight C equivalent

Why don't I see a significant speed-up when using the MATLAB compiler?

Categories

Resources