Nested for loops extremely slow in MATLAB (preallocated) - arrays

I am trying to learn MATLAB and one of the first problems I encountered was to guess the background from an image sequence with a static camera and moving objects. For a start I just want to do a mean or median on pixels over time, so it's just a single function I would like to apply to one of the rows of the 4 dimensional array.
I have loaded my RGB images in a 4 dimensional array with the following dimensions:
uint8 [ num_images, width, height, RGB ]
Here is the function I wrote which includes 4 nested loops. I use preallocation but still, it is extremely slow. In C++ I believe this function could run at least 10x-20x faster, and I think on CUDA it could actually run in real time. In MATLAB it takes about 20 seconds with the 4 nested loops. My stack is 100 images with 640x480x3 dimensions.
function background = calc_background(stack)
tic;
si = size(stack,1);
sy = size(stack,2);
sx = size(stack,3);
sc = size(stack,4);
background = zeros(sy,sx,sc);
A = zeros(si,1);
for x = 1:sx
for y = 1:sy
for c = 1:sc
for i = 1:si
A(i) = stack(i,y,x,c);
end
background(y,x,c) = median(A);
end
end
end
background = uint8(background);
disp(toc);
end
Could you tell me how to make this code much faster? I have tried experimenting with somehow getting the data directly from the array using only the indexes and it seems MUCH faster. It completes in 3 seconds vs. 20 seconds, so that’s a 7x performance difference, just by writing a smaller function.
function background = calc_background2(stack)
tic;
% bad code, confusing
% background = uint8(squeeze(median(stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 ))));
% good code (credits: Laurent)
background=uint8((squeeze(median(stack,1)));
disp(toc);
end
So now I don't understand if MATLAB could be this fast then why is the nested loop version so slow? I am not making any dynamic resizing and MATLAB must be running the same 4 nested loops inside.
Why is this happening?
Is there any way to make nested loops run fast, like it would happen naturally in C++?
Or should I get used to the idea of programming MATLAB in this crazy one line statements way to get optimal performance?
Update
Thank you for all the great answers, now I understand a lot more. My original code with stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 )) didn't make any sense, it is exactly the same as stack, I was just lucky with median's default option of using the 1st dimension for its working range.
I think it's better to ask how to write an efficient question in an other question, so I asked it here:
How to write vectorized functions in MATLAB

If I understand your question, you're asking why Matlab is faster for matrix operations than for procedural programming calls. The answer is simply that that's how it's designed. If you really want to know what makes it that way, you can read this newsletter from Matlab's website which discusses some of the underlying technology, but you probably won't get a great answer, as the software is proprietary. I also found some relevant pages by simply googling, and this old SO question
also seems to address your question.

Matlab is an interpreted language, meaning that it must evaluate each line of code of your script.
Evaluating is a lengthy process since it must parse, 'compile' and interpret each line*.
Using for loops with simple operations means that matlab takes far more time parsing/compiling than actually executing your code.
Builtin functions, on the other hand are coded in a compiled language and heavily optimized. They're very fast, hence the speed difference.
Bottom line: we're very used to procedural language and for loops, but there's almost always a nice and fast way to do the same things in a vectorized way.
* To be complete and to pay honour to whom honour is due: recent versions of Matlab actually tries to accelerate loops by analyzing repeated operations to compile chunks of repetitive operations into native executable. This is called Just In Time compilation (JIT) and was pointed out by Jonas in the following comments.
Original answer:
If I understood well (and you want the median of the first dimension) you might try:
background=uint8((squeeze(median(stack,1)));

Well, the difference between both is their method of executing code. To sketch it very roughly: in C you feed your code to a compiler which will try to optimize your code or at any rate convert it to machine code. This takes some time, but when you actually execute your program, it is in machine code already and therefore executes very fast. You compiler can take a lot of time trying to optimize the code for you, in general you don't care whether it takes 1 minute or 10 minutes to compile a distribution-ready program.
MATLAB (and other interpreted languages) don't generally work that way. When you execute your program, an interpreter will interprete each line of code and transform it into a sequence of machine code on the fly. This is a bit slower if you write for-loops as it has to interprete the code over and over again (at least in principle, there are other overheads which might matter more for the newest versions of MATLAB). Here the hurdle is the fact that everything has to be done at runtime: the interpreter can perform some optimizations, but it is not useful to perform time-consuming optimizations that might increase performance by a lot in some cases as they will cause performance to suffer in most other cases.
You might ask what you gain by using MATLAB? You gain flexibility and clear semantics. When you want to do a matrix multiplication, you just write it as such; in C this would yield a double for loop. You have to worry very little about data types, memory management, ...
Behind the scenes, MATLAB uses compiled code (Fortan/C/C++ if I'm not mistaken) to perform large operations: so a matrix multiplication is really performed by a piece of machine code which was compiled from another language. For smaller operations, this is the case as well, but you won't notice the speed of these calculations as most of your time is spent in management code (passing variables, allocating memory, ...).
To sum it all up: yes you should get used to such compact statements. If you see a line of code like Laurent's example, you immediately see that it computes a median of stack. Your code requires 11 lines of code to express the same, so when you are looking at code like yours (which might be embedded in hundreds of lines of other code), you will have a harder time understanding what is happening and pinpointing where a certain operation is performed.
To argue even further: you shouldn't program in MATLAB in the same way as you'd program in C/C++; nor should you do the other way round. Each language has its stronger and weaker points, learn to know them and use each language for what it's made for. E.g. you could write a whole compiler or webserver in MATLAB but in general that will be really slow as MATLAB was not intended to handle or concatenate strings (it can, but it might be very slow).

Related

Matlab Array Ops Optimization

In a MATLAB program I call many times (over 3 million) to a function that converts from local coordinates to global coordinates in an image, just a a simple transformation. My whole code takes 6 minutes to run, and the coordinates conversion function takes 20% of that time.
How can I optimize this code?
function LMP_glb = do(pnt_val,LMP,NP_glb)
NP_co = ones(1,3)*round(pnt_val+1);
LMP_glb = [NP_glb(1:3) + LMP(1:3) - NP_co(1:3)]; %basic operations
end
Note: this function is called from several other functions in my code (not in a single endless for loop).
Thank you.
You can make this a little bit simpler:
function LMP_glb = do(pnt_val,LMP,NP_glb)
LMP_glb = NP_glb(1:3) + LMP(1:3) - round(pnt_val+1);
end
as MATLAB will handle expanding the last part of the expression from a scalar to an array for you.
I can't see much of a way to optimize it beyond that, as it isn't doing much. If you were to inline this (i.e. put the expression directly in your code rather than calling out to it as a function), you would additionally save the time taken in the overhead of a function call - if there are 3 million calls, that could be expensive.
#CitizenInsane 's suggestion is also a good one - if it's possible to precompute this function for all your coordinates at once, and then use them later, it might be possible to vectorize the operation for additional speedup.
Lastly, it would be possible (if you're comfortable with a little C), to implement this function in C, compile it, and call it as a MEX function. This may or may not be faster - you'd need to try it out.
I guess your do function is called inside some kind of for loops. If your for loop doesn't have data inter-dependences, you can simply change it to parfor, and Matlab will start parallel pools to take advantage of multicore processor, i.e. parallel computing. Even if your data is interlaced inside for loops, you can do some trick to get rid of it ( declare addition matrix, array to hold another copy of the data structure, trade memory usage for speed).
As mentioned by Sam, you can rewrite the function using C/C++, compile it by Matlab compiler, and call the compiled MEX function directly. From my experience, C/C++ MEX function is faster than Matlab build in for simple arithmetics.
Parfor guide:
http://www.mathworks.com/help/distcomp/getting-started-with-parfor.html
MEX example:
http://www.mathworks.com/help/matlab/matlab_external/standalone-example.html

How come the mex code is running more slowly than the matlab code

I use matlab to write a program with many iterations. It cannot be vectorized since the data processing in each iteration is related to that in the previous iteration.
Then I transform the matlab code to mex using the build-in MATLAB coder and the resulting speed is even lower. I don't know whether I need to write the mex code by myself since it seems the mex code doesn't help.
I'd suggest that if you can, you get in touch with MathWorks to ask them for some advice. If you're not able to do that, then I would suggest really reading through the documentation and trying everything you find before giving up.
I've found that a few small changes to the way one implements the MATLAB code, and a few small changes to the project settings (such as disabling responsiveness to Ctrl-C, extrinsic calls back to MATLAB) can make give a speed difference of an order of magnitude or more in the generated code. There are not many people outside MathWorks who would be able to give good advice on exactly what changes might be worthwhile/sensible for you.
I should say that I've only used MATLAB Coder on one project, and I'm not at all an expert (actually not even a competent) C programmer. Nevertheless I've managed to produce C code that was about 10-15 times as fast as the original MATLAB code when mexed. I achieved that by a) just fiddling with all the different settings to see what happened and b) methodically going through the documentation, and seeing if there were places in my MATLAB code where I could apply any of the constructs I came across (such as coder.nullcopy, coder.unroll etc). Of course, your code may differ substantially.

Any way to vectorize in C

My question may seem primitive or dumb because, I've just switched to C.
I have been working with MATLAB for several years and I've learned that any computation should be vectorized in MATLAB and I should avoid any for loop to get an acceptable performance.
It seems that if I want to add two vectors, or multiply matrices, or do any other matrix computation, I should use a for loop.
It is appreciated if you let me know whether or not there is any way to do the computations in a vectorized sense, e.g. reading all elements of a vector using only one command and adding those elements to another vector using one command.
Thanks
MATLAB suggests you to avoid any for loop because most of the operations available on vectors and matrices are already implements in its API and ready to be used. They are probably optimized and they work directly on underlying data instead that working at MATLAB language level, a sort of opaque implementation I guess.
Even MATLAB uses for loops underneath to implement most of its magic (or delegates them to highly specialized assembly instructions or through CUDA to the GPU).
What you are asking is not directly possible, you will need to use loops to work on vectors and matrices, actually you would search for a library which allows you to do most of the work without directly using a for loop but by using functions already defined that wraps them.
As it was mentioned, it is not possible to hide the for loops. However, I doubt that the code MATLAB produces is in any way faster the the one produced by C. If you compile your C code with the -O3 it will try to use every hardware feature your computer has available, such as SIMD extensions and multiple issue. Moreover, if your code is good and it doesn't cause too many pipeline stalls and you use the cache, it will be really fast.
But i think what you are looking for are some libraries, search google for LAPACK or BLAS, they might be what you are looking for.
In C there is no way to perform operations in a vectorized way. You can use structures and functions to abstract away the details of operations but in the end you will always be using fors to process your data.
As for speed C is a compiled language and you will not get a performance hit from using for loops in C. C has the benefit (compared to MATLAB) that it does not hide anything from you, so you can always see where your time is being used. On the downside you will notice that things that MATLAB makes trivial (svd,cholesky,inv,cond,imread,etc) are challenging in C.

How much faster is C than R in practice?

I wrote a Gibbs sampler in R and decided to port it to C to see whether it would be faster. A lot of pages I have looked at claim that C will be up to 50 times faster, but every time I have used it, it's only about five or six times faster than R. My question is: is this to be expected, or are there tricks which I am not using which would make my C code significantly faster than this (like how using vectorization speeds up code in R)? I basically took the code and rewrote it in C, replacing matrix operations with for loops and making all the variables pointers.
Also, does anyone know of good resources for C from the point of view of an R programmer? There's an excellent book called The Art of R Programming by Matloff, but it seems to be written from the perspective of someone who already knows C.
Also, the screen tends to freeze when my C code is running in the standard R GUI for Windows. It doesn't crash; it unfreezes once the code has finished running, but it stops me from doing anything else in the GUI. Does anybody know how I could avoid this? I am calling the function using .C()
Many of the existing posts have explicit examples you can run, for example Darren Wilkinson has several posts on his blog analyzing this in different languages, and later even on different hardware (eg comparing his high-end laptop to his netbook and to a Raspberry Pi). Some of his posts are
the initial (then revised) post
another later post
and there are many more on his site -- these often compare C, Java, Python and more.
Now, I also turned this into a version using Rcpp -- see this blog post. We also used the same example in a comparison between Julia, Python and R/C++ at useR this summer so you should find plenty other examples and references. MCMC is widely used, and "easy pickings" for speedups.
Given these examples, allow me to add that I disagree with the two earlier comments your question received. The speed will not be the same, it is easy to do better in an example such as this, and your C/C++ skills will mostly determines how much better.
Finally, an often overlooked aspect is that the speed of the RNG matters a lot. Running down loops and adding things up is cheap -- doing "good" draws is not, and a lot of inter-system variation comes from that too.
About the GUI freezing, you might want to call R_CheckUserInterrupt and perhaps R_ProcessEvents every now and then.
I would say C, done properly, is much faster than R.
Some easy gains you could try:
Set the compiler to optimize for more speed.
Compiling with the -march flag.
Also if you're using VS, make sure you're compiling with release options, not debug.
Your observed performance difference will depend on a number of things: the type of operations that you are doing, how you write the C code, what type of compiler-level optimizations you use, your target CPU architecture, etc etc.
You can write basic, sloppy C and get something that works and runs with decent efficiency. You can also fine-tune your code for the unique characteristics of your target CPU - perhaps invoking specialized assembly instructions - and squeeze every last drop of performance that you can out of the code. You could even write code that runs significantly slower than the R version. C gives you a lot of flexibility. The limiting factor here is how much time that you want to put into writing and optimizing the C code.
The reverse is also true (duplicate the previous paragraph here, but swap "C" and "R").
I'm not trying to sound facetious, but there's really not a straightforward answer to your question. The only way to tell how much faster your C version would be is to write the code both ways and benchmark them.

Why don't I see a significant speed-up when using the MATLAB compiler?

I have a lot of nice MATLAB code that runs too slowly and would be a pain to write over in C. The MATLAB compiler for C does not seem to help much, if at all. Should it be speeding execution up more? Am I screwed?
If you are using the MATLAB complier (on a recent version of MATLAB) then you will almost certainly not see any speedups at all. This is because all the compiler actually does is give you a way of packaging up your code so that it can be distributed to people who don't have MATLAB. It doesn't convert it to anything faster (such as machine code or C) - it merely wraps it in C so you can call it.
It does this by getting your code to run on the MATLAB Compiler Runtime (MCR) which is essentially the MATLAB computational kernel - your code is still being interpreted. Thanks to the penalty incurred by having to invoke the MCR you may find that compiled code runs more slowly than if you simply ran it on MATLAB.
Put another way - you might say that the compiler doesn't actually compile - in the traditional sense of the word at least.
Older versions of the compiler worked differently and speedups could occur in certain situations. For Mathwork's take on this go to
http://www.mathworks.com/support/solutions/data/1-1ARNS.html
In my experience slow MATLAB code usually comes from not vectorizing your code (i.e., writing for-loops instead of just multiplying arrays (simple example)).
If you are doing file I/O look out for reading data in one piece at a time. Look in the help files for the vectorized version of fscanf.
Don't forget that MATLAB includes a profiler, too!
I'll echo what dwj said: if your MATLAB code is slow, this is probably because it is not sufficiently vectorized. If you're doing explicit loops when you could be doing operations on whole arrays, that's the culprit.
This applies equally to all array-oriented dynamic languages: Perl Data Language, Numeric Python, MATLAB/Octave, etc. It's even true to some extent in compiled C and FORTRAN compiled code: specially-designed vectorization libraries generally use carefully hand-coded inner loops and SIMD instructions (e.g. MMX, SSE, AltiVec).
First, I second all the above comments about profiling and vectorizing.
For a historical perspective...
Older version of Matlab allowed the user to convert m files to mex functions by pre-parsing the m code and converting it to a set of matlab library calls. These calls have all the error checking that the interpreter did, but old versions of the interpreter and/or online parser were slow, so compiling the m file would sometimes help. Usually it helped when you had loops because Matlab was smart enough to inline some of that in C. If you have one of those versions of Matlab, you can try telling the mex script to save the .c file and you can see exactly what it's doing.
In more recent version (probably 2006a and later, but I don't remember), Mathworks started using a just-in-time compiler for the interpreter. In effect, this JIT compiler automatically compiles all mex functions, so explicitly doing it offline doesn't help at all. In each version since then, they've also put a lot of effort into making the interpreter much faster. I believe that newer versions of Matlab don't even let you automatically compile m files to mex files because it doesn't make sense any more.
The MATLAB compiler wraps up your m-code and dispatches it to a MATLAB runtime. So, the performance you see in MATLAB should be the performance you see with the compiler.
Per the other answers, vectorizing your code is helpful. But, the MATLAB JIT is pretty good these days and lots of things perform roughly as well vectorized or not. That'a not to say there aren't performance benefits to be gained from vectorization, it's just not the magic bullet it once was. The only way to really tell is to use the profiler to find out where your code is seeing bottlenecks. Often times there are some places where you can do local refactoring to really improve the performance of your code.
There are a couple of other hardware approaches you can take on performance. First, much of the linear algebra subsystem is multithreaded. You may want to make sure you have enabled that in your preferences if you are working on a multi-core or multi-processor platform. Second, you may be able to use the parallel computing toolbox to take more advantage of multiple processors. Finally, if you are a Simulink user, you may be able to use emlmex to compile m-code into c. This is particularly effective for fixed point work.
Have you tried profiling your code? You don't need to vectorize ALL your code, just the functions that dominate running time. The MATLAB profiler will give you some hints on where your code is spending the most time.
There are many other things you you should read up on the Tips For Improving Performance section in the MathWorks manual.
mcc won't speed up your code at all--it's not really a compiler.
Before you give up, you need to run the profiler and figure out where all your time is going (Tools->Open Profiler). Also, judicious use of "tic" and "toc" can help. Don't optimize your code until you know where the time is going (don't try to guess).
Keep in mind that in matlab:
bit-level operations are really slow
file I/O is slow
loops are generally slow, but vectorizing is fast (if you don't know the vector syntax, learn it)
core operations are really fast (e.g. matrix multiply, fft)
if you think you can do something faster in C/Fortran/etc, you can write a MEX file
there are commercial solutions to convert matlab to C (google "matlab to c") and they work
You could port your code to "Embedded Matlab" and then use the Realtime-Workshop to translate it to C.
Embedded Matlab is a subset of Matlab. It does not support Cell-Arrays, Graphics, Marices of dynamic size, or some Matrix addressing modes. It may take considerable effort to port to Embedded Matlab.
Realtime-Workshop is at the core of the Code Generation Products. It spits out generic C, or can optimize for a range of embedded Platforms. Most interresting to you is perhaps the xPC-Target, which treats general purpose hardware as embedded target.
I would vote for profiling + then look at what are the bottlenecks.
If the bottleneck is matrix math, you're probably not going to do any better... EXCEPT one big gotcha is array allocation. e.g. if you have a loop:
s = [];
for i = 1:50000
s(i) = 3;
end
This has to keep resizing the array; it's much faster to presize the array (start with zeros or NaN) & fill it from there:
s = zeros(50000,1);
for i = 1:50000
s(i) = 3;
end
If the bottleneck is repeated executions of a lot of function calls, that's a tough one.
If the bottleneck is stuff that MATLAB doesn't do quickly (certain types of parsing, XML, stuff like that) then I would use Java since MATLAB already runs on a JVM and it interfaces really easily to arbitrary JAR files. I looked at interfacing with C/C++ and it's REALLY ugly. Microsoft COM is ok (on Windows only) but after learning Java I don't think I'll ever go back to that.
As others has noted, slow Matlab code is often the result of insufficient vectorization.
However, sometimes even perfectly vectorized code is slow. Then, you have several more options:
See if there are any libraries / toolboxes you can use. These were usually written to be very optimized.
Profile your code, find the tight spots and rewrite those in plain C. Connecting C code (as DLLs for instance) to Matlab is easy and is covered in the documentation.
By Matlab compiler you probably mean the command mcc, which does speed the code a little bit by circumventing Matlab interpreter. What would speed the MAtlab code significantly (by a factor of 50-200) is use of actual C code compiled by the mex command.

Resources