Matlab Array Ops Optimization - arrays

In a MATLAB program I call many times (over 3 million) to a function that converts from local coordinates to global coordinates in an image, just a a simple transformation. My whole code takes 6 minutes to run, and the coordinates conversion function takes 20% of that time.
How can I optimize this code?
function LMP_glb = do(pnt_val,LMP,NP_glb)
NP_co = ones(1,3)*round(pnt_val+1);
LMP_glb = [NP_glb(1:3) + LMP(1:3) - NP_co(1:3)]; %basic operations
end
Note: this function is called from several other functions in my code (not in a single endless for loop).
Thank you.

You can make this a little bit simpler:
function LMP_glb = do(pnt_val,LMP,NP_glb)
LMP_glb = NP_glb(1:3) + LMP(1:3) - round(pnt_val+1);
end
as MATLAB will handle expanding the last part of the expression from a scalar to an array for you.
I can't see much of a way to optimize it beyond that, as it isn't doing much. If you were to inline this (i.e. put the expression directly in your code rather than calling out to it as a function), you would additionally save the time taken in the overhead of a function call - if there are 3 million calls, that could be expensive.
#CitizenInsane 's suggestion is also a good one - if it's possible to precompute this function for all your coordinates at once, and then use them later, it might be possible to vectorize the operation for additional speedup.
Lastly, it would be possible (if you're comfortable with a little C), to implement this function in C, compile it, and call it as a MEX function. This may or may not be faster - you'd need to try it out.

I guess your do function is called inside some kind of for loops. If your for loop doesn't have data inter-dependences, you can simply change it to parfor, and Matlab will start parallel pools to take advantage of multicore processor, i.e. parallel computing. Even if your data is interlaced inside for loops, you can do some trick to get rid of it ( declare addition matrix, array to hold another copy of the data structure, trade memory usage for speed).
As mentioned by Sam, you can rewrite the function using C/C++, compile it by Matlab compiler, and call the compiled MEX function directly. From my experience, C/C++ MEX function is faster than Matlab build in for simple arithmetics.
Parfor guide:
http://www.mathworks.com/help/distcomp/getting-started-with-parfor.html
MEX example:
http://www.mathworks.com/help/matlab/matlab_external/standalone-example.html

Related

In c: do internal states improve speed?

Lets say I have a function with two parameters that is repetitively called. Does it increase the memory usage when you have functions with arguments?
Would it be faster to generate a function for each repetitive case, and call that function with no parameters?
I believe this is sometimes refereed to as 'internal state', but my question is which of the two options will perform faster?
EDIT>>>>>>>>
Your answers are all enlightening, allow me to clarify all at once.
It seems logical that
x = x + 10
would be faster than:
x = x + y
And I'm not talking about the time it takes to define and initialize y, I am just talking about the operation itself. I'm logically, in the second case there must be some extra step in which the CPU must find Y before performing the operation. When you amplify this with functions and then multiply it over and over, I would assume this would make a significant difference.
And yes, what in my case it applies to physics and the speed will likely be felt.
PS I am very interested in compiler functionality and debating learning assembler.
Parameters are typically passed on the stack so they don't take up more memory.
Parameters may be "un-noticeably" slower because the values may be copied to the stack (depends on how good the compiler is at optimizing).
The compiler is way smarter than you are, so don't try to outsmart the compiler. Write clear code and let the compiler worry about performance.
re: your edit
"it depends"
Does your processor have a different instruction to add 10 to a variable?
What sort of addressing modes does it support?
Regardless of the answers to the above, does the compiler make use of all the processor's features which might squeeze out every drip of performance.
e.g. - The good old 68000 chips had an "INC" opcode to increment a register by 1. It was much faster than other methods. If you were hand rolling assembly the fastest way to do x = x + 10 might have been to call INC 10 times...
I've worked with time constrained real time embedded apps and never had to worry about this level of optimization. I'd write the code and worry about performance if/when it becomes an issue.
Is the repetitive call is made with compile-time parameters, then you can indeed improve performance by "instantiating" a special version of the same function for the given set of compile-time parameters. In such cases the function will not even have a "state": the parameter values will essentially be embedded into the function code. Some compilers can do it implicitly.
The amount of improvement will depend on the nature of the function. In each given version of the function the entire blocks of code might be easily recognized as unreachable and eliminated entirely. One can also say that function inlining by nature involves the same kind of optimization.
Obviously, using such optimizations thoughtlessly might easily lead to a combinatorial explosion of the number of different versions of the same function (i.e. to code bloat), so it should be used with care.
(BTW, this is very similar to what C++ function templates with non-type template parameters do.)
If the repetitive call is made with run-time parameters, then pre-saving them in a run-time state might not achieve any significant improvement. Retrieving parameter values from some "state" is not necessarily more efficient than retrieving them from the "regular" function parameters.
Of course, there are such classic techniques as packing multiple function parameters into a struct object and passing such struct object to the function (instead of passing a large number of independent parameters). If the parameters remain unchanged between multiple calls, then this does improve overall performance by saving time on parameter preparation. But whether to call such struct object a "state" or not is a different question. It is definitely a manual technique, not something done by the compiler and involving any "internal state".
Does it increase the memory usage when you have functions with arguments?
No, function arguments are passed on the stack (or in registers if x64 calling convention).
Would it be faster to generate a function for each repetitive case, and call that function with no parameters?
No, your compiler should optimize it for you, there's no need to make your code less readable

C and MPI: function works differently with same data

I have successfully wrote a complicate function with PETSc library (it's a MPI-based scientific library for parallel solving huge linear systems). This library provides its own "malloc" version and basic datatypes (i.e. "PetscInt" as standard "int"). For this function, I've always been using PETSc stuff instead of standard stuff such as "malloc" and "int". The function has been extensevely tested and always worked fine. Despite the use of MPI, the function is fully serial, and all processors perform it on the same data (each processor has its copy): no communication involved at all.
Then, I decided to not use PETSc and write a standard MPI version. Basically, I rewrote all code substituting PETSc stuff with classic C stuff, not with brutal force but paying attention for substitutions (no "Replace" tool of any editor, I mean! All done by hands). During substitution, few minor changes have been made, such as declaring two different variables a and b, instead of declaring a[2]. These are the substitutions:
PetscMalloc -> malloc
PetscScalar -> double
PetscInt -> int
PetscBool -> created an enum structure to replicate it, as C doesn't have boolean datatype.
Basically, algorithms have not been changed during the substitution process. The main function is a "for" loop (actually 4 nested loops). At each iteration, it calls another function. Let's call it Disfunction. Well, Disfunction works perfectly outside the 4-cycle (as I tested it separately), but inside the 4-cycle, in some cases works, in some doesn't. Also, I checked the data passed to Disfunction at each iteration: with ECXACTELY the same input, Disfunction performs different computations between one iteration and another.
Also, computed data doesn't seem to be Undefined Behaviour, as Disfunction always gives back the same results with different runs of the program.
I've noticed that changing the number of processors for "mpiexec" gives different computational results.
That's my problem. Few other considerations: the program use extensively "malloc"; computed data is the same for all processes, correct or not; Valgrind doesn't detect errors (apart from detecting error with normal use of printf, which is another problem and an OT); Disfunction calls recursively two other functions (extensively tested in PETSc version as well); algorithms involved are mathematically correct; Disfunction depends on an integer parameter p>0: for p=1,2,3,4,5 it works PERFECTELY, for p>=6 it does not.
If asked, I can post the code but it's long and complicated (scientifically, not informatically) and I think it requires time to be explained.
My idea is that I mess up with memory allocations, but I can't understand where.
Sorry for my english and for bad formattation.
Well, I don't know if anyone is stll interested, but the problem was that PETSc functon PetscMalloc zero-initialize the data, not like standard C malloc. Stupid mistake... – user3029623
The only suggestion I can offer without reference to the code itself is to try to construct progressively simpler test cases that demonstrate your issue.
When you narrow down the iterative process to a single point in your data set or a single step (by eliminating some loops), does the error still occur? If not, that might suggest their bounds are wrong.
Does the erroneous output always occur on particular loop indices, especially the first or last? Perhaps there are some ghost or halo values you're missing or some boundary condition that you're not properly accounting for.

Any way to vectorize in C

My question may seem primitive or dumb because, I've just switched to C.
I have been working with MATLAB for several years and I've learned that any computation should be vectorized in MATLAB and I should avoid any for loop to get an acceptable performance.
It seems that if I want to add two vectors, or multiply matrices, or do any other matrix computation, I should use a for loop.
It is appreciated if you let me know whether or not there is any way to do the computations in a vectorized sense, e.g. reading all elements of a vector using only one command and adding those elements to another vector using one command.
Thanks
MATLAB suggests you to avoid any for loop because most of the operations available on vectors and matrices are already implements in its API and ready to be used. They are probably optimized and they work directly on underlying data instead that working at MATLAB language level, a sort of opaque implementation I guess.
Even MATLAB uses for loops underneath to implement most of its magic (or delegates them to highly specialized assembly instructions or through CUDA to the GPU).
What you are asking is not directly possible, you will need to use loops to work on vectors and matrices, actually you would search for a library which allows you to do most of the work without directly using a for loop but by using functions already defined that wraps them.
As it was mentioned, it is not possible to hide the for loops. However, I doubt that the code MATLAB produces is in any way faster the the one produced by C. If you compile your C code with the -O3 it will try to use every hardware feature your computer has available, such as SIMD extensions and multiple issue. Moreover, if your code is good and it doesn't cause too many pipeline stalls and you use the cache, it will be really fast.
But i think what you are looking for are some libraries, search google for LAPACK or BLAS, they might be what you are looking for.
In C there is no way to perform operations in a vectorized way. You can use structures and functions to abstract away the details of operations but in the end you will always be using fors to process your data.
As for speed C is a compiled language and you will not get a performance hit from using for loops in C. C has the benefit (compared to MATLAB) that it does not hide anything from you, so you can always see where your time is being used. On the downside you will notice that things that MATLAB makes trivial (svd,cholesky,inv,cond,imread,etc) are challenging in C.

Nested for loops extremely slow in MATLAB (preallocated)

I am trying to learn MATLAB and one of the first problems I encountered was to guess the background from an image sequence with a static camera and moving objects. For a start I just want to do a mean or median on pixels over time, so it's just a single function I would like to apply to one of the rows of the 4 dimensional array.
I have loaded my RGB images in a 4 dimensional array with the following dimensions:
uint8 [ num_images, width, height, RGB ]
Here is the function I wrote which includes 4 nested loops. I use preallocation but still, it is extremely slow. In C++ I believe this function could run at least 10x-20x faster, and I think on CUDA it could actually run in real time. In MATLAB it takes about 20 seconds with the 4 nested loops. My stack is 100 images with 640x480x3 dimensions.
function background = calc_background(stack)
tic;
si = size(stack,1);
sy = size(stack,2);
sx = size(stack,3);
sc = size(stack,4);
background = zeros(sy,sx,sc);
A = zeros(si,1);
for x = 1:sx
for y = 1:sy
for c = 1:sc
for i = 1:si
A(i) = stack(i,y,x,c);
end
background(y,x,c) = median(A);
end
end
end
background = uint8(background);
disp(toc);
end
Could you tell me how to make this code much faster? I have tried experimenting with somehow getting the data directly from the array using only the indexes and it seems MUCH faster. It completes in 3 seconds vs. 20 seconds, so that’s a 7x performance difference, just by writing a smaller function.
function background = calc_background2(stack)
tic;
% bad code, confusing
% background = uint8(squeeze(median(stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 ))));
% good code (credits: Laurent)
background=uint8((squeeze(median(stack,1)));
disp(toc);
end
So now I don't understand if MATLAB could be this fast then why is the nested loop version so slow? I am not making any dynamic resizing and MATLAB must be running the same 4 nested loops inside.
Why is this happening?
Is there any way to make nested loops run fast, like it would happen naturally in C++?
Or should I get used to the idea of programming MATLAB in this crazy one line statements way to get optimal performance?
Update
Thank you for all the great answers, now I understand a lot more. My original code with stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 )) didn't make any sense, it is exactly the same as stack, I was just lucky with median's default option of using the 1st dimension for its working range.
I think it's better to ask how to write an efficient question in an other question, so I asked it here:
How to write vectorized functions in MATLAB
If I understand your question, you're asking why Matlab is faster for matrix operations than for procedural programming calls. The answer is simply that that's how it's designed. If you really want to know what makes it that way, you can read this newsletter from Matlab's website which discusses some of the underlying technology, but you probably won't get a great answer, as the software is proprietary. I also found some relevant pages by simply googling, and this old SO question
also seems to address your question.
Matlab is an interpreted language, meaning that it must evaluate each line of code of your script.
Evaluating is a lengthy process since it must parse, 'compile' and interpret each line*.
Using for loops with simple operations means that matlab takes far more time parsing/compiling than actually executing your code.
Builtin functions, on the other hand are coded in a compiled language and heavily optimized. They're very fast, hence the speed difference.
Bottom line: we're very used to procedural language and for loops, but there's almost always a nice and fast way to do the same things in a vectorized way.
* To be complete and to pay honour to whom honour is due: recent versions of Matlab actually tries to accelerate loops by analyzing repeated operations to compile chunks of repetitive operations into native executable. This is called Just In Time compilation (JIT) and was pointed out by Jonas in the following comments.
Original answer:
If I understood well (and you want the median of the first dimension) you might try:
background=uint8((squeeze(median(stack,1)));
Well, the difference between both is their method of executing code. To sketch it very roughly: in C you feed your code to a compiler which will try to optimize your code or at any rate convert it to machine code. This takes some time, but when you actually execute your program, it is in machine code already and therefore executes very fast. You compiler can take a lot of time trying to optimize the code for you, in general you don't care whether it takes 1 minute or 10 minutes to compile a distribution-ready program.
MATLAB (and other interpreted languages) don't generally work that way. When you execute your program, an interpreter will interprete each line of code and transform it into a sequence of machine code on the fly. This is a bit slower if you write for-loops as it has to interprete the code over and over again (at least in principle, there are other overheads which might matter more for the newest versions of MATLAB). Here the hurdle is the fact that everything has to be done at runtime: the interpreter can perform some optimizations, but it is not useful to perform time-consuming optimizations that might increase performance by a lot in some cases as they will cause performance to suffer in most other cases.
You might ask what you gain by using MATLAB? You gain flexibility and clear semantics. When you want to do a matrix multiplication, you just write it as such; in C this would yield a double for loop. You have to worry very little about data types, memory management, ...
Behind the scenes, MATLAB uses compiled code (Fortan/C/C++ if I'm not mistaken) to perform large operations: so a matrix multiplication is really performed by a piece of machine code which was compiled from another language. For smaller operations, this is the case as well, but you won't notice the speed of these calculations as most of your time is spent in management code (passing variables, allocating memory, ...).
To sum it all up: yes you should get used to such compact statements. If you see a line of code like Laurent's example, you immediately see that it computes a median of stack. Your code requires 11 lines of code to express the same, so when you are looking at code like yours (which might be embedded in hundreds of lines of other code), you will have a harder time understanding what is happening and pinpointing where a certain operation is performed.
To argue even further: you shouldn't program in MATLAB in the same way as you'd program in C/C++; nor should you do the other way round. Each language has its stronger and weaker points, learn to know them and use each language for what it's made for. E.g. you could write a whole compiler or webserver in MATLAB but in general that will be really slow as MATLAB was not intended to handle or concatenate strings (it can, but it might be very slow).

Why is C slow with function calls?

I am new here so apologies if I did the post in a wrong way.
I was wondering if someone could please explain why is C so slow with function calling ?
Its easy to give a shallow answer to the standard question about Recursive Fibonacci, but I would appreciate if I knew the "deeper" reason as deep as possible.
Thanks.
Edit1 : Sorry for that mistake. I misunderstood an article in Wiki.
When you make a function call, your program has to put several registers on the stack, maybe push some more stuff, and mess with the stack pointer. That's about all for what can be "slow". Which is, actually, pretty fast. About 10 machine instructions on an x86_64 platform.
It's slow if your code is sparse and your functions are very small. This is the case of the Fibonacci function. However, you have to make a difference between "slow calls" and "slow algorithm": calculating the Fibonacci suite with a recursive implementation is pretty much the slowest straightforward way of doing it. There is almost as much code involved in the function body than in the function prologue and epilogue (where pushing and popping takes place).
There are cases in which calling functions will actually make your code faster overall. When you deal with large functions and your registers are crowded, the compiler may have a rough time deciding in which register to store data. However, isolating code inside a function call will simplify the compiler's task of deciding which register to use.
So, no, C calls are not slow.
Based on the additional information you posted in the comment, it seems that what is confusing you is this sentence:
"In languages (such as C and Java)
that favor iterative looping
constructs, there is usually
significant time and space cost
associated with recursive programs,
due to the overhead required to manage
the stack and the relative slowness of
function calls;"
In the context of a recursive implementation fibonacci calculations.
What this is saying is that making recursive function calls is slower than looping but this does not mean that function calls are slow in general or that function calls in C are slower than function calls in other languages.
Fibbonacci generation is naturally a recursive algorithm, and so the most obvious and natural implementation involves many function calls, but is can also be expressed as an iteration (a loop) instead.
The fibonacci number generation algorithm in particular has a special property called tail recursion. A tail-recursive recursive function can be easily and automatically converted into an iteration, even if it is expressed as a recursive function. Some languages, particularly functional languages where recursion is very common and iteration is rare, guarantee that they will recognize this pattern and automatically transform such a recursion into an iteration "under the hood". Some optimizing C compilers will do this as well, but it is not guaranteed. In C, since iteration is both common and idiomatic, and since the tail recursive optimization is not necessarily going to be made for you by the compiler, it is a better idea to write it explicitly as an iteration to achieve the best performance.
So interpreting this quote as a comment on the speed of C function calls, relative to other languages, is comparing apples to oranges. The other languages in question are those that can take certain patterns of function calls (which happen to occur in fibbonnaci number generation) and automatically transform them into something that is faster, but is faster because it is actually not a function call at all.
C is not slow with function calls.
The overhead of calling a C function is extremely low.
I challenge you to provide evidence to support your assertion.
There are a couple of reasons C can be slower than some other languages for a job like computing Fibonacci numbers recursively. Neither really has anything to do with slow function calls though.
In quite a few functional languages (and languages where a more or less functional style is common), recursion (often very deep recursion) is quite common. To keep speed reasonable, many implementations of such languages do a fair amount of work optimizing recursive calls to (among other things) turn them into iteration when possible.
Quite a few also "memoize" results from previous calls -- i.e., they keep track of the results from a function for a number of values that have been passed recently. When/if the same value is passed again, they can simply return the appropriate value without re-calculating it.
It should be noted, however, that the optimization here isn't really faster function calls -- it's avoiding (often many) function calls.
The Recursive Fibonacci is the reason, not C-language. Recursive Fibonacci is something like
int f(int i)
{
return i < 2 ? 1 : f(i-1) + f(i-2);
}
This is the slowest algorithm to calculate Fibonacci number, and by using stack store called functions list -> make it slower.
I'm not sure what you mean by "a shallow answer to the standard question about Recursive Fibonacci".
The problem with the naive recursive implementation is not that the function calls are slow, but that you make an exponentially large number of calls. By caching the results (memoization) you can reduce the number of calls, allowing the algorithm to run in linear time.
Of all the languages out there, C is probably the fastest (unless you are an assembly language programmer). Most C function calls are 100% pure stack operations. Meaning when you call a function, what this translates too in your binary code is, the CPU pushes any parameters you pass to your function onto the stack. Afterwards, it calls the function. The function then pops your parameters. After that, it executes whatever code makes up your function. Finally, any return parameters are pushed onto the stack, then the function ends and the parameters are popped off. Stack operations on any CPU are usually faster then anything else.
If you are using a profiler or something that is saying a function call you are making is slow, then it HAS to be the code inside your function. Try posting your code here and we will see what is going on.
I'm not sure what you mean. C is basically one abstraction layer on top of CPU assembly instructions, which is pretty fast.
You should clarify your question really.
In some languages, mostly of the functional paradigm, function calls made at the end of a function body can be optimized so that the same stack frame is re-used. This can potentially save both time and space. The benefit is particularly significant when the function is both short and recursive, so that the stack overhead might otherwise dwarf the actual work being done.
The naive Fibonacci algorithm will therefore run much faster with such optimization available. C does not generally perform this optimization, so its performance could suffer.
BUT, as has been stated already, the naive algorithm for the Fibonacci numbers is horrendously inefficient in the first place. A more efficient algorithm will run much faster, whether in C or another language. Other Fibonacci algorithms probably will not see nearly the same benefit from the optimization in question.
So in a nutshell, there are certain optimizations that C does not generally support that could result in significant performance gains in certain situations, but for the most part, in those situations, you could realize equivalent or greater performance gains by using a slightly different algorithm.
I agree with Mark Byers, since you mentioned the recursive Fibonacci. Try adding a printf, so that a message is printed each time you do an addition. You will see that the recursive Fibonacci is doing a lot more additions that it may appear at first glance.
What the article is talking about is the difference between recursion and iteration.
This is under the topic called algorithm analysis in computer science.
Suppose I write the fibonacci function and it looks something like this:
//finds the nth fibonacci
int rec_fib(n) {
if(n == 1)
return 1;
else if (n == 2)
return 1;
else
return fib(n-1) + fib(n - 2)
}
Which, if you write it out on paper (I recommend this), you will see this pyramid-looking shape emerge.
It's taking A Whole Lotta Calls to get the job done.
However, there is another way to write fibonacci (there are several others too)
int fib(int n) //this one taken from scriptol.com, since it takes more thought to write it out.
{
int first = 0, second = 1;
int tmp;
while (n--)
{
tmp = first+second;
first = second;
second = tmp;
}
return first;
}
This one only takes the length of time that is directly proportional to n,instead of the big pyramid shape you saw earlier that grew out in two dimensions.
With algorithm analysis you can determine exactly the speed of growth in terms of run-time vs. size of n of these two functions.
Also, some recursive algorithms are fast(or can be tricked into being faster). It depends on the algorithm - which is why algorithm analysis is important and useful.
Does that make sense?

Resources