Is the Julia language really as fast as it claims? - benchmarking

Following this post I decided to benchmark Julia against GNU Octave and the results were inconsistent with the speed-ups illustrated in julialang.org.
I compiled both Julia and GNU Octave with CXXFLAGS='-std=c++11 -O3', the results I got:
GNU Octave
a=0.9999;
tic;y=a.^(1:10000);toc
Elapsed time is 0.000159025 seconds.
tic;y=a.^(1:10000);toc
Elapsed time is 0.000162125 seconds.
tic;y=a.^(1:10000);toc
Elapsed time is 0.000159979 seconds.
--
tic;y=cumprod(ones(1,10000)*a);toc
Elapsed time is 0.000280142 seconds.
tic;y=cumprod(ones(1,10000)*a);toc
Elapsed time is 0.000280142 seconds.
tic;y=cumprod(ones(1,10000)*a);toc
Elapsed time is 0.000277996 seconds.
Julia
tic();y=a.^(1:10000);toc()
elapsed time: 0.003486508 seconds
tic();y=a.^(1:10000);toc()
elapsed time: 0.003909662 seconds
tic();y=a.^(1:10000);toc()
elapsed time: 0.003465313 seconds
--
tic();y=cumprod(ones(1,10000)*a);toc()
elapsed time: 0.001692931 seconds
tic();y=cumprod(ones(1,10000)*a);toc()
elapsed time: 0.001690245 seconds
tic();y=cumprod(ones(1,10000)*a);toc()
elapsed time: 0.001689241 seconds
Could someone explain why Julia is slower than GNU Octave with these basic operations? After warmed, it should call LAPACK/BLAS without overhead, right?
EDIT:
As explained in the comments and answers, the code above is not a good benchmark nor it illustrates the benefits of using the language in a real application. I used to think of Julia as a faster "Octave/MATLAB", but it is much more than that. It is a huge step towards productive, high-performance, scientific computing. By using Julia, I was able to 1) outperform software in my research field written in Fortran and C++, and 2) provide users with a much nicer API.

Vectorized operations like .^ are exactly the kind of thing that Octave is good at because they're actually entirely implemented in specialized C code. Somewhere in the code that is compiled when Octave is built, there is a C function that computes .^ for a double and an array of doubles – that's what you're really timing here, and it's fast because it's written in C. Julia's .^ operator, on the other hand, is written in Julia:
julia> a = 0.9999;
julia> #which a.^(1:10000)
.^(x::Number,r::Ranges{T}) at range.jl:327
That definition consists of this:
.^(x::Number, r::Ranges) = [ x^y for y=r ]
It uses a one-dimensional array comprehension to raise x to each value y in the range r, returning the result as a vector.
Edward Garson is quite right that one shouldn't use globals for optimal performance in Julia. The reason is that the compiler can't reason very well about the types of globals because they can change at any point where execution leaves the current scope. Leaving the current scope doesn't sound like it happens that often, but in Julia, even basic things like indexing into an array or adding two integers are actually method calls and thus leave the current scope. In the code in this question, however, all the time is spent inside the .^ function, so the fact that a is a global doesn't actually matter:
julia> #elapsed a.^(1:10000)
0.000809698
julia> let a = 0.9999;
#elapsed a.^(1:10000)
end
0.000804208
Ultimately, if all you're ever doing is calling vectorized operations on floating point arrays, Octave is just fine. However, this is often not actually where most of the time is spent even in high-level dynamic languages. If you ever find yourself wanting to iterate over an array with a for loop, operating on each element with scalar arithmetic, you'll find that Octave is quite slow at that sort of thing – often thousands of times slower than C or Julia code doing the same thing. Writing for loops in Julia, on the other hand, is a perfectly reasonable thing to do – in fact, all our sorting code is written in Julia and is comparable to C in performance. There are also many other reasons to use Julia that don't have to do with performance. As a Matlab clone, Octave inherits many of Matlab's design problems, and doesn't fare very well as a general purpose programming language. You wouldn't, for example, want to write a web service in Octave or Matlab, but it's quite easy to do so in Julia.

You're using global variables which is a performance gotcha in Julia.
The issue is that globals can potentially change type whenever your code calls anther function. As a result, the compiler has to generate extremely slow code that cannot make any assumptions about the types of global variables that are used.
Simple modifications of your code in line with https://docs.julialang.org/en/stable/manual/performance-tips/ should yield more satisfactory results.

Related

Efficient way to perform tensor products in Fortran

I need to perform some tensor products and contractions on some large arrays in Fortran. Sometimes they are vectors or matrices and sometimes some of the objects involved are 3-arrays or 4-arrays.
Of course, it is very easy to write a subroutine achieving this with some nested loops, and that's just what I've done. But I have to call this subroutine with all its loops a lot of times for very large arrays, and I was just wondering whether there is some optimized function or subroutine implemented in Fortran which I could benefit from.
Last time I looked (about a year ago) I did not find a high performance general purpose tensor product library in Fortran. I think one of the reason for this might be Fortran's cumbersome way of resizing arrays, which is a constant requirement when dealing with tensors.
If you only need multiplication you might be able to get away with using your own code. However if you need high performance, or more general operations, I would highly recommend writing a C interface and using one of the excellent C++ libraries out there, which are probably already optimized for your type of application:
Physics:
http://itensor.org/
Machine learning:
https://github.com/tensorflow/tensorflow
These are only examples. For a more complete listing see:
Tensor multiplication library

Matlab Array Ops Optimization

In a MATLAB program I call many times (over 3 million) to a function that converts from local coordinates to global coordinates in an image, just a a simple transformation. My whole code takes 6 minutes to run, and the coordinates conversion function takes 20% of that time.
How can I optimize this code?
function LMP_glb = do(pnt_val,LMP,NP_glb)
NP_co = ones(1,3)*round(pnt_val+1);
LMP_glb = [NP_glb(1:3) + LMP(1:3) - NP_co(1:3)]; %basic operations
end
Note: this function is called from several other functions in my code (not in a single endless for loop).
Thank you.
You can make this a little bit simpler:
function LMP_glb = do(pnt_val,LMP,NP_glb)
LMP_glb = NP_glb(1:3) + LMP(1:3) - round(pnt_val+1);
end
as MATLAB will handle expanding the last part of the expression from a scalar to an array for you.
I can't see much of a way to optimize it beyond that, as it isn't doing much. If you were to inline this (i.e. put the expression directly in your code rather than calling out to it as a function), you would additionally save the time taken in the overhead of a function call - if there are 3 million calls, that could be expensive.
#CitizenInsane 's suggestion is also a good one - if it's possible to precompute this function for all your coordinates at once, and then use them later, it might be possible to vectorize the operation for additional speedup.
Lastly, it would be possible (if you're comfortable with a little C), to implement this function in C, compile it, and call it as a MEX function. This may or may not be faster - you'd need to try it out.
I guess your do function is called inside some kind of for loops. If your for loop doesn't have data inter-dependences, you can simply change it to parfor, and Matlab will start parallel pools to take advantage of multicore processor, i.e. parallel computing. Even if your data is interlaced inside for loops, you can do some trick to get rid of it ( declare addition matrix, array to hold another copy of the data structure, trade memory usage for speed).
As mentioned by Sam, you can rewrite the function using C/C++, compile it by Matlab compiler, and call the compiled MEX function directly. From my experience, C/C++ MEX function is faster than Matlab build in for simple arithmetics.
Parfor guide:
http://www.mathworks.com/help/distcomp/getting-started-with-parfor.html
MEX example:
http://www.mathworks.com/help/matlab/matlab_external/standalone-example.html

improving computationnal speed of a kalman filter in Matlab

I am computing a statistical model in Matlab which has to run about 200 kalman filter per iteration, and I want to iterate the modelat least 10 000 times which suggest that I should run it at least 2 000 000 times. I am therefore searching for a way to optimize the computationnal speed of Matlab on this part. I have already gone operation per operation to try to optimize the computation in Matlab using all the tricks which could be used but I would like to go further...
I am not familiar with C/C++ but I read that mex-file could be usefull in some cases. Anyone could tell me if it would be Worth going into this direction ???
Thanks...
Writing mex files will definitely speed up the entire process, but you will not be able to use a lot of the built in MATLAB functions. You are limited to what you can do in C++ and C. Of course you can write your own functions, as long as you know how to.
The speed increase mainly comes from the fact the mex files are compiled and not interpreted line by line as are standard MATLAB scripts. Once you compile the mex you can call it the same way you do any other MATLAB function.
For a class I took in college I had to write my own image scaling function, I had initially written it in a standard script and it would take a couple seconds to complete on large images, but when I wrote it in C in a mex it would complete in less than 0.1 seconds.
MEX Files Documentation
If you you are not familiar at all with C/C++ this will be hard going. Hopefully you have some experience with another language besides Matlab? You can try learning/copying from the many included examples, but you'll really need to figure out the basics first.
One thing in particular. If you use mex you'll need some way to obtain decent random numbers for your Kalman filter noise. You might be surprised, but a very large percentage of your calculation time will likely be in generating random numbers for noise (depending on the complexity of your filter it could be > 50%).
Don't use the default random number generators in C/C++.
These are not suitable for scientific computation, especially when generating vast numbers of values as you seem to need. Your first option is to pass in a large array of random numbers generated via randn in Matlab to your mex code. Or look into including the C code Mersenne Twister algorithm itself and find/implement a scheme for generating normal random numbers from the uniform ones (log-polar is simplest, but Ziggurat will be faster). This is not too hard. I've done it myself and the Double precision SIMD-oriented Fast Mersenne Twister (dSFMT) is actually 2+ times faster than Matlab's current implementation for uniform variates.
You could use parfor loops or the parallel computing toolbox in general to speedup your calculations. Did you already checked whether MATLAB is using 100% CPU?

How much mxRealloc can affect a C-Mex matlab code?

For these days I was working on C-mex code in order to improve speed in DBSCAN matlab code. In fact, at the moment I finished a DBSCAN on C-mex. But instead, it takes more time (14.64 seconds in matlab, 53.39 seconds in C-Mex) with my test data which is a matrix 3 x 14414. I think this is due to the use of mxRealloc function in several parts of my code. Would be great that someone give me some suggestion with the aim to get better results.
Here is the code DBSCAN1.c:
https://www.dropbox.com/sh/mxn757a2qmniy06/PmromUQCbO
Using mxRealloc in every iteration of a loop is indeed a performance killer. You can use vector or similar class instead. Dynamic allocation is not needed at all in your distance function.
If your goal is not to implement DBSCAN as a mex but to speed it up, I will offer you a different solution.
I don't know which Matlab implementation are you using, but you won't make a trivial n^2 implementation much faster by just rewriting it to C in the same way. Most of the time is spent calculating the nearest neighbors which won't be faster in C than it is in Matlab. DBSCAN can run in nlogn time by using an index structure to get the nearest neighbors.
For my application, I am using this implementation of dbscan, but I have changed the calculation of nearest neighbors to use a KD-tree (available here). The speedup was sufficient for my application and no reimplementation was required. I think this will be faster than any n^2 c implementation no matter how good you write it.

Nested for loops extremely slow in MATLAB (preallocated)

I am trying to learn MATLAB and one of the first problems I encountered was to guess the background from an image sequence with a static camera and moving objects. For a start I just want to do a mean or median on pixels over time, so it's just a single function I would like to apply to one of the rows of the 4 dimensional array.
I have loaded my RGB images in a 4 dimensional array with the following dimensions:
uint8 [ num_images, width, height, RGB ]
Here is the function I wrote which includes 4 nested loops. I use preallocation but still, it is extremely slow. In C++ I believe this function could run at least 10x-20x faster, and I think on CUDA it could actually run in real time. In MATLAB it takes about 20 seconds with the 4 nested loops. My stack is 100 images with 640x480x3 dimensions.
function background = calc_background(stack)
tic;
si = size(stack,1);
sy = size(stack,2);
sx = size(stack,3);
sc = size(stack,4);
background = zeros(sy,sx,sc);
A = zeros(si,1);
for x = 1:sx
for y = 1:sy
for c = 1:sc
for i = 1:si
A(i) = stack(i,y,x,c);
end
background(y,x,c) = median(A);
end
end
end
background = uint8(background);
disp(toc);
end
Could you tell me how to make this code much faster? I have tried experimenting with somehow getting the data directly from the array using only the indexes and it seems MUCH faster. It completes in 3 seconds vs. 20 seconds, so that’s a 7x performance difference, just by writing a smaller function.
function background = calc_background2(stack)
tic;
% bad code, confusing
% background = uint8(squeeze(median(stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 ))));
% good code (credits: Laurent)
background=uint8((squeeze(median(stack,1)));
disp(toc);
end
So now I don't understand if MATLAB could be this fast then why is the nested loop version so slow? I am not making any dynamic resizing and MATLAB must be running the same 4 nested loops inside.
Why is this happening?
Is there any way to make nested loops run fast, like it would happen naturally in C++?
Or should I get used to the idea of programming MATLAB in this crazy one line statements way to get optimal performance?
Update
Thank you for all the great answers, now I understand a lot more. My original code with stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 )) didn't make any sense, it is exactly the same as stack, I was just lucky with median's default option of using the 1st dimension for its working range.
I think it's better to ask how to write an efficient question in an other question, so I asked it here:
How to write vectorized functions in MATLAB
If I understand your question, you're asking why Matlab is faster for matrix operations than for procedural programming calls. The answer is simply that that's how it's designed. If you really want to know what makes it that way, you can read this newsletter from Matlab's website which discusses some of the underlying technology, but you probably won't get a great answer, as the software is proprietary. I also found some relevant pages by simply googling, and this old SO question
also seems to address your question.
Matlab is an interpreted language, meaning that it must evaluate each line of code of your script.
Evaluating is a lengthy process since it must parse, 'compile' and interpret each line*.
Using for loops with simple operations means that matlab takes far more time parsing/compiling than actually executing your code.
Builtin functions, on the other hand are coded in a compiled language and heavily optimized. They're very fast, hence the speed difference.
Bottom line: we're very used to procedural language and for loops, but there's almost always a nice and fast way to do the same things in a vectorized way.
* To be complete and to pay honour to whom honour is due: recent versions of Matlab actually tries to accelerate loops by analyzing repeated operations to compile chunks of repetitive operations into native executable. This is called Just In Time compilation (JIT) and was pointed out by Jonas in the following comments.
Original answer:
If I understood well (and you want the median of the first dimension) you might try:
background=uint8((squeeze(median(stack,1)));
Well, the difference between both is their method of executing code. To sketch it very roughly: in C you feed your code to a compiler which will try to optimize your code or at any rate convert it to machine code. This takes some time, but when you actually execute your program, it is in machine code already and therefore executes very fast. You compiler can take a lot of time trying to optimize the code for you, in general you don't care whether it takes 1 minute or 10 minutes to compile a distribution-ready program.
MATLAB (and other interpreted languages) don't generally work that way. When you execute your program, an interpreter will interprete each line of code and transform it into a sequence of machine code on the fly. This is a bit slower if you write for-loops as it has to interprete the code over and over again (at least in principle, there are other overheads which might matter more for the newest versions of MATLAB). Here the hurdle is the fact that everything has to be done at runtime: the interpreter can perform some optimizations, but it is not useful to perform time-consuming optimizations that might increase performance by a lot in some cases as they will cause performance to suffer in most other cases.
You might ask what you gain by using MATLAB? You gain flexibility and clear semantics. When you want to do a matrix multiplication, you just write it as such; in C this would yield a double for loop. You have to worry very little about data types, memory management, ...
Behind the scenes, MATLAB uses compiled code (Fortan/C/C++ if I'm not mistaken) to perform large operations: so a matrix multiplication is really performed by a piece of machine code which was compiled from another language. For smaller operations, this is the case as well, but you won't notice the speed of these calculations as most of your time is spent in management code (passing variables, allocating memory, ...).
To sum it all up: yes you should get used to such compact statements. If you see a line of code like Laurent's example, you immediately see that it computes a median of stack. Your code requires 11 lines of code to express the same, so when you are looking at code like yours (which might be embedded in hundreds of lines of other code), you will have a harder time understanding what is happening and pinpointing where a certain operation is performed.
To argue even further: you shouldn't program in MATLAB in the same way as you'd program in C/C++; nor should you do the other way round. Each language has its stronger and weaker points, learn to know them and use each language for what it's made for. E.g. you could write a whole compiler or webserver in MATLAB but in general that will be really slow as MATLAB was not intended to handle or concatenate strings (it can, but it might be very slow).

Resources