Matlab mex file is slow compared to its straight C equivalent - c

I'm at a loss to explain (and avoid) the differences in speed between a Matlab mex program and the corresponding C program with no Matlab interface. I've been profiling a numerical analysis program:
int main(){
Well_optimized_code();
}
compiled with gcc 4.4 against the Matlab-Mex equivalent (directed to use gcc44, which is not the version currently supported by Matlab, but it's required for other reasons):
void mexFunction(int nlhs,mxArray* plhs[], int nrhs, const mxArray* prhs[]){
Well_optimized_code(); //literally the exact same code
}
I performed the timings as:
$ time ./C_version
vs.
>> tic; mex_version(); toc
The difference in timing is staggering. The version run from the command line takes 5.8 seconds on average. The version in Matlab runs in 21 seconds. For context, the mex file replaces an algorithm in the SimBiology toolbox that takes about 26 seconds to run.
As compared to Matlab's algorithm, both the C and mex versions scale linearly up to 27 threads using calls to openMP, but for the purposes of profiling these calls have been disabled and commented out.
The two versions have been compiled in the same way with the exception of the necessary flags to compile as a mex file: -fPIC --shared -lmex -DMATLAB_MEX_FILE being applied in the mex compilation/linking. I've removed all references to the left and right arguments of the mex file. That is to say it takes no inputs and gives no outputs, it is solely for profiling.
The Great and Glorious Google has informed me that the position independent code should not be the source of the slowdown and beyond that I'm at a loss.
Any help will be appreciated,
Andrew

After a month of emailing with my contacts at Mathworks, playing around with my own code, and profiling my code every which way, I have an answer; however, it may be the most dissatisfying answer I have ever had to a technical question:
The short version is "upgrade to Matlab version 2011a (officially released last week), this issue has now been resolved".
The longer version regards an issue of the overhead associated with the mex gateway in versions 2010b and earlier. The best explanation that I've been able to extract is that this overhead is not assessed once, rather we pay a little bit every time a function calls another function that is in a linked library.
While why this occurs baffles me, it is at least consistent with the SHARK profiling that I did. When I profile and compare the differences between the native app and the mex app there is a recurring pattern. The time spent in functions that are in the source code I wrote for the app does not change. The time spent in library functions increases a little when comparing between the native and mex implementations. Functions in another library used to build this library increase the difference a lot. The time difference continues to increase as we proceed ever deeper until we reach by BLAS implementation.
A couple of heavily used BLAS functions were the main culprits. A function that took ~1% of my computation time in the native app was clocking in at 30% in the mex function.
The implementation of the mex gateway appears to have changed between 2010b and 2011a. On my macbook the native app takes about 6 seconds and the mex version takes 6.5 seconds. This is overhead that I can deal with.
As for the underlying cause, I can only speculate. Matlab has it's roots in interpretive coding. Since mex functions are dynamic libraries, I'm guessing that each mex library was unaware of what it was linked against until runtime. Since Matlab suggests the user rarely use mex and then only for small computationally intensive chunks, I assume that large programs (such as an ODE solver) are rarely implemented. These programs, like mine, are the ones that suffer the most.
I've profiled a couple of Matlab functions that I know to be implemented in C then compiled using mex (especially sbiosimulate after calling sbioaccelerate on kinetic models, part of the SimBiology toolbox) and there appears to be some significant speed ups. So the 2011a update appears to be more broadly beneficial than the usual semi-yearly upgrade.
Best of luck to other coders with the similar issues. Thanks for all of the helpful advice that got me started in the right direction.
--Andrew

Recall that Matlab stores arrays as column major, and C/C++ as row major. Is it possible that your loop structure/algorithm is iterating in a row major fashion, resulting in poor memory access times in Matlab, but fast access times in C/C++ ?

Related

C/C++ code is much slower when executed within OpenFX

I have a segmentation algorithm written in C/C++ that makes extensive use of C pointers, in order to access linked lists of structures, which are calloc'ed in the beginning of the programme.
This algorithm takes about 3 sec. to run on Ubuntu 14.04, gcc 4.8.2. It also uses OpenCV 2.4.8.
The algorithm is meant to be embedded within a library of OpenFX, so that this library can be added as plug-in to software suites, like Natron.
When executed as a plug-in of a host, on SUSE, gcc 4.3.2, the very same method with the same inputs takes 12 sec. to execute. I've been debugging and can't figure out why it takes so long, when executed within OpenFX. My strongest guess is that OpenFX handles differently the access to memory and that makes the execution of the algorithm slower.
Could anyone give me any clue? Please let me know if you need further information.

How come the mex code is running more slowly than the matlab code

I use matlab to write a program with many iterations. It cannot be vectorized since the data processing in each iteration is related to that in the previous iteration.
Then I transform the matlab code to mex using the build-in MATLAB coder and the resulting speed is even lower. I don't know whether I need to write the mex code by myself since it seems the mex code doesn't help.
I'd suggest that if you can, you get in touch with MathWorks to ask them for some advice. If you're not able to do that, then I would suggest really reading through the documentation and trying everything you find before giving up.
I've found that a few small changes to the way one implements the MATLAB code, and a few small changes to the project settings (such as disabling responsiveness to Ctrl-C, extrinsic calls back to MATLAB) can make give a speed difference of an order of magnitude or more in the generated code. There are not many people outside MathWorks who would be able to give good advice on exactly what changes might be worthwhile/sensible for you.
I should say that I've only used MATLAB Coder on one project, and I'm not at all an expert (actually not even a competent) C programmer. Nevertheless I've managed to produce C code that was about 10-15 times as fast as the original MATLAB code when mexed. I achieved that by a) just fiddling with all the different settings to see what happened and b) methodically going through the documentation, and seeing if there were places in my MATLAB code where I could apply any of the constructs I came across (such as coder.nullcopy, coder.unroll etc). Of course, your code may differ substantially.

How much faster is C than R in practice?

I wrote a Gibbs sampler in R and decided to port it to C to see whether it would be faster. A lot of pages I have looked at claim that C will be up to 50 times faster, but every time I have used it, it's only about five or six times faster than R. My question is: is this to be expected, or are there tricks which I am not using which would make my C code significantly faster than this (like how using vectorization speeds up code in R)? I basically took the code and rewrote it in C, replacing matrix operations with for loops and making all the variables pointers.
Also, does anyone know of good resources for C from the point of view of an R programmer? There's an excellent book called The Art of R Programming by Matloff, but it seems to be written from the perspective of someone who already knows C.
Also, the screen tends to freeze when my C code is running in the standard R GUI for Windows. It doesn't crash; it unfreezes once the code has finished running, but it stops me from doing anything else in the GUI. Does anybody know how I could avoid this? I am calling the function using .C()
Many of the existing posts have explicit examples you can run, for example Darren Wilkinson has several posts on his blog analyzing this in different languages, and later even on different hardware (eg comparing his high-end laptop to his netbook and to a Raspberry Pi). Some of his posts are
the initial (then revised) post
another later post
and there are many more on his site -- these often compare C, Java, Python and more.
Now, I also turned this into a version using Rcpp -- see this blog post. We also used the same example in a comparison between Julia, Python and R/C++ at useR this summer so you should find plenty other examples and references. MCMC is widely used, and "easy pickings" for speedups.
Given these examples, allow me to add that I disagree with the two earlier comments your question received. The speed will not be the same, it is easy to do better in an example such as this, and your C/C++ skills will mostly determines how much better.
Finally, an often overlooked aspect is that the speed of the RNG matters a lot. Running down loops and adding things up is cheap -- doing "good" draws is not, and a lot of inter-system variation comes from that too.
About the GUI freezing, you might want to call R_CheckUserInterrupt and perhaps R_ProcessEvents every now and then.
I would say C, done properly, is much faster than R.
Some easy gains you could try:
Set the compiler to optimize for more speed.
Compiling with the -march flag.
Also if you're using VS, make sure you're compiling with release options, not debug.
Your observed performance difference will depend on a number of things: the type of operations that you are doing, how you write the C code, what type of compiler-level optimizations you use, your target CPU architecture, etc etc.
You can write basic, sloppy C and get something that works and runs with decent efficiency. You can also fine-tune your code for the unique characteristics of your target CPU - perhaps invoking specialized assembly instructions - and squeeze every last drop of performance that you can out of the code. You could even write code that runs significantly slower than the R version. C gives you a lot of flexibility. The limiting factor here is how much time that you want to put into writing and optimizing the C code.
The reverse is also true (duplicate the previous paragraph here, but swap "C" and "R").
I'm not trying to sound facetious, but there's really not a straightforward answer to your question. The only way to tell how much faster your C version would be is to write the code both ways and benchmark them.

what are the steps/strategy to analyze and improve performance of an embedded system

I will break down this question in to sub questions. I am confused if I should ask them separately or in one question. So I will just stick to one SO question.
What are generally the steps to analyze and improve performance of C applications?
Do these steps change if I am developing for an embedded system?
What tools are out there which can help me?
Recently I have been given a task to improve the performance of our product on ARM11 platform. I am relatively new to this field of embedded systems and need gurus here on SO to help me out.
simply changing compilers can improve your C performance for the same source code by many times over. GCC has not necessarily gotten better for performance over the years, for some programs gcc 3.x produces much tighter code than 4.x. Back when I had access to the tools, ARMs compiler produced significantly better code than gcc. As much as 3 or 4 times faster. LLVM has caught up to GCC 4.x and I suspect will pass gcc by in terms of performance and overall use for cross compiling embedded code. Try different versions of gcc, 3.x and 4.x if you are using gcc. Metaware's compiler and arms adt ran circles around gcc3.x, gcc3.x will give gcc4.x a run for its money with arm code, for thumb code gcc4.x is better and for thumb2 (which doesnt apply to you) gcc4.x also better. Remember I have not said a word about changing a single line of code (yet).
LLVM is capable of full program optimization in addition to infinitely more tuning knobs than gcc. Despite that the code generated (ver 27) is only just catching up to the current gcc 4.x in terms of performance for the few programs I tried. And I didnt try the n factoral number of optimization combinations (optimize on the compile step, different options for each file, or combine two files or three files or all files and optimize those bundles, my theory is do no optimization on the C to bc steps, link all the bc together then do a single optimization pass on the whole program, the allow the default optimization when llc takes it to the target).
By the same token simply knowing your compiler and the optimizations can greatly improve the performance of the code without having to change any of it. You have an ARM11 arr you compiling for arm11 or generic arm? You can gain a few to a dozen percent by telling the compiler specifically which architecture/family (armv6 for example) over the generic armv4 (ARM7) that is often chosen as the default. Knowing to use -O2 or -O3 if you are brave.
It is often not the case but switching to thumb mode can improve performance for specific platforms. Doesnt apply to you but the gameboy advance is a perfect example, loaded with non-zero wait state 16 bit busses. Thumb has a handful of a percent overhead because it takes more instructions to do the same thing, but by increasing the fetch times, and taking advantage of some of the sequential read features of the gba thumb code can run significantly faster than arm code for the same source code.
having an arm11 you probably have an L1 and maybe L2 cache, are they on? Are they configured? Do you have an mmu and is your heavy use memory cached? or are you running zero wait state memory and dont need a cache and should turn it off? In addition to not realizing that you can take the same source code and make it run many times faster by changing compilers or options, folks often dont realize that when you use a cache simply adding a single up to a few nops in your startup code (as a trick to adjust where code lands in memory by one, two, a few words) you can change your codes execution speed by as much as 10 to 20 percent. Where those cache line reads hit in heavily used functions/loops makes a big difference. Even saving one cache line read by adjusting where the code lands is noticeable (cutting it from 3 to 2 or 2 to 1 for example).
Knowing your architecture, both the processor and your memory environment is where the tuning if any would start. Most C libraries if you are high level enough to use one (I often dont use a C library as I run without an operating system and with very limited resources) both in their C code and sometimes add some assembler to make bottleneck routines like memcpy, much faster. If your programs are operating on aligned 32 or even better 64 bit addresses, and you adjust even if it means using a handful of bytes more memory for every structure/array/memcpy to be an integral multiple of 32 bits or 64 bits you will see noticeable improvements (if your code uses structs or copies data in other ways). In addition to getting your structures (if you use them, I certainly dont with embedded code) size aligned, even if you waste memory, getting elements aligned, consider using 32 bit integers for every element instead of bytes or halfwords. Depending on your memory system this can help (it can hurt too btw). As with the GBA example above looking at specific functions that either by profiling or intuition you know are not being implemented in a manner that takes advantage of your processor or platform or libraries you may want to turn to assembler either from scratch or compiling from C initially then disassembling and hand tuning. Memcpy is a good example you may know your systems memory performance and may chose to create your own memcpy specifically for aligned data, copying 64 or 128 or more bits per instruction.
Likewise mixing global and local variables can make a noticeable performance difference. Traditionally folks are told never to use globals, but in embedded this isnt necessarily true, depends on how deeply embedded and how much tuning and speed and other factors you are interested in. This is a touchy subject and I may get flamed for it, so I will leave it at that.
The compiler has to burn and evict registers in order to make function calls, plus if you use local variables a stack frame may be required, so function calls are expensive, but at the same time, depending on the code within a function that has now grown in size by avoiding functions, you may create the problem you were trying to avoid, evicting registers to re-use them. Even a single line of C code can make the difference between all the variables in a function fits in registers to having to start evicting a bunch of registers. For functions or segments of code where you know you need some performance gain compile and disassemble (and look at register usage, how often it fetches memory or writes to memory). You can and will find places where you need to take a well used loop and make it its own function even though the function call has a penalty because by doing that the compiler can better optimize the loop and not evict/reuse registers and you get an overall net gain. Even a single extra instruction in a loop that goes around hundreds of times is a measurable performance hit.
Hopefully you already know to absolutely not compile for debug, turn all of the compile for debug options off. You may already know that code compile for debug that runs without bugs doesnt mean it is debugged, compiling for debug and using debuggers hide bugs leaving them as time bombs in your code for your final compile for release. Learn to always compile for release and test with the release version both for performance and finding bugs in your code.
Most instruction sets do not have a divide function. Avoid using divides or modulo in your code as much as humanly possible they are performance killers. Naturally this is not the case for powers of two, to save the compiler and to mentally avoid divides and modulos try to use shifts and ands. Multplies are easier and more often found in instruction sets, but are still costly. This is a good case to write assembler to do your multiplies instead of letting the C copiler do it. The arm multiply is a 32bit * 32bit = 32 bit so to do accurate math without overflowing there has to be extra C code wrapped around the multiply, if you already know you wont overflow, burn the registers for a function call and do the multiply in assembler (for the arm).
Likewise most instruction sets do not have a floating point unit, with yours you might, even so avoid float if at all possible. If you have to use float that is a whole other pandora's box of performance issues. Most folks dont see the performance problems with code as simple as this:
float a,b;
...
a = b * 7.0;
The rest of the problem is not understanding floating point accuracy and how good or bad the C libraries are just trying to get your constants into floating point form. Again float is a whole other long discussion on performance problems.
I am a product of Michael Abrash (I actually have a print copy of zen of assembly language) and the bottom line is time your code. Come up with an accurate way to time the code, you may think you know where the bottlenecks are and you may think you know your architecture but trying different things even if you think they are wrong, and timing them you may find and eventually have to figure out the error in your thinking. Adding nops to start.S as a final tuning step is a good example of this, all the other work you have done for performance can be instantly erased by not having a good alignment with the cache, this also means re-arranging functions within your source code so that they land in different places in the binary image. I have seen 10 to 20 percent swings of speed increase and decrease as a result of cache line alignments.
Code Review:
What are good code review techniques ?
Static and dynamic analysis of the code.
Tools for static analysis: Sparrow, Prevent, Klockworks
Tools for dynamic analysis : Valgrind, purify
Gprof allows you to learn where your program spent its time and which functions called which other functions while it was executing.
Steps are same
Apart from what is listed is point 1, there are tools like memcheck etc.
There is a big list here based on platform
Phew!! Quite a big question!
What are generally the steps to
analyze and improve performance of C
applications?
As well as other static code analysers mentioned here there is a fairly cheap version called PC-Lint which has been around for ages. Sometimes throws up lots of errors and warnings for one error but by the end of it you'll be happy and know waaaaay more about C/C++ because of it.
With all code analysers some of the issues may be more structural to the code so best to start analysing it from day 1 of coding; running analysis on old software may swamp you with issues which may take a while to untangle, best to keep it clean from the beginning.
But code analysers will not catch all logical errors, i.e. it doesn't do what you want it to do! These are best done by code reviews first, then testing. Performance is often improved by by trying to keep the algorithms as simple as possible, keeping instructions in loops tight, possibly unrolling loops (your compiler optimisations may do this), use of fast caches when accessing data which is slow to get.
Code reviews can raise a lot of issues from lots of other peoples eyes looking at it. Don't get too many people, try to get 3 other people if possible, sometimes junior developers ask the most insightful questions like, "why are we doing this?".
Testing can be roughly split into two sections, automated and manual. Automated testing requires effort producing test handlers for functions/units but once run can be run again and again very quickly. Manual testing requires planning, self-discipline to perform them all to the required, imagination to think up of scenarios that may impair performance and you have to be observant (you may have passed the test but the 'scope trace has a bit of an anomaly before/after the test).
"Do these steps change if I am
developing for an embedded system?"
Performance ananlysis can be different on embedded systems to applications systems; with the very broad brush that "embedded" now covers it depends how hardware-centric you are. It can be done using profilers, if you want a more cheap and chearful method then use test output pins to measure sections of code, or measure them with breakpoints on simulators that come with the development environment.
Make sure that not just a typical length of task is measured but also a maximum, as that is where one task may start impeding on other tasks and your scheduled tasks are not completed in time.
What tools are out there which can
help me?
Simulators on the IDEs, static analysis tools, dynamic analysis tools, but most of all you and other humans getting the requirements right, decent reviewing (of code and testing) and thorough testing (automated and manual).
Good luck!
My experiences.
Function calls are slow, eliminate with macros or inlined methods. Look at the disassembler listing to see.
If using GCC, mark optimized sections with #pragma GCC optimize("O3") or compile them separately.
Play with different combinations of applying the inline attribute (basically find a balance between size and speed).
It is a difficult question to be answered shortly since various techniques have been proposed such as flowchart and state diagram,so you can take a look at some titles:
ARM System-on-Chip Architecture, 2nd Edition -- Steve Furber
ARM System Developer's Guide - Designing and Optimizing System Software -- Andrew N. Sloss, Dominic Symes, Chris Wright & John Rayfield
The Definitive Guide to the ARM Cortex-M3 --Joseph Yiu
C Programming for Embedded Systems --Kirk Zurell
Embedded C -- Michael J. Pont
Programming Embedded Systems in C and C++ --Michael Barr
An Embedded Software Primer --David E, Simon
Embedded Microprocessor Systems 3rd Edition --Stuart Ball
Global Specification and Validation of Embedded Systems - Integrating Heterogeneous Components --G. Nicolescu & A.A Jerraya
Embedded Systems: Modeling, Technology and Applications --Gunter Hommel & Sheng Huanye
Embedded Systems and Computer Architecture --Graham Wilson
Designing Embedded Hardware --John Catsoulis
You have to use a profiler. It will help you identify your application's bottleneck(s). Then focus on improving the functions you spend the most time in and the ones you call the most. Repeat this procedure until you're satisfied with your application performance.
No they don't.
Depending on the platform you're developing onto :
Windows : AMD Code Analyst, VTune, Sleepy
Linux : valgrind / callgrind / cachegrind
Mac : the Xcode profiler is quite good.
Try to find a profiler for the architecture you actually work on.

Why don't I see a significant speed-up when using the MATLAB compiler?

I have a lot of nice MATLAB code that runs too slowly and would be a pain to write over in C. The MATLAB compiler for C does not seem to help much, if at all. Should it be speeding execution up more? Am I screwed?
If you are using the MATLAB complier (on a recent version of MATLAB) then you will almost certainly not see any speedups at all. This is because all the compiler actually does is give you a way of packaging up your code so that it can be distributed to people who don't have MATLAB. It doesn't convert it to anything faster (such as machine code or C) - it merely wraps it in C so you can call it.
It does this by getting your code to run on the MATLAB Compiler Runtime (MCR) which is essentially the MATLAB computational kernel - your code is still being interpreted. Thanks to the penalty incurred by having to invoke the MCR you may find that compiled code runs more slowly than if you simply ran it on MATLAB.
Put another way - you might say that the compiler doesn't actually compile - in the traditional sense of the word at least.
Older versions of the compiler worked differently and speedups could occur in certain situations. For Mathwork's take on this go to
http://www.mathworks.com/support/solutions/data/1-1ARNS.html
In my experience slow MATLAB code usually comes from not vectorizing your code (i.e., writing for-loops instead of just multiplying arrays (simple example)).
If you are doing file I/O look out for reading data in one piece at a time. Look in the help files for the vectorized version of fscanf.
Don't forget that MATLAB includes a profiler, too!
I'll echo what dwj said: if your MATLAB code is slow, this is probably because it is not sufficiently vectorized. If you're doing explicit loops when you could be doing operations on whole arrays, that's the culprit.
This applies equally to all array-oriented dynamic languages: Perl Data Language, Numeric Python, MATLAB/Octave, etc. It's even true to some extent in compiled C and FORTRAN compiled code: specially-designed vectorization libraries generally use carefully hand-coded inner loops and SIMD instructions (e.g. MMX, SSE, AltiVec).
First, I second all the above comments about profiling and vectorizing.
For a historical perspective...
Older version of Matlab allowed the user to convert m files to mex functions by pre-parsing the m code and converting it to a set of matlab library calls. These calls have all the error checking that the interpreter did, but old versions of the interpreter and/or online parser were slow, so compiling the m file would sometimes help. Usually it helped when you had loops because Matlab was smart enough to inline some of that in C. If you have one of those versions of Matlab, you can try telling the mex script to save the .c file and you can see exactly what it's doing.
In more recent version (probably 2006a and later, but I don't remember), Mathworks started using a just-in-time compiler for the interpreter. In effect, this JIT compiler automatically compiles all mex functions, so explicitly doing it offline doesn't help at all. In each version since then, they've also put a lot of effort into making the interpreter much faster. I believe that newer versions of Matlab don't even let you automatically compile m files to mex files because it doesn't make sense any more.
The MATLAB compiler wraps up your m-code and dispatches it to a MATLAB runtime. So, the performance you see in MATLAB should be the performance you see with the compiler.
Per the other answers, vectorizing your code is helpful. But, the MATLAB JIT is pretty good these days and lots of things perform roughly as well vectorized or not. That'a not to say there aren't performance benefits to be gained from vectorization, it's just not the magic bullet it once was. The only way to really tell is to use the profiler to find out where your code is seeing bottlenecks. Often times there are some places where you can do local refactoring to really improve the performance of your code.
There are a couple of other hardware approaches you can take on performance. First, much of the linear algebra subsystem is multithreaded. You may want to make sure you have enabled that in your preferences if you are working on a multi-core or multi-processor platform. Second, you may be able to use the parallel computing toolbox to take more advantage of multiple processors. Finally, if you are a Simulink user, you may be able to use emlmex to compile m-code into c. This is particularly effective for fixed point work.
Have you tried profiling your code? You don't need to vectorize ALL your code, just the functions that dominate running time. The MATLAB profiler will give you some hints on where your code is spending the most time.
There are many other things you you should read up on the Tips For Improving Performance section in the MathWorks manual.
mcc won't speed up your code at all--it's not really a compiler.
Before you give up, you need to run the profiler and figure out where all your time is going (Tools->Open Profiler). Also, judicious use of "tic" and "toc" can help. Don't optimize your code until you know where the time is going (don't try to guess).
Keep in mind that in matlab:
bit-level operations are really slow
file I/O is slow
loops are generally slow, but vectorizing is fast (if you don't know the vector syntax, learn it)
core operations are really fast (e.g. matrix multiply, fft)
if you think you can do something faster in C/Fortran/etc, you can write a MEX file
there are commercial solutions to convert matlab to C (google "matlab to c") and they work
You could port your code to "Embedded Matlab" and then use the Realtime-Workshop to translate it to C.
Embedded Matlab is a subset of Matlab. It does not support Cell-Arrays, Graphics, Marices of dynamic size, or some Matrix addressing modes. It may take considerable effort to port to Embedded Matlab.
Realtime-Workshop is at the core of the Code Generation Products. It spits out generic C, or can optimize for a range of embedded Platforms. Most interresting to you is perhaps the xPC-Target, which treats general purpose hardware as embedded target.
I would vote for profiling + then look at what are the bottlenecks.
If the bottleneck is matrix math, you're probably not going to do any better... EXCEPT one big gotcha is array allocation. e.g. if you have a loop:
s = [];
for i = 1:50000
s(i) = 3;
end
This has to keep resizing the array; it's much faster to presize the array (start with zeros or NaN) & fill it from there:
s = zeros(50000,1);
for i = 1:50000
s(i) = 3;
end
If the bottleneck is repeated executions of a lot of function calls, that's a tough one.
If the bottleneck is stuff that MATLAB doesn't do quickly (certain types of parsing, XML, stuff like that) then I would use Java since MATLAB already runs on a JVM and it interfaces really easily to arbitrary JAR files. I looked at interfacing with C/C++ and it's REALLY ugly. Microsoft COM is ok (on Windows only) but after learning Java I don't think I'll ever go back to that.
As others has noted, slow Matlab code is often the result of insufficient vectorization.
However, sometimes even perfectly vectorized code is slow. Then, you have several more options:
See if there are any libraries / toolboxes you can use. These were usually written to be very optimized.
Profile your code, find the tight spots and rewrite those in plain C. Connecting C code (as DLLs for instance) to Matlab is easy and is covered in the documentation.
By Matlab compiler you probably mean the command mcc, which does speed the code a little bit by circumventing Matlab interpreter. What would speed the MAtlab code significantly (by a factor of 50-200) is use of actual C code compiled by the mex command.

Resources