OCaml MicroBenchmark - c

I am trying a basic microbenchmark comparison of c with ocaml. I have heard that for the fibonacci program, c and ocaml are about the same, but I can't replicate those results. I compile the c code with gcc -O3 fib.c -o c-code, and compile the OCaml code with ocamlopt -o ocaml-code fibo.ml. I am timing by using time ./c-code and time ./ocaml-code. Every time I do this OCaml takes 0.10 seconds whereas the c code is about .03 seconds each time. Besides the fact that this is a naive benchmark, is there a way to make ocaml faster? Can anyone see what the times on their computers are?
C
#include <stdio.h>
int fibonacci(int n)
{
return n<3 ? 1 : fibonacci(n-1) + fibonacci(n-2);
}
int main(void)
{
printf("%d", fibonacci(34));
return 0;
}
OCaml
let rec fibonacci n = if n < 3 then 1 else fibonacci(n-1) + fibonacci(n-2);;
print_int(fibonacci 34);;

The ML version already beats the C version when compiled with gcc -O2, which I think is a pretty decent job. Looking at the assembly generated by gcc -O3, it looks like gcc is doing some aggressive inlining and loop unrolling. To make the code faster, I think you would have to rewrite the code, but you should focus on higher level abstraction instead.

I think this is just an overhead to ocaml, it would be more relevant to compare with a larger program.
You can use the -S option to produce assembly output, along with -verbose to see how ocaml calls external applications (gcc). Additionally, using the -p option and running your application through gprof will help determine if this is an overhead from ocaml, or something you can actually improve.
Cheers.
For my computer I get the following,
ocaml - 0.035 (std-dev=0.02; 10 trials)
c - 0.027 (std-dev=0.03; 10 trials)

Related

Order of arguments for C bitwise operations?

I've gotten a piece of software working, and am now trying to tune it up so it runs faster. I discovered something that struck as well - just bizarre. It's no longer relevant, because I switched to using a pointer instead of indexing an array (it's faster with the pointers), but I'd still like to know what is going on.
Here's the code:
short mask_num_vals(short mask)
{
short count = 0;
for(short val=0;val<NUM_VALS;val++)
if(mask & val_masks[val])
count++;
return count;
}
This small piece of code is called many many times. What really surprised me is that this code runs significantly faster than its predecessor, which simply had the two arguments to the "&" operation reversed.
Now, I would have thought the two versions would be, for all practical purposes, identical, and they do produce the same result. But the version above is faster - noticeably faster. It makes about a 5% difference in the running time of the overall code that uses it. My attempt to measure the amount of time spent in the function above failed completely - measuring the time used up far more time than actually executing the rest of the code. (A version of Heisenberg's principle for software, I guess.)
So my picture here is, the compiled code evaluates the two arguments, and then does a bitwise "and" on them. Who cares which order the arguments are in? Apparently the compiler or the computer does.
My completely unsupported conjecture is that the compiled code must be evaluating "val_masks[val]" for each bit. If "val_masks[val]" comes first, it evaluates it for every bit, if "mask" comes first, it doesn't bother with "val_masks[val]" if that particular bit in "mask" is zero. I have no evidence whatsoever to support this conjecture; I just can't think of anything else that might cause this behaviour.
Does this seem likely? This behaviour just seemed weird to me, and I think points to some difference in my picture of how the compiled code works, and how it actually works. Again, not all that relevant any more, as I've evolved the code further (using pointers instead of arrays). But I'd still be interested in knowing what is causing this.
Hardware is an Apple MacBook Pro 15-inch 2018, MacOS 10.15.5. Software is gcc compiler, and "gcc --version" produces the following output.
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.3 (clang-1103.0.32.62)
Target: x86_64-apple-darwin19.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Compiled with the command "gcc -c -Wall 'C filename'", linked with "gcc -o -Wall 'object filenames'".
Code optimizers are often unpredictable. Their output can change after small meaningless tweaks in code, or after changing command-line options, or after upgrading the compiler. You cannot always explain why the compiler does some optimization in one case but not in another; you can guess all you want, but only experience can show.
One powerful technique in determining what is going on: convert your two versions of code to assembly language and compare.
GCC could be invoked with the command-line switch -S for that.
gcc -S -Wall -O -fverbose-asm your-c-source.c
which produces a textual assembler file your-c-source.s (you could glance into it using a pager like less or a source code editor like GNU emacs) from the C file your-c-source.c
The Clang compiler has similar options.

How to convert gcc -S output to C code?

I have a program written in C which does a lot of processing on a 2d array and i would like to optimize it regarding function calls and cache line misses.
gcc -O3 performs really good but I would like to see what it really does or at least get an idea of it to help me do the same in the C code.
Now, the gcc -O3 -S outputs assembly and I was wondering if any of you guys know a way to get it back to C.
Or is there a way to get the output of O3 back to C code and see how it looks after optimization? Or at least a glimpse of it?

GCC optimization differences in recursive functions using globals

The other day I ran into a weird problem using GCC and the '-Ofast' optimization flag. Compiling the below program using 'gcc -Ofast -o fib1 fib1.c'.
#include <stdio.h>
int f1(int n) {
if (n < 2) {
return n;
}
int a, b;
a = f1(n - 1);
b = f1(n - 2);
return a + b;
}
int main(){
printf("%d", f1(40));
}
When measuring execution time, the result is:
peter#host ~ $ time ./fib1
102334155
real 0m0.511s
user 0m0.510s
sys 0m0.000s
Now let's introduce a global variable in our program and compile again using 'gcc -Ofast -o fib2 fib2.c'.
#include <stdio.h>
int global;
int f1(int n) {
if (n < 2) {
return n;
}
int a, b;
a = f1(n - 1);
b = f1(n - 2);
global = 0;
return a + b;
}
int main(){
printf("%d", f1(40));
}
Now the execution time is:
peter#host ~ $ time ./fib2
102334155
real 0m0.265s
user 0m0.265s
sys 0m0.000s
The new global variable does not do anything meaningful. However, the difference in execution time is considerable.
Apart from the question (1) what the reason is for such behavior, it also would be nice if (2) the last performance could be achieved without introducing meaningless variables. Any suggestions?
Thanks
Peter
I believe you hit some very clever and very weird gcc (mis-?)optimization. That's about as far as I got in researching this.
I modified your code to have an #ifdef G around the global:
$ cc -O3 -o foo foo.c && time ./foo
102334155
real 0m0.634s
user 0m0.631s
sys 0m0.001s
$ cc -O3 -DG -o foo foo.c && time ./foo
102334155
real 0m0.365s
user 0m0.362s
sys 0m0.001s
So I have the same weird performance difference.
When in doubt, read the generated assembler.
$ cc -S -O3 -o foo.s -S foo.c
$ cc -S -DG -O3 -o foog.s -S foo.c
Here it gets truly bizarre. Normally I can follow gcc-generated code pretty easily. The code that got generated here is just incomprehensible. What should be pretty straightforward recursion and addition that should fit in 15-20 instructions, gcc expanded to a several hundred instructions with a flurry of shifts, additions, subtractions, compares, branches and a large array on the stack. It looks like it tried to partially convert one or both recursions into an iteration and then unrolled that loop. One thing struck me though, the non-global function had only one recursive call to itself (the second one is the call from main):
$ grep 'call.*f1' foo.s | wc
2 4 18
While the global one one had:
$ grep 'call.*f1' foog.s | wc
33 66 297
My educated (I've seen this many times before) guess? Gcc tried to be clever and in its fervor the function that in theory should be easier to optimize generated worse code while the write to the global variable made it sufficiently confused that it couldn't optimize so hard that it led to better code. This happens all the time, many optimizations that gcc (and other compilers too, let's not single them out) uses are very specific to certain benchmarks they use and might not generate faster running code in many other cases. In fact, from experience I only ever use -O2 unless I've benchmarked things very carefully to see that -O3 makes sense. It very often doesn't.
If you really want to research this further, I'd recommend reading gcc documentation about which optimizations get enabled with -O3 as opposed to -O2 (-O2 doesn't do this), then try them one by one until you find which one causes this behavior and that optimization should be a pretty good hint for what's going on. I was about to do this, but I ran out of time (must do last minute christmas shopping).
On my machine (gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010) I've got this:
time ./fib1 0,36s user 0,00s system 98% cpu 0,364 total
time ./fib2 0,20s user 0,00s system 98% cpu 0,208 total
From man gcc:
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
Not so safe option, let's try -O2:
time ./fib1 0,38s user 0,00s system 99% cpu 0,377 total
time ./fib2 0,47s user 0,00s system 99% cpu 0,470 total
I think, that some of aggressive optimizations weren't applied to fib1, but were applied to fib2. When I switched -Ofast for -O2 - some of optimizations weren't applied to fib2, but were applied to fib1.
Let's try -O0:
time ./fib1 0,81s user 0,00s system 99% cpu 0,812 total
time ./fib2 0,81s user 0,00s system 99% cpu 0,814 total
They are equal without optimizations.
So introducing global variable in recursive function can break some optimizations on one hand and improve other optimizations on other hand.
This results from inline limits kicking in earlier in the second version. Because the version with the global variable does more. That strongly suggests that inlining makes run-time performance worse in this particular example.
Compile both versions with -Ofast -fno-inline and the difference in time is gone. In fact, the version without the global variable runs faster.
Alternatively, just mark the function with __attribute__((noinline)).

Why is this NodeJS 2x faster than native C?

For the sake of a presentation at work, I wanted to compare the performance of NodeJS to C. Here is what I wrote:
Node.js (for.js):
var d = 0.0,
start = new Date().getTime();
for (var i = 0; i < 100000000; i++)
{
d += i >> 1;
}
var end = new Date().getTime();
console.log(d);
console.log(end - start);
C (for.c)
#include <stdio.h>
#include <time.h>
int main () {
clock_t start = clock();
long d = 0.0;
for (long i = 0; i < 100000000; i++)
{
d += i >> 1;
}
clock_t end = clock();
clock_t elapsed = (end - start) / (CLOCKS_PER_SEC / 1000);
printf("%ld\n", d);
printf("%lu\n", elapsed);
}
Using GCC I compiled my for.c and ran it:
gcc for.c
./a.out
Results:
2499999950000000
198
Then I tried it in NodeJS:
node for.js
Results:
2499999950000000
116
After running numerous times, I discovered this held true no matter what. If I switched for.c to use a double instead of a long in the loop, the time C took was even longer!
Not trying to start a flame war, but why is Node.JS (116 ms.) so much faster than native C (198 ms.) at performing this same operation? Is Node.JS applying an optimization that GCC does not do out of the box?
EDIT:
Per suggestion in comments, I ran gcc -Wall -O2 for.c. Results improved to C taking 29 ms. This begs the question, how is it that the native C settings are not optimized as much as a Javascript compiler? Also, what is -Wall and -02 doing. I'm really curious about the details of what is going on here.
This begs the question, how is it that the native C settings are not optimized as much as a Javascript compiler?
Since C is statically-compiled and linked, requiring a potentially lengthy build step of your entire codebase (I once worked in one that took almost an hour for a full optimized build, but only 10 minutes otherwise), and a very dangerous, hardware-level language that risks a lot of undefined behaviors if you don't treat it with care, the default settings of compilers usually don't optimize to smithereens since that's a developer/debug build intended to help with debugging and productivity with faster turnaround.
So in C you get a distinct separation between an unoptimized but faster-to-build, easier-to-debug developer/debug build and a very optimized, slower-to-build, harder-to-debug production/release build that runs really fast, and the default settings of compilers often favor the former.
With something like v8/NodeJS, you're dealing with a just-in-time compiler (dynamic compilation) that builds and optimizes only the necessary code on the fly at run-time. On top of that, JS is a much safer language and also often designed for security that doesn't allow you to work at the raw bits and bytes of the hardware.
As a result, it doesn't need that kind of strong release/debug build distinction of a native, statically-compiled language like C/C++. But it also doesn't let you put the pedal to the metal as you can in C if you really want.
A lot of people trying to benchmark C/C++ coming from other languages often fail to understand this build distinction and the importance of compiler/linker optimization settings and get confused. As you can see, with the proper settings, it's hard to beat the performance of these native compilers and languages that allow you to write really low-level code.
Adding the register key word helps as expected
#include <stdio.h>
#include <time.h>
int main () {
register long i, d;
clock_t start = clock();
i = d = 0L;
for (i = 0; i < 100000000L; i++) {
d += i >> 1;
}
clock_t end = clock();
clock_t elapsed = (end - start) / (CLOCKS_PER_SEC / 1000);
printf("%ld\n", d);
printf("%lu\n", elapsed);
}
and compile with the C compiler
cc for.c -o for
./for ; node for.js
returns
2499999950000000
97
2499999950000000
222
I've also done some testing calculating prime numbers and I've found that Node.js is about twice as fast as C see here.
When you have a very simple counting type of loop, the -O2 optimization can sometimes convert the output to a simple formula without even iterating the loop. See Karl's Blog for an explination. If you add something more complicated to the routine it is likely node.js will be faster again. for example I added a devisor term into your sample program and the c -O2 optimization was no longer able to convert it to a simple formula and node.js became faster again.
I am still baffled as to how node.js can be caster than c at simple integer calculations, but in every test I've done so far it is faster. I've also performed some tests with bitwise calculations that I haven't posted yet and still node.js was faster.
node.js and C are distinct in that node.js is interpreting JavaScript whereas C is compiling code to machine language. As such, both are handled differently. For node.js, you simply run the .js file. C is much different from this. When you are compiling code to machine language using GCC, you must supply compiler optimization settings aka "flags." Your node.js program is actually slower than the C program if you specify the -O3 flag to GCC. In fact, the node.js program took twice as long for me as the C program did. You stated that you would like to know more about what the C compiler does to optimize code. This is a complex topic/field and I highly recommend reading this Wikipedia article on compiler optimization to learn more.
In short, you made an unfair comparison because you did not optimize your C code.

gcc -O1 optimization slower than -O0

I've been rewriting a program that is a mix of Fortran and C, which is around 10k lines of random particle simulation. However, I recently realised that Release mode was running a lot slower than Debug.
Debug (-O0): 23 seconds. Release(-O1 or -O2): 43 seconds. For a small test run.
This only changes when the C optimization settings are changed within Project->Properties->C/C++ Build->GCC C Compiler->Optimization Level, and is unaffected by the GNU Fortran compiler settings.
Looking into this, it seems that -O1 and -O2 run slowly, while -O0 runs a lot faster. Even with all optimization flags set manually (from GCC Docs), it still runs faster than -O1.
It may be that running in Release causes different results which cause extra computations to be made (values being outside of their expected ranges etc.), would this be likely? And if so would it be possible to change the behaviour back to the original Debug settings?
Any help would be appreciated, let me know if you need more information to help.
Chris.
Edit: System Information:
Windows 8.1 Pro
GCC version 4.8.1
Eclipse for Parallel Application Developers, Juno SR2
Okay it turns out that this occurs for version 4.8.1 of gcc, but not the later versions (is fixed with 4.8.3). For both the equivalent C and Fortran codes, using -O1 or higher caused the compiler to incorrectly optimise the code.
To simplify the code, the incorrect result occurs from the lines:
a = b - c;
c = c + a;
Which, if computed normally would be c = b + (c - c);, reducing to c = b;.
But seems to instead be calculated as c = c + (b - c);, which is calculated normally.
This creates a very slight inaccuracy (-0.0999999978 compared to -0.1000000015).
In this case, this resulted in a loop that would have run 2 times, to run 1500 times, and for this to happen over 500 times across the run.
The way it was (accidentally) fixed, was to force the compiler to calculate a = b - c before running the c = c + a. This can be done by printf("%f\n",a);, or c = c + 0 between calculating a and c (both seemed to work). Note, this line does not ever have to run. You can put this into a if statement that is never true (e.g. if (a == INT_MAX)) but not somewhere the compiler knows it will never run (e.g. if (false)).
At least this is my, and my supervisors, current thoughts on the matter. Still not exactly sure what part of the optimisation causes this, or what fixed it in the most recent versions, but I hope this can be helpful for someone else and save them a week of confusion.
Thanks for the help anyway. - Chris.

Resources