GCC optimization differences in recursive functions using globals - c

The other day I ran into a weird problem using GCC and the '-Ofast' optimization flag. Compiling the below program using 'gcc -Ofast -o fib1 fib1.c'.
#include <stdio.h>
int f1(int n) {
if (n < 2) {
return n;
}
int a, b;
a = f1(n - 1);
b = f1(n - 2);
return a + b;
}
int main(){
printf("%d", f1(40));
}
When measuring execution time, the result is:
peter#host ~ $ time ./fib1
102334155
real 0m0.511s
user 0m0.510s
sys 0m0.000s
Now let's introduce a global variable in our program and compile again using 'gcc -Ofast -o fib2 fib2.c'.
#include <stdio.h>
int global;
int f1(int n) {
if (n < 2) {
return n;
}
int a, b;
a = f1(n - 1);
b = f1(n - 2);
global = 0;
return a + b;
}
int main(){
printf("%d", f1(40));
}
Now the execution time is:
peter#host ~ $ time ./fib2
102334155
real 0m0.265s
user 0m0.265s
sys 0m0.000s
The new global variable does not do anything meaningful. However, the difference in execution time is considerable.
Apart from the question (1) what the reason is for such behavior, it also would be nice if (2) the last performance could be achieved without introducing meaningless variables. Any suggestions?
Thanks
Peter

I believe you hit some very clever and very weird gcc (mis-?)optimization. That's about as far as I got in researching this.
I modified your code to have an #ifdef G around the global:
$ cc -O3 -o foo foo.c && time ./foo
102334155
real 0m0.634s
user 0m0.631s
sys 0m0.001s
$ cc -O3 -DG -o foo foo.c && time ./foo
102334155
real 0m0.365s
user 0m0.362s
sys 0m0.001s
So I have the same weird performance difference.
When in doubt, read the generated assembler.
$ cc -S -O3 -o foo.s -S foo.c
$ cc -S -DG -O3 -o foog.s -S foo.c
Here it gets truly bizarre. Normally I can follow gcc-generated code pretty easily. The code that got generated here is just incomprehensible. What should be pretty straightforward recursion and addition that should fit in 15-20 instructions, gcc expanded to a several hundred instructions with a flurry of shifts, additions, subtractions, compares, branches and a large array on the stack. It looks like it tried to partially convert one or both recursions into an iteration and then unrolled that loop. One thing struck me though, the non-global function had only one recursive call to itself (the second one is the call from main):
$ grep 'call.*f1' foo.s | wc
2 4 18
While the global one one had:
$ grep 'call.*f1' foog.s | wc
33 66 297
My educated (I've seen this many times before) guess? Gcc tried to be clever and in its fervor the function that in theory should be easier to optimize generated worse code while the write to the global variable made it sufficiently confused that it couldn't optimize so hard that it led to better code. This happens all the time, many optimizations that gcc (and other compilers too, let's not single them out) uses are very specific to certain benchmarks they use and might not generate faster running code in many other cases. In fact, from experience I only ever use -O2 unless I've benchmarked things very carefully to see that -O3 makes sense. It very often doesn't.
If you really want to research this further, I'd recommend reading gcc documentation about which optimizations get enabled with -O3 as opposed to -O2 (-O2 doesn't do this), then try them one by one until you find which one causes this behavior and that optimization should be a pretty good hint for what's going on. I was about to do this, but I ran out of time (must do last minute christmas shopping).

On my machine (gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010) I've got this:
time ./fib1 0,36s user 0,00s system 98% cpu 0,364 total
time ./fib2 0,20s user 0,00s system 98% cpu 0,208 total
From man gcc:
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
Not so safe option, let's try -O2:
time ./fib1 0,38s user 0,00s system 99% cpu 0,377 total
time ./fib2 0,47s user 0,00s system 99% cpu 0,470 total
I think, that some of aggressive optimizations weren't applied to fib1, but were applied to fib2. When I switched -Ofast for -O2 - some of optimizations weren't applied to fib2, but were applied to fib1.
Let's try -O0:
time ./fib1 0,81s user 0,00s system 99% cpu 0,812 total
time ./fib2 0,81s user 0,00s system 99% cpu 0,814 total
They are equal without optimizations.
So introducing global variable in recursive function can break some optimizations on one hand and improve other optimizations on other hand.

This results from inline limits kicking in earlier in the second version. Because the version with the global variable does more. That strongly suggests that inlining makes run-time performance worse in this particular example.
Compile both versions with -Ofast -fno-inline and the difference in time is gone. In fact, the version without the global variable runs faster.
Alternatively, just mark the function with __attribute__((noinline)).

Related

Why is my program performing differently based only on the order I give source file operands to Clang?

I have a Brainfuck interpreter project with two source files, altering the order that the source files are given as operands to Clang, and nothing else, results in consistent performance differences.
I am using Clang, with the following arguments:
clang -I../ext -D VERSION=\"1.0.0\" main.c lex.c
clang -I../ext -D VERSION=\"1.0.0\" lex.c main.c
Performance differences are seen regardless of optimisation level.
Benchmark results:
-O0 lex before main: 13.68s, main before lex: 13.02s
-01 lex before main: 6.91s, main before lex: 6.65s
-O2 lex before main: 7.58s, main before lex: 7.50s
-O3 lex before main: 6.25s, main before lex: 7.40s
Which order performs worse is not always consistent between the optimisation levels, but for each level, the same operand order always performs worse than the other.
Notes:
Source code can be found here.
The mandelbrot benchmark that I am using with the interpreter can be found here.
Edits:
The executable files for each optimisation level are exactly the same size, but are structured differently.
Object files are identical with either operand order.
The I/O and parsing process is indistinguishably quick regardless of operand order, even running a 500 MiB random file through it results in no variation, thus performance variation is occurring in the run loop.
Upon comparing the objdump of each executable, it appears to me that the primary, if not the only, difference is the order of the sections (, , etc), and the memory addresses that have changed because of this.
The objdumps can be found here.
I don't have a complete answer. But I think I know what's causing the differences between linkage ordering.
First, I got the similar results. I'm using gcc on cygwin. Some sample runs:
Building like this:
$ gcc -I../ext -D VERSION=\"1.0.0\" main.c lex.c -O3 -o mainlex
$ gcc -I../ext -D VERSION=\"1.0.0\" lex.c main.c -O3 -o lexmain
Then running (multiple times to confirm, but here's a sample run)
$ time ./mainlex.exe input.txt > /dev/null
real 0m7.377s
user 0m7.359s
sys 0m0.015s
$ time ./lexmain.exe input.txt > /dev/null
real 0m6.945s
user 0m6.921s
sys 0m0.000s
Then I noticed these declarations:
static char arr[30000] = { 0 }, *ptr = arr;
static tok_t **dat; static size_t cap, top;
And that caused me to recognize that 30K of a zero byte array is getting inserted into the linkage of the program. That might introduce a page load hit. And the linkage ordering might influence if code in main is within the same page as the functions in lex. Or just accessing array means jumping between a page that isn't in the cache anymore. Or some combination thereof. It was just a hypothesis, not a theory.
So I moved the declarations of these global directly into main and dropped the static declaration. Kept the zero-init on the variables.
int main(int argc, char *argv[]) {
char arr[30000] = { 0 }, *ptr = arr;
tok_t **dat=NULL; size_t cap=0, top=0;
That will certainly shrink the object code and binary size by 30K and the stack allocation should be near instant.
I get near identical perf when I run it both ways. As a matter of fact, both builds run faster.
$ time ./mainlex.exe input.txt > /dev/null
real 0m6.385s
user 0m6.359s
sys 0m0.015s
$ time ./lexmain.exe input.txt > /dev/null
real 0m6.353s
user 0m6.343s
sys 0m0.015s
I'm not an expert on page sizes, code paging, or even how the linker and loader operate. But I do know that global variables, including that 30K array, are expanded into the object code directly (thus increasing the object code size itself) and are effectively part of the binaries final image. And smaller code is often faster code.
That 30K buffer in global space might be introducing a sufficiently large enough number of bytes between the functions in lex, main and the c-runtime itself to effect the way code gets paged in and out. Or just causing the loader to take longer to load the binary.
In other words, globals cause code bloat and increase object size. By moving the array declaration to the stack, the memory allocation is near instant. And now the linkage of lex and main probably fits within the same page in memory. Further, because the variables are on the stack, the compiler can likely take more liberty with optimizations.
So in other words, I think I found the root cause. But I'm not 100% sure as to why. There is not a whole lot of function invocations getting made. So it's not like the instruction pointer is jumping around a lot between code in lex.o and code in main.o such that the cache is having to reload the page.
A better test might be to find a much larger input file that triggers a longer run. That way, we can see if the runtime delta is fixed or linear between the two original builds.
Any more insight will require doing some some actual code profiling, instrumentation, or binary analysis.

Same object files with different performance

I am using an ARM cortex A9 platfrom to measure the performance of some algorithms. More specifically i measure the execution time of one algorithm using the clock() function (time.h). I use the latter function right before calling my algorithm and right after the algorithm returns.
....
....
....
start=clock();
alg();
end=clock();
...
...
...
Then I compile the code with exactly the same options and i produce two different object files. The first one is named n and the second one nn. On the ARM platform i run my code in one core. All the other tasks'affinity is set to the other cores. Object file n returns 0,12sec while Object file nn returns 0.1sec. I compared the two binaries files and they don't differences. I noticed that if I give a name to the object file larger than 1 letter then I always have less execution time for my algorithm. Moreover if I run the n.c file and then rename it and run it again I will also get different performance numbers.
Could you please give me some ideas why something like this happens? Thanks in advance
P.S.1: I am using gcc 4.8.1 cross compiler.
P.S.2: I compile my code with
arm-none-linux-gnueabi-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat- abi=softfp -mfpu=neon -Ofast code.c -o n

Is there are flag in gcc compiler for pthread code to minimize execution time?

I am writing a pthread code in C, and using gcc compiler. I have implemented a code with pthread_condition, mutex locks and semaphores.. Is there any flag or option in gcc to enhance the execution time?
Program is written to solve this Problem
the gcc manpage reveals:
-O
-O1 Optimize.
Optimizing compilation takes somewhat more time, and a lot more
memory for a large function. With -O, the compiler tries to reduce
code size and execution time, without performing any optimizations
that take a great deal of compilation time.
-O2 Optimize even more.
GCC performs nearly all supported optimizations that do not involve a
space-speed tradeoff. As compared to -O, this option increases both
compilation time and the performance of the generated code.
-O3 Optimize yet more.
-O3 turns on all optimizations specified by -O2 and also turns on the
-finline-functions, -funswitch-loops, -fpredictive-commoning,
-fgcse-after-reload, -ftree-vectorize and -fipa-cp-clone options.
so if you want your code to run faster ("minimize execution time"), a good start is to use -O3.
since the optimizations will be generic, you will have to do a lot of benchmarking to get best results for a given code.

OCaml MicroBenchmark

I am trying a basic microbenchmark comparison of c with ocaml. I have heard that for the fibonacci program, c and ocaml are about the same, but I can't replicate those results. I compile the c code with gcc -O3 fib.c -o c-code, and compile the OCaml code with ocamlopt -o ocaml-code fibo.ml. I am timing by using time ./c-code and time ./ocaml-code. Every time I do this OCaml takes 0.10 seconds whereas the c code is about .03 seconds each time. Besides the fact that this is a naive benchmark, is there a way to make ocaml faster? Can anyone see what the times on their computers are?
C
#include <stdio.h>
int fibonacci(int n)
{
return n<3 ? 1 : fibonacci(n-1) + fibonacci(n-2);
}
int main(void)
{
printf("%d", fibonacci(34));
return 0;
}
OCaml
let rec fibonacci n = if n < 3 then 1 else fibonacci(n-1) + fibonacci(n-2);;
print_int(fibonacci 34);;
The ML version already beats the C version when compiled with gcc -O2, which I think is a pretty decent job. Looking at the assembly generated by gcc -O3, it looks like gcc is doing some aggressive inlining and loop unrolling. To make the code faster, I think you would have to rewrite the code, but you should focus on higher level abstraction instead.
I think this is just an overhead to ocaml, it would be more relevant to compare with a larger program.
You can use the -S option to produce assembly output, along with -verbose to see how ocaml calls external applications (gcc). Additionally, using the -p option and running your application through gprof will help determine if this is an overhead from ocaml, or something you can actually improve.
Cheers.
For my computer I get the following,
ocaml - 0.035 (std-dev=0.02; 10 trials)
c - 0.027 (std-dev=0.03; 10 trials)

How many GCC optimization levels are there?

How many GCC optimization levels are there?
I tried gcc -O1, gcc -O2, gcc -O3, and gcc -O4
If I use a really large number, it won't work.
However, I have tried
gcc -O100
and it compiled.
How many optimization levels are there?
To be pedantic, there are 8 different valid -O options you can give to gcc, though there are some that mean the same thing.
The original version of this answer stated there were 7 options. GCC has since added -Og to bring the total to 8.
From the man page:
-O (Same as -O1)
-O0 (do no optimization, the default if no optimization level is specified)
-O1 (optimize minimally)
-O2 (optimize more)
-O3 (optimize even more)
-Ofast (optimize very aggressively to the point of breaking standard compliance)
-Og (Optimize debugging experience. -Og enables optimizations that do not interfere with debugging. It should be the
optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization
while maintaining fast compilation and a good debugging experience.)
-Os (Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations
designed to reduce code size.
-Os disables the following optimization flags: -falign-functions -falign-jumps -falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition -fprefetch-loop-arrays -ftree-vect-loop-version)
There may also be platform specific optimizations, as #pauldoo notes, OS X has -Oz.
Let's interpret the source code of GCC 5.1
We'll try to understand what happens on -O100, since it is not clear on the man page.
We shall conclude that:
anything above -O3 up to INT_MAX is the same as -O3, but that could easily change in the future, so don't rely on it.
GCC 5.1 runs undefined behavior if you enter integers larger than INT_MAX.
the argument can only have digits, or it fails gracefully. In particular, this excludes negative integers like -O-1
Focus on subprograms
First remember that GCC is just a front-end for cpp, as, cc1, collect2. A quick ./XXX --help says that only collect2 and cc1 take -O, so let's focus on them.
And:
gcc -v -O100 main.c |& grep 100
gives:
COLLECT_GCC_OPTIONS='-O100' '-v' '-mtune=generic' '-march=x86-64'
/usr/local/libexec/gcc/x86_64-unknown-linux-gnu/5.1.0/cc1 [[noise]] hello_world.c -O100 -o /tmp/ccetECB5.
so -O was forwarded to both cc1 and collect2.
O in common.opt
common.opt is a GCC specific CLI option description format described in the internals documentation and translated to C by opth-gen.awk and optc-gen.awk.
It contains the following interesting lines:
O
Common JoinedOrMissing Optimization
-O<number> Set optimization level to <number>
Os
Common Optimization
Optimize for space rather than speed
Ofast
Common Optimization
Optimize for speed disregarding exact standards compliance
Og
Common Optimization
Optimize for debugging experience rather than speed or size
which specify all the O options. Note how -O<n> is in a separate family from the other Os, Ofast and Og.
When we build, this generates a options.h file that contains:
OPT_O = 139, /* -O */
OPT_Ofast = 140, /* -Ofast */
OPT_Og = 141, /* -Og */
OPT_Os = 142, /* -Os */
As a bonus, while we are grepping for \bO\n inside common.opt we notice the lines:
-optimize
Common Alias(O)
which teaches us that --optimize (double dash because it starts with a dash -optimize on the .opt file) is an undocumented alias for -O which can be used as --optimize=3!
Where OPT_O is used
Now we grep:
git grep -E '\bOPT_O\b'
which points us to two files:
opts.c
lto-wrapper.c
Let's first track down opts.c
opts.c:default_options_optimization
All opts.c usages happen inside: default_options_optimization.
We grep backtrack to see who calls this function, and we see that the only code path is:
main.c:main
toplev.c:toplev::main
opts-global.c:decode_opts
opts.c:default_options_optimization
and main.c is the entry point of cc1. Good!
The first part of this function:
does integral_argument which calls atoi on the string corresponding to OPT_O to parse the input argument
stores the value inside opts->x_optimize where opts is a struct gcc_opts.
struct gcc_opts
After grepping in vain, we notice that this struct is also generated at options.h:
struct gcc_options {
int x_optimize;
[...]
}
where x_optimize comes from the lines:
Variable
int optimize
present in common.opt, and that options.c:
struct gcc_options global_options;
so we guess that this is what contains the entire configuration global state, and int x_optimize is the optimization value.
255 is an internal maximum
in opts.c:integral_argument, atoi is applied to the input argument, so INT_MAX is an upper bound. And if you put anything larger, it seem that GCC runs C undefined behaviour. Ouch?
integral_argument also thinly wraps atoi and rejects the argument if any character is not a digit. So negative values fail gracefully.
Back to opts.c:default_options_optimization, we see the line:
if ((unsigned int) opts->x_optimize > 255)
opts->x_optimize = 255;
so that the optimization level is truncated to 255. While reading opth-gen.awk I had come across:
# All of the optimization switches gathered together so they can be saved and restored.
# This will allow attribute((cold)) to turn on space optimization.
and on the generated options.h:
struct GTY(()) cl_optimization
{
unsigned char x_optimize;
which explains why the truncation: the options must also be forwarded to cl_optimization, which uses a char to save space. So 255 is an internal maximum actually.
opts.c:maybe_default_options
Back to opts.c:default_options_optimization, we come across maybe_default_options which sounds interesting. We enter it, and then maybe_default_option where we reach a big switch:
switch (default_opt->levels)
{
[...]
case OPT_LEVELS_1_PLUS:
enabled = (level >= 1);
break;
[...]
case OPT_LEVELS_3_PLUS:
enabled = (level >= 3);
break;
There are no >= 4 checks, which indicates that 3 is the largest possible.
Then we search for the definition of OPT_LEVELS_3_PLUS in common-target.h:
enum opt_levels
{
OPT_LEVELS_NONE, /* No levels (mark end of array). */
OPT_LEVELS_ALL, /* All levels (used by targets to disable options
enabled in target-independent code). */
OPT_LEVELS_0_ONLY, /* -O0 only. */
OPT_LEVELS_1_PLUS, /* -O1 and above, including -Os and -Og. */
OPT_LEVELS_1_PLUS_SPEED_ONLY, /* -O1 and above, but not -Os or -Og. */
OPT_LEVELS_1_PLUS_NOT_DEBUG, /* -O1 and above, but not -Og. */
OPT_LEVELS_2_PLUS, /* -O2 and above, including -Os. */
OPT_LEVELS_2_PLUS_SPEED_ONLY, /* -O2 and above, but not -Os or -Og. */
OPT_LEVELS_3_PLUS, /* -O3 and above. */
OPT_LEVELS_3_PLUS_AND_SIZE, /* -O3 and above and -Os. */
OPT_LEVELS_SIZE, /* -Os only. */
OPT_LEVELS_FAST /* -Ofast only. */
};
Ha! This is a strong indicator that there are only 3 levels.
opts.c:default_options_table
opt_levels is so interesting, that we grep OPT_LEVELS_3_PLUS, and come across opts.c:default_options_table:
static const struct default_options default_options_table[] = {
/* -O1 optimizations. */
{ OPT_LEVELS_1_PLUS, OPT_fdefer_pop, NULL, 1 },
[...]
/* -O3 optimizations. */
{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
[...]
}
so this is where the -On to specific optimization mapping mentioned in the docs is encoded. Nice!
Assure that there are no more uses for x_optimize
The main usage of x_optimize was to set other specific optimization options like -fdefer_pop as documented on the man page. Are there any more?
We grep, and find a few more. The number is small, and upon manual inspection we see that every usage only does at most a x_optimize >= 3, so our conclusion holds.
lto-wrapper.c
Now we go for the second occurrence of OPT_O, which was in lto-wrapper.c.
LTO means Link Time Optimization, which as the name suggests is going to need an -O option, and will be linked to collec2 (which is basically a linker).
In fact, the first line of lto-wrapper.c says:
/* Wrapper to call lto. Used by collect2 and the linker plugin.
In this file, the OPT_O occurrences seems to only normalize the value of O to pass it forward, so we should be fine.
Seven distinct levels:
-O0 (default): No optimization.
-O or -O1 (same thing): Optimize, but do not spend too much time.
-O2: Optimize more aggressively
-O3: Optimize most aggressively
-Ofast: Equivalent to -O3 -ffast-math. -ffast-math triggers non-standards-compliant floating point optimizations. This allows the compiler to pretend that floating point numbers are infinitely precise, and that algebra on them follows the standard rules of real number algebra. It also tells the compiler to tell the hardware to flush denormals to zero and treat denormals as zero, at least on some processors, including x86 and x86-64. Denormals trigger a slow path on many FPUs, and so treating them as zero (which does not trigger the slow path) can be a big performance win.
-Os: Optimize for code size. This can actually improve speed in some cases, due to better I-cache behavior.
-Og: Optimize, but do not interfere with debugging. This enables non-embarrassing performance for debug builds and is intended to replace -O0 for debug builds.
There are also other options that are not enabled by any of these, and must be enabled separately. It is also possible to use an optimization option, but disable specific flags enabled by this optimization.
For more information, see GCC website.
Four (0-3): See the GCC 4.4.2 manual. Anything higher is just -O3, but at some point you will overflow the variable size limit.

Resources