How many GCC optimization levels are there?
I tried gcc -O1, gcc -O2, gcc -O3, and gcc -O4
If I use a really large number, it won't work.
However, I have tried
gcc -O100
and it compiled.
How many optimization levels are there?
To be pedantic, there are 8 different valid -O options you can give to gcc, though there are some that mean the same thing.
The original version of this answer stated there were 7 options. GCC has since added -Og to bring the total to 8.
From the man page:
-O (Same as -O1)
-O0 (do no optimization, the default if no optimization level is specified)
-O1 (optimize minimally)
-O2 (optimize more)
-O3 (optimize even more)
-Ofast (optimize very aggressively to the point of breaking standard compliance)
-Og (Optimize debugging experience. -Og enables optimizations that do not interfere with debugging. It should be the
optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization
while maintaining fast compilation and a good debugging experience.)
-Os (Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations
designed to reduce code size.
-Os disables the following optimization flags: -falign-functions -falign-jumps -falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition -fprefetch-loop-arrays -ftree-vect-loop-version)
There may also be platform specific optimizations, as #pauldoo notes, OS X has -Oz.
Let's interpret the source code of GCC 5.1
We'll try to understand what happens on -O100, since it is not clear on the man page.
We shall conclude that:
anything above -O3 up to INT_MAX is the same as -O3, but that could easily change in the future, so don't rely on it.
GCC 5.1 runs undefined behavior if you enter integers larger than INT_MAX.
the argument can only have digits, or it fails gracefully. In particular, this excludes negative integers like -O-1
Focus on subprograms
First remember that GCC is just a front-end for cpp, as, cc1, collect2. A quick ./XXX --help says that only collect2 and cc1 take -O, so let's focus on them.
And:
gcc -v -O100 main.c |& grep 100
gives:
COLLECT_GCC_OPTIONS='-O100' '-v' '-mtune=generic' '-march=x86-64'
/usr/local/libexec/gcc/x86_64-unknown-linux-gnu/5.1.0/cc1 [[noise]] hello_world.c -O100 -o /tmp/ccetECB5.
so -O was forwarded to both cc1 and collect2.
O in common.opt
common.opt is a GCC specific CLI option description format described in the internals documentation and translated to C by opth-gen.awk and optc-gen.awk.
It contains the following interesting lines:
O
Common JoinedOrMissing Optimization
-O<number> Set optimization level to <number>
Os
Common Optimization
Optimize for space rather than speed
Ofast
Common Optimization
Optimize for speed disregarding exact standards compliance
Og
Common Optimization
Optimize for debugging experience rather than speed or size
which specify all the O options. Note how -O<n> is in a separate family from the other Os, Ofast and Og.
When we build, this generates a options.h file that contains:
OPT_O = 139, /* -O */
OPT_Ofast = 140, /* -Ofast */
OPT_Og = 141, /* -Og */
OPT_Os = 142, /* -Os */
As a bonus, while we are grepping for \bO\n inside common.opt we notice the lines:
-optimize
Common Alias(O)
which teaches us that --optimize (double dash because it starts with a dash -optimize on the .opt file) is an undocumented alias for -O which can be used as --optimize=3!
Where OPT_O is used
Now we grep:
git grep -E '\bOPT_O\b'
which points us to two files:
opts.c
lto-wrapper.c
Let's first track down opts.c
opts.c:default_options_optimization
All opts.c usages happen inside: default_options_optimization.
We grep backtrack to see who calls this function, and we see that the only code path is:
main.c:main
toplev.c:toplev::main
opts-global.c:decode_opts
opts.c:default_options_optimization
and main.c is the entry point of cc1. Good!
The first part of this function:
does integral_argument which calls atoi on the string corresponding to OPT_O to parse the input argument
stores the value inside opts->x_optimize where opts is a struct gcc_opts.
struct gcc_opts
After grepping in vain, we notice that this struct is also generated at options.h:
struct gcc_options {
int x_optimize;
[...]
}
where x_optimize comes from the lines:
Variable
int optimize
present in common.opt, and that options.c:
struct gcc_options global_options;
so we guess that this is what contains the entire configuration global state, and int x_optimize is the optimization value.
255 is an internal maximum
in opts.c:integral_argument, atoi is applied to the input argument, so INT_MAX is an upper bound. And if you put anything larger, it seem that GCC runs C undefined behaviour. Ouch?
integral_argument also thinly wraps atoi and rejects the argument if any character is not a digit. So negative values fail gracefully.
Back to opts.c:default_options_optimization, we see the line:
if ((unsigned int) opts->x_optimize > 255)
opts->x_optimize = 255;
so that the optimization level is truncated to 255. While reading opth-gen.awk I had come across:
# All of the optimization switches gathered together so they can be saved and restored.
# This will allow attribute((cold)) to turn on space optimization.
and on the generated options.h:
struct GTY(()) cl_optimization
{
unsigned char x_optimize;
which explains why the truncation: the options must also be forwarded to cl_optimization, which uses a char to save space. So 255 is an internal maximum actually.
opts.c:maybe_default_options
Back to opts.c:default_options_optimization, we come across maybe_default_options which sounds interesting. We enter it, and then maybe_default_option where we reach a big switch:
switch (default_opt->levels)
{
[...]
case OPT_LEVELS_1_PLUS:
enabled = (level >= 1);
break;
[...]
case OPT_LEVELS_3_PLUS:
enabled = (level >= 3);
break;
There are no >= 4 checks, which indicates that 3 is the largest possible.
Then we search for the definition of OPT_LEVELS_3_PLUS in common-target.h:
enum opt_levels
{
OPT_LEVELS_NONE, /* No levels (mark end of array). */
OPT_LEVELS_ALL, /* All levels (used by targets to disable options
enabled in target-independent code). */
OPT_LEVELS_0_ONLY, /* -O0 only. */
OPT_LEVELS_1_PLUS, /* -O1 and above, including -Os and -Og. */
OPT_LEVELS_1_PLUS_SPEED_ONLY, /* -O1 and above, but not -Os or -Og. */
OPT_LEVELS_1_PLUS_NOT_DEBUG, /* -O1 and above, but not -Og. */
OPT_LEVELS_2_PLUS, /* -O2 and above, including -Os. */
OPT_LEVELS_2_PLUS_SPEED_ONLY, /* -O2 and above, but not -Os or -Og. */
OPT_LEVELS_3_PLUS, /* -O3 and above. */
OPT_LEVELS_3_PLUS_AND_SIZE, /* -O3 and above and -Os. */
OPT_LEVELS_SIZE, /* -Os only. */
OPT_LEVELS_FAST /* -Ofast only. */
};
Ha! This is a strong indicator that there are only 3 levels.
opts.c:default_options_table
opt_levels is so interesting, that we grep OPT_LEVELS_3_PLUS, and come across opts.c:default_options_table:
static const struct default_options default_options_table[] = {
/* -O1 optimizations. */
{ OPT_LEVELS_1_PLUS, OPT_fdefer_pop, NULL, 1 },
[...]
/* -O3 optimizations. */
{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
[...]
}
so this is where the -On to specific optimization mapping mentioned in the docs is encoded. Nice!
Assure that there are no more uses for x_optimize
The main usage of x_optimize was to set other specific optimization options like -fdefer_pop as documented on the man page. Are there any more?
We grep, and find a few more. The number is small, and upon manual inspection we see that every usage only does at most a x_optimize >= 3, so our conclusion holds.
lto-wrapper.c
Now we go for the second occurrence of OPT_O, which was in lto-wrapper.c.
LTO means Link Time Optimization, which as the name suggests is going to need an -O option, and will be linked to collec2 (which is basically a linker).
In fact, the first line of lto-wrapper.c says:
/* Wrapper to call lto. Used by collect2 and the linker plugin.
In this file, the OPT_O occurrences seems to only normalize the value of O to pass it forward, so we should be fine.
Seven distinct levels:
-O0 (default): No optimization.
-O or -O1 (same thing): Optimize, but do not spend too much time.
-O2: Optimize more aggressively
-O3: Optimize most aggressively
-Ofast: Equivalent to -O3 -ffast-math. -ffast-math triggers non-standards-compliant floating point optimizations. This allows the compiler to pretend that floating point numbers are infinitely precise, and that algebra on them follows the standard rules of real number algebra. It also tells the compiler to tell the hardware to flush denormals to zero and treat denormals as zero, at least on some processors, including x86 and x86-64. Denormals trigger a slow path on many FPUs, and so treating them as zero (which does not trigger the slow path) can be a big performance win.
-Os: Optimize for code size. This can actually improve speed in some cases, due to better I-cache behavior.
-Og: Optimize, but do not interfere with debugging. This enables non-embarrassing performance for debug builds and is intended to replace -O0 for debug builds.
There are also other options that are not enabled by any of these, and must be enabled separately. It is also possible to use an optimization option, but disable specific flags enabled by this optimization.
For more information, see GCC website.
Four (0-3): See the GCC 4.4.2 manual. Anything higher is just -O3, but at some point you will overflow the variable size limit.
Related
In our project, we are using ticlang compiler, i.e. a flavor of clang from TI.
Optimization is set to level -Os.
In the code we have variables that have a struct type and are only used within a C file and hence are defined as static struct_type_xy variable;
The compiler performs some optimization where the members of such a struct are not kept in sequence in one block of memory but are re-ordered and even split.
This means that while debugging such variables cannot be displayed properly.
Of course, I could define them as volatile but that would also prevent optimizing multiple accesses to same members which I don't want to happen.
Therefore I want to prevent this kind of optimization.
What is the name of such an optimization and how can I disable it in clang?
I don't have a MCVE yet but I can provide a few details:
typedef struct
{
Command_t Command; // this is an enum type
int Par_1; // System uses 32 bit integers.
int Par_2;
int Par_3;
int Par_4;
size_t Num_Tok;
} Cmd_t;
static Cmd_t Cmd;
The map file then contains:
20000540 00000004 Cmd.o (.bss.Cmd.1)
20000544 00000004 Cmd.o (.bss.Cmd.2)
20000548 00000004 Cmd.o (.bss.Cmd.5)
2000054c 00000004 HAL_*
...
2000057b 00000001 XY_*
2000057c 00000001 Cmd.o (.bss.Cmd.0)
The parts of Cmd are split accross the memory and some are even removed. (I used a bulid configuration where the missing 2 members are not used but the struct definition is identical for all configurations)
If I remove static this changes to
200004c4 00000018 (.common:Cmd)
Clang is apparently scalarizing the static struct, breaking it up into separate members, since the address is never taken or used, and doesn't escape the compilation unit. This lets it optimize away unused members.
LLVM has a "Scalar Replacement of Aggregates" (sroa) optimization pass.
https://llvm.org/docs/Passes.html#sroa-scalar-replacement-of-aggregates
(The alloca mentioned in that doc is an LLVM IR instruction, not the C alloca() function. Also, google found a random copy of the LLVM source that implements this while I was trying to find the right search terms.)
clang -O3 -Rpass=sroa might print a "remark" for each struct it optimizes, if that pass supports optimization reports.
According to Clang optimization levels, -sroa is enabled at -O1 and higher. But -sroa isn't a clang option, nor it an LLVM option for clang -mllvm -sroa. In 2011, someone asked about adding a command-line option to disable an arbitrary optimization pass; IDK if any feature ever got added.
clang -cc1 -mllvm -help-list-hidden does show some interesting option names, like --stop-before=<pass-name> and --start-after=<pass-name>, and there's a --sroa-strict-inbounds.
clang -mllvm --sroa-strict-inbounds -O1 does actually compile, but I don't know what it does.
clang -mllvm --stop-before=sroa -O3 hello.c doesn't work on my system with clang 13. Or with --stop-before=-sroa. I get error in backend: "sroa" pass is not registered.
So I don't know how to actually disable this optimization pass, but that's almost certainly the one responsible. This is as far as I've gotten.
It's enabled at -O1, so it's not viable to use a lower optimization level and enabling the other optimization flags that normally implies. -O0 is special, and marks everything as optnone, to make sure code-gen is suitably literal, storing/reloading everything between C statements.
Is there a way that GCC does not optimize any function calls?
In the generated assembly code, the printf function is replaced by putchar. This happens even with the default -O0 minimal optimization flag.
#include <stdio.h>
int main(void) {
printf("a");
return 0;
}
(Godbolt is showing GCC 9 doing it, and Clang 8 keeping it unchanged.)
Use -fno-builtin to disable all replacement and inlining of standard C functions with equivalents. (This is very bad for performance in code that assumes memcpy(x,y, 4) will compile to just an unaligned/aliasing-safe load, not a function call. And disables constant-propagation such as strlen of string literals. So normally you'd want to avoid that for practical use.)
Or use -fno-builtin-FUNCNAME for a specific function, like -fno-builtin-printf.
By default, some commonly-used standard C functions are handled as builtin functions, similar to __builtin_popcount. The handler for printf replaces it with putchar or puts
if possible.
6.59 Other Built-in Functions Provided by GCC
The implementation details of a C statement like printf("a") are not considered a visible side effect by default, so they aren't something that get preserved. You can still set a breakpoint at the call site and step into the function (at least in assembly, or in source mode if you have debug symbols installed).
To disable other kinds of optimizations for a single function, see __attribute__((optimize(0))) on a function or #pragma GCC optimize. But beware:
The optimize attribute should be used for debugging purposes only. It is not suitable in production code.
You can't disable all optimizations. Some optimization is inherent in the way GCC transforms through an internal representation on the way to assembly. See Disable all optimization options in GCC.
E.g., even at -O0, GCC will optimize x / 10 to a multiplicative inverse.
It still stores everything to memory between C statements (for consistent debugging; that's what -O0 really means); GCC doesn't have a "fully dumb" mode that tries to transliterate C to assembly as naively as possible. Use tcc for that. Clang and ICC with -O0 are somewhat more literal than GCC, and so is MSVC debug mode.
Note that -g never has any effect on code generation, only on the metadata emitted. GCC uses other options (mostly -O, -f*, and -m*) to control code generation, so you can always safely enable -g without hurting performance, other than a larger binary. It's not debug mode (that's -O0); it's just debug symbols.
The other day I ran into a weird problem using GCC and the '-Ofast' optimization flag. Compiling the below program using 'gcc -Ofast -o fib1 fib1.c'.
#include <stdio.h>
int f1(int n) {
if (n < 2) {
return n;
}
int a, b;
a = f1(n - 1);
b = f1(n - 2);
return a + b;
}
int main(){
printf("%d", f1(40));
}
When measuring execution time, the result is:
peter#host ~ $ time ./fib1
102334155
real 0m0.511s
user 0m0.510s
sys 0m0.000s
Now let's introduce a global variable in our program and compile again using 'gcc -Ofast -o fib2 fib2.c'.
#include <stdio.h>
int global;
int f1(int n) {
if (n < 2) {
return n;
}
int a, b;
a = f1(n - 1);
b = f1(n - 2);
global = 0;
return a + b;
}
int main(){
printf("%d", f1(40));
}
Now the execution time is:
peter#host ~ $ time ./fib2
102334155
real 0m0.265s
user 0m0.265s
sys 0m0.000s
The new global variable does not do anything meaningful. However, the difference in execution time is considerable.
Apart from the question (1) what the reason is for such behavior, it also would be nice if (2) the last performance could be achieved without introducing meaningless variables. Any suggestions?
Thanks
Peter
I believe you hit some very clever and very weird gcc (mis-?)optimization. That's about as far as I got in researching this.
I modified your code to have an #ifdef G around the global:
$ cc -O3 -o foo foo.c && time ./foo
102334155
real 0m0.634s
user 0m0.631s
sys 0m0.001s
$ cc -O3 -DG -o foo foo.c && time ./foo
102334155
real 0m0.365s
user 0m0.362s
sys 0m0.001s
So I have the same weird performance difference.
When in doubt, read the generated assembler.
$ cc -S -O3 -o foo.s -S foo.c
$ cc -S -DG -O3 -o foog.s -S foo.c
Here it gets truly bizarre. Normally I can follow gcc-generated code pretty easily. The code that got generated here is just incomprehensible. What should be pretty straightforward recursion and addition that should fit in 15-20 instructions, gcc expanded to a several hundred instructions with a flurry of shifts, additions, subtractions, compares, branches and a large array on the stack. It looks like it tried to partially convert one or both recursions into an iteration and then unrolled that loop. One thing struck me though, the non-global function had only one recursive call to itself (the second one is the call from main):
$ grep 'call.*f1' foo.s | wc
2 4 18
While the global one one had:
$ grep 'call.*f1' foog.s | wc
33 66 297
My educated (I've seen this many times before) guess? Gcc tried to be clever and in its fervor the function that in theory should be easier to optimize generated worse code while the write to the global variable made it sufficiently confused that it couldn't optimize so hard that it led to better code. This happens all the time, many optimizations that gcc (and other compilers too, let's not single them out) uses are very specific to certain benchmarks they use and might not generate faster running code in many other cases. In fact, from experience I only ever use -O2 unless I've benchmarked things very carefully to see that -O3 makes sense. It very often doesn't.
If you really want to research this further, I'd recommend reading gcc documentation about which optimizations get enabled with -O3 as opposed to -O2 (-O2 doesn't do this), then try them one by one until you find which one causes this behavior and that optimization should be a pretty good hint for what's going on. I was about to do this, but I ran out of time (must do last minute christmas shopping).
On my machine (gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010) I've got this:
time ./fib1 0,36s user 0,00s system 98% cpu 0,364 total
time ./fib2 0,20s user 0,00s system 98% cpu 0,208 total
From man gcc:
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
Not so safe option, let's try -O2:
time ./fib1 0,38s user 0,00s system 99% cpu 0,377 total
time ./fib2 0,47s user 0,00s system 99% cpu 0,470 total
I think, that some of aggressive optimizations weren't applied to fib1, but were applied to fib2. When I switched -Ofast for -O2 - some of optimizations weren't applied to fib2, but were applied to fib1.
Let's try -O0:
time ./fib1 0,81s user 0,00s system 99% cpu 0,812 total
time ./fib2 0,81s user 0,00s system 99% cpu 0,814 total
They are equal without optimizations.
So introducing global variable in recursive function can break some optimizations on one hand and improve other optimizations on other hand.
This results from inline limits kicking in earlier in the second version. Because the version with the global variable does more. That strongly suggests that inlining makes run-time performance worse in this particular example.
Compile both versions with -Ofast -fno-inline and the difference in time is gone. In fact, the version without the global variable runs faster.
Alternatively, just mark the function with __attribute__((noinline)).
-Og is a relatively new optimization option that is intended to improve the debugging experience while apply optimizations. If a user selects -Og, then I'd like my source files to activate alternate code paths to enhance the debugging experience. GCC offers the __OPTIMIZE__ preprocessor macro, but its only set to 1 when optimizations are in effect.
Is there a way to learn the optimization level, like -O1, -O3 or -Og, for use with the preprocessor?
I don't know if this is clever hack, but it is a hack.
$ gcc -Xpreprocessor -dM -E - < /dev/null > 1
$ gcc -Xpreprocessor -dM -O -E - < /dev/null > 2
$ diff 1 2
53a54
> #define __OPTIMIZE__ 1
68a70
> #define _FORTIFY_SOURCE 2
154d155
< #define __NO_INLINE__ 1
clang didn't produce the FORTIFY one.
I believe this is not possible to know directly the optimization level used to compile the software as this is not in the list of defined preprocessor symbols
You could rely on -DNDEBUG (no debug) which is used to disable assertions in release code and enable your "debug" code path in this case.
However, I believe a better thing to do is having a system wide set of symbols local to your project and let the user choose what to use explicitly.:
MYPROJECT_DNDEBUG
MYPROJECT_OPTIMIZE
MYPROJECT_OPTIMIZE_AGGRESSIVELY
This makes debugging or the differences of behavior between release/debug much easier as you can incrementally turn on/off the different behaviors.
Some system-specific preprocessor macros exist, depending on your target. For example, the Microchip-specific XC16 variant of gcc (currently based on gcc 4.5.1) has the preprocessor macro __OPTIMIZATION_LEVEL__, which takes on values 0, 1, 2, s, or 3.
Note that overriding optimization for a specific routine, e.g. with __attribute__((optimize(0))), does not change the value of __OPTIMIZE__ or __OPTIMIZATION_LEVEL__ within that routine.
Which gcc compiler options may be safely used for numerical programming?
The easy way to turn on optimizations for gcc is to add -0# to the compiler options. It is tempting to say -O3. However I know that -O3 includes optimization which are non-save in the sense that results of numerical computations may differ once this option is included. Small changes in the result may be insignificant if the algorithm is stable. On the other hand, precision can be an issue for certain math operations, so math optimization can have significant impact.
I find it inconvenient to take compiler dependent issues into account in the process of debugging. I.e. I don't want to wonder whether minor changes in the code will lead to strongly different behavior because the compiler changed its optimizations internally.
Which options are safe to add if I want deterministic--and hence controllable--behavior in my code? Which are almost safe, that is, which options induce only minor uncertainties compared to performance benefits?
I think of options like: -finline -finline-limit=2000 which inlines functions even if they are long.
It is not true that -O3 includes numerically unsafe optimizations. According to the manual, -O3 includes the following optimization passes in comparison to -O2:
-finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-cp-clone
You might be referring to -ffast-math, turned on by default with -Ofast, but not with -O3:
-ffast-math Sets -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range. This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it
can result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions.
It may, however, yield faster code for programs that do not require
the guarantees of these specifications.
In other words, all of -O, -O2, and -O3 are safe for numeric programming.