While compiling some code I noticed big differences in the assembler created between -O0 and -O1. I wanted to run through enabling/disabling optimisations until I found out what was causing a certain change in the assembler.
If I use -fverbose-asm to find out exactly which flags O1 is enabling compared to O0, and then disable them manually, why is the assembler produced still so massively different? Even if I run gcc with O0 and manually add all the flags that fverbose-asm said were enabled with O1, I don't get the same assembler that I would have got just by using O1.
Is there anything apart from '-f...' and '-m...' that can be changed?
Or is is just that 'O1' has some magic compared with 'O0' that cannot be turned off.
Sorry for the crypticness - this was related to Reducing stack usage during recursion with GCC + ARM however the mention of it was making the question a bit hard to understand.
If all you want is to see which passes are enabled at O1 which are not enabled at O0 you could run something like:
gcc -O0 test.c -fdump-tree-all -da
ls > O0
rm -f test.c.*
gcc -O1 test.c -fdump-tree-all -da
ls > O1
diff O0 O1
A similar process, using the set of flags which you discovered, will let you see what extra magic passes not controlled by flags are undertaken by GCC at O1.
EDIT:
A less messy way might be to compare the output of -fdump-passes, which will list which passes are ON or OFF to stderr.
So something like:
gcc -O0 test.c -fdump-passes |& grep ON > O0
gcc -O1 test.c -fdump-passes |& grep ON > O1
diff O0 O1
Not that this helps, other than providing some evidence for your suspicions about -O1 magic that can't be turned off:
From http://gcc.gnu.org/ml/gcc-help/2007-11/msg00214.html:
CAVEAT, not all optimizations enabled by -O1 have a command-line toggle flag to disable them.
From Hagen's "Definitive Guide to GCC, 2nd Ed":
Note: Not all of GCC’s optimizations can be controlled using a flag. GCC performs some optimizations automatically and, short of modifying the source code, you cannot disable these optimizations when you request optimization using -O
Unfortunately, I haven't found any clear statement about what these hard-coded optimizations might be. Hopefully someone who is knowlegable about GCC's internals might post an answer with some information about that.
In addition to the many options you can also change parameters, e.g.
--param max-crossjump-edges=1
which affects the code generation. Check the source file params.def for all available params.
But there is no way to switch from -O0 to -O1, or from -O1 to -O2, or from -Os or to -Os or etc. p.p. , by adding options, without patching the source code, since there are several hard coded locations where the level is checked without consulting an command line option, e.g.:
return perform_tree_ssa_dce (/*aggressive=*/optimize >= 2);
Related
I'm interested in the loop unswitching optimization option -funswitch-loops to GCC, in particular what actually enables it. According to the documentation:
The following options control optimizations that may improve performance, but are not enabled by any -O options. This section includes experimental options that may produce broken code.
...
-funswitch-loops
Move branches with loop invariant conditions out of the loop, with duplicates of the loop on both branches (modified according to result of the condition).
Enabled by -fprofile-use and -fauto-profile.
So if I'm not already using -fprofile-use or -fauto-profile, it seems that I have to explicitly add -funswitch-loops to my list of compiler flags in order to activate loop unswitching. Fair enough. Though elsewhere in the same documentation, we find
-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags:
...
-funswitch-loops
...
So the documentation seems to claims that -funswitch-loops is switched on by -O3, but also that it is not turned on by any of the -O options. Which one is it?
You can compile with -v -Q to see a list of all the optimization options that are in effect. With -v -Q -O3 on gcc 10.2, I see -funswitch-loops is included.
So its listing under "not enabled by any -O options" is apparently an error. You could report it as a documentation bug.
I've noticed an interesting phenomenon around flags to the compiler linker affecting the running code in ways I cannot understand.
I have a library that presents different implementations of the same algorithm in order to test the run speed of those different implementations.
Initially, I tested the situation with a pair of identical implementation to check the correct thing happened (both ran at roughly the same speed). I begun by compiling the objects (one per implementation) with the following compiler flags:
-g -funroll-loops -flto -Ofast -Werror
and then during linking passed gcc the following flags:
-Ofast -flto=4 -fuse-linker-plugin
This gave a library that ran blazingly fast, but curiously was reliably and repeatably ~7% faster for the first object that was included in the arguments during linking (so either implementation was faster if it was linked first).
so with:
gcc -o libfoo.so -O3 -ffast-math -flto=4 -fuse-linker-plugin -shared support_obj.os obj1.os obj2.os -lm
vs
gcc -o libfoo.so -O3 -ffast-math -flto=4 -fuse-linker-plugin -shared support_obj.os obj2.os obj1.os -lm
the first case had the implementation in obj1 running faster than the implementation in obj2. In the second case, the converse was true. To be clear, the code is identical in both cases except for the function entry name.
Now I removed this strange link-argument-order difference (and actually sped it up a bit) by removing the -Ofast flag during linking.
I can replicate mostly the same situation by changing -Ofast to -O3 -ffast-math, but in that case I need to supply -ffast-math during linking, which leads again to the strange ordering speed difference. I'm not sure why the speed-up is maintained for -Ofast but not for -ffast-math when -ffast-math is not passed during linking, but I can accept it might be down to the link time optimisation passing the relevant info in one case but not the other. This doesn't explain the speed disparity though.
Removing -ffast-math means it runs ~8 times slower.
Is anybody able to shed some light on what might be happening to cause this effect? I'm really keen to know what might be going on to cause this funny behaviour so I can not accidentally trigger it down the line.
The run speed test is performed in python using a wrapper around the library and timeit, and I'm fairly sure this is doing the right thing (I can twiddle orders and things to show the python side effects are negligible).
I also tested the library for correctness of output, so I can be reasonably confident of that too.
too long for a comment so posted as an answer:
Due to the risk of obtaining incorrect results in math operations, I would suggest not using it.
using -ffast_math and/or -Ofast can lead to incorrect results, as expressed in these excerpts from the gcc manual:
option:-ffast-math Sets the options:
-fno-math-errno,
-funsafe-math-optimizations,
-ffinite-math-only,
-fno-rounding-math,
-fno-signaling-nans and
-fcx-limited-range.
This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. "
option: -Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
Is there any way to reduce the memory used by an executable generated with a command like gcc source_file.c -o result? I browsed the Internet and also looked in the man page for "gcc" and I think that I should use something related to -c or -S. So is gcc -c -S source_file.c -o result working? (This seems to reduce the space used...is there any other way to reduce even more?)
Thanks,
Polb
The standard compiler option on POSIX-like systems to instruct the compiler to optimize is -O (capital letter O for optimize). Many compilers allow you to optionally specify an optimization level after -O. Common optimization levels include:
-O0 no optimization at all
-O1 basic optimization for speed
-O2 all of -O1 plus some advanced optimizations
-O3 all of -O2 plus expensive optimizations that aren't usually needed
-Os optimize for size instead of speed (gcc, clang)
-Oz optimize even more for size (clang)
-Og all of -O2 except for optimizations that hinder debugging (gcc)
-Ofast all of -O3 and some numeric optimizations not in conformance with standard C. Use with caution. (gcc)
option -S generate assembler output.
Better option is using llvm and generating asembler for multi architecture. Similar http://kripken.github.io/llvm.js/demo.html
llc
here is example https://idea.popcount.org/2013-07-24-ir-is-better-than-assembly/
I am new to Compiler related work. I want to analyse some source code before and after optimising with -O1, -O2, -O3 flags. I am using Intel's PIN tool for analysis purposes. I am using source code from cBench Benchmark suite. But I am not getting that how to set optimization option in that.
Tutorial of cBench mentions following statement.
use __compile batch script with compiler name as the first parameter to compile the benchmark with a specific compiler, i.e. gcc, open64, pathscale or intel. In the second parameter you can specify optimization flags.
So I compile every source code with these three optimization flags as follows
./__compile gcc -O3
./__compile gcc -O2
./__compile gcc -O1
But when I analyse the object file in PIN tool, I am not able to find any difference in any of the 24 program set of cBench.
What is the point where I am missing.?
If I want to minimize the time my c programs run, what optimization flags should I use (I want to keep it standard too)
Currently I'm using:
-Wall -Wextra -pedantic -ansi -O3
Should I also use
-std=c99
for example?
And is there I specific order I should put those flags on my makefile? Does it make any difference?
And also, is there any reason not to use all the optimization flags I can find? do they ever counter eachother or something like that?
I'd recommend compiling new code with -std=gnu11, or -std=c11 if needed. Silencing all -Wall warnings is usually a good idea, IIRC. -Wextra warns for some things you might not want to change.
A good way to check how something compiles is to look at the compiler asm output. http://gcc.godbolt.org/ formats the asm output nicely (stripping out the noise). Putting some key functions up there and looking at what different compiler versions do is useful if you understand asm at all.
Use a new compiler version. gcc and clang have both improved significantly in newer versions. gcc 5.3 and clang 3.8 are the current releases. gcc5 makes noticeably better code than gcc 4.9.3 in some cases.
If you only need the binary to run on your own machine, you should use -O3 -march=native.
If you need the binary to run on other machines, choose the baseline for instruction-set extensions with stuff like -mssse3 -mpopcnt. You can use -mtune=haswell to optimize for Haswell even while making code that still runs on older CPUs (as determined by -march).
If your program doesn't depend on strict FP rounding behaviour, use -ffast-math. If it does, you can usually still use -fno-math-errno and stuff like that, without enabling -funsafe-math-optimizations. Some FP code can get big speedups from fast-math, like auto-vectorization.
If you can usefully do a test-run of your program that exercises most of the code paths that need to be optimized for a real run, then use profile-directed optimization:
gcc -fprofile-generate -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
./my_program -option1 < test_input1
./my_program -option2 < test_input2
gcc -fprofile-use -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
-fprofile-use enables -funroll-loops, since it has enough information to decide when to actually unroll. Unrolling loops all over the place can make things worse. However, it's worth trying -funroll-loops to see if it helps.
If your test runs don't cover all the code paths, then some important ones will be marked as "cold" and optimized less.
-O3 enables auto-vectorization, which -O2 doesn't. This can give big speedups
-fwhole-program allows cross-file inlining, but only works when you put all the source files on one gcc command-line. -flto is another way to get the same effect. (Link-Time Optimization). clang supports -flto but not -fwhole-program.
-fomit-frame-pointer has been the default for a while now for x86-64, and more recently for x86 (32bit).
As well as gcc, try compiling your program with clang. Clang sometimes makes better code than gcc, sometimes worse. Try both and benchmark.
The flag -std=c99 does not change the optimization levels. It only changes what target language standard you want the compiler to confirm to.
You use -std=c99 when you want your program to be treated as a C99 program by the compiler.
The only flag that has to do with optimization among those you specified is -O3. Others serve for other purposes.
You may want to add -funroll-loops and -fomit-frame-pointer, but they should be already included in -O3.