Format for setting Optimization flags in cBench - c

I am new to Compiler related work. I want to analyse some source code before and after optimising with -O1, -O2, -O3 flags. I am using Intel's PIN tool for analysis purposes. I am using source code from cBench Benchmark suite. But I am not getting that how to set optimization option in that.
Tutorial of cBench mentions following statement.
use __compile batch script with compiler name as the first parameter to compile the benchmark with a specific compiler, i.e. gcc, open64, pathscale or intel. In the second parameter you can specify optimization flags.
So I compile every source code with these three optimization flags as follows
./__compile gcc -O3
./__compile gcc -O2
./__compile gcc -O1
But when I analyse the object file in PIN tool, I am not able to find any difference in any of the 24 program set of cBench.
What is the point where I am missing.?

Related

Optimization setting

In c we can enble the optimization setting by enabling the flag -O for enable all the possible optimization and -O0 will disable all enabled optimization.
My question is that this flags are message to whom?means to compiler or kernel?
All command line arguments you supply are interpreted by the compiler (or compiler driver, in the case of some compilers like gcc). They may then be passed on to other programs that the compiler (or compiler driver) executes to complete particular tasks.
Incidentally, -o is not an optimisation setting with quite a few compilers. It usually specifies the name of an output file. For example, gcc -c file.c -o anotherfile.o compilers file.c and produces an object file named anotherfile.o.
The optimisation setting is usually -O (for example -O3). Note the uppercase O. It won't necessarily be passed to every program executed by the compiler/driver. For example, gcc -O3 file.c -o program compiles file.c with optimisation setting -O3 and produces an output executable named program. To do that, the linker is invoked, as well as various compilation phases (preprocessor, compiler proper, etc). -O3 will not normally be passed to the linker - it is a compilation option which linkers normally do not understand.
The O flags are passed to the compiler, not the kernel. The kernel has nothing to do with compilation. These flags determine how aggressively the optimizer will do it's job. A practical example would be clang -O3 WannabeObjectFile.c.
Edit: I made a mistake, the lower case o flag is used to specify the output file. The uppercase O is used to specify optimization level.

Why assembly produced by objdump is huge?

I am trying to view the assembly for my simple C application. So, I have tried to produce assembly from binary by using objdump and it produces about 4.3MB sized file with 103228 lines of assembly code. Then, I have tried to do so by providing -S & -save-temps flags to the gcc.
I have used the following three commands:
1. arm-linux-gnueabi-objdump -d hello_simple > hello_simple.dump
2. arm-linux-gnueabi-gcc -save-temps -static hello_simple.c -o hello_simple -lm
3. arm-linux-gnueabi-gcc -S -static hello_simple.c -o hello_simple.asm -lm
In case of 2 & 3, exactly same results are produced, i.e., 65 lines of assembly code. I understand objdump produces some extra details too.
But, why is there a huge difference?
EDIT1: I have used the following command to build that binary:
arm-linux-gnueabi-gcc -static hello_simple.c -o hello_simple -lm
EDIT2: Though, -static and -lm flags may look here unnecessary but, I have to execute this binary on simulator after compile time additions of some assembly components, making them a must.
So, which assembly code should I consider as the most relevant during my analysis of execution traces? (I know it's another question but it would be handy to cover it in the same answer.)
The second two are just saving the asm for your functions.
The first one also has the CRT startup code. And, since you statically linked it, all the library functions you called.
Note that for 3, -static and -lm don't do anything, because you're not linking. gcc foo.c -S -O3 -fverbose-asm -o- | less is often handy.
I notice that none of your command lines included a -O3, or a -march=. You should compile with optimization on, and have gcc optimize your code for the target hardware.
.s is the standard suffix for machine-generated asm. (.S for hand-written asm: gcc foo.S will run it through cpp first). gcc -S produces a .s, the same way -c produces a .o.
For x86, .asm is usually only used for Intel-syntax (NASM/YASM), but IDK what the conventions are for ARM.
So, which assembly code should I consider as the most relevant during my analysis of execution traces?
It depends what you're trying to learn! If you have a good sense of how "expensive" each library function call is (in terms of number of instructions, number of branches polluting the branch-predictors, and data-cache pollution), then you don't need to trace execution through library calls. If you have math library functions that are used from some of your inner loops, then it's worth looking at them if the code is time-critical.
Usually a profiler or single-stepping in a debugger is useful for that, though. Just having disassembly output of a lot of library code is usually just clutter.

Curious result from the gcc linker behaviour around -ffast-math

I've noticed an interesting phenomenon around flags to the compiler linker affecting the running code in ways I cannot understand.
I have a library that presents different implementations of the same algorithm in order to test the run speed of those different implementations.
Initially, I tested the situation with a pair of identical implementation to check the correct thing happened (both ran at roughly the same speed). I begun by compiling the objects (one per implementation) with the following compiler flags:
-g -funroll-loops -flto -Ofast -Werror
and then during linking passed gcc the following flags:
-Ofast -flto=4 -fuse-linker-plugin
This gave a library that ran blazingly fast, but curiously was reliably and repeatably ~7% faster for the first object that was included in the arguments during linking (so either implementation was faster if it was linked first).
so with:
gcc -o libfoo.so -O3 -ffast-math -flto=4 -fuse-linker-plugin -shared support_obj.os obj1.os obj2.os -lm
vs
gcc -o libfoo.so -O3 -ffast-math -flto=4 -fuse-linker-plugin -shared support_obj.os obj2.os obj1.os -lm
the first case had the implementation in obj1 running faster than the implementation in obj2. In the second case, the converse was true. To be clear, the code is identical in both cases except for the function entry name.
Now I removed this strange link-argument-order difference (and actually sped it up a bit) by removing the -Ofast flag during linking.
I can replicate mostly the same situation by changing -Ofast to -O3 -ffast-math, but in that case I need to supply -ffast-math during linking, which leads again to the strange ordering speed difference. I'm not sure why the speed-up is maintained for -Ofast but not for -ffast-math when -ffast-math is not passed during linking, but I can accept it might be down to the link time optimisation passing the relevant info in one case but not the other. This doesn't explain the speed disparity though.
Removing -ffast-math means it runs ~8 times slower.
Is anybody able to shed some light on what might be happening to cause this effect? I'm really keen to know what might be going on to cause this funny behaviour so I can not accidentally trigger it down the line.
The run speed test is performed in python using a wrapper around the library and timeit, and I'm fairly sure this is doing the right thing (I can twiddle orders and things to show the python side effects are negligible).
I also tested the library for correctness of output, so I can be reasonably confident of that too.
too long for a comment so posted as an answer:
Due to the risk of obtaining incorrect results in math operations, I would suggest not using it.
using -ffast_math and/or -Ofast can lead to incorrect results, as expressed in these excerpts from the gcc manual:
option:-ffast-math Sets the options:
-fno-math-errno,
-funsafe-math-optimizations,
-ffinite-math-only,
-fno-rounding-math,
-fno-signaling-nans and
-fcx-limited-range.
This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. "
option: -Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.

Differences between -O0 and -O1 in GCC

While compiling some code I noticed big differences in the assembler created between -O0 and -O1. I wanted to run through enabling/disabling optimisations until I found out what was causing a certain change in the assembler.
If I use -fverbose-asm to find out exactly which flags O1 is enabling compared to O0, and then disable them manually, why is the assembler produced still so massively different? Even if I run gcc with O0 and manually add all the flags that fverbose-asm said were enabled with O1, I don't get the same assembler that I would have got just by using O1.
Is there anything apart from '-f...' and '-m...' that can be changed?
Or is is just that 'O1' has some magic compared with 'O0' that cannot be turned off.
Sorry for the crypticness - this was related to Reducing stack usage during recursion with GCC + ARM however the mention of it was making the question a bit hard to understand.
If all you want is to see which passes are enabled at O1 which are not enabled at O0 you could run something like:
gcc -O0 test.c -fdump-tree-all -da
ls > O0
rm -f test.c.*
gcc -O1 test.c -fdump-tree-all -da
ls > O1
diff O0 O1
A similar process, using the set of flags which you discovered, will let you see what extra magic passes not controlled by flags are undertaken by GCC at O1.
EDIT:
A less messy way might be to compare the output of -fdump-passes, which will list which passes are ON or OFF to stderr.
So something like:
gcc -O0 test.c -fdump-passes |& grep ON > O0
gcc -O1 test.c -fdump-passes |& grep ON > O1
diff O0 O1
Not that this helps, other than providing some evidence for your suspicions about -O1 magic that can't be turned off:
From http://gcc.gnu.org/ml/gcc-help/2007-11/msg00214.html:
CAVEAT, not all optimizations enabled by -O1 have a command-line toggle flag to disable them.
From Hagen's "Definitive Guide to GCC, 2nd Ed":
Note: Not all of GCC’s optimizations can be controlled using a flag. GCC performs some optimizations automatically and, short of modifying the source code, you cannot disable these optimizations when you request optimization using -O
Unfortunately, I haven't found any clear statement about what these hard-coded optimizations might be. Hopefully someone who is knowlegable about GCC's internals might post an answer with some information about that.
In addition to the many options you can also change parameters, e.g.
--param max-crossjump-edges=1
which affects the code generation. Check the source file params.def for all available params.
But there is no way to switch from -O0 to -O1, or from -O1 to -O2, or from -Os or to -Os or etc. p.p. , by adding options, without patching the source code, since there are several hard coded locations where the level is checked without consulting an command line option, e.g.:
return perform_tree_ssa_dce (/*aggressive=*/optimize >= 2);

Which gcc optimization flags should I use?

If I want to minimize the time my c programs run, what optimization flags should I use (I want to keep it standard too)
Currently I'm using:
-Wall -Wextra -pedantic -ansi -O3
Should I also use
-std=c99
for example?
And is there I specific order I should put those flags on my makefile? Does it make any difference?
And also, is there any reason not to use all the optimization flags I can find? do they ever counter eachother or something like that?
I'd recommend compiling new code with -std=gnu11, or -std=c11 if needed. Silencing all -Wall warnings is usually a good idea, IIRC. -Wextra warns for some things you might not want to change.
A good way to check how something compiles is to look at the compiler asm output. http://gcc.godbolt.org/ formats the asm output nicely (stripping out the noise). Putting some key functions up there and looking at what different compiler versions do is useful if you understand asm at all.
Use a new compiler version. gcc and clang have both improved significantly in newer versions. gcc 5.3 and clang 3.8 are the current releases. gcc5 makes noticeably better code than gcc 4.9.3 in some cases.
If you only need the binary to run on your own machine, you should use -O3 -march=native.
If you need the binary to run on other machines, choose the baseline for instruction-set extensions with stuff like -mssse3 -mpopcnt. You can use -mtune=haswell to optimize for Haswell even while making code that still runs on older CPUs (as determined by -march).
If your program doesn't depend on strict FP rounding behaviour, use -ffast-math. If it does, you can usually still use -fno-math-errno and stuff like that, without enabling -funsafe-math-optimizations. Some FP code can get big speedups from fast-math, like auto-vectorization.
If you can usefully do a test-run of your program that exercises most of the code paths that need to be optimized for a real run, then use profile-directed optimization:
gcc -fprofile-generate -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
./my_program -option1 < test_input1
./my_program -option2 < test_input2
gcc -fprofile-use -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
-fprofile-use enables -funroll-loops, since it has enough information to decide when to actually unroll. Unrolling loops all over the place can make things worse. However, it's worth trying -funroll-loops to see if it helps.
If your test runs don't cover all the code paths, then some important ones will be marked as "cold" and optimized less.
-O3 enables auto-vectorization, which -O2 doesn't. This can give big speedups
-fwhole-program allows cross-file inlining, but only works when you put all the source files on one gcc command-line. -flto is another way to get the same effect. (Link-Time Optimization). clang supports -flto but not -fwhole-program.
-fomit-frame-pointer has been the default for a while now for x86-64, and more recently for x86 (32bit).
As well as gcc, try compiling your program with clang. Clang sometimes makes better code than gcc, sometimes worse. Try both and benchmark.
The flag -std=c99 does not change the optimization levels. It only changes what target language standard you want the compiler to confirm to.
You use -std=c99 when you want your program to be treated as a C99 program by the compiler.
The only flag that has to do with optimization among those you specified is -O3. Others serve for other purposes.
You may want to add -funroll-loops and -fomit-frame-pointer, but they should be already included in -O3.

Resources