I've noticed an interesting phenomenon around flags to the compiler linker affecting the running code in ways I cannot understand.
I have a library that presents different implementations of the same algorithm in order to test the run speed of those different implementations.
Initially, I tested the situation with a pair of identical implementation to check the correct thing happened (both ran at roughly the same speed). I begun by compiling the objects (one per implementation) with the following compiler flags:
-g -funroll-loops -flto -Ofast -Werror
and then during linking passed gcc the following flags:
-Ofast -flto=4 -fuse-linker-plugin
This gave a library that ran blazingly fast, but curiously was reliably and repeatably ~7% faster for the first object that was included in the arguments during linking (so either implementation was faster if it was linked first).
so with:
gcc -o libfoo.so -O3 -ffast-math -flto=4 -fuse-linker-plugin -shared support_obj.os obj1.os obj2.os -lm
vs
gcc -o libfoo.so -O3 -ffast-math -flto=4 -fuse-linker-plugin -shared support_obj.os obj2.os obj1.os -lm
the first case had the implementation in obj1 running faster than the implementation in obj2. In the second case, the converse was true. To be clear, the code is identical in both cases except for the function entry name.
Now I removed this strange link-argument-order difference (and actually sped it up a bit) by removing the -Ofast flag during linking.
I can replicate mostly the same situation by changing -Ofast to -O3 -ffast-math, but in that case I need to supply -ffast-math during linking, which leads again to the strange ordering speed difference. I'm not sure why the speed-up is maintained for -Ofast but not for -ffast-math when -ffast-math is not passed during linking, but I can accept it might be down to the link time optimisation passing the relevant info in one case but not the other. This doesn't explain the speed disparity though.
Removing -ffast-math means it runs ~8 times slower.
Is anybody able to shed some light on what might be happening to cause this effect? I'm really keen to know what might be going on to cause this funny behaviour so I can not accidentally trigger it down the line.
The run speed test is performed in python using a wrapper around the library and timeit, and I'm fairly sure this is doing the right thing (I can twiddle orders and things to show the python side effects are negligible).
I also tested the library for correctness of output, so I can be reasonably confident of that too.
too long for a comment so posted as an answer:
Due to the risk of obtaining incorrect results in math operations, I would suggest not using it.
using -ffast_math and/or -Ofast can lead to incorrect results, as expressed in these excerpts from the gcc manual:
option:-ffast-math Sets the options:
-fno-math-errno,
-funsafe-math-optimizations,
-ffinite-math-only,
-fno-rounding-math,
-fno-signaling-nans and
-fcx-limited-range.
This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. "
option: -Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
Related
I ran into this while studying a line of code in cmake for building a library:
-Wall -Wfloat-equal -o3 -fPIC
What do these compiler flags mean and how do they work? Why do they need to be inserted?
-Wall -Wfloat-equal -o3 -fPIC"
So
-Wall
Enables apparently not all, but an awful lot of compiler warning messages. It should be used to generate better code since you'll know if anything's wrong.
-Wfloat-equal
Warns if floating point numbers are used in equality comparisons. Comparing floats for equality is risky business because 1.0 isn't necessarily the exact value. I'm not sure why you'd want it in this context, because it seems like -Wall would display the warnings anyways.
-o3
Is probably O3, or optimization level 3. AKA optimize to the maximum level permitted (iirc).
-fPIC
Will generate position independent code. This is a bit more complicated, but was asked before, but is useful for including in a library.
I am working on a task for the university, there is a webiste that checks my memory usage and it compiles the .c files with:
/usr/bin/gcc -DEVAL -std=c11 -O2 -pipe -static -s -o program programname.c -lm
and it says my program exceeds the memory limits of 4 Mib which is a lot i think. I was told this command makes it use more memory that the standard compilation I use on my pc, like this:
gcc myprog.c -o myprog
I launched the executable created by this one compilation with:
/usr/bin/time -v ./myprog
and under "maximum resident set size" it says 1708 kilobytes, which should be 1,6 Mibs. So how can it be that for the university checker my program goes over 4 Mibs? I have eliminated all the possible mallocs i have, I just left the essential ones but it still says it goes over the limit, what else should I improve? I'm almost thinking the wesite has an error or something...
From GNU GCC Manual, Page 197:
-static On systems that support dynamic linking, this overrides ‘-pie’ and prevents linking with the shared libraries. On other systems, this
option has no effect.
If you don't know about the pie flag quoted here, have a look at this section:
-pie Produce a dynamically linked position independent executable on targets that support it. For predictable results, you must also
specify the same set of options used for compilation (‘-fpie’,
‘-fPIE’, or model suboptions) when you specify this linker option.
To answer your question: yes is it possible this overhead generated by the static flag, because in that case, the compiler can not do the basic optimization by merging stdlib's code with the one you've produced.
As it was suggested in the comments you shall compile your code with the same flag of the website to have an idea of the real overhead of your program (be sure that your gcc version is the same of the website) and also you shall do some common manual optimization such constant folding, function inlining etc. A good reference to these optimizations could be this one
Is there any way to reduce the memory used by an executable generated with a command like gcc source_file.c -o result? I browsed the Internet and also looked in the man page for "gcc" and I think that I should use something related to -c or -S. So is gcc -c -S source_file.c -o result working? (This seems to reduce the space used...is there any other way to reduce even more?)
Thanks,
Polb
The standard compiler option on POSIX-like systems to instruct the compiler to optimize is -O (capital letter O for optimize). Many compilers allow you to optionally specify an optimization level after -O. Common optimization levels include:
-O0 no optimization at all
-O1 basic optimization for speed
-O2 all of -O1 plus some advanced optimizations
-O3 all of -O2 plus expensive optimizations that aren't usually needed
-Os optimize for size instead of speed (gcc, clang)
-Oz optimize even more for size (clang)
-Og all of -O2 except for optimizations that hinder debugging (gcc)
-Ofast all of -O3 and some numeric optimizations not in conformance with standard C. Use with caution. (gcc)
option -S generate assembler output.
Better option is using llvm and generating asembler for multi architecture. Similar http://kripken.github.io/llvm.js/demo.html
llc
here is example https://idea.popcount.org/2013-07-24-ir-is-better-than-assembly/
While compiling some code I noticed big differences in the assembler created between -O0 and -O1. I wanted to run through enabling/disabling optimisations until I found out what was causing a certain change in the assembler.
If I use -fverbose-asm to find out exactly which flags O1 is enabling compared to O0, and then disable them manually, why is the assembler produced still so massively different? Even if I run gcc with O0 and manually add all the flags that fverbose-asm said were enabled with O1, I don't get the same assembler that I would have got just by using O1.
Is there anything apart from '-f...' and '-m...' that can be changed?
Or is is just that 'O1' has some magic compared with 'O0' that cannot be turned off.
Sorry for the crypticness - this was related to Reducing stack usage during recursion with GCC + ARM however the mention of it was making the question a bit hard to understand.
If all you want is to see which passes are enabled at O1 which are not enabled at O0 you could run something like:
gcc -O0 test.c -fdump-tree-all -da
ls > O0
rm -f test.c.*
gcc -O1 test.c -fdump-tree-all -da
ls > O1
diff O0 O1
A similar process, using the set of flags which you discovered, will let you see what extra magic passes not controlled by flags are undertaken by GCC at O1.
EDIT:
A less messy way might be to compare the output of -fdump-passes, which will list which passes are ON or OFF to stderr.
So something like:
gcc -O0 test.c -fdump-passes |& grep ON > O0
gcc -O1 test.c -fdump-passes |& grep ON > O1
diff O0 O1
Not that this helps, other than providing some evidence for your suspicions about -O1 magic that can't be turned off:
From http://gcc.gnu.org/ml/gcc-help/2007-11/msg00214.html:
CAVEAT, not all optimizations enabled by -O1 have a command-line toggle flag to disable them.
From Hagen's "Definitive Guide to GCC, 2nd Ed":
Note: Not all of GCC’s optimizations can be controlled using a flag. GCC performs some optimizations automatically and, short of modifying the source code, you cannot disable these optimizations when you request optimization using -O
Unfortunately, I haven't found any clear statement about what these hard-coded optimizations might be. Hopefully someone who is knowlegable about GCC's internals might post an answer with some information about that.
In addition to the many options you can also change parameters, e.g.
--param max-crossjump-edges=1
which affects the code generation. Check the source file params.def for all available params.
But there is no way to switch from -O0 to -O1, or from -O1 to -O2, or from -Os or to -Os or etc. p.p. , by adding options, without patching the source code, since there are several hard coded locations where the level is checked without consulting an command line option, e.g.:
return perform_tree_ssa_dce (/*aggressive=*/optimize >= 2);
If I want to minimize the time my c programs run, what optimization flags should I use (I want to keep it standard too)
Currently I'm using:
-Wall -Wextra -pedantic -ansi -O3
Should I also use
-std=c99
for example?
And is there I specific order I should put those flags on my makefile? Does it make any difference?
And also, is there any reason not to use all the optimization flags I can find? do they ever counter eachother or something like that?
I'd recommend compiling new code with -std=gnu11, or -std=c11 if needed. Silencing all -Wall warnings is usually a good idea, IIRC. -Wextra warns for some things you might not want to change.
A good way to check how something compiles is to look at the compiler asm output. http://gcc.godbolt.org/ formats the asm output nicely (stripping out the noise). Putting some key functions up there and looking at what different compiler versions do is useful if you understand asm at all.
Use a new compiler version. gcc and clang have both improved significantly in newer versions. gcc 5.3 and clang 3.8 are the current releases. gcc5 makes noticeably better code than gcc 4.9.3 in some cases.
If you only need the binary to run on your own machine, you should use -O3 -march=native.
If you need the binary to run on other machines, choose the baseline for instruction-set extensions with stuff like -mssse3 -mpopcnt. You can use -mtune=haswell to optimize for Haswell even while making code that still runs on older CPUs (as determined by -march).
If your program doesn't depend on strict FP rounding behaviour, use -ffast-math. If it does, you can usually still use -fno-math-errno and stuff like that, without enabling -funsafe-math-optimizations. Some FP code can get big speedups from fast-math, like auto-vectorization.
If you can usefully do a test-run of your program that exercises most of the code paths that need to be optimized for a real run, then use profile-directed optimization:
gcc -fprofile-generate -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
./my_program -option1 < test_input1
./my_program -option2 < test_input2
gcc -fprofile-use -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
-fprofile-use enables -funroll-loops, since it has enough information to decide when to actually unroll. Unrolling loops all over the place can make things worse. However, it's worth trying -funroll-loops to see if it helps.
If your test runs don't cover all the code paths, then some important ones will be marked as "cold" and optimized less.
-O3 enables auto-vectorization, which -O2 doesn't. This can give big speedups
-fwhole-program allows cross-file inlining, but only works when you put all the source files on one gcc command-line. -flto is another way to get the same effect. (Link-Time Optimization). clang supports -flto but not -fwhole-program.
-fomit-frame-pointer has been the default for a while now for x86-64, and more recently for x86 (32bit).
As well as gcc, try compiling your program with clang. Clang sometimes makes better code than gcc, sometimes worse. Try both and benchmark.
The flag -std=c99 does not change the optimization levels. It only changes what target language standard you want the compiler to confirm to.
You use -std=c99 when you want your program to be treated as a C99 program by the compiler.
The only flag that has to do with optimization among those you specified is -O3. Others serve for other purposes.
You may want to add -funroll-loops and -fomit-frame-pointer, but they should be already included in -O3.