Which gcc compiler options may be safely used for numerical programming?
The easy way to turn on optimizations for gcc is to add -0# to the compiler options. It is tempting to say -O3. However I know that -O3 includes optimization which are non-save in the sense that results of numerical computations may differ once this option is included. Small changes in the result may be insignificant if the algorithm is stable. On the other hand, precision can be an issue for certain math operations, so math optimization can have significant impact.
I find it inconvenient to take compiler dependent issues into account in the process of debugging. I.e. I don't want to wonder whether minor changes in the code will lead to strongly different behavior because the compiler changed its optimizations internally.
Which options are safe to add if I want deterministic--and hence controllable--behavior in my code? Which are almost safe, that is, which options induce only minor uncertainties compared to performance benefits?
I think of options like: -finline -finline-limit=2000 which inlines functions even if they are long.
It is not true that -O3 includes numerically unsafe optimizations. According to the manual, -O3 includes the following optimization passes in comparison to -O2:
-finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-cp-clone
You might be referring to -ffast-math, turned on by default with -Ofast, but not with -O3:
-ffast-math Sets -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range. This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it
can result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions.
It may, however, yield faster code for programs that do not require
the guarantees of these specifications.
In other words, all of -O, -O2, and -O3 are safe for numeric programming.
Related
I'm building a toy cracking program for self teaching purposes in C. I want the brute forcing to run as fast as possible, and one of the considerations there is naturally compiler optimizations. Presumably, cryptographic implementations would break or have their results thrown off by forgoing floating point precision, but I tested enabling -Ofast (on gcc) with my current script and the final hash output from a long series of cryptographic functions remains the same as with just -O3.
I understand though that this isn't necessarily conclusive as there's a lot that can be going on under the hood with modern compilers, so my question is, will enabling -Ofast on my crypto cracking script potentially throw off the results of my crypto functions?
-Ofast does this:
Disregard strict standards compliance. -Ofast enables all -O3
optimizations. It also enables optimizations that are not valid for
all standard-compliant programs. It turns on -ffast-math and the
Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is
specified, and -fno-protect-parens.
-ffast-math turns on a bunch of other flags, but none of them matter unless you're using floating-point arithmetic, which no hash function I'm aware of does.
-fstack-arrays and -fno-protect-parens don't do anything at all unless you're using Fortran.
I saw somewhere that the GCC compiler might prefer sometimes not using conditional mov when converting my code into ASM.
What are the cases where it might choose to do something other than conditional mov?
Compilers often favour if-conversion to cmov when both sides of the branch are short, especially with a ternary so you always assign a C variable. e.g. if(x) y=bar; sometimes doesn't optimize to CMOV but y = x ? bar : y; does use CMOV more often. Especially when y is an array entry that otherwise wouldn't be touched: introducing a non-atomic RMW of it could create a data-race not present in the source. (Compilers can't invent writes to possibly-shared objects.)
The obvious example of when if-conversion would be legal but obviously not profitable would be when there's a lot of work on both sides of an if/else. e.g. some multiplies and divides, a whole loop, and/or table lookups. Even if gcc can prove that it's safe to run both sides and select one result at the end, it would see that doing that much more work isn't worth avoiding a branch.
If-conversion to a data-dependency (branchless cmov) is only even possible in limited circumstances. e.g. Why is gcc allowed to speculatively load from a struct? shows a case where it can/can't be done. Other cases include doing a memory access that the C abstract machine doesn't, which the compiler can't prove won't fault. Or a non-inline function call that might have side-effects.
See also these questions about getting gcc to use CMOV.
Getting GCC/Clang to use CMOV
how to force the use of cmov in gcc and VS
Make gcc use conditional moves
See also Disabling predication in gcc/g++ - apparently gcc -fno-if-conversion -fno-if-conversion2 will disable use of cmov.
For a case where cmov hurts performance, see gcc optimization flag -O3 makes code slower than -O2 - GCC -O3 needs profile-guided optimization to get it right and use a branch for an if that turns out to be highly predictable. GCC -O2 didn't do if-conversion in the first place, even without PGO profiling data.
An example the other way: Is there a good reason why GCC would generate jump to jump just over one cheap instruction?
GCC seemingly misses simple optimization shows a case where a ternary has side-effects in both halves: ternary isn't like CMOV: only one side is even evalutated for side effects.
AVX-512 and Branching shows a Fortran example where GCC needs help from source changes to be able to use branchless SIMD. (Equivalent of scalar CMOV). This is a case of not inventing writes: it can't turn a read/branch into read/maybe-modify/write for elements that source wouldn't have written. If-conversion is usually necessary for auto-vectorization.
If I run gcc with -O0, and hand-optimize my code using techniques such as the ones mentioned here, will it generally be the case that my optimized code will run faster than my unoptimized code when I run gcc with -O3?
That is, if I hand-optimize code under a particular compiler optimization level, is it generally true that these optimizations will still be productive (rather than counterproductive) under a different (higher or lower) compiler optimization level?
It might not be same in different compiler. Even the compiler can do away with your hand optimization, i mean ignore them. It heavily depends the implementation and behavior of the compiler itself. Most of the optimizations are like a request to compiler, which can be dropped at any time, (mostly without any notification)
Is it OK to say that 'volatile' keyword makes no difference if the compiler optimization is turned off i.e (gcc -o0 ....)?
I had made some sample 'C' program and seeing the difference between volatile and non-volatile in the generated assembly code only when the compiler optimization is turned on i.e ((gcc -o1 ....).
No, there is no basis for making such a statement.
volatile has specific semantics that are spelled out in the standard. You are asserting that gcc -O0 always generates code such that every variable -- volatile or not -- conforms to those semantics. This is not guaranteed; even if it happens to be the case for a particular program and a particular version of gcc, it could well change when, for example, you upgrade your compiler.
Probably volatile does not make much difference with gcc -O0 -for GCC 4.7 or earlier. However, this is probably changing in the next version of GCC (i.e. future 4.8, that is current trunk). And the next version will also provide -Og to get debug-friendly optimization.
In GCC 4.7 and earlier no optimizations mean that values are not always kept in registers from one C (or even Gimple, that is the internal representation inside GCC) instruction to the next.
Also, volatile has a specific meaning, both for standard conforming compilers and for human. For instance, I would be upset if reading some code with a sig_atomic_t variable which is not volatile!
BTW, you could use the -fdump-tree-all option to GCC to get a lot of dump files, or use the MELT domain specific language and plugin, notably its probe to query the GCC internal representations thru a graphical interface.
I am converting a number of low-level operations from native matlab code into C/mex code, with great speedups. (These low-level operations can be done vectorized in .m code, but I think I get memory hits b/c of large data. whatever.) I have noticed that compiling the mex code with different CFLAGS can cause mild improvements. For example CFLAGS = -O3 -ffast-math does indeed give some speedups, at the cost of mild numerical inaccuracy.
My question: what are the "best" CFLAGS to use, without incurring too many other side effects? It seems that, at the very least that
CFLAGS = -O3 -fno-math-errno -fno-unsafe-math-optimizations -fno-trapping-math -fno-signaling-nans are all OK. I'm not sure about -funroll-loops.
also, how would you optimize the set of CFLAGS used, semi-automatically, without going nuts?
If you know the target CPU...or are at least willing to guarantee a "minimum" CPU...you should definitely look into -mcpu and -march.
The performance gain can be significant.
Whatever ATLAS uses on your machine (http://math-atlas.sourceforge.net/) is probably a good starting point. I don't know that ATLAS automatically optimizes specific compiler flags, but the developers have probably spent a fair amount of time doing so by hand.