I am converting a number of low-level operations from native matlab code into C/mex code, with great speedups. (These low-level operations can be done vectorized in .m code, but I think I get memory hits b/c of large data. whatever.) I have noticed that compiling the mex code with different CFLAGS can cause mild improvements. For example CFLAGS = -O3 -ffast-math does indeed give some speedups, at the cost of mild numerical inaccuracy.
My question: what are the "best" CFLAGS to use, without incurring too many other side effects? It seems that, at the very least that
CFLAGS = -O3 -fno-math-errno -fno-unsafe-math-optimizations -fno-trapping-math -fno-signaling-nans are all OK. I'm not sure about -funroll-loops.
also, how would you optimize the set of CFLAGS used, semi-automatically, without going nuts?
If you know the target CPU...or are at least willing to guarantee a "minimum" CPU...you should definitely look into -mcpu and -march.
The performance gain can be significant.
Whatever ATLAS uses on your machine (http://math-atlas.sourceforge.net/) is probably a good starting point. I don't know that ATLAS automatically optimizes specific compiler flags, but the developers have probably spent a fair amount of time doing so by hand.
Related
I'm building a toy cracking program for self teaching purposes in C. I want the brute forcing to run as fast as possible, and one of the considerations there is naturally compiler optimizations. Presumably, cryptographic implementations would break or have their results thrown off by forgoing floating point precision, but I tested enabling -Ofast (on gcc) with my current script and the final hash output from a long series of cryptographic functions remains the same as with just -O3.
I understand though that this isn't necessarily conclusive as there's a lot that can be going on under the hood with modern compilers, so my question is, will enabling -Ofast on my crypto cracking script potentially throw off the results of my crypto functions?
-Ofast does this:
Disregard strict standards compliance. -Ofast enables all -O3
optimizations. It also enables optimizations that are not valid for
all standard-compliant programs. It turns on -ffast-math and the
Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is
specified, and -fno-protect-parens.
-ffast-math turns on a bunch of other flags, but none of them matter unless you're using floating-point arithmetic, which no hash function I'm aware of does.
-fstack-arrays and -fno-protect-parens don't do anything at all unless you're using Fortran.
In my class project, my project is set to use gcc's optimization level of -O0 (no optimizations) and we are not allowed to change it for the final submission.
I tested my code using -O2 and got around a 2x speedup of my entire program. So I was wondering, is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code? Or are some of the -O2 optimizations internal to the stack, frame, machine/assembly, etc, thus disallowing me, the programmer, from manually making those optimizations in my source code (If that makes sense)
Is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code?
No. Many of the optimizations performed by the compiler cannot be represented in C. Some of these include:
Disabling the frame pointer
Removing unnecessary register saves/restores at the beginning and end of a function
"Peephole" optimizations on the assembly, such as removing redundant moves, loads, or stores
Inserting no-ops to align loops to specific address boundaries (typically 16 bytes)
This isn't to say that all of the optimizations performed by the compiler are untranslatable, of course -- merely that some of them are.
Yes, but that's the same as building your own 8086-class microprocessor in Minecraft — not worth your time and effort. And yes, many of those optimizations involve stuff below the language level of abstraction. Your professor might have unknown-to-you reasons for wanting an unoptimized executable.
My program heavily depends on the special functions from GSL and thus I would like to make it run faster, so I wish to compile GSL with higher optimization levels.
When I compile gsl, the default CFLAGS is "-g -O2" if I do nothing when I configure with "./configure". I am wondering why gsl is defaulting to an optimization level of O2 only since O3 is compliant to standards. I tried to compile and test with "./configure CFLAGS='-g -O3'", things worked. But I'm still not sure if everything would work.
Can anyone tell me why GSL is defaulting to O2 instead of O3? Would it be dangerous if I default to O3? Thanks!
The optimization level 3 is something that should only be used in case it is absolutely sure that it helps the library.
Since that level activates optimizations that may increase the size of the code a lot. This means in some cases it creates binaries that are even slower compared to a binary optimized with -O2. How ever that happens rarely. More likely are effects like a massively increased time to compile it, along with a increased binary size and a barely measurable performance change.
That -O3 actually breaks something was pretty common some time back, but in the last couple of years I did not have a single case where -O3 actually optimized something that caused the binary to break.
In the end the optimization level is something you can just test. Since -O2 is the default, it is a pretty safe bet that this is the best settings for the compile operation in this case. If you feel like it you could try to compile it with a different setting to see if it makes any performance difference.
Interesting options are -O3 and even -Os. I had cases in the past were both options gave improved performance over -O2.
So the real answer is: Try it and see what happens.
Which gcc compiler options may be safely used for numerical programming?
The easy way to turn on optimizations for gcc is to add -0# to the compiler options. It is tempting to say -O3. However I know that -O3 includes optimization which are non-save in the sense that results of numerical computations may differ once this option is included. Small changes in the result may be insignificant if the algorithm is stable. On the other hand, precision can be an issue for certain math operations, so math optimization can have significant impact.
I find it inconvenient to take compiler dependent issues into account in the process of debugging. I.e. I don't want to wonder whether minor changes in the code will lead to strongly different behavior because the compiler changed its optimizations internally.
Which options are safe to add if I want deterministic--and hence controllable--behavior in my code? Which are almost safe, that is, which options induce only minor uncertainties compared to performance benefits?
I think of options like: -finline -finline-limit=2000 which inlines functions even if they are long.
It is not true that -O3 includes numerically unsafe optimizations. According to the manual, -O3 includes the following optimization passes in comparison to -O2:
-finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-cp-clone
You might be referring to -ffast-math, turned on by default with -Ofast, but not with -O3:
-ffast-math Sets -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range. This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it
can result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions.
It may, however, yield faster code for programs that do not require
the guarantees of these specifications.
In other words, all of -O, -O2, and -O3 are safe for numeric programming.
For a brief report I have to do, our class ran code on a cluster using both gcc -O0 and icc -O0. We found that gcc was about 2.5 times faster than icc without any optimizations? Why is this? Does gcc -O0 actually do some minor optimization or does it simply happen to work better for this system?
The code was an implementation of the naive string searching algorithm found here, written in c.
Thank you
Performance at -O0 is not interesting or indicative of anything. It explicitly says "I don't care about performance", and the compiler takes you up on that; it just does whatever happens to be simplest. By random luck, what is simplest for GCC is faster than what is simplest for ICC for one highly specific microbenchmark on your specific hardware configuration. If you ran 100 other microbenchmarks, you would probably find some where ICC is faster, too. Even if you didn't, that still wouldn't mean much. If you're going to compare performance across compilers, turn on optimizations, because that's what you do if you care about performance.
If you want to understand why one is faster, profile the execution. Where is the execution time being spent? Where are there stalls? Why do those stalls occur?
A few things to take into account:
The instruction set each compiler uses by default. For example if your GCC build produces i686 code by default, while ICC restricts itself to i586 opcodes, you would probably see a significant performance difference.
The actual CPUs in your cluster. If you are using AMD processors, instead of Intel CPUs, then ICC is at a disadvantage because it is, of course, targeted specifically to Intel processors.
You mentioned using a cluster. Does this speed difference exist on a single processor as well? If you used any parallelisation facilities provided by your compiler, there could be significant differences there.
Simplistically, when optimisations are disabled, the compiler uses pre-made "templates" for each code construct. Since these templates are intended to be optimised afterwards, they are constructed in a way that enables the optimisation passes to produce better code. The fact that they may be slower or faster with -O0 does not really mean anything - for example, more explicit initial code could be easier to optimise but far slower to execute.
That said, the only way to find out what is going on is to profile the execution of your code and, if necessary, have a look at the assembly of those parts of the code where the major differences lie.