Compile GSL with O3 optimization level - c

My program heavily depends on the special functions from GSL and thus I would like to make it run faster, so I wish to compile GSL with higher optimization levels.
When I compile gsl, the default CFLAGS is "-g -O2" if I do nothing when I configure with "./configure". I am wondering why gsl is defaulting to an optimization level of O2 only since O3 is compliant to standards. I tried to compile and test with "./configure CFLAGS='-g -O3'", things worked. But I'm still not sure if everything would work.
Can anyone tell me why GSL is defaulting to O2 instead of O3? Would it be dangerous if I default to O3? Thanks!

The optimization level 3 is something that should only be used in case it is absolutely sure that it helps the library.
Since that level activates optimizations that may increase the size of the code a lot. This means in some cases it creates binaries that are even slower compared to a binary optimized with -O2. How ever that happens rarely. More likely are effects like a massively increased time to compile it, along with a increased binary size and a barely measurable performance change.
That -O3 actually breaks something was pretty common some time back, but in the last couple of years I did not have a single case where -O3 actually optimized something that caused the binary to break.
In the end the optimization level is something you can just test. Since -O2 is the default, it is a pretty safe bet that this is the best settings for the compile operation in this case. If you feel like it you could try to compile it with a different setting to see if it makes any performance difference.
Interesting options are -O3 and even -Os. I had cases in the past were both options gave improved performance over -O2.
So the real answer is: Try it and see what happens.

Related

Can you do all gcc optimizations (-O2, -O3) manually in your c source code?

In my class project, my project is set to use gcc's optimization level of -O0 (no optimizations) and we are not allowed to change it for the final submission.
I tested my code using -O2 and got around a 2x speedup of my entire program. So I was wondering, is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code? Or are some of the -O2 optimizations internal to the stack, frame, machine/assembly, etc, thus disallowing me, the programmer, from manually making those optimizations in my source code (If that makes sense)
Is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code?
No. Many of the optimizations performed by the compiler cannot be represented in C. Some of these include:
Disabling the frame pointer
Removing unnecessary register saves/restores at the beginning and end of a function
"Peephole" optimizations on the assembly, such as removing redundant moves, loads, or stores
Inserting no-ops to align loops to specific address boundaries (typically 16 bytes)
This isn't to say that all of the optimizations performed by the compiler are untranslatable, of course -- merely that some of them are.
Yes, but that's the same as building your own 8086-class microprocessor in Minecraft — not worth your time and effort. And yes, many of those optimizations involve stuff below the language level of abstraction. Your professor might have unknown-to-you reasons for wanting an unoptimized executable.

Are the effects of hand-optimization of code consistent across different gcc optimization levels?

If I run gcc with -O0, and hand-optimize my code using techniques such as the ones mentioned here, will it generally be the case that my optimized code will run faster than my unoptimized code when I run gcc with -O3?
That is, if I hand-optimize code under a particular compiler optimization level, is it generally true that these optimizations will still be productive (rather than counterproductive) under a different (higher or lower) compiler optimization level?
It might not be same in different compiler. Even the compiler can do away with your hand optimization, i mean ignore them. It heavily depends the implementation and behavior of the compiler itself. Most of the optimizations are like a request to compiler, which can be dropped at any time, (mostly without any notification)

Performance of compiled code by compiled compiler

If I want to achieve better performance from, let's say for example, MySQLdb, I can compile it myself and I will get better performance because it's not compiled on i386, i486 or what ever, just on my CPU. Further I can choose the compile options and so on...
Now, I was wondering if this is true also for non-regular Software, such as compiler.
Here come the 1st part:
Will compiling a compiler like GCC result in better performance?
and the 2nd part:
Will the code compiled by my own compiled compiler perform better?
(Yes, I know, I can compile my compiler and benchmark it... but maybe ... someone already knows the answer, and will share it with us =)
In answer to your first question, almost certainly yes. Binary versions of gcc will be the "lowest common denominator" and, if you compile them with special flags more appropriate to your system, it will most likely be faster.
As to your second question, no.
The output of the compiler will be the same regardless of how you've optimised it (unless it's buggy, of course).
In other words, even if you totally stuffed up your compiler flags when compiling gcc, to the point where your particular compiled version of gcc takes a week and a half to compile "Hello World", the actual "Hello World" executable should be identical to the one produced by the "lowest common denominator" gcc (if you use the same flags).
(1) It is possible. If you introduce a new optimization to your compiler, and re-compile it with this optimization included - it is possible that the re-compiled code will perform better.
(2) No!!!! A compiler cannot change the logic of the code! In your case, the logic of the code is the native code produced at the end. So, if compiler A_1 is compiled using compiler A_2 or B, has no affect on the native code produced by A_1 [in here A_1, A_2 are the same compilers, the index is just for clarity].
a.Well, you can compile the compiler to your system, and maybe it will run faster. like any program. (I think that usualy it's not worth it, but do whatever you want).
b. No. Even if you compile the compiler in your computer, it's behavior should not change, and so the code that it generates also doesn't change.
Will compiling a compiler like GCC result in better performance?
A program compiled specifically to the target platform it is used on will usually perform better than a program compiled for a generic platform. Why is this? Knowledge about the harware can help the compiler align data to be cache friendly and choose an instruction ordering that plays well with a CPUs pipelining.
The most benefit is usally achieved by leveraging specific instruction sets such as SSE (in its various versions).
On the other hand, you should ask yourself if a programm like GCC is really CPU bound (much more likely it will be IO bound) and tuning its CPU performance provides any measurable benefit.
Will the code compiled by my own compiled compiler perform better
Hopefully not! Allowing a compiler to optimize a program should never change its behavior. No matter how you compiled your GCC, it should compile code to the same binaries as a generic binary distribution of GCC would.
If code compiled to the specific platform is faster than code compil for a generic platform, why dont we all ship code instead of binaries? Guess what, some linux distros actually follow this phillosophy, such as Gentoo. And while you're at it, make sure to built statically linked binaries, disk space is so cheap nowadays and it gives you at least another 0.001% of performance.
Alright, that was a bit sarcastic. The reason people distribute generic binaries is pretty obvious: It's geneirc, the lowest common denominator and it will work everywhere. Thats a big bonus in terms of flexibility and user friendlyness. I remember once compiling Gnome for my Gentoo box, it took a day or two! (But it must have been so much faster ;-) )
On the other hand, there are occassions where you want to get the best performance possible and it makes sense to build and optimize for specific architctures.
GCC uses a three step bootstraping when building from source. Basically it compiles the source three times to ensure build tools and compiler is build successfully. This bootstraping is used for validation purpose. However it is possible to use the stage 1 as a benchmark for optimizing later stages. You should build GCC with make profiledbootstrap to use this profile based optimization.
This profile based build process increases the performance of "GCC", but not the software compiled with it, as other answers point out.

Why would gcc -o0 be faster than icc -o0?

For a brief report I have to do, our class ran code on a cluster using both gcc -O0 and icc -O0. We found that gcc was about 2.5 times faster than icc without any optimizations? Why is this? Does gcc -O0 actually do some minor optimization or does it simply happen to work better for this system?
The code was an implementation of the naive string searching algorithm found here, written in c.
Thank you
Performance at -O0 is not interesting or indicative of anything. It explicitly says "I don't care about performance", and the compiler takes you up on that; it just does whatever happens to be simplest. By random luck, what is simplest for GCC is faster than what is simplest for ICC for one highly specific microbenchmark on your specific hardware configuration. If you ran 100 other microbenchmarks, you would probably find some where ICC is faster, too. Even if you didn't, that still wouldn't mean much. If you're going to compare performance across compilers, turn on optimizations, because that's what you do if you care about performance.
If you want to understand why one is faster, profile the execution. Where is the execution time being spent? Where are there stalls? Why do those stalls occur?
A few things to take into account:
The instruction set each compiler uses by default. For example if your GCC build produces i686 code by default, while ICC restricts itself to i586 opcodes, you would probably see a significant performance difference.
The actual CPUs in your cluster. If you are using AMD processors, instead of Intel CPUs, then ICC is at a disadvantage because it is, of course, targeted specifically to Intel processors.
You mentioned using a cluster. Does this speed difference exist on a single processor as well? If you used any parallelisation facilities provided by your compiler, there could be significant differences there.
Simplistically, when optimisations are disabled, the compiler uses pre-made "templates" for each code construct. Since these templates are intended to be optimised afterwards, they are constructed in a way that enables the optimisation passes to produce better code. The fact that they may be slower or faster with -O0 does not really mean anything - for example, more explicit initial code could be easier to optimise but far slower to execute.
That said, the only way to find out what is going on is to profile the execution of your code and, if necessary, have a look at the assembly of those parts of the code where the major differences lie.

safe, fast CFLAGS for mex functions in matlab

I am converting a number of low-level operations from native matlab code into C/mex code, with great speedups. (These low-level operations can be done vectorized in .m code, but I think I get memory hits b/c of large data. whatever.) I have noticed that compiling the mex code with different CFLAGS can cause mild improvements. For example CFLAGS = -O3 -ffast-math does indeed give some speedups, at the cost of mild numerical inaccuracy.
My question: what are the "best" CFLAGS to use, without incurring too many other side effects? It seems that, at the very least that
CFLAGS = -O3 -fno-math-errno -fno-unsafe-math-optimizations -fno-trapping-math -fno-signaling-nans are all OK. I'm not sure about -funroll-loops.
also, how would you optimize the set of CFLAGS used, semi-automatically, without going nuts?
If you know the target CPU...or are at least willing to guarantee a "minimum" CPU...you should definitely look into -mcpu and -march.
The performance gain can be significant.
Whatever ATLAS uses on your machine (http://math-atlas.sourceforge.net/) is probably a good starting point. I don't know that ATLAS automatically optimizes specific compiler flags, but the developers have probably spent a fair amount of time doing so by hand.

Resources