gcc -O1 optimization slower than -O0 - c

I've been rewriting a program that is a mix of Fortran and C, which is around 10k lines of random particle simulation. However, I recently realised that Release mode was running a lot slower than Debug.
Debug (-O0): 23 seconds. Release(-O1 or -O2): 43 seconds. For a small test run.
This only changes when the C optimization settings are changed within Project->Properties->C/C++ Build->GCC C Compiler->Optimization Level, and is unaffected by the GNU Fortran compiler settings.
Looking into this, it seems that -O1 and -O2 run slowly, while -O0 runs a lot faster. Even with all optimization flags set manually (from GCC Docs), it still runs faster than -O1.
It may be that running in Release causes different results which cause extra computations to be made (values being outside of their expected ranges etc.), would this be likely? And if so would it be possible to change the behaviour back to the original Debug settings?
Any help would be appreciated, let me know if you need more information to help.
Chris.
Edit: System Information:
Windows 8.1 Pro
GCC version 4.8.1
Eclipse for Parallel Application Developers, Juno SR2

Okay it turns out that this occurs for version 4.8.1 of gcc, but not the later versions (is fixed with 4.8.3). For both the equivalent C and Fortran codes, using -O1 or higher caused the compiler to incorrectly optimise the code.
To simplify the code, the incorrect result occurs from the lines:
a = b - c;
c = c + a;
Which, if computed normally would be c = b + (c - c);, reducing to c = b;.
But seems to instead be calculated as c = c + (b - c);, which is calculated normally.
This creates a very slight inaccuracy (-0.0999999978 compared to -0.1000000015).
In this case, this resulted in a loop that would have run 2 times, to run 1500 times, and for this to happen over 500 times across the run.
The way it was (accidentally) fixed, was to force the compiler to calculate a = b - c before running the c = c + a. This can be done by printf("%f\n",a);, or c = c + 0 between calculating a and c (both seemed to work). Note, this line does not ever have to run. You can put this into a if statement that is never true (e.g. if (a == INT_MAX)) but not somewhere the compiler knows it will never run (e.g. if (false)).
At least this is my, and my supervisors, current thoughts on the matter. Still not exactly sure what part of the optimisation causes this, or what fixed it in the most recent versions, but I hope this can be helpful for someone else and save them a week of confusion.
Thanks for the help anyway. - Chris.

Related

Order of arguments for C bitwise operations?

I've gotten a piece of software working, and am now trying to tune it up so it runs faster. I discovered something that struck as well - just bizarre. It's no longer relevant, because I switched to using a pointer instead of indexing an array (it's faster with the pointers), but I'd still like to know what is going on.
Here's the code:
short mask_num_vals(short mask)
{
short count = 0;
for(short val=0;val<NUM_VALS;val++)
if(mask & val_masks[val])
count++;
return count;
}
This small piece of code is called many many times. What really surprised me is that this code runs significantly faster than its predecessor, which simply had the two arguments to the "&" operation reversed.
Now, I would have thought the two versions would be, for all practical purposes, identical, and they do produce the same result. But the version above is faster - noticeably faster. It makes about a 5% difference in the running time of the overall code that uses it. My attempt to measure the amount of time spent in the function above failed completely - measuring the time used up far more time than actually executing the rest of the code. (A version of Heisenberg's principle for software, I guess.)
So my picture here is, the compiled code evaluates the two arguments, and then does a bitwise "and" on them. Who cares which order the arguments are in? Apparently the compiler or the computer does.
My completely unsupported conjecture is that the compiled code must be evaluating "val_masks[val]" for each bit. If "val_masks[val]" comes first, it evaluates it for every bit, if "mask" comes first, it doesn't bother with "val_masks[val]" if that particular bit in "mask" is zero. I have no evidence whatsoever to support this conjecture; I just can't think of anything else that might cause this behaviour.
Does this seem likely? This behaviour just seemed weird to me, and I think points to some difference in my picture of how the compiled code works, and how it actually works. Again, not all that relevant any more, as I've evolved the code further (using pointers instead of arrays). But I'd still be interested in knowing what is causing this.
Hardware is an Apple MacBook Pro 15-inch 2018, MacOS 10.15.5. Software is gcc compiler, and "gcc --version" produces the following output.
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.3 (clang-1103.0.32.62)
Target: x86_64-apple-darwin19.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Compiled with the command "gcc -c -Wall 'C filename'", linked with "gcc -o -Wall 'object filenames'".
Code optimizers are often unpredictable. Their output can change after small meaningless tweaks in code, or after changing command-line options, or after upgrading the compiler. You cannot always explain why the compiler does some optimization in one case but not in another; you can guess all you want, but only experience can show.
One powerful technique in determining what is going on: convert your two versions of code to assembly language and compare.
GCC could be invoked with the command-line switch -S for that.
gcc -S -Wall -O -fverbose-asm your-c-source.c
which produces a textual assembler file your-c-source.s (you could glance into it using a pager like less or a source code editor like GNU emacs) from the C file your-c-source.c
The Clang compiler has similar options.

Floating Point Multiply Bug Using gcc 2.7.0 on Amiga with 68881 - Any Fixes/Workarounds?

For the heck of it, I decided to see if a program I started writing on an Amiga many years ago and much further developed on other machines would still compile and run on an Amiga (after being developed on other machines). I originally used Lattice C because that's what I used before. But the 68881 support in Lattice is VERY buggy. So I decided to try gcc. I think the most recent version of gcc for Amiga is 2.7.0 (so I can't upgrade). It's worked rather well except for one bug in 68881 support: When multiplying any negative number by zero, the result is always:
1.:00000
when printed out (colon is NOT a typo). BTW, if you set x to zero, then print out, it's 0.00000 like it should be.
Here's a sample program to test the bug, it doesn't matter which variable is 0 and which is negative, and if the non-zero value is positive, it works fine.
#include <stdio.h>
#include <math.h>
main()
{
float x,a,b;
a=-10.0;
b=0.0;
x=a*b;
printf("%f\n",x);
}
and it's compiled with: gcc -o tt -m68020 -m68881 tt.c -lm
Taking out -m68881 works fine (but of course, doesn't use the FPU)
Taking out -lm and/or math.h makes no difference.
Does anyone know of a bug fix or workaround? Maybe a gcc command line argument? (would rather not have to do UGLY things like "if ((a<0)&&(b==0))")
BTW, since I don't have a working Amiga anymore, I've had to use an emulator. If you want to see what I've been doing on this project (using Lattice C version), you can view my video at:
https://www.youtube.com/watch?v=x8O-qYQvP4M
(Amiga part starts at 10:07)
Thanks for any help.
This isn't exactly an answer, but a revelation that the problem is rather complicated (more so than a simple bug with gcc). Here's the info:
If I set the Amiga emulator to emulate a 68020 or a 68030 and a 68881 or a 68882 INSTEAD of a 68040 using the 68040's internal FPU it doesn't produce the 1.:00000 (in other words, it works). So that could mean the emulator is to blame for not emulating the 68040's FPU correctly (though I imagine the 68040's FPU is likely compatible with the 68881/68882). (Don't know if there's a performance hit in setting the emulator to 68020/30 68881/2 (I have the emulator set to run as fast as possible on the host machine instead of going at the speed of the 680xx)).
However, if I compile with the Amiga's gcc's -noixemul option, the code works correctly in every combination of CPU and FPU. So that would indicate it's a problem with the Amiga's version of gcc (really the part of the gcc system that tries to emulate UNIX on an Amiga (that is what ixemul.library does)). So that might not be gcc's fault (if it were compiled on some other system that uses a 68040 it would probably work), but the fault of the people who ported gcc to Amiga.
So, you might say "problem solved, just use -noixemul" - well not so fast... Although the simple test program doesn't crash, my bigger program that exposed this problem crashes on program exit (recoverable GURU meditation) only when compiled with -noixemul (perhaps it's trying to close a library that was never opened, I don't know). This is why I didn't use -noixemul even though I wanted to.
So, it's not exactly solved, but I would say it's not likely a non-Amiga gcc bug.

Why is this NodeJS 2x faster than native C?

For the sake of a presentation at work, I wanted to compare the performance of NodeJS to C. Here is what I wrote:
Node.js (for.js):
var d = 0.0,
start = new Date().getTime();
for (var i = 0; i < 100000000; i++)
{
d += i >> 1;
}
var end = new Date().getTime();
console.log(d);
console.log(end - start);
C (for.c)
#include <stdio.h>
#include <time.h>
int main () {
clock_t start = clock();
long d = 0.0;
for (long i = 0; i < 100000000; i++)
{
d += i >> 1;
}
clock_t end = clock();
clock_t elapsed = (end - start) / (CLOCKS_PER_SEC / 1000);
printf("%ld\n", d);
printf("%lu\n", elapsed);
}
Using GCC I compiled my for.c and ran it:
gcc for.c
./a.out
Results:
2499999950000000
198
Then I tried it in NodeJS:
node for.js
Results:
2499999950000000
116
After running numerous times, I discovered this held true no matter what. If I switched for.c to use a double instead of a long in the loop, the time C took was even longer!
Not trying to start a flame war, but why is Node.JS (116 ms.) so much faster than native C (198 ms.) at performing this same operation? Is Node.JS applying an optimization that GCC does not do out of the box?
EDIT:
Per suggestion in comments, I ran gcc -Wall -O2 for.c. Results improved to C taking 29 ms. This begs the question, how is it that the native C settings are not optimized as much as a Javascript compiler? Also, what is -Wall and -02 doing. I'm really curious about the details of what is going on here.
This begs the question, how is it that the native C settings are not optimized as much as a Javascript compiler?
Since C is statically-compiled and linked, requiring a potentially lengthy build step of your entire codebase (I once worked in one that took almost an hour for a full optimized build, but only 10 minutes otherwise), and a very dangerous, hardware-level language that risks a lot of undefined behaviors if you don't treat it with care, the default settings of compilers usually don't optimize to smithereens since that's a developer/debug build intended to help with debugging and productivity with faster turnaround.
So in C you get a distinct separation between an unoptimized but faster-to-build, easier-to-debug developer/debug build and a very optimized, slower-to-build, harder-to-debug production/release build that runs really fast, and the default settings of compilers often favor the former.
With something like v8/NodeJS, you're dealing with a just-in-time compiler (dynamic compilation) that builds and optimizes only the necessary code on the fly at run-time. On top of that, JS is a much safer language and also often designed for security that doesn't allow you to work at the raw bits and bytes of the hardware.
As a result, it doesn't need that kind of strong release/debug build distinction of a native, statically-compiled language like C/C++. But it also doesn't let you put the pedal to the metal as you can in C if you really want.
A lot of people trying to benchmark C/C++ coming from other languages often fail to understand this build distinction and the importance of compiler/linker optimization settings and get confused. As you can see, with the proper settings, it's hard to beat the performance of these native compilers and languages that allow you to write really low-level code.
Adding the register key word helps as expected
#include <stdio.h>
#include <time.h>
int main () {
register long i, d;
clock_t start = clock();
i = d = 0L;
for (i = 0; i < 100000000L; i++) {
d += i >> 1;
}
clock_t end = clock();
clock_t elapsed = (end - start) / (CLOCKS_PER_SEC / 1000);
printf("%ld\n", d);
printf("%lu\n", elapsed);
}
and compile with the C compiler
cc for.c -o for
./for ; node for.js
returns
2499999950000000
97
2499999950000000
222
I've also done some testing calculating prime numbers and I've found that Node.js is about twice as fast as C see here.
When you have a very simple counting type of loop, the -O2 optimization can sometimes convert the output to a simple formula without even iterating the loop. See Karl's Blog for an explination. If you add something more complicated to the routine it is likely node.js will be faster again. for example I added a devisor term into your sample program and the c -O2 optimization was no longer able to convert it to a simple formula and node.js became faster again.
I am still baffled as to how node.js can be caster than c at simple integer calculations, but in every test I've done so far it is faster. I've also performed some tests with bitwise calculations that I haven't posted yet and still node.js was faster.
node.js and C are distinct in that node.js is interpreting JavaScript whereas C is compiling code to machine language. As such, both are handled differently. For node.js, you simply run the .js file. C is much different from this. When you are compiling code to machine language using GCC, you must supply compiler optimization settings aka "flags." Your node.js program is actually slower than the C program if you specify the -O3 flag to GCC. In fact, the node.js program took twice as long for me as the C program did. You stated that you would like to know more about what the C compiler does to optimize code. This is a complex topic/field and I highly recommend reading this Wikipedia article on compiler optimization to learn more.
In short, you made an unfair comparison because you did not optimize your C code.

Are programs that are compiled gcc optimised by default?

While at University I learned that compiler optimises our code, in order for the executable to be faster. For example when a variable is not used after a point, it will not be calculated.
So, as far as I understand, that means that if I have a program that calls a sorting algorithm, if the results of the algorithm are printed then the algorithm will run. However, if nothing is printed(or used anywhere else), then there is no reason for the program to even make that call.
So, my question is:
Does these things(optimisation) happen by default when compiling with gcc? Or only when the code is compiled with O1, O2, O3 flags?
When you meet a new program for the first time, it is helpful to type man followed by the program name. When I did it for gcc, it showed me this:
Most optimizations are only enabled if an -O level is set on the command line. Otherwise they are disabled, even if individual optimization flags are specified.
...
-O0 Reduce compilation time and make debugging produce the expected results. This is the default.
To summarize, with -O0, all code that is in the execution path that is taken will actually execute. (Program text that can never be in any execution path, such as if (false) { /* ... */ }, may not generate any machine code, but that is unobservable.) The executed code will feel "as expected", i.e. it'll do what you wrote. That's the goal, at least.

Performance of compiled code by compiled compiler

If I want to achieve better performance from, let's say for example, MySQLdb, I can compile it myself and I will get better performance because it's not compiled on i386, i486 or what ever, just on my CPU. Further I can choose the compile options and so on...
Now, I was wondering if this is true also for non-regular Software, such as compiler.
Here come the 1st part:
Will compiling a compiler like GCC result in better performance?
and the 2nd part:
Will the code compiled by my own compiled compiler perform better?
(Yes, I know, I can compile my compiler and benchmark it... but maybe ... someone already knows the answer, and will share it with us =)
In answer to your first question, almost certainly yes. Binary versions of gcc will be the "lowest common denominator" and, if you compile them with special flags more appropriate to your system, it will most likely be faster.
As to your second question, no.
The output of the compiler will be the same regardless of how you've optimised it (unless it's buggy, of course).
In other words, even if you totally stuffed up your compiler flags when compiling gcc, to the point where your particular compiled version of gcc takes a week and a half to compile "Hello World", the actual "Hello World" executable should be identical to the one produced by the "lowest common denominator" gcc (if you use the same flags).
(1) It is possible. If you introduce a new optimization to your compiler, and re-compile it with this optimization included - it is possible that the re-compiled code will perform better.
(2) No!!!! A compiler cannot change the logic of the code! In your case, the logic of the code is the native code produced at the end. So, if compiler A_1 is compiled using compiler A_2 or B, has no affect on the native code produced by A_1 [in here A_1, A_2 are the same compilers, the index is just for clarity].
a.Well, you can compile the compiler to your system, and maybe it will run faster. like any program. (I think that usualy it's not worth it, but do whatever you want).
b. No. Even if you compile the compiler in your computer, it's behavior should not change, and so the code that it generates also doesn't change.
Will compiling a compiler like GCC result in better performance?
A program compiled specifically to the target platform it is used on will usually perform better than a program compiled for a generic platform. Why is this? Knowledge about the harware can help the compiler align data to be cache friendly and choose an instruction ordering that plays well with a CPUs pipelining.
The most benefit is usally achieved by leveraging specific instruction sets such as SSE (in its various versions).
On the other hand, you should ask yourself if a programm like GCC is really CPU bound (much more likely it will be IO bound) and tuning its CPU performance provides any measurable benefit.
Will the code compiled by my own compiled compiler perform better
Hopefully not! Allowing a compiler to optimize a program should never change its behavior. No matter how you compiled your GCC, it should compile code to the same binaries as a generic binary distribution of GCC would.
If code compiled to the specific platform is faster than code compil for a generic platform, why dont we all ship code instead of binaries? Guess what, some linux distros actually follow this phillosophy, such as Gentoo. And while you're at it, make sure to built statically linked binaries, disk space is so cheap nowadays and it gives you at least another 0.001% of performance.
Alright, that was a bit sarcastic. The reason people distribute generic binaries is pretty obvious: It's geneirc, the lowest common denominator and it will work everywhere. Thats a big bonus in terms of flexibility and user friendlyness. I remember once compiling Gnome for my Gentoo box, it took a day or two! (But it must have been so much faster ;-) )
On the other hand, there are occassions where you want to get the best performance possible and it makes sense to build and optimize for specific architctures.
GCC uses a three step bootstraping when building from source. Basically it compiles the source three times to ensure build tools and compiler is build successfully. This bootstraping is used for validation purpose. However it is possible to use the stage 1 as a benchmark for optimizing later stages. You should build GCC with make profiledbootstrap to use this profile based optimization.
This profile based build process increases the performance of "GCC", but not the software compiled with it, as other answers point out.

Resources