I've gotten a piece of software working, and am now trying to tune it up so it runs faster. I discovered something that struck as well - just bizarre. It's no longer relevant, because I switched to using a pointer instead of indexing an array (it's faster with the pointers), but I'd still like to know what is going on.
Here's the code:
short mask_num_vals(short mask)
{
short count = 0;
for(short val=0;val<NUM_VALS;val++)
if(mask & val_masks[val])
count++;
return count;
}
This small piece of code is called many many times. What really surprised me is that this code runs significantly faster than its predecessor, which simply had the two arguments to the "&" operation reversed.
Now, I would have thought the two versions would be, for all practical purposes, identical, and they do produce the same result. But the version above is faster - noticeably faster. It makes about a 5% difference in the running time of the overall code that uses it. My attempt to measure the amount of time spent in the function above failed completely - measuring the time used up far more time than actually executing the rest of the code. (A version of Heisenberg's principle for software, I guess.)
So my picture here is, the compiled code evaluates the two arguments, and then does a bitwise "and" on them. Who cares which order the arguments are in? Apparently the compiler or the computer does.
My completely unsupported conjecture is that the compiled code must be evaluating "val_masks[val]" for each bit. If "val_masks[val]" comes first, it evaluates it for every bit, if "mask" comes first, it doesn't bother with "val_masks[val]" if that particular bit in "mask" is zero. I have no evidence whatsoever to support this conjecture; I just can't think of anything else that might cause this behaviour.
Does this seem likely? This behaviour just seemed weird to me, and I think points to some difference in my picture of how the compiled code works, and how it actually works. Again, not all that relevant any more, as I've evolved the code further (using pointers instead of arrays). But I'd still be interested in knowing what is causing this.
Hardware is an Apple MacBook Pro 15-inch 2018, MacOS 10.15.5. Software is gcc compiler, and "gcc --version" produces the following output.
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.3 (clang-1103.0.32.62)
Target: x86_64-apple-darwin19.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Compiled with the command "gcc -c -Wall 'C filename'", linked with "gcc -o -Wall 'object filenames'".
Code optimizers are often unpredictable. Their output can change after small meaningless tweaks in code, or after changing command-line options, or after upgrading the compiler. You cannot always explain why the compiler does some optimization in one case but not in another; you can guess all you want, but only experience can show.
One powerful technique in determining what is going on: convert your two versions of code to assembly language and compare.
GCC could be invoked with the command-line switch -S for that.
gcc -S -Wall -O -fverbose-asm your-c-source.c
which produces a textual assembler file your-c-source.s (you could glance into it using a pager like less or a source code editor like GNU emacs) from the C file your-c-source.c
The Clang compiler has similar options.
Related
The carefully crafted, self-including code in this IOCCC winning entry from 1988:
http://www.ioccc.org/years.html#1988_isaak
...was still too much for certain systems back then. Also, ANSI C was finally emerging as a stable alternative to the chaotic K&R ecosystem. As a result, the IOCCC judges also provided an ANSI version of this entry:
http://www.ioccc.org/1988/isaak.ansi.c
Its main attraction is its gimmick of including <stdio.h> in the last line (!) with well-thought-out #defines, both inside the source and at compile time, to only allow certain parts of the code into the right level. This is what allows the <stdio.h> header to be ultimately included at the latest stage possible, just before it is necessary, in the source fed to the compiler.
However, this version still fails to produce its output when compiled today, with the provided compiler settings:
gcc -std=c89 -DI=B -DO=- -Dy isaak.ansi.c
tcc -DI=B -DO=- -Dy isaak.ansi.c
Versions used: GCC 9.3.0, TCC 0.9.27
There isn't any evident reliance on the compiled binary filename, hence I left it to the compiler's choice. Even when using -o isaak or -o isaak.ansi, the same result happens: no output.
What is causing this? How are the output functions failing? What can be done to correct this?
Thanks in advance!
NOTE: The IOCCC judges, realising that this entry had portability issues that would detract from its obfuscation value, decided to also include a UUENCODEd version of the code's output:
http://www.ioccc.org/1988/isaak.encode
There is nothing remotely portable about this program. As I see it tries to overwrite the exit standard library function with its own code, expecting that return from empty main() would call that exit(), which is not true. And even then, such behaviour is not standard-conforming - even C89 said it would have undefined behaviour.
You can "fix" the program on modern GCC / Linux by actually calling exit(); inside main - just change the first line to
main(){exit(0);}
I compiled gcc -std=c89 -DI=B -DO=- -Dy isaak.ansi.c and run ./a.out and got sensible output out.
For the heck of it, I decided to see if a program I started writing on an Amiga many years ago and much further developed on other machines would still compile and run on an Amiga (after being developed on other machines). I originally used Lattice C because that's what I used before. But the 68881 support in Lattice is VERY buggy. So I decided to try gcc. I think the most recent version of gcc for Amiga is 2.7.0 (so I can't upgrade). It's worked rather well except for one bug in 68881 support: When multiplying any negative number by zero, the result is always:
1.:00000
when printed out (colon is NOT a typo). BTW, if you set x to zero, then print out, it's 0.00000 like it should be.
Here's a sample program to test the bug, it doesn't matter which variable is 0 and which is negative, and if the non-zero value is positive, it works fine.
#include <stdio.h>
#include <math.h>
main()
{
float x,a,b;
a=-10.0;
b=0.0;
x=a*b;
printf("%f\n",x);
}
and it's compiled with: gcc -o tt -m68020 -m68881 tt.c -lm
Taking out -m68881 works fine (but of course, doesn't use the FPU)
Taking out -lm and/or math.h makes no difference.
Does anyone know of a bug fix or workaround? Maybe a gcc command line argument? (would rather not have to do UGLY things like "if ((a<0)&&(b==0))")
BTW, since I don't have a working Amiga anymore, I've had to use an emulator. If you want to see what I've been doing on this project (using Lattice C version), you can view my video at:
https://www.youtube.com/watch?v=x8O-qYQvP4M
(Amiga part starts at 10:07)
Thanks for any help.
This isn't exactly an answer, but a revelation that the problem is rather complicated (more so than a simple bug with gcc). Here's the info:
If I set the Amiga emulator to emulate a 68020 or a 68030 and a 68881 or a 68882 INSTEAD of a 68040 using the 68040's internal FPU it doesn't produce the 1.:00000 (in other words, it works). So that could mean the emulator is to blame for not emulating the 68040's FPU correctly (though I imagine the 68040's FPU is likely compatible with the 68881/68882). (Don't know if there's a performance hit in setting the emulator to 68020/30 68881/2 (I have the emulator set to run as fast as possible on the host machine instead of going at the speed of the 680xx)).
However, if I compile with the Amiga's gcc's -noixemul option, the code works correctly in every combination of CPU and FPU. So that would indicate it's a problem with the Amiga's version of gcc (really the part of the gcc system that tries to emulate UNIX on an Amiga (that is what ixemul.library does)). So that might not be gcc's fault (if it were compiled on some other system that uses a 68040 it would probably work), but the fault of the people who ported gcc to Amiga.
So, you might say "problem solved, just use -noixemul" - well not so fast... Although the simple test program doesn't crash, my bigger program that exposed this problem crashes on program exit (recoverable GURU meditation) only when compiled with -noixemul (perhaps it's trying to close a library that was never opened, I don't know). This is why I didn't use -noixemul even though I wanted to.
So, it's not exactly solved, but I would say it's not likely a non-Amiga gcc bug.
If I want to achieve better performance from, let's say for example, MySQLdb, I can compile it myself and I will get better performance because it's not compiled on i386, i486 or what ever, just on my CPU. Further I can choose the compile options and so on...
Now, I was wondering if this is true also for non-regular Software, such as compiler.
Here come the 1st part:
Will compiling a compiler like GCC result in better performance?
and the 2nd part:
Will the code compiled by my own compiled compiler perform better?
(Yes, I know, I can compile my compiler and benchmark it... but maybe ... someone already knows the answer, and will share it with us =)
In answer to your first question, almost certainly yes. Binary versions of gcc will be the "lowest common denominator" and, if you compile them with special flags more appropriate to your system, it will most likely be faster.
As to your second question, no.
The output of the compiler will be the same regardless of how you've optimised it (unless it's buggy, of course).
In other words, even if you totally stuffed up your compiler flags when compiling gcc, to the point where your particular compiled version of gcc takes a week and a half to compile "Hello World", the actual "Hello World" executable should be identical to the one produced by the "lowest common denominator" gcc (if you use the same flags).
(1) It is possible. If you introduce a new optimization to your compiler, and re-compile it with this optimization included - it is possible that the re-compiled code will perform better.
(2) No!!!! A compiler cannot change the logic of the code! In your case, the logic of the code is the native code produced at the end. So, if compiler A_1 is compiled using compiler A_2 or B, has no affect on the native code produced by A_1 [in here A_1, A_2 are the same compilers, the index is just for clarity].
a.Well, you can compile the compiler to your system, and maybe it will run faster. like any program. (I think that usualy it's not worth it, but do whatever you want).
b. No. Even if you compile the compiler in your computer, it's behavior should not change, and so the code that it generates also doesn't change.
Will compiling a compiler like GCC result in better performance?
A program compiled specifically to the target platform it is used on will usually perform better than a program compiled for a generic platform. Why is this? Knowledge about the harware can help the compiler align data to be cache friendly and choose an instruction ordering that plays well with a CPUs pipelining.
The most benefit is usally achieved by leveraging specific instruction sets such as SSE (in its various versions).
On the other hand, you should ask yourself if a programm like GCC is really CPU bound (much more likely it will be IO bound) and tuning its CPU performance provides any measurable benefit.
Will the code compiled by my own compiled compiler perform better
Hopefully not! Allowing a compiler to optimize a program should never change its behavior. No matter how you compiled your GCC, it should compile code to the same binaries as a generic binary distribution of GCC would.
If code compiled to the specific platform is faster than code compil for a generic platform, why dont we all ship code instead of binaries? Guess what, some linux distros actually follow this phillosophy, such as Gentoo. And while you're at it, make sure to built statically linked binaries, disk space is so cheap nowadays and it gives you at least another 0.001% of performance.
Alright, that was a bit sarcastic. The reason people distribute generic binaries is pretty obvious: It's geneirc, the lowest common denominator and it will work everywhere. Thats a big bonus in terms of flexibility and user friendlyness. I remember once compiling Gnome for my Gentoo box, it took a day or two! (But it must have been so much faster ;-) )
On the other hand, there are occassions where you want to get the best performance possible and it makes sense to build and optimize for specific architctures.
GCC uses a three step bootstraping when building from source. Basically it compiles the source three times to ensure build tools and compiler is build successfully. This bootstraping is used for validation purpose. However it is possible to use the stage 1 as a benchmark for optimizing later stages. You should build GCC with make profiledbootstrap to use this profile based optimization.
This profile based build process increases the performance of "GCC", but not the software compiled with it, as other answers point out.
For a brief report I have to do, our class ran code on a cluster using both gcc -O0 and icc -O0. We found that gcc was about 2.5 times faster than icc without any optimizations? Why is this? Does gcc -O0 actually do some minor optimization or does it simply happen to work better for this system?
The code was an implementation of the naive string searching algorithm found here, written in c.
Thank you
Performance at -O0 is not interesting or indicative of anything. It explicitly says "I don't care about performance", and the compiler takes you up on that; it just does whatever happens to be simplest. By random luck, what is simplest for GCC is faster than what is simplest for ICC for one highly specific microbenchmark on your specific hardware configuration. If you ran 100 other microbenchmarks, you would probably find some where ICC is faster, too. Even if you didn't, that still wouldn't mean much. If you're going to compare performance across compilers, turn on optimizations, because that's what you do if you care about performance.
If you want to understand why one is faster, profile the execution. Where is the execution time being spent? Where are there stalls? Why do those stalls occur?
A few things to take into account:
The instruction set each compiler uses by default. For example if your GCC build produces i686 code by default, while ICC restricts itself to i586 opcodes, you would probably see a significant performance difference.
The actual CPUs in your cluster. If you are using AMD processors, instead of Intel CPUs, then ICC is at a disadvantage because it is, of course, targeted specifically to Intel processors.
You mentioned using a cluster. Does this speed difference exist on a single processor as well? If you used any parallelisation facilities provided by your compiler, there could be significant differences there.
Simplistically, when optimisations are disabled, the compiler uses pre-made "templates" for each code construct. Since these templates are intended to be optimised afterwards, they are constructed in a way that enables the optimisation passes to produce better code. The fact that they may be slower or faster with -O0 does not really mean anything - for example, more explicit initial code could be easier to optimise but far slower to execute.
That said, the only way to find out what is going on is to profile the execution of your code and, if necessary, have a look at the assembly of those parts of the code where the major differences lie.
Just curious. Using gcc/gdb under Ubuntu 9.10.
Reading a C book that also often gives the disassembly of the object file. When reading in January, my disassembly looks a lot like the book's; now, it's quite different - possibly more optimized (I notice some re-arrangements in the assembly code now that, at least in the files I checked, look optimized). I have used optimization options -O1 - -O3 for gcc between the first and second read, but not before the first.
(1) Is the use of optimization options persistent, aka, if you use them once, you'll use them until switching them off? That would be weird (browsed man file and at least did not see anything to that sort). In the unlikely case that it is true, how do you switch them off?
(2) Has gcc's assembly changed through any recent upgrade?
(3) Does gcc sometimes produce (significantly) different assembly code although same compile options are chosen?
Thanks much.
1) No, the options don't persist.
2) Yes, the optimisers are constantly being changed and improved. If your gcc packages have been upgraded, then it is quite likely that the assembly generated for a particular source file will change.
3) Compiling with gcc is a deterministic process; if you compile the same source with the same version of gcc and the same options for the same target, then the assembly produced should be the same (modulo some symbol names).