I saw somewhere that the GCC compiler might prefer sometimes not using conditional mov when converting my code into ASM.
What are the cases where it might choose to do something other than conditional mov?
Compilers often favour if-conversion to cmov when both sides of the branch are short, especially with a ternary so you always assign a C variable. e.g. if(x) y=bar; sometimes doesn't optimize to CMOV but y = x ? bar : y; does use CMOV more often. Especially when y is an array entry that otherwise wouldn't be touched: introducing a non-atomic RMW of it could create a data-race not present in the source. (Compilers can't invent writes to possibly-shared objects.)
The obvious example of when if-conversion would be legal but obviously not profitable would be when there's a lot of work on both sides of an if/else. e.g. some multiplies and divides, a whole loop, and/or table lookups. Even if gcc can prove that it's safe to run both sides and select one result at the end, it would see that doing that much more work isn't worth avoiding a branch.
If-conversion to a data-dependency (branchless cmov) is only even possible in limited circumstances. e.g. Why is gcc allowed to speculatively load from a struct? shows a case where it can/can't be done. Other cases include doing a memory access that the C abstract machine doesn't, which the compiler can't prove won't fault. Or a non-inline function call that might have side-effects.
See also these questions about getting gcc to use CMOV.
Getting GCC/Clang to use CMOV
how to force the use of cmov in gcc and VS
Make gcc use conditional moves
See also Disabling predication in gcc/g++ - apparently gcc -fno-if-conversion -fno-if-conversion2 will disable use of cmov.
For a case where cmov hurts performance, see gcc optimization flag -O3 makes code slower than -O2 - GCC -O3 needs profile-guided optimization to get it right and use a branch for an if that turns out to be highly predictable. GCC -O2 didn't do if-conversion in the first place, even without PGO profiling data.
An example the other way: Is there a good reason why GCC would generate jump to jump just over one cheap instruction?
GCC seemingly misses simple optimization shows a case where a ternary has side-effects in both halves: ternary isn't like CMOV: only one side is even evalutated for side effects.
AVX-512 and Branching shows a Fortran example where GCC needs help from source changes to be able to use branchless SIMD. (Equivalent of scalar CMOV). This is a case of not inventing writes: it can't turn a read/branch into read/maybe-modify/write for elements that source wouldn't have written. If-conversion is usually necessary for auto-vectorization.
Related
In my class project, my project is set to use gcc's optimization level of -O0 (no optimizations) and we are not allowed to change it for the final submission.
I tested my code using -O2 and got around a 2x speedup of my entire program. So I was wondering, is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code? Or are some of the -O2 optimizations internal to the stack, frame, machine/assembly, etc, thus disallowing me, the programmer, from manually making those optimizations in my source code (If that makes sense)
Is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code?
No. Many of the optimizations performed by the compiler cannot be represented in C. Some of these include:
Disabling the frame pointer
Removing unnecessary register saves/restores at the beginning and end of a function
"Peephole" optimizations on the assembly, such as removing redundant moves, loads, or stores
Inserting no-ops to align loops to specific address boundaries (typically 16 bytes)
This isn't to say that all of the optimizations performed by the compiler are untranslatable, of course -- merely that some of them are.
Yes, but that's the same as building your own 8086-class microprocessor in Minecraft — not worth your time and effort. And yes, many of those optimizations involve stuff below the language level of abstraction. Your professor might have unknown-to-you reasons for wanting an unoptimized executable.
I need to write optimized NEON code for a project and I'm perfectly happy to write assembly language, but for portability/maintainability I'm using NEON instrinsics. This code needs to be as fast as possible, so I'm using my experience in ARM optimization to properly interleave instructions and avoid pipe stalls. No matter what I do, GCC works against me and creates slower code full of stalls.
Does anyone know how to have GCC get out of the way and just translate my intrinsics into code?
Here's an example: I have a simple loop which negates and copies floating point values. It works with 4 sets of 4 at a time to allow some time for the memory to load and instructions to execute. There are plenty of registers left over, so it's got no reason to mangle things so badly.
float32x4_t f32_0, f32_1, f32_2, f32_3;
int x;
for (x=0; x<n-15; x+=16)
{
f32_0 = vld1q_f32(&s[x]);
f32_1 = vld1q_f32(&s[x+4]);
f32_2 = vld1q_f32(&s[x+8]);
f32_3 = vld1q_f32(&s[x+12]);
__builtin_prefetch(&s[x+64]);
f32_0 = vnegq_f32(f32_0);
f32_1 = vnegq_f32(f32_1);
f32_2 = vnegq_f32(f32_2);
f32_3 = vnegq_f32(f32_3);
vst1q_f32(&d[x], f32_0);
vst1q_f32(&d[x+4], f32_1);
vst1q_f32(&d[x+8], f32_2);
vst1q_f32(&d[x+12], f32_3);
}
This is the code it generates:
vld1.32 {d18-d19}, [r5]
vneg.f32 q9,q9 <-- GCC intentionally causes stalls
add r7,r7,#16
vld1.32 {d22-d23}, [r8]
add r5,r1,r4
vneg.f32 q11,q11 <-- all of my interleaving is undone (why?!!?)
add r8,r3,#256
vld1.32 {d20-d21}, [r10]
add r4,r1,r3
vneg.f32 q10,q10
add lr,r1,lr
vld1.32 {d16-d17}, [r9]
add ip,r1,ip
vneg.f32 q8,q8
More info:
GCC 4.9.2 for Raspbian
compiler flags: -c -fPIE -march=armv7-a -Wall -O3 -mfloat-abi=hard -mfpu=neon
When I write the loop in ASM code patterned exactly as my intrinsics (without even making use of extra src/dest registers to gain some free ARM cycles), it's still faster than GCC's code.
Update: I appreciate James' answer, but in the scheme of things, it doesn't really help with the problem. The simplest of my functions perform a little better with the cortex-a7 option, but the majority saw no change. The sad truth is that GCC's optimization of intrinsics is not great. When I worked with the Microsoft ARM compiler a few years ago, it consistently created well crafted output for NEON intrinsics while GCC consistently stumbled. With GCC 4.9.x, nothing has changed. I certainly appreciate the FOSS nature of GCC and the greater GNU effort, but there is no denying that it doesn't do as good a job as Intel, Microsoft or even ARM's compilers.
Broadly, the class of optimisation you are seeing here is known as "instruction scheduling". GCC uses instruction scheduling to try to build a better schedule for the instructions in each basic block of your program. Here, a "schedule" refers to any correct ordering of the instructions in a block, and a "better" schedule might be one which avoids stalls and other pipeline hazards, or one which reduces the live range of variables (resulting in better register allocation), or some other ordering goal on the instructions.
To avoid stalls due to hazards, GCC uses a model of the pipeline of the processor you are targeting (see here for details of the specification language used for these, and here for an example pipeline model). This model gives some indication to the GCC scheduling algorithms of the functional units of a processor, and the execution characteristics of instructions on those functional units. GCC can then schedule instructions to minimise structural hazards due to multiple instructions requiring the same processor resources.
Without a -mcpu or -mtune option (to the compiler), or a --with-cpu, or --with-tune option (to the configuration of the compiler), GCC for ARM or AArch64 will try to use a representative model for the architecture revision you are targeting. In this case, -march=armv7-a, causes the compiler to try to schedule instructions as if -mtune=cortex-a8 were passed on the command line.
So what you are seeing in your output is GCC's attempt at transforming your input in to a schedule it expects to execute well when running on a Cortex-A8, and to run reasonably well on processors which implement the ARMv7-A architecture.
To improve on this you can try:
Explicitly setting the processor you are targeting (-mcpu=cortex-a7)
Disabling instruction scheduling entirely (`-fno-schedule-insns -fno-schedule-insns2)
Note that disabling instruction scheduling entirely may well cause you problems elsewhere, as GCC will no longer be trying to reduce pipeline hazards across your code.
Edit With regards to your edit, performance bugs in GCC can be reported in the GCC Bugzilla (see https://gcc.gnu.org/bugs/ ) just as correctness bugs can be. Naturally with all optimisations there is some degree of heuristic involved and a compiler may not be able to beat a seasoned assembly programmer, but if the compiler is doing something especially egregious it can be worth highlighting.
If I run gcc with -O0, and hand-optimize my code using techniques such as the ones mentioned here, will it generally be the case that my optimized code will run faster than my unoptimized code when I run gcc with -O3?
That is, if I hand-optimize code under a particular compiler optimization level, is it generally true that these optimizations will still be productive (rather than counterproductive) under a different (higher or lower) compiler optimization level?
It might not be same in different compiler. Even the compiler can do away with your hand optimization, i mean ignore them. It heavily depends the implementation and behavior of the compiler itself. Most of the optimizations are like a request to compiler, which can be dropped at any time, (mostly without any notification)
Here my very simple question. With ICC I know it is possible to use #pragma SIMD to force vectorization of loops that the compiler chooses not to vectorize. Is there something analogous in GCC? Or, is there any plan to add this feature in a future release?
Quite related, what about forcing vectorization with Graphite?
As long as gcc is allowed to use SSE/SSE2/etc instructions, the compiler will in general produce vector instructions when it realizes that it's "worthwhile". Like most things in compilers, this requires some luck/planning/care from the programmer to avoid the compiler thinking "maybe this isn't safe" or "this is too complicated, I can't figure out what's going on". But quite often, it's successful if you are using a reasonably modern version of gcc (4.x versions should all do this).
You can make the compiler use SSE or SSE2 instructions by adding -msse or -msse2 (etc. for later SSE extensions). -msse2 is default in x86-64.
I'm not aware of any way that you can FORCE this, however. The compiler will either do this because it's happy that it's a good solution, or it wont.
Sorry, can't answer about Graphite.
Which gcc compiler options may be safely used for numerical programming?
The easy way to turn on optimizations for gcc is to add -0# to the compiler options. It is tempting to say -O3. However I know that -O3 includes optimization which are non-save in the sense that results of numerical computations may differ once this option is included. Small changes in the result may be insignificant if the algorithm is stable. On the other hand, precision can be an issue for certain math operations, so math optimization can have significant impact.
I find it inconvenient to take compiler dependent issues into account in the process of debugging. I.e. I don't want to wonder whether minor changes in the code will lead to strongly different behavior because the compiler changed its optimizations internally.
Which options are safe to add if I want deterministic--and hence controllable--behavior in my code? Which are almost safe, that is, which options induce only minor uncertainties compared to performance benefits?
I think of options like: -finline -finline-limit=2000 which inlines functions even if they are long.
It is not true that -O3 includes numerically unsafe optimizations. According to the manual, -O3 includes the following optimization passes in comparison to -O2:
-finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-cp-clone
You might be referring to -ffast-math, turned on by default with -Ofast, but not with -O3:
-ffast-math Sets -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range. This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it
can result in incorrect output for programs that depend on an exact
implementation of IEEE or ISO rules/specifications for math functions.
It may, however, yield faster code for programs that do not require
the guarantees of these specifications.
In other words, all of -O, -O2, and -O3 are safe for numeric programming.