How to stop GCC from breaking my NEON intrinsics? - c

I need to write optimized NEON code for a project and I'm perfectly happy to write assembly language, but for portability/maintainability I'm using NEON instrinsics. This code needs to be as fast as possible, so I'm using my experience in ARM optimization to properly interleave instructions and avoid pipe stalls. No matter what I do, GCC works against me and creates slower code full of stalls.
Does anyone know how to have GCC get out of the way and just translate my intrinsics into code?
Here's an example: I have a simple loop which negates and copies floating point values. It works with 4 sets of 4 at a time to allow some time for the memory to load and instructions to execute. There are plenty of registers left over, so it's got no reason to mangle things so badly.
float32x4_t f32_0, f32_1, f32_2, f32_3;
int x;
for (x=0; x<n-15; x+=16)
{
f32_0 = vld1q_f32(&s[x]);
f32_1 = vld1q_f32(&s[x+4]);
f32_2 = vld1q_f32(&s[x+8]);
f32_3 = vld1q_f32(&s[x+12]);
__builtin_prefetch(&s[x+64]);
f32_0 = vnegq_f32(f32_0);
f32_1 = vnegq_f32(f32_1);
f32_2 = vnegq_f32(f32_2);
f32_3 = vnegq_f32(f32_3);
vst1q_f32(&d[x], f32_0);
vst1q_f32(&d[x+4], f32_1);
vst1q_f32(&d[x+8], f32_2);
vst1q_f32(&d[x+12], f32_3);
}
This is the code it generates:
vld1.32 {d18-d19}, [r5]
vneg.f32 q9,q9 <-- GCC intentionally causes stalls
add r7,r7,#16
vld1.32 {d22-d23}, [r8]
add r5,r1,r4
vneg.f32 q11,q11 <-- all of my interleaving is undone (why?!!?)
add r8,r3,#256
vld1.32 {d20-d21}, [r10]
add r4,r1,r3
vneg.f32 q10,q10
add lr,r1,lr
vld1.32 {d16-d17}, [r9]
add ip,r1,ip
vneg.f32 q8,q8
More info:
GCC 4.9.2 for Raspbian
compiler flags: -c -fPIE -march=armv7-a -Wall -O3 -mfloat-abi=hard -mfpu=neon
When I write the loop in ASM code patterned exactly as my intrinsics (without even making use of extra src/dest registers to gain some free ARM cycles), it's still faster than GCC's code.
Update: I appreciate James' answer, but in the scheme of things, it doesn't really help with the problem. The simplest of my functions perform a little better with the cortex-a7 option, but the majority saw no change. The sad truth is that GCC's optimization of intrinsics is not great. When I worked with the Microsoft ARM compiler a few years ago, it consistently created well crafted output for NEON intrinsics while GCC consistently stumbled. With GCC 4.9.x, nothing has changed. I certainly appreciate the FOSS nature of GCC and the greater GNU effort, but there is no denying that it doesn't do as good a job as Intel, Microsoft or even ARM's compilers.

Broadly, the class of optimisation you are seeing here is known as "instruction scheduling". GCC uses instruction scheduling to try to build a better schedule for the instructions in each basic block of your program. Here, a "schedule" refers to any correct ordering of the instructions in a block, and a "better" schedule might be one which avoids stalls and other pipeline hazards, or one which reduces the live range of variables (resulting in better register allocation), or some other ordering goal on the instructions.
To avoid stalls due to hazards, GCC uses a model of the pipeline of the processor you are targeting (see here for details of the specification language used for these, and here for an example pipeline model). This model gives some indication to the GCC scheduling algorithms of the functional units of a processor, and the execution characteristics of instructions on those functional units. GCC can then schedule instructions to minimise structural hazards due to multiple instructions requiring the same processor resources.
Without a -mcpu or -mtune option (to the compiler), or a --with-cpu, or --with-tune option (to the configuration of the compiler), GCC for ARM or AArch64 will try to use a representative model for the architecture revision you are targeting. In this case, -march=armv7-a, causes the compiler to try to schedule instructions as if -mtune=cortex-a8 were passed on the command line.
So what you are seeing in your output is GCC's attempt at transforming your input in to a schedule it expects to execute well when running on a Cortex-A8, and to run reasonably well on processors which implement the ARMv7-A architecture.
To improve on this you can try:
Explicitly setting the processor you are targeting (-mcpu=cortex-a7)
Disabling instruction scheduling entirely (`-fno-schedule-insns -fno-schedule-insns2)
Note that disabling instruction scheduling entirely may well cause you problems elsewhere, as GCC will no longer be trying to reduce pipeline hazards across your code.
Edit With regards to your edit, performance bugs in GCC can be reported in the GCC Bugzilla (see https://gcc.gnu.org/bugs/ ) just as correctness bugs can be. Naturally with all optimisations there is some degree of heuristic involved and a compiler may not be able to beat a seasoned assembly programmer, but if the compiler is doing something especially egregious it can be worth highlighting.

Related

Can you do all gcc optimizations (-O2, -O3) manually in your c source code?

In my class project, my project is set to use gcc's optimization level of -O0 (no optimizations) and we are not allowed to change it for the final submission.
I tested my code using -O2 and got around a 2x speedup of my entire program. So I was wondering, is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code? Or are some of the -O2 optimizations internal to the stack, frame, machine/assembly, etc, thus disallowing me, the programmer, from manually making those optimizations in my source code (If that makes sense)
Is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code?
No. Many of the optimizations performed by the compiler cannot be represented in C. Some of these include:
Disabling the frame pointer
Removing unnecessary register saves/restores at the beginning and end of a function
"Peephole" optimizations on the assembly, such as removing redundant moves, loads, or stores
Inserting no-ops to align loops to specific address boundaries (typically 16 bytes)
This isn't to say that all of the optimizations performed by the compiler are untranslatable, of course -- merely that some of them are.
Yes, but that's the same as building your own 8086-class microprocessor in Minecraft — not worth your time and effort. And yes, many of those optimizations involve stuff below the language level of abstraction. Your professor might have unknown-to-you reasons for wanting an unoptimized executable.

Writing a piece of C code such that compiler uses SSE4.1 instruction for generating assembly Code

I want to write some C code such that gcc using the -msse4.1 flag can optimize it. Basically I want to check whether or not the compiler is taking advantage of SSE4.1 instructions.
There are many SSE4.1 instructions (http://en.wikipedia.org/wiki/SSE4#New_instructions) but I am not able to write a fragment of C Code which is using any of those instructions in the generated assembly code.
Thanks in advance.
From what I've seen, compilers rarely ever generate SSE4.1 instructions. I've seen a few cases where it will use the insert/extract instructions to pack data.
But for the most part, if you want to use the SSE4.1 instructions, you need to do them explicitly using intrinsics:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_bk_sse41.htm
I doubt GCC would emit SSE4.1 instructions that easily. But you could have a look at Intel SPMD Program Compiler:
Under the SPMD model, the programmer writes a program that mostly
appears to be a regular serial program, though the execution model is
actually that a number of program instances execute in parallel on the
hardware. (See a more detailed example that illustrates this concept.)
ispc compiles a C-based SPMD programming language to run on the SIMD
units of CPUs; it frequently provides a 3x or more speedup on CPUs
with 4-wide SSE units, without any of the difficulty of writing
intrinsics code.

Why would gcc -o0 be faster than icc -o0?

For a brief report I have to do, our class ran code on a cluster using both gcc -O0 and icc -O0. We found that gcc was about 2.5 times faster than icc without any optimizations? Why is this? Does gcc -O0 actually do some minor optimization or does it simply happen to work better for this system?
The code was an implementation of the naive string searching algorithm found here, written in c.
Thank you
Performance at -O0 is not interesting or indicative of anything. It explicitly says "I don't care about performance", and the compiler takes you up on that; it just does whatever happens to be simplest. By random luck, what is simplest for GCC is faster than what is simplest for ICC for one highly specific microbenchmark on your specific hardware configuration. If you ran 100 other microbenchmarks, you would probably find some where ICC is faster, too. Even if you didn't, that still wouldn't mean much. If you're going to compare performance across compilers, turn on optimizations, because that's what you do if you care about performance.
If you want to understand why one is faster, profile the execution. Where is the execution time being spent? Where are there stalls? Why do those stalls occur?
A few things to take into account:
The instruction set each compiler uses by default. For example if your GCC build produces i686 code by default, while ICC restricts itself to i586 opcodes, you would probably see a significant performance difference.
The actual CPUs in your cluster. If you are using AMD processors, instead of Intel CPUs, then ICC is at a disadvantage because it is, of course, targeted specifically to Intel processors.
You mentioned using a cluster. Does this speed difference exist on a single processor as well? If you used any parallelisation facilities provided by your compiler, there could be significant differences there.
Simplistically, when optimisations are disabled, the compiler uses pre-made "templates" for each code construct. Since these templates are intended to be optimised afterwards, they are constructed in a way that enables the optimisation passes to produce better code. The fact that they may be slower or faster with -O0 does not really mean anything - for example, more explicit initial code could be easier to optimise but far slower to execute.
That said, the only way to find out what is going on is to profile the execution of your code and, if necessary, have a look at the assembly of those parts of the code where the major differences lie.

Disable vectorized looping in FORTRAN?

Is it possible to bypass loop vectorization in FORTRAN? I'm writing to F77 standards for a particular project, but the GNU gfortran compiles up through modern FORTRANs, such as F95. Does anyone know if certain FORTRAN standards avoided loop vectorization or if there are any flags/options in gfortran to turn this off?
UPDATE: So, I think the final solution to my specific problem has to "DO" with the FORTRAN DO loops not allowing the updating of the iteration variable. Mention of this can be found in #High Performance Mark's reply on this related thread... Loop vectorization and how to avoid it
[Into the FORT, RAN the noobs for shelter.]
The Fortran standards are generally silent on how the language is to be implemented, leaving that to the compiler writers who are in a better position to determine the best, or good (and bad) options for implementation of the language's various features on whatever chip architecture(s) they are writing for.
What do you mean when you write that you want to bypass loop vectorisation ? And in the next sentence suggest that this would be unavailable to FORTRAN77 programs ? It is perfectly normal for a compiler for a modern CPU to generate vector instructions if the CPU is capable of obeying them. This is true whatever version of the language the program is written in.
If you really don't want to generate vector instructions then you'll have to examine the gfortran documentation carefully -- it's not a compiler I use so I can't point you to specific options or flags. You might want to look at its capabilities for architecture-specific code generation, paying particular attention to SSE level.
You might be able to coerce the compiler into not vectorising loops if all your loops are explicit (so no whole-array operations) and if you make your code hard to vectorise in other ways (dependencies between loop iterations for example). But a good modern compiler, without interference, is going to try its damndest to vectorise loops for your own good.
It seems rather perverse to me to try to force the compiler to go against its nature, perhaps you could explain why you want to do that in more detail.
As High Performance Mark wrote, the compiler is free to select machine instructions to implement your source code as long as the results follow the rules of the language. You should not be able to observe any difference in the output values as a result of loop vectorization ... you code should run faster. So why do you care?
Sometimes differences can be observed across optimization levels, e.g., on some architectures registers have extra precision.
The place to look for these sorts of compiler optimizations is the gcc manual. They are located there since they are common across the gcc compiler suite.
With most modern compilers, the command-line option -O0 should turn off all optimisations, including loop vectorisation.
I have sometimes found that this causes bugs to apparently disappear. However usually this means that there is something wrong with my code so if this sort of thing is happening to you then you have almost certainly written a buggy program.
It is theoretically possible but much less likely that there is a bug in the compiler, you can easily check this by compiling your code in another fortran compiler. (e.g. gfortran or g95).
gfortran doesn't auto-vectorize unless you have set -O3 or -ftree-vectorize. So it's easy to avoid vectorization. You will probably need to read (skim) the gcc manual as well as the gfortran one.
Auto-vectorization has been a well-known feature of Fortran compilers for over 35 years, and even the Fortran 77 definition of DO loops was set with this in mind (and also in view of some known non-portable abuses of F66 standard). You could not count on turning off vectorization as a way of making incorrect code work, although it might expose symptoms of incorrect code.

Ensure compiler always use SSE sqrt instruction

I'm trying to get GCC (or clang) to consistently use the SSE instruction for sqrt instead of the math library function for a computationally intensive scientific application. I've tried a variety of GCCs on various 32 and 64 bit OS X and Linux systems. I'm making sure to enable sse with -mfpmath=sse (and -march=core2 to satisfy GCCs requirement to use -mfpmath=sse on 32 bit). I'm also using -O3. Depending on the GCC or clang version, the generated assembly doesn't consistently use SSE's sqrtss. In some versions of GCC, all the sqrts use the instruction. In others, there is mixed usage of sqrtss and calling the math library function. Is there a way to give a hint or force the compiler to only use the SSE instruction?
Use the sqrtss intrinsic __builtin_ia32_sqrtss?
You should be carefull in using that, you probably know that it has less precicision. That will be the reason that gcc doesn't use it systematically.
There is a trick that is even mentionned in INTEL's SSE manual (I hope that I remember correctly). The result of sqrtss is only one Heron iteration away from the target. Maybe that gcc is sometimes able to inline that surrounding brief iteration at some point (versions) and for others it doesn't.
You could use the builtin as MSN says, but you should definitively look up the specs on INTEL's web site to know what you are trading.

Resources