Ensure compiler always use SSE sqrt instruction - c

I'm trying to get GCC (or clang) to consistently use the SSE instruction for sqrt instead of the math library function for a computationally intensive scientific application. I've tried a variety of GCCs on various 32 and 64 bit OS X and Linux systems. I'm making sure to enable sse with -mfpmath=sse (and -march=core2 to satisfy GCCs requirement to use -mfpmath=sse on 32 bit). I'm also using -O3. Depending on the GCC or clang version, the generated assembly doesn't consistently use SSE's sqrtss. In some versions of GCC, all the sqrts use the instruction. In others, there is mixed usage of sqrtss and calling the math library function. Is there a way to give a hint or force the compiler to only use the SSE instruction?

Use the sqrtss intrinsic __builtin_ia32_sqrtss?

You should be carefull in using that, you probably know that it has less precicision. That will be the reason that gcc doesn't use it systematically.
There is a trick that is even mentionned in INTEL's SSE manual (I hope that I remember correctly). The result of sqrtss is only one Heron iteration away from the target. Maybe that gcc is sometimes able to inline that surrounding brief iteration at some point (versions) and for others it doesn't.
You could use the builtin as MSN says, but you should definitively look up the specs on INTEL's web site to know what you are trading.

Related

Implicit definition of non-simd intel intrinsic

In the following link there is a section for non-simd intel intrinsics:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
These include assembly instructions like bsf and bsr. For SIMD instructions I can copy the c function and run it after including the proper header.
For the non-simd functions, like _bit_scan_reverse (bsr), I get that this function is undefined for gcc (implicit definition). GCC has similar "builtin functions" e.g. __builtin_ctz, but no _bit_scan_reverse or _mm_popcnt_u32. Why are these intrinsics not available?
#include <stdio.h>
#include <immintrin.h>
int main(void) {
int x = 5;
int y = _bit_scan_reverse (x);
printf("%d\n",y);
return 0;
}
It appears that I needed to have two changes:
First, it appears to be best practice to include x86intrin.h rather than more specific includes. This appears to be compiler specific and is covered in much better detail in:
Header files for x86 SIMD intrinsics
Importantly, you would have a different include if not using gcc.
Second, compiler options also need to be enabled. For gcc these are detailed in:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
Although documentation for many flags are lacking.
As my goal is to distribute a compiled binary, I wanted to try and avoid -march=native
Most of the "other" intrinsics I'm interested in are bit manipulation related.
Ye Olde Wikipedia has a decent writeup of important bit manipulation intrinsic groups like bmi2:
https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets
I need bmi2 for BZHI (instruction) or _bzhi_u32 (c)
Thus I can get what I want with something like:
-mavx2 -mbmi2
Using -mbmi2 seems to be sufficient to get things like bmi1 and abm (see linked Wikipedia page for definitions) although I don't see any mention of this in the linked gcc page so I might be wrong about this ... EDIT: It seems like adding bmi2 support does not add bmi1 and abm, I might have been using a __builtin call.... I later needed to add -mabm and -mbmi explicitly to get the instructions I wanted. As Peter Cordes suggested it is probably better to target Haswell -march=haswell as a starting point and then add on additional flags as needed. Haswell is the first processor with AVX2 from 2013 so in my mind -march=haswell is basically saying, I expect that you have a computer from 2013 or newer.
Also, based on some quick reading, it sounds like the use of __builtin enables the necessary flags (a future question for SO), although there does not appear to be a 1:1 correspondence between intrinsics and builtins. More specifically, not all intrinsics seem to be included as builtins, meaning the flag setting approach seems to be necessary, rather than just always using builtins and not worrying about setting flags. Also it is useful to know what intrinsics are being used, for distribution purposes, as it seems like bmi2 could still be missing on a substantial portion of computers (e.g. needing AMD from 2015+ - I think).
It's still not clear to me why just using the specified include in the Intel documentation doesn't work, but this info get's me 99% of the way to where I want to be.

How to stop GCC from breaking my NEON intrinsics?

I need to write optimized NEON code for a project and I'm perfectly happy to write assembly language, but for portability/maintainability I'm using NEON instrinsics. This code needs to be as fast as possible, so I'm using my experience in ARM optimization to properly interleave instructions and avoid pipe stalls. No matter what I do, GCC works against me and creates slower code full of stalls.
Does anyone know how to have GCC get out of the way and just translate my intrinsics into code?
Here's an example: I have a simple loop which negates and copies floating point values. It works with 4 sets of 4 at a time to allow some time for the memory to load and instructions to execute. There are plenty of registers left over, so it's got no reason to mangle things so badly.
float32x4_t f32_0, f32_1, f32_2, f32_3;
int x;
for (x=0; x<n-15; x+=16)
{
f32_0 = vld1q_f32(&s[x]);
f32_1 = vld1q_f32(&s[x+4]);
f32_2 = vld1q_f32(&s[x+8]);
f32_3 = vld1q_f32(&s[x+12]);
__builtin_prefetch(&s[x+64]);
f32_0 = vnegq_f32(f32_0);
f32_1 = vnegq_f32(f32_1);
f32_2 = vnegq_f32(f32_2);
f32_3 = vnegq_f32(f32_3);
vst1q_f32(&d[x], f32_0);
vst1q_f32(&d[x+4], f32_1);
vst1q_f32(&d[x+8], f32_2);
vst1q_f32(&d[x+12], f32_3);
}
This is the code it generates:
vld1.32 {d18-d19}, [r5]
vneg.f32 q9,q9 <-- GCC intentionally causes stalls
add r7,r7,#16
vld1.32 {d22-d23}, [r8]
add r5,r1,r4
vneg.f32 q11,q11 <-- all of my interleaving is undone (why?!!?)
add r8,r3,#256
vld1.32 {d20-d21}, [r10]
add r4,r1,r3
vneg.f32 q10,q10
add lr,r1,lr
vld1.32 {d16-d17}, [r9]
add ip,r1,ip
vneg.f32 q8,q8
More info:
GCC 4.9.2 for Raspbian
compiler flags: -c -fPIE -march=armv7-a -Wall -O3 -mfloat-abi=hard -mfpu=neon
When I write the loop in ASM code patterned exactly as my intrinsics (without even making use of extra src/dest registers to gain some free ARM cycles), it's still faster than GCC's code.
Update: I appreciate James' answer, but in the scheme of things, it doesn't really help with the problem. The simplest of my functions perform a little better with the cortex-a7 option, but the majority saw no change. The sad truth is that GCC's optimization of intrinsics is not great. When I worked with the Microsoft ARM compiler a few years ago, it consistently created well crafted output for NEON intrinsics while GCC consistently stumbled. With GCC 4.9.x, nothing has changed. I certainly appreciate the FOSS nature of GCC and the greater GNU effort, but there is no denying that it doesn't do as good a job as Intel, Microsoft or even ARM's compilers.
Broadly, the class of optimisation you are seeing here is known as "instruction scheduling". GCC uses instruction scheduling to try to build a better schedule for the instructions in each basic block of your program. Here, a "schedule" refers to any correct ordering of the instructions in a block, and a "better" schedule might be one which avoids stalls and other pipeline hazards, or one which reduces the live range of variables (resulting in better register allocation), or some other ordering goal on the instructions.
To avoid stalls due to hazards, GCC uses a model of the pipeline of the processor you are targeting (see here for details of the specification language used for these, and here for an example pipeline model). This model gives some indication to the GCC scheduling algorithms of the functional units of a processor, and the execution characteristics of instructions on those functional units. GCC can then schedule instructions to minimise structural hazards due to multiple instructions requiring the same processor resources.
Without a -mcpu or -mtune option (to the compiler), or a --with-cpu, or --with-tune option (to the configuration of the compiler), GCC for ARM or AArch64 will try to use a representative model for the architecture revision you are targeting. In this case, -march=armv7-a, causes the compiler to try to schedule instructions as if -mtune=cortex-a8 were passed on the command line.
So what you are seeing in your output is GCC's attempt at transforming your input in to a schedule it expects to execute well when running on a Cortex-A8, and to run reasonably well on processors which implement the ARMv7-A architecture.
To improve on this you can try:
Explicitly setting the processor you are targeting (-mcpu=cortex-a7)
Disabling instruction scheduling entirely (`-fno-schedule-insns -fno-schedule-insns2)
Note that disabling instruction scheduling entirely may well cause you problems elsewhere, as GCC will no longer be trying to reduce pipeline hazards across your code.
Edit With regards to your edit, performance bugs in GCC can be reported in the GCC Bugzilla (see https://gcc.gnu.org/bugs/ ) just as correctness bugs can be. Naturally with all optimisations there is some degree of heuristic involved and a compiler may not be able to beat a seasoned assembly programmer, but if the compiler is doing something especially egregious it can be worth highlighting.

How to check with Intel intrinsics if AVX extensions is supported by the CPU?

I'm writing a program using Intel intrinsics. I want to use _mm_permute_pd intrinsic, which is only available on CPUs with AVX. For CPUs without AVX I can use _mm_shuffle_pd but according to the specs it is much slower than _mm_permute_pd. Do the header files for Intel intrinsics define constants that allow me to distinguish whether AVX is supported so that I can write sth like this:
#ifdef __IS_AVX_SUPPORTED__ // is there sth like this defined?
// use _mm_permute_pd
# else
// use _mm_shuffle_pd
#endif
? I have found this tutorial, which shows how to perform a runtime check but I need to do a static, compile-time check for the current machine.
GCC, ICC, MSVC, and Clang all define a macro __AVX__ which you can check. In fact it's the only SIMD constant defined by all those compilers (MSVC is the one that breaks the mold). This only tells you if your code was compiled with AVX support (e.g. -mavx with GCC or /arch:AVX with MSVC) it does not tell you if your CPU supports AVX. If you want to know if the CPU supports AVX you need to check CPUID. Here, asm-in-c-error, is an example to read CPUID from all those compilers.
To do this properly I suggest you make a CPU dispatcher.
Edit: In case anyone wants to know how to use the values from CPUID to find out if AVX is available see https://github.com/Mysticial/FeatureDetector
I assume you are using Intel C++ Compiler. In this case - yes, there are such macros: Intel C++ Compiler Reference Guide: __AVX__, __AVX2__.
P.S. Be aware that if you compile you application with AVX instruction set enabled it will fail on CPUs not supporting AVX. If you are going to distribute your software as source code package and compile on target machine - this is may be a viable solution. Otherwise you should check for AVX dynamically.
P.P.S. There are several options for ICC. Take a look at the following compiler options and also references from it to other.
It seems to me that the only way is to compile and run a program that identifies whether AVX is available. Then manually or automatically compile separate code with or without AVX functions. For VS 2013, I would used my code in commomAVX folder in the following to identify hasAVX (or not) and use this to execute one of two different BAT files to compile and link the appropriate program.
http://www.roylongbottom.org.uk/gigaflops-benchmarks.zip
My question was to help to identify a solution regarding the use of suitable compile options such as /arch:AVX.

Forcing automatic vectorization with GCC

Here my very simple question. With ICC I know it is possible to use #pragma SIMD to force vectorization of loops that the compiler chooses not to vectorize. Is there something analogous in GCC? Or, is there any plan to add this feature in a future release?
Quite related, what about forcing vectorization with Graphite?
As long as gcc is allowed to use SSE/SSE2/etc instructions, the compiler will in general produce vector instructions when it realizes that it's "worthwhile". Like most things in compilers, this requires some luck/planning/care from the programmer to avoid the compiler thinking "maybe this isn't safe" or "this is too complicated, I can't figure out what's going on". But quite often, it's successful if you are using a reasonably modern version of gcc (4.x versions should all do this).
You can make the compiler use SSE or SSE2 instructions by adding -msse or -msse2 (etc. for later SSE extensions). -msse2 is default in x86-64.
I'm not aware of any way that you can FORCE this, however. The compiler will either do this because it's happy that it's a good solution, or it wont.
Sorry, can't answer about Graphite.

Enabling strict floating point mode in GCC

I haven't yet created a program to see whether GCC will need it passed, When I do I'd like to know how I'd go about enabling strict floating point mode which will allow reproducible results between runs and computers, Thanks.
Compiling with -msse2 on an Intel/AMD processor that supports it will get you almost there. Do not let any library put the FPU in FTZ/DNZ mode, and you will be mostly set (processor bugs notwithstanding).
For other architectures, the answer would be different. Those achitectures that do not offer any convenient way to get exact IEEE 754 semantics (for instance, pre-SSE2 IA32 CPUs) would require the use of a floating-point emulation library to get the result you want, at a very high performance penalty.
If your target architecture supports the fmadd (multiplication and addition without intermediate rounding) instruction, make sure your compiler does not use it when you have explicit multiplications and additions in the source code. GCC is not supposed to do this unless you use the -ffast-math option.
If you use -ffloat-store and always store intermediate values to variables or apply (explicit) casts to the desired type/precision, you should be at least 90% to your goal, and maybe more. I'd welcome comments on whether there are cases this approach still misses. Note that I claim this works even without any SSE options.
You can also use GCC's option -mpc64 on i386 / ia32 target to force double precision computation even on x87 FPU. See GCC manual.
You can also modify the x87 FPU behavor at runtime, see Deterministic cross-platform floating point arithmetics and also An Introduction to GCC.

Resources