Implicit definition of non-simd intel intrinsic - c

In the following link there is a section for non-simd intel intrinsics:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
These include assembly instructions like bsf and bsr. For SIMD instructions I can copy the c function and run it after including the proper header.
For the non-simd functions, like _bit_scan_reverse (bsr), I get that this function is undefined for gcc (implicit definition). GCC has similar "builtin functions" e.g. __builtin_ctz, but no _bit_scan_reverse or _mm_popcnt_u32. Why are these intrinsics not available?
#include <stdio.h>
#include <immintrin.h>
int main(void) {
int x = 5;
int y = _bit_scan_reverse (x);
printf("%d\n",y);
return 0;
}

It appears that I needed to have two changes:
First, it appears to be best practice to include x86intrin.h rather than more specific includes. This appears to be compiler specific and is covered in much better detail in:
Header files for x86 SIMD intrinsics
Importantly, you would have a different include if not using gcc.
Second, compiler options also need to be enabled. For gcc these are detailed in:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
Although documentation for many flags are lacking.
As my goal is to distribute a compiled binary, I wanted to try and avoid -march=native
Most of the "other" intrinsics I'm interested in are bit manipulation related.
Ye Olde Wikipedia has a decent writeup of important bit manipulation intrinsic groups like bmi2:
https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets
I need bmi2 for BZHI (instruction) or _bzhi_u32 (c)
Thus I can get what I want with something like:
-mavx2 -mbmi2
Using -mbmi2 seems to be sufficient to get things like bmi1 and abm (see linked Wikipedia page for definitions) although I don't see any mention of this in the linked gcc page so I might be wrong about this ... EDIT: It seems like adding bmi2 support does not add bmi1 and abm, I might have been using a __builtin call.... I later needed to add -mabm and -mbmi explicitly to get the instructions I wanted. As Peter Cordes suggested it is probably better to target Haswell -march=haswell as a starting point and then add on additional flags as needed. Haswell is the first processor with AVX2 from 2013 so in my mind -march=haswell is basically saying, I expect that you have a computer from 2013 or newer.
Also, based on some quick reading, it sounds like the use of __builtin enables the necessary flags (a future question for SO), although there does not appear to be a 1:1 correspondence between intrinsics and builtins. More specifically, not all intrinsics seem to be included as builtins, meaning the flag setting approach seems to be necessary, rather than just always using builtins and not worrying about setting flags. Also it is useful to know what intrinsics are being used, for distribution purposes, as it seems like bmi2 could still be missing on a substantial portion of computers (e.g. needing AMD from 2015+ - I think).
It's still not clear to me why just using the specified include in the Intel documentation doesn't work, but this info get's me 99% of the way to where I want to be.

Related

How to use x86intrin.h

In one of my applications, I need to efficiently de-interleave bits in a long stream of data. Ideally, I would like to use the BMI2 pext_u32() and/or pext_u64() x86_64 intrinsic instructions when available. I scoured the internet for doc on x86intrin.h (GCC), but couldn't find much on the subject; so, I am asking the gurus on StackOverflow to help me out.
Where can I find documentation about how to work with functions in x86intrin.h?
Does gcc's implementation of pext_*() already have code behind it to fall back on, or do I need to write the fallback code myself (for conditional compile)?
Is it possible to write a binary that automatically falls back to an alternate implementation if a target does not support the intrinsic? If so, how does one do so?
Is there a known programming pattern that will be recognized by GCC and automatically converted to pext_*() when compiling with optimization enabled and with -mbmi2?
Intel publishes the Intrinsics Guide, which also applies to GCC. You will have to write your own fallback code if you use these intrinsics.
You can achieve automatic switching of implementations by using IFUNC resolvers, but for non-library code, using conditionals or function pointers is probably simpler.
Looking at the gcc/config/i386/i386.md and gcc/config/i386/i386.c files, I don't see anything in GCC 8 which would automatically select the pext instruction without intrinsics in the source code.
The design philosophy of Intel's intrinsics is that you can only use them in functions that will run only on CPUs with the required extensions. Checking for support every instruction would add way too much overhead, and then there's have to be a fallback (there isn't).
Intel intrinsics are not like GNU C __builtin_popcountll (which does use a fallback if compiled without -mpopcnt, but not you can enable target options on a per-function basis with attributes.)

How to stop GCC from breaking my NEON intrinsics?

I need to write optimized NEON code for a project and I'm perfectly happy to write assembly language, but for portability/maintainability I'm using NEON instrinsics. This code needs to be as fast as possible, so I'm using my experience in ARM optimization to properly interleave instructions and avoid pipe stalls. No matter what I do, GCC works against me and creates slower code full of stalls.
Does anyone know how to have GCC get out of the way and just translate my intrinsics into code?
Here's an example: I have a simple loop which negates and copies floating point values. It works with 4 sets of 4 at a time to allow some time for the memory to load and instructions to execute. There are plenty of registers left over, so it's got no reason to mangle things so badly.
float32x4_t f32_0, f32_1, f32_2, f32_3;
int x;
for (x=0; x<n-15; x+=16)
{
f32_0 = vld1q_f32(&s[x]);
f32_1 = vld1q_f32(&s[x+4]);
f32_2 = vld1q_f32(&s[x+8]);
f32_3 = vld1q_f32(&s[x+12]);
__builtin_prefetch(&s[x+64]);
f32_0 = vnegq_f32(f32_0);
f32_1 = vnegq_f32(f32_1);
f32_2 = vnegq_f32(f32_2);
f32_3 = vnegq_f32(f32_3);
vst1q_f32(&d[x], f32_0);
vst1q_f32(&d[x+4], f32_1);
vst1q_f32(&d[x+8], f32_2);
vst1q_f32(&d[x+12], f32_3);
}
This is the code it generates:
vld1.32 {d18-d19}, [r5]
vneg.f32 q9,q9 <-- GCC intentionally causes stalls
add r7,r7,#16
vld1.32 {d22-d23}, [r8]
add r5,r1,r4
vneg.f32 q11,q11 <-- all of my interleaving is undone (why?!!?)
add r8,r3,#256
vld1.32 {d20-d21}, [r10]
add r4,r1,r3
vneg.f32 q10,q10
add lr,r1,lr
vld1.32 {d16-d17}, [r9]
add ip,r1,ip
vneg.f32 q8,q8
More info:
GCC 4.9.2 for Raspbian
compiler flags: -c -fPIE -march=armv7-a -Wall -O3 -mfloat-abi=hard -mfpu=neon
When I write the loop in ASM code patterned exactly as my intrinsics (without even making use of extra src/dest registers to gain some free ARM cycles), it's still faster than GCC's code.
Update: I appreciate James' answer, but in the scheme of things, it doesn't really help with the problem. The simplest of my functions perform a little better with the cortex-a7 option, but the majority saw no change. The sad truth is that GCC's optimization of intrinsics is not great. When I worked with the Microsoft ARM compiler a few years ago, it consistently created well crafted output for NEON intrinsics while GCC consistently stumbled. With GCC 4.9.x, nothing has changed. I certainly appreciate the FOSS nature of GCC and the greater GNU effort, but there is no denying that it doesn't do as good a job as Intel, Microsoft or even ARM's compilers.
Broadly, the class of optimisation you are seeing here is known as "instruction scheduling". GCC uses instruction scheduling to try to build a better schedule for the instructions in each basic block of your program. Here, a "schedule" refers to any correct ordering of the instructions in a block, and a "better" schedule might be one which avoids stalls and other pipeline hazards, or one which reduces the live range of variables (resulting in better register allocation), or some other ordering goal on the instructions.
To avoid stalls due to hazards, GCC uses a model of the pipeline of the processor you are targeting (see here for details of the specification language used for these, and here for an example pipeline model). This model gives some indication to the GCC scheduling algorithms of the functional units of a processor, and the execution characteristics of instructions on those functional units. GCC can then schedule instructions to minimise structural hazards due to multiple instructions requiring the same processor resources.
Without a -mcpu or -mtune option (to the compiler), or a --with-cpu, or --with-tune option (to the configuration of the compiler), GCC for ARM or AArch64 will try to use a representative model for the architecture revision you are targeting. In this case, -march=armv7-a, causes the compiler to try to schedule instructions as if -mtune=cortex-a8 were passed on the command line.
So what you are seeing in your output is GCC's attempt at transforming your input in to a schedule it expects to execute well when running on a Cortex-A8, and to run reasonably well on processors which implement the ARMv7-A architecture.
To improve on this you can try:
Explicitly setting the processor you are targeting (-mcpu=cortex-a7)
Disabling instruction scheduling entirely (`-fno-schedule-insns -fno-schedule-insns2)
Note that disabling instruction scheduling entirely may well cause you problems elsewhere, as GCC will no longer be trying to reduce pipeline hazards across your code.
Edit With regards to your edit, performance bugs in GCC can be reported in the GCC Bugzilla (see https://gcc.gnu.org/bugs/ ) just as correctness bugs can be. Naturally with all optimisations there is some degree of heuristic involved and a compiler may not be able to beat a seasoned assembly programmer, but if the compiler is doing something especially egregious it can be worth highlighting.

How to check with Intel intrinsics if AVX extensions is supported by the CPU?

I'm writing a program using Intel intrinsics. I want to use _mm_permute_pd intrinsic, which is only available on CPUs with AVX. For CPUs without AVX I can use _mm_shuffle_pd but according to the specs it is much slower than _mm_permute_pd. Do the header files for Intel intrinsics define constants that allow me to distinguish whether AVX is supported so that I can write sth like this:
#ifdef __IS_AVX_SUPPORTED__ // is there sth like this defined?
// use _mm_permute_pd
# else
// use _mm_shuffle_pd
#endif
? I have found this tutorial, which shows how to perform a runtime check but I need to do a static, compile-time check for the current machine.
GCC, ICC, MSVC, and Clang all define a macro __AVX__ which you can check. In fact it's the only SIMD constant defined by all those compilers (MSVC is the one that breaks the mold). This only tells you if your code was compiled with AVX support (e.g. -mavx with GCC or /arch:AVX with MSVC) it does not tell you if your CPU supports AVX. If you want to know if the CPU supports AVX you need to check CPUID. Here, asm-in-c-error, is an example to read CPUID from all those compilers.
To do this properly I suggest you make a CPU dispatcher.
Edit: In case anyone wants to know how to use the values from CPUID to find out if AVX is available see https://github.com/Mysticial/FeatureDetector
I assume you are using Intel C++ Compiler. In this case - yes, there are such macros: Intel C++ Compiler Reference Guide: __AVX__, __AVX2__.
P.S. Be aware that if you compile you application with AVX instruction set enabled it will fail on CPUs not supporting AVX. If you are going to distribute your software as source code package and compile on target machine - this is may be a viable solution. Otherwise you should check for AVX dynamically.
P.P.S. There are several options for ICC. Take a look at the following compiler options and also references from it to other.
It seems to me that the only way is to compile and run a program that identifies whether AVX is available. Then manually or automatically compile separate code with or without AVX functions. For VS 2013, I would used my code in commomAVX folder in the following to identify hasAVX (or not) and use this to execute one of two different BAT files to compile and link the appropriate program.
http://www.roylongbottom.org.uk/gigaflops-benchmarks.zip
My question was to help to identify a solution regarding the use of suitable compile options such as /arch:AVX.

Intel c++ - optimizer messages

I wonder if it's possible to make Intel C++ compiler (or other compilers such as gcc or clang) display some messages from optimizer. I would like to know what exactly optimizer did with my code. By default compiler prints only very basic things like unused variable. very simple example - I want to know that expression;
float x = 1.0f/2;
will be evaluated into:
float x = 0.5f;
and there will be no division in code (I know that in this case it's always true, but this is just an example). More advanced example could be loop unroll or operations reorder.
Thanks in advance.
For icc and icpc, you can use the -opt-report -opt-report-level max set of flags.
You can also specify an opt-report file. See here for more details
An optimizing compiler (like GCC, when asked to optimize with -O1 or -O2 etc...) is essentially transforming internal representations of your source code.
If you want to see some of the internal GCC representations, you could pass -fdump-tree-all to GCC. Beware, you'll get hundreds of dump files.
You could also use the MELT probe: MELT is a domain specific language (and plugin implementation) to extend GCC, and it has a probe mode to interactively show some of the internal (notably Gimple) representations.
The optimization you describe at the top of the post is (somewhat strangely) part of icc -fno-prec-div (which is a default which you might be overriding).

Ensure compiler always use SSE sqrt instruction

I'm trying to get GCC (or clang) to consistently use the SSE instruction for sqrt instead of the math library function for a computationally intensive scientific application. I've tried a variety of GCCs on various 32 and 64 bit OS X and Linux systems. I'm making sure to enable sse with -mfpmath=sse (and -march=core2 to satisfy GCCs requirement to use -mfpmath=sse on 32 bit). I'm also using -O3. Depending on the GCC or clang version, the generated assembly doesn't consistently use SSE's sqrtss. In some versions of GCC, all the sqrts use the instruction. In others, there is mixed usage of sqrtss and calling the math library function. Is there a way to give a hint or force the compiler to only use the SSE instruction?
Use the sqrtss intrinsic __builtin_ia32_sqrtss?
You should be carefull in using that, you probably know that it has less precicision. That will be the reason that gcc doesn't use it systematically.
There is a trick that is even mentionned in INTEL's SSE manual (I hope that I remember correctly). The result of sqrtss is only one Heron iteration away from the target. Maybe that gcc is sometimes able to inline that surrounding brief iteration at some point (versions) and for others it doesn't.
You could use the builtin as MSN says, but you should definitively look up the specs on INTEL's web site to know what you are trading.

Resources