Using SSE instructions with gcc without inline assembly - c

I am interested in using the SSE vector instructions of x86-64 with gcc and don't want to use any inline assembly for that. Is there a way I can do that in C? If so, can someone give me an example?

Yes, you can use the intrinsics in the *mmintrin.h headers (emmintrin.h, xmmintrin.h, etc, depending on what level of SSE you want to use). This is generally preferable to using assembler for many reasons.
#include <emmintrin.h>
int main(void)
{
__m128i a = _mm_set_epi32(4, 3, 2, 1);
__m128i b = _mm_set_epi32(7, 6, 5, 4);
__m128i c = _mm_add_epi32(a, b);
// ...
return 0;
}
Note that this approach works for most x86 and x86-64 compilers on various platforms, e.g. gcc, clang and Intel's ICC on Linux/Mac OS X/Windows and even Microsoft's Visual C/C++ (Windows only, of course).

Find the *intrin.h headers in your gcc includes (/usr/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/include/ here).
Maybe noteworthy, the header immintrin.h includes all other intrins according to the features you allow (using -msse2 or -mavx for instance).

What you want are intrinsics, which look like library functions but are actually built into the compiler so they translate into specific machine code.
Paul R and hroptatyr describe where to find GCC's documentation. Microsoft also has good documentation on the intrinsics in their compiler; even if you are using GCC, you might find MS' description of the idea a better tutorial.

Related

Implicit definition of non-simd intel intrinsic

In the following link there is a section for non-simd intel intrinsics:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
These include assembly instructions like bsf and bsr. For SIMD instructions I can copy the c function and run it after including the proper header.
For the non-simd functions, like _bit_scan_reverse (bsr), I get that this function is undefined for gcc (implicit definition). GCC has similar "builtin functions" e.g. __builtin_ctz, but no _bit_scan_reverse or _mm_popcnt_u32. Why are these intrinsics not available?
#include <stdio.h>
#include <immintrin.h>
int main(void) {
int x = 5;
int y = _bit_scan_reverse (x);
printf("%d\n",y);
return 0;
}
It appears that I needed to have two changes:
First, it appears to be best practice to include x86intrin.h rather than more specific includes. This appears to be compiler specific and is covered in much better detail in:
Header files for x86 SIMD intrinsics
Importantly, you would have a different include if not using gcc.
Second, compiler options also need to be enabled. For gcc these are detailed in:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
Although documentation for many flags are lacking.
As my goal is to distribute a compiled binary, I wanted to try and avoid -march=native
Most of the "other" intrinsics I'm interested in are bit manipulation related.
Ye Olde Wikipedia has a decent writeup of important bit manipulation intrinsic groups like bmi2:
https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets
I need bmi2 for BZHI (instruction) or _bzhi_u32 (c)
Thus I can get what I want with something like:
-mavx2 -mbmi2
Using -mbmi2 seems to be sufficient to get things like bmi1 and abm (see linked Wikipedia page for definitions) although I don't see any mention of this in the linked gcc page so I might be wrong about this ... EDIT: It seems like adding bmi2 support does not add bmi1 and abm, I might have been using a __builtin call.... I later needed to add -mabm and -mbmi explicitly to get the instructions I wanted. As Peter Cordes suggested it is probably better to target Haswell -march=haswell as a starting point and then add on additional flags as needed. Haswell is the first processor with AVX2 from 2013 so in my mind -march=haswell is basically saying, I expect that you have a computer from 2013 or newer.
Also, based on some quick reading, it sounds like the use of __builtin enables the necessary flags (a future question for SO), although there does not appear to be a 1:1 correspondence between intrinsics and builtins. More specifically, not all intrinsics seem to be included as builtins, meaning the flag setting approach seems to be necessary, rather than just always using builtins and not worrying about setting flags. Also it is useful to know what intrinsics are being used, for distribution purposes, as it seems like bmi2 could still be missing on a substantial portion of computers (e.g. needing AMD from 2015+ - I think).
It's still not clear to me why just using the specified include in the Intel documentation doesn't work, but this info get's me 99% of the way to where I want to be.

Is there a XL C builtin for LXVD2X prior to 13.1.4?

I'm working from C/C++ using built-ins. I need the lvd2x instruction to load unaligned data into a VMX register. It looks like lvd2x is available on Power7 and Power8 processors.
GCC provides vec_vsx_ld built-in to perform the task. According to IBM XL C/C++ for Linux, V13.1.5, Chapter 4, Enhancements added in Version 13.1.4:
New built-in functions
The following GCC vector built-in functions are supported:
vec_vsx_ld
...
The code is guarded for XL C, so I don't need GCC's built-ins. The problem is, I can't find XL C's built-in for lvd2x:
#if defined(__xlc__) || defined(__xlC__)
uint8x16_p8 block = vec_vsx_ld(0, t);
#else
uint64x2_p8 block = (uint64x2_p8)vec_vsx_ld(0, t);
#endif
The GCC compile farm provides AIX with XL C v13.1.3 (5725-C72, 5765-J07). Is there a XL C builtin for LXVD2X prior to 13.1.4? If there is a built-in, then what is it? If not, then how do we gain access to the instruction?
(I'm trying to avoid ASM and inline ASM. I don't know enough about the processor to write it. I've also had a fairly unpleasant experience, and I don't want to amplify the pain by trying to use asm).
The portable function that should be implemented by both GCC and XL is vec_xl. It's part of the PPC64-LE ABI.
The older functions that XLC supported are vec_xld2 (for loading a vector containing 8-byte elements) and vec_xlw4 (for loading a vector containing 4-byte elements.)
Note that if you require big-endian vector element order, you should use vec_xl_be, or compile with -qaltivec=be.

How to check with Intel intrinsics if AVX extensions is supported by the CPU?

I'm writing a program using Intel intrinsics. I want to use _mm_permute_pd intrinsic, which is only available on CPUs with AVX. For CPUs without AVX I can use _mm_shuffle_pd but according to the specs it is much slower than _mm_permute_pd. Do the header files for Intel intrinsics define constants that allow me to distinguish whether AVX is supported so that I can write sth like this:
#ifdef __IS_AVX_SUPPORTED__ // is there sth like this defined?
// use _mm_permute_pd
# else
// use _mm_shuffle_pd
#endif
? I have found this tutorial, which shows how to perform a runtime check but I need to do a static, compile-time check for the current machine.
GCC, ICC, MSVC, and Clang all define a macro __AVX__ which you can check. In fact it's the only SIMD constant defined by all those compilers (MSVC is the one that breaks the mold). This only tells you if your code was compiled with AVX support (e.g. -mavx with GCC or /arch:AVX with MSVC) it does not tell you if your CPU supports AVX. If you want to know if the CPU supports AVX you need to check CPUID. Here, asm-in-c-error, is an example to read CPUID from all those compilers.
To do this properly I suggest you make a CPU dispatcher.
Edit: In case anyone wants to know how to use the values from CPUID to find out if AVX is available see https://github.com/Mysticial/FeatureDetector
I assume you are using Intel C++ Compiler. In this case - yes, there are such macros: Intel C++ Compiler Reference Guide: __AVX__, __AVX2__.
P.S. Be aware that if you compile you application with AVX instruction set enabled it will fail on CPUs not supporting AVX. If you are going to distribute your software as source code package and compile on target machine - this is may be a viable solution. Otherwise you should check for AVX dynamically.
P.P.S. There are several options for ICC. Take a look at the following compiler options and also references from it to other.
It seems to me that the only way is to compile and run a program that identifies whether AVX is available. Then manually or automatically compile separate code with or without AVX functions. For VS 2013, I would used my code in commomAVX folder in the following to identify hasAVX (or not) and use this to execute one of two different BAT files to compile and link the appropriate program.
http://www.roylongbottom.org.uk/gigaflops-benchmarks.zip
My question was to help to identify a solution regarding the use of suitable compile options such as /arch:AVX.

Intel c++ - optimizer messages

I wonder if it's possible to make Intel C++ compiler (or other compilers such as gcc or clang) display some messages from optimizer. I would like to know what exactly optimizer did with my code. By default compiler prints only very basic things like unused variable. very simple example - I want to know that expression;
float x = 1.0f/2;
will be evaluated into:
float x = 0.5f;
and there will be no division in code (I know that in this case it's always true, but this is just an example). More advanced example could be loop unroll or operations reorder.
Thanks in advance.
For icc and icpc, you can use the -opt-report -opt-report-level max set of flags.
You can also specify an opt-report file. See here for more details
An optimizing compiler (like GCC, when asked to optimize with -O1 or -O2 etc...) is essentially transforming internal representations of your source code.
If you want to see some of the internal GCC representations, you could pass -fdump-tree-all to GCC. Beware, you'll get hundreds of dump files.
You could also use the MELT probe: MELT is a domain specific language (and plugin implementation) to extend GCC, and it has a probe mode to interactively show some of the internal (notably Gimple) representations.
The optimization you describe at the top of the post is (somewhat strangely) part of icc -fno-prec-div (which is a default which you might be overriding).

Ensure compiler always use SSE sqrt instruction

I'm trying to get GCC (or clang) to consistently use the SSE instruction for sqrt instead of the math library function for a computationally intensive scientific application. I've tried a variety of GCCs on various 32 and 64 bit OS X and Linux systems. I'm making sure to enable sse with -mfpmath=sse (and -march=core2 to satisfy GCCs requirement to use -mfpmath=sse on 32 bit). I'm also using -O3. Depending on the GCC or clang version, the generated assembly doesn't consistently use SSE's sqrtss. In some versions of GCC, all the sqrts use the instruction. In others, there is mixed usage of sqrtss and calling the math library function. Is there a way to give a hint or force the compiler to only use the SSE instruction?
Use the sqrtss intrinsic __builtin_ia32_sqrtss?
You should be carefull in using that, you probably know that it has less precicision. That will be the reason that gcc doesn't use it systematically.
There is a trick that is even mentionned in INTEL's SSE manual (I hope that I remember correctly). The result of sqrtss is only one Heron iteration away from the target. Maybe that gcc is sometimes able to inline that surrounding brief iteration at some point (versions) and for others it doesn't.
You could use the builtin as MSN says, but you should definitively look up the specs on INTEL's web site to know what you are trading.

Resources