How to use x86intrin.h - c

In one of my applications, I need to efficiently de-interleave bits in a long stream of data. Ideally, I would like to use the BMI2 pext_u32() and/or pext_u64() x86_64 intrinsic instructions when available. I scoured the internet for doc on x86intrin.h (GCC), but couldn't find much on the subject; so, I am asking the gurus on StackOverflow to help me out.
Where can I find documentation about how to work with functions in x86intrin.h?
Does gcc's implementation of pext_*() already have code behind it to fall back on, or do I need to write the fallback code myself (for conditional compile)?
Is it possible to write a binary that automatically falls back to an alternate implementation if a target does not support the intrinsic? If so, how does one do so?
Is there a known programming pattern that will be recognized by GCC and automatically converted to pext_*() when compiling with optimization enabled and with -mbmi2?

Intel publishes the Intrinsics Guide, which also applies to GCC. You will have to write your own fallback code if you use these intrinsics.
You can achieve automatic switching of implementations by using IFUNC resolvers, but for non-library code, using conditionals or function pointers is probably simpler.
Looking at the gcc/config/i386/i386.md and gcc/config/i386/i386.c files, I don't see anything in GCC 8 which would automatically select the pext instruction without intrinsics in the source code.

The design philosophy of Intel's intrinsics is that you can only use them in functions that will run only on CPUs with the required extensions. Checking for support every instruction would add way too much overhead, and then there's have to be a fallback (there isn't).
Intel intrinsics are not like GNU C __builtin_popcountll (which does use a fallback if compiled without -mpopcnt, but not you can enable target options on a per-function basis with attributes.)

Related

_asm in which cases is it best to use it? [duplicate]

This question already has answers here:
Why do you program in assembly? [closed]
(30 answers)
Closed 8 years ago.
Is there anything that we can do in assembly that we can't do in raw C? Or anything which is easier to do in assembly? Is any modern code actually written using inline assembly, or is it simply implemented as a legacy or educational feature?
Inline assembly (and on a related note, calling external functions written purely in assembly) can be extremely useful or absolutely essential for reasons such as writing device drivers, direct access to hardware or processor capabilities not defined in the language, hardware-supported parallel processing (as opposed to multi-threading) such as CUDA, interfacing with FPGAs, performance, etc.
It is also important because some things are only possible by going "beneath" the level of abstraction provided by the Standard (both C++ and C).
The Standard(s) recognize that some things will be inherently implementation-defined, and allow for that throughout the Standard. One of these allowances (perhaps the lowest-level) is recognition of asm. Well, "sort of" recognition:
In C (N1256), it is found in the Standard under "Common extensions":
J.5.10 The asm keyword
1 The asm keyword may be used to insert assembly language directly into the translator output (6.8). The most common implementation is via a statement of the form:
asm ( character-string-literal );
In C++ (N3337), it has similar caveats:
§7.4/1
An asm declaration has the form
asm-definition:
asm ( string-literal ) ;
The asm declaration is conditionally-supported; its meaning is implementation-defined. [ Note: Typically it is used to pass information through the implementation to an assembler. —end note]
It should be noted that an important development in recent years is that attempting to increase performance by using inline assembly is often counter-productive, unless you know exactly what you are doing. Compiler/optimizer register usage decisions, awareness of pipeline and branch prediction behavior, etc., are almost always enough for most uses.
On the other hand, processors in recent years have added CPU-level support for higher-level operations (such as Intel's AES extensions) that can increase performance by several orders of magnitude for specialized applications.
So:
Legacy feature? Not at all. It is absolutely essential for some requirements.
Educational feature? In an ideal world, only if accompanied by a series of lectures explaining why you'll probably never need it, and if you ever do need it, how to limit it's visible surface area to the rest of your application as much as possible.
You also need to code with inlined asm when:
you need to use some processor features not usable in standard C; typically, the add with carry machine instruction is useful in bignum implementations like GMPlib
on today's processors with current optimizing compilers, you usually should not use  asm for performance reasons, since compilers optimize better than you can (an old example was implementing memcpy with rep stosw on x86).
you need some asm when you are using or implementing a different ABI. For example, the runtime system of some Ocaml or Common Lisp implementations have different calling conventions, and transitioning to them may require asm; but the current libffi (which is using  asm) may avoid you to code with asm
your brand-new hardware might have a recent instruction set not fully implemented by your compiler (e.g. extensions like AVX512...) for which you might need asm
you want to implement some functionality not implementable in C, e.g. backtrace
In general, you should think more than twice before using asm and if you do use it, you should use it in very few places. In general, avoid using asm....
The GCC compiler introduced an extended asm feature which has nearly become a de facto standard supported by many other compilers (e.g. Clang/LLVM...) - but the evil is in the details. See also the GCC Inline Assembly HowTo
The Linux kernel (and the many libc implementations, e.g. glibc or musl libc, etc...) is using asm (at least to make syscalls) but few major free software are also (directly) using asm instructions.
Read also the Linux Assembly HowTo

Implicit definition of non-simd intel intrinsic

In the following link there is a section for non-simd intel intrinsics:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
These include assembly instructions like bsf and bsr. For SIMD instructions I can copy the c function and run it after including the proper header.
For the non-simd functions, like _bit_scan_reverse (bsr), I get that this function is undefined for gcc (implicit definition). GCC has similar "builtin functions" e.g. __builtin_ctz, but no _bit_scan_reverse or _mm_popcnt_u32. Why are these intrinsics not available?
#include <stdio.h>
#include <immintrin.h>
int main(void) {
int x = 5;
int y = _bit_scan_reverse (x);
printf("%d\n",y);
return 0;
}
It appears that I needed to have two changes:
First, it appears to be best practice to include x86intrin.h rather than more specific includes. This appears to be compiler specific and is covered in much better detail in:
Header files for x86 SIMD intrinsics
Importantly, you would have a different include if not using gcc.
Second, compiler options also need to be enabled. For gcc these are detailed in:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
Although documentation for many flags are lacking.
As my goal is to distribute a compiled binary, I wanted to try and avoid -march=native
Most of the "other" intrinsics I'm interested in are bit manipulation related.
Ye Olde Wikipedia has a decent writeup of important bit manipulation intrinsic groups like bmi2:
https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets
I need bmi2 for BZHI (instruction) or _bzhi_u32 (c)
Thus I can get what I want with something like:
-mavx2 -mbmi2
Using -mbmi2 seems to be sufficient to get things like bmi1 and abm (see linked Wikipedia page for definitions) although I don't see any mention of this in the linked gcc page so I might be wrong about this ... EDIT: It seems like adding bmi2 support does not add bmi1 and abm, I might have been using a __builtin call.... I later needed to add -mabm and -mbmi explicitly to get the instructions I wanted. As Peter Cordes suggested it is probably better to target Haswell -march=haswell as a starting point and then add on additional flags as needed. Haswell is the first processor with AVX2 from 2013 so in my mind -march=haswell is basically saying, I expect that you have a computer from 2013 or newer.
Also, based on some quick reading, it sounds like the use of __builtin enables the necessary flags (a future question for SO), although there does not appear to be a 1:1 correspondence between intrinsics and builtins. More specifically, not all intrinsics seem to be included as builtins, meaning the flag setting approach seems to be necessary, rather than just always using builtins and not worrying about setting flags. Also it is useful to know what intrinsics are being used, for distribution purposes, as it seems like bmi2 could still be missing on a substantial portion of computers (e.g. needing AMD from 2015+ - I think).
It's still not clear to me why just using the specified include in the Intel documentation doesn't work, but this info get's me 99% of the way to where I want to be.

_InterlockedCompareExchange vs _InterlockedCompareExchange_np

I am developing a very specialized and very optimized kernel code (target Windows7 to Windows10, x64), and I need some of the Interlocked intrisics, in particular _InterlockedCompareExchange. The documentation says that there are many variants of the function, but for some of them the description is quite cryptic. MSDN says "The intrinsics with an _np ("no prefetch") suffix prevent a possible prefetch operation from being inserted by the compiler". Does it mean that the compiler may mess up my code by reordering instructions? Should I always use the "_np" version? I tried both functions and the resulting machine code is the same. Any hints?

How to check with Intel intrinsics if AVX extensions is supported by the CPU?

I'm writing a program using Intel intrinsics. I want to use _mm_permute_pd intrinsic, which is only available on CPUs with AVX. For CPUs without AVX I can use _mm_shuffle_pd but according to the specs it is much slower than _mm_permute_pd. Do the header files for Intel intrinsics define constants that allow me to distinguish whether AVX is supported so that I can write sth like this:
#ifdef __IS_AVX_SUPPORTED__ // is there sth like this defined?
// use _mm_permute_pd
# else
// use _mm_shuffle_pd
#endif
? I have found this tutorial, which shows how to perform a runtime check but I need to do a static, compile-time check for the current machine.
GCC, ICC, MSVC, and Clang all define a macro __AVX__ which you can check. In fact it's the only SIMD constant defined by all those compilers (MSVC is the one that breaks the mold). This only tells you if your code was compiled with AVX support (e.g. -mavx with GCC or /arch:AVX with MSVC) it does not tell you if your CPU supports AVX. If you want to know if the CPU supports AVX you need to check CPUID. Here, asm-in-c-error, is an example to read CPUID from all those compilers.
To do this properly I suggest you make a CPU dispatcher.
Edit: In case anyone wants to know how to use the values from CPUID to find out if AVX is available see https://github.com/Mysticial/FeatureDetector
I assume you are using Intel C++ Compiler. In this case - yes, there are such macros: Intel C++ Compiler Reference Guide: __AVX__, __AVX2__.
P.S. Be aware that if you compile you application with AVX instruction set enabled it will fail on CPUs not supporting AVX. If you are going to distribute your software as source code package and compile on target machine - this is may be a viable solution. Otherwise you should check for AVX dynamically.
P.P.S. There are several options for ICC. Take a look at the following compiler options and also references from it to other.
It seems to me that the only way is to compile and run a program that identifies whether AVX is available. Then manually or automatically compile separate code with or without AVX functions. For VS 2013, I would used my code in commomAVX folder in the following to identify hasAVX (or not) and use this to execute one of two different BAT files to compile and link the appropriate program.
http://www.roylongbottom.org.uk/gigaflops-benchmarks.zip
My question was to help to identify a solution regarding the use of suitable compile options such as /arch:AVX.

How to create a C compiler for custom CPU?

What would be the easiest way to create a C compiler for a custom CPU, assuming of course I already have an assembler for it?
Since a C compiler generates assembly, is there some way to just define standard bits and pieces of assembly code for the various C idioms, rebuild the compiler, and thereby obtain a cross compiler for the target hardware?
Preferably the compiler itself would be written in C, and build as a native executable for either Linux or Windows.
Please note: I am not asking how to write the compiler itself. I did take that course in college, I know about general compiler-compilers, etc. In this situation, I'd just like to configure some existing framework if at all possible. I don't want to modify the language, I just want to be able to target an arbitrary architecture. If the answer turns out to be "it doesn't work that way", that information will be useful to myself and anyone else who might make similar assumptions.
Quick overview/tutorial on writing a LLVM backend.
This document describes techniques for writing backends for LLVM which convert the LLVM representation to machine assembly code or other languages.
[ . . . ]
To create a static compiler (one that emits text assembly), you need to implement the following:
Describe the register set.
Describe the instruction set.
Describe the target machine.
Implement the assembly printer for the architecture.
Implement an instruction selector for the architecture.
There's the concept of a cross-compiler, ie., one that runs on one architecture, but targets a different one. You can see how GCC does it (for example) and add a new architecture to the set, if that's the compiler you want to extend.
Edit: I just spotted a question a few years ago on a GCC mailing list on how to add a new target and someone pointed to this
vbcc (at www.compilers.de) is a good and simple retargetable C-compiler written in C. It's much simpler than GCC/LLVM. It's so simple I was able to retarget the compiler to my own CPU with a few weeks of work without having any prior knowledge of compilers.
The short answer is that it doesn't work that way.
The longer answer is that it does take some effort to write a compiler for a new CPU type. You don't need to create a compiler from scratch, however. Most compilers are structured in several passes; here's a typical architecture (a lot of variations are possible):
Syntactic analysis (lexer and parser), and for C preprocessing, leading to an abstract syntax tree.
Type checking, leading to an annotated abstract syntax tree.
Intermediate code generation, leading to architecture-independent intermediate code. Some optimizations are performed at this stage.
Machine code generation, leading to assembly or directly to machine code. More optimizations are performed at this stage.
In this description, only step 4 is machine-dependent. So you can take a compiler where step 4 is clearly separated and plug in your own step 4. Doing this requires a deep understanding of the CPU and some understanding of the compiler internals, but you don't need to worry about what happens before.
Almost all CPUs that are not very small, very rare or very old have a backend (step 4) for GCC. The main documentation for writing a GCC backend is the GCC internals manual, in particular the chapters on machine descriptions and target descriptions. GCC is free software, so there is no licensing cost in using it.
1) Short answer:
"No. There's no such thing as a "compiler framework" where you can just add water (plug in your own assembly set), stir, and it's done."
2) Longer answer: it's certainly possible. But challenging. And likely expensive.
If you wanted to do it yourself, I'd start by looking at Gnu CC. It's already available for a large variety of CPUs and platforms.
3) Take a look at this link for more ideas (including the idea of "just build a library of functions and macros"), that would be my first suggestion:
http://www.instructables.com/answers/Custom-C-Compiler-for-homemade-instruction-set/
You can modify existing open source compilers such as GCC or Clang. Other answers have provided you with links about where to learn more. But these compilers are not designed to easily retargeted; they are "easier" to retarget than compilers than other compilers wired for specific targets.
But if you want a compiler that is relatively easy to retarget, you want one in which you can specify the machine architecture in explicit terms, and some tool generates the rest of the compiler (GCC does a bit of this; I don't think Clang/LLVM does much but I could be wrong here).
There's a lot of this in the literature, google "compiler-compiler".
But for a concrete solution for C, you should check out ACE, a compiler vendor that generates compilers on demand for customers. Not free, but I hear they produce very good compilers very quickly. I think it produces standard style binaries (ELF?) so it skips the assembler stage. (I have no experience or relationship with ACE.)
If you don't care about code quality, you can likely write a syntax-directed translation of C to assembler using a C AST. You can get C ASTs from GCC, Clang, maybe ANTLR, and from our DMS Software Reengineering Toolkit.

Resources