I am working on a project to accelerate the perf on ARM platform with NEON intrinsics.
I could not find the direct equivalents for below intrinsics
_mm_mulhi_epi16
_mm_hadd_epi32
_mm_maddubs_epi16
_mm_madd_epi16
_mm_extract_epi8
Equivalent intrinsics will help a lot in my efforts
_mm_hadd_epi32 appears to match vpaddq_s32.
_mm_extract_epi8 appears to match vgetq_lane_s8.
Not sure about the others offhand.
Related
In one of my applications, I need to efficiently de-interleave bits in a long stream of data. Ideally, I would like to use the BMI2 pext_u32() and/or pext_u64() x86_64 intrinsic instructions when available. I scoured the internet for doc on x86intrin.h (GCC), but couldn't find much on the subject; so, I am asking the gurus on StackOverflow to help me out.
Where can I find documentation about how to work with functions in x86intrin.h?
Does gcc's implementation of pext_*() already have code behind it to fall back on, or do I need to write the fallback code myself (for conditional compile)?
Is it possible to write a binary that automatically falls back to an alternate implementation if a target does not support the intrinsic? If so, how does one do so?
Is there a known programming pattern that will be recognized by GCC and automatically converted to pext_*() when compiling with optimization enabled and with -mbmi2?
Intel publishes the Intrinsics Guide, which also applies to GCC. You will have to write your own fallback code if you use these intrinsics.
You can achieve automatic switching of implementations by using IFUNC resolvers, but for non-library code, using conditionals or function pointers is probably simpler.
Looking at the gcc/config/i386/i386.md and gcc/config/i386/i386.c files, I don't see anything in GCC 8 which would automatically select the pext instruction without intrinsics in the source code.
The design philosophy of Intel's intrinsics is that you can only use them in functions that will run only on CPUs with the required extensions. Checking for support every instruction would add way too much overhead, and then there's have to be a fallback (there isn't).
Intel intrinsics are not like GNU C __builtin_popcountll (which does use a fallback if compiled without -mpopcnt, but not you can enable target options on a per-function basis with attributes.)
I'm looking for the intrinsic corresponding to the operation 'SMLAL2 Vd.8H,Vn.16B,Vm.16B', which according to ARM's own documentation (ARM Neon Intrinsics Ref) should be something like
int16x8_t vmlal_high_s8 (int16x8_t a,int8x16_t b,int8x16_t c)
however in the arm_neon.h that is included in ARM's GNU Toolchain doesn't have anything corresponding to it. So my question would be if I have to just include something else or otherwise can somehow circumvent this problem.
Thanks in advance!
For anyone else hitting this problem: I had chosen the ARM embedded tool chain instead of the linaro one, which is suitable for aarch64
I'm writing a program using Intel intrinsics. I want to use _mm_permute_pd intrinsic, which is only available on CPUs with AVX. For CPUs without AVX I can use _mm_shuffle_pd but according to the specs it is much slower than _mm_permute_pd. Do the header files for Intel intrinsics define constants that allow me to distinguish whether AVX is supported so that I can write sth like this:
#ifdef __IS_AVX_SUPPORTED__ // is there sth like this defined?
// use _mm_permute_pd
# else
// use _mm_shuffle_pd
#endif
? I have found this tutorial, which shows how to perform a runtime check but I need to do a static, compile-time check for the current machine.
GCC, ICC, MSVC, and Clang all define a macro __AVX__ which you can check. In fact it's the only SIMD constant defined by all those compilers (MSVC is the one that breaks the mold). This only tells you if your code was compiled with AVX support (e.g. -mavx with GCC or /arch:AVX with MSVC) it does not tell you if your CPU supports AVX. If you want to know if the CPU supports AVX you need to check CPUID. Here, asm-in-c-error, is an example to read CPUID from all those compilers.
To do this properly I suggest you make a CPU dispatcher.
Edit: In case anyone wants to know how to use the values from CPUID to find out if AVX is available see https://github.com/Mysticial/FeatureDetector
I assume you are using Intel C++ Compiler. In this case - yes, there are such macros: Intel C++ Compiler Reference Guide: __AVX__, __AVX2__.
P.S. Be aware that if you compile you application with AVX instruction set enabled it will fail on CPUs not supporting AVX. If you are going to distribute your software as source code package and compile on target machine - this is may be a viable solution. Otherwise you should check for AVX dynamically.
P.P.S. There are several options for ICC. Take a look at the following compiler options and also references from it to other.
It seems to me that the only way is to compile and run a program that identifies whether AVX is available. Then manually or automatically compile separate code with or without AVX functions. For VS 2013, I would used my code in commomAVX folder in the following to identify hasAVX (or not) and use this to execute one of two different BAT files to compile and link the appropriate program.
http://www.roylongbottom.org.uk/gigaflops-benchmarks.zip
My question was to help to identify a solution regarding the use of suitable compile options such as /arch:AVX.
I want to write some C code such that gcc using the -msse4.1 flag can optimize it. Basically I want to check whether or not the compiler is taking advantage of SSE4.1 instructions.
There are many SSE4.1 instructions (http://en.wikipedia.org/wiki/SSE4#New_instructions) but I am not able to write a fragment of C Code which is using any of those instructions in the generated assembly code.
Thanks in advance.
From what I've seen, compilers rarely ever generate SSE4.1 instructions. I've seen a few cases where it will use the insert/extract instructions to pack data.
But for the most part, if you want to use the SSE4.1 instructions, you need to do them explicitly using intrinsics:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_bk_sse41.htm
I doubt GCC would emit SSE4.1 instructions that easily. But you could have a look at Intel SPMD Program Compiler:
Under the SPMD model, the programmer writes a program that mostly
appears to be a regular serial program, though the execution model is
actually that a number of program instances execute in parallel on the
hardware. (See a more detailed example that illustrates this concept.)
ispc compiles a C-based SPMD programming language to run on the SIMD
units of CPUs; it frequently provides a 3x or more speedup on CPUs
with 4-wide SSE units, without any of the difficulty of writing
intrinsics code.
The ARM reference manual doesn't go into too much detail into the individual instructions ( http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/BABIIBBG.html ). Is there something that's a little more detailed?
For more information on the instructions themselves, you need the Assembler Guide. The list you found there just shows the mapping from compiler intrinsics to assembly instructions.
There's also the ARM C Language Extensions which provides details on the usage of the intrinsics (see chapter 12) that could be useful.
There is now an HTML version of the NEON Intrinsics Reference which is pretty convenient. Each entry includes a link to a more detailed explanation of the relevant instruction.
It's still not quite as good as Intel's, which lets you filter by instruction set and includes pseudo-code implementations, but it's a huge improvement over the old PDFs.
The ARM NEON Intrinsics Reference lists every NEON intrinsic with a mapping to the instruction it behaves like. Like the reference you give, it doesn't go in to detail about the behavior of the instruction, so must be read together with an Architecture Reference Manual, but it is the most complete reference for NEON Intrinsics which I'm aware of.