The ARM reference manual doesn't go into too much detail into the individual instructions ( http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/BABIIBBG.html ). Is there something that's a little more detailed?
For more information on the instructions themselves, you need the Assembler Guide. The list you found there just shows the mapping from compiler intrinsics to assembly instructions.
There's also the ARM C Language Extensions which provides details on the usage of the intrinsics (see chapter 12) that could be useful.
There is now an HTML version of the NEON Intrinsics Reference which is pretty convenient. Each entry includes a link to a more detailed explanation of the relevant instruction.
It's still not quite as good as Intel's, which lets you filter by instruction set and includes pseudo-code implementations, but it's a huge improvement over the old PDFs.
The ARM NEON Intrinsics Reference lists every NEON intrinsic with a mapping to the instruction it behaves like. Like the reference you give, it doesn't go in to detail about the behavior of the instruction, so must be read together with an Architecture Reference Manual, but it is the most complete reference for NEON Intrinsics which I'm aware of.
Related
I am working on a project to accelerate the perf on ARM platform with NEON intrinsics.
I could not find the direct equivalents for below intrinsics
_mm_mulhi_epi16
_mm_hadd_epi32
_mm_maddubs_epi16
_mm_madd_epi16
_mm_extract_epi8
Equivalent intrinsics will help a lot in my efforts
_mm_hadd_epi32 appears to match vpaddq_s32.
_mm_extract_epi8 appears to match vgetq_lane_s8.
Not sure about the others offhand.
I have a maybe trivial question, but what is a difference between Thumb (-mthumb) and Arm (-marm) state and why most of the tutorials recommend to use Thumb state?
I am curious what exactly does it mean? What it is related to?
Best!
I would suggest to read those two articles, one from Arm, Instruction Set Architecture (-marm means that GCC will generate arm32/A32 code, -mthumb means that it will generate thumb/T32 one), and this research paper, Profile Guided Selection of ARM and ThumbInstructions.
Basically, the two instruction sets differ in the set of instructions available as well as their encoding. You should therefore get a smaller and faster executable by using thumb/T32 than by using arm/A32.
This is the reason why most of the tutorials recommend to use the thumb/T32 instruction set.
In one of my applications, I need to efficiently de-interleave bits in a long stream of data. Ideally, I would like to use the BMI2 pext_u32() and/or pext_u64() x86_64 intrinsic instructions when available. I scoured the internet for doc on x86intrin.h (GCC), but couldn't find much on the subject; so, I am asking the gurus on StackOverflow to help me out.
Where can I find documentation about how to work with functions in x86intrin.h?
Does gcc's implementation of pext_*() already have code behind it to fall back on, or do I need to write the fallback code myself (for conditional compile)?
Is it possible to write a binary that automatically falls back to an alternate implementation if a target does not support the intrinsic? If so, how does one do so?
Is there a known programming pattern that will be recognized by GCC and automatically converted to pext_*() when compiling with optimization enabled and with -mbmi2?
Intel publishes the Intrinsics Guide, which also applies to GCC. You will have to write your own fallback code if you use these intrinsics.
You can achieve automatic switching of implementations by using IFUNC resolvers, but for non-library code, using conditionals or function pointers is probably simpler.
Looking at the gcc/config/i386/i386.md and gcc/config/i386/i386.c files, I don't see anything in GCC 8 which would automatically select the pext instruction without intrinsics in the source code.
The design philosophy of Intel's intrinsics is that you can only use them in functions that will run only on CPUs with the required extensions. Checking for support every instruction would add way too much overhead, and then there's have to be a fallback (there isn't).
Intel intrinsics are not like GNU C __builtin_popcountll (which does use a fallback if compiled without -mpopcnt, but not you can enable target options on a per-function basis with attributes.)
I'm looking for the intrinsic corresponding to the operation 'SMLAL2 Vd.8H,Vn.16B,Vm.16B', which according to ARM's own documentation (ARM Neon Intrinsics Ref) should be something like
int16x8_t vmlal_high_s8 (int16x8_t a,int8x16_t b,int8x16_t c)
however in the arm_neon.h that is included in ARM's GNU Toolchain doesn't have anything corresponding to it. So my question would be if I have to just include something else or otherwise can somehow circumvent this problem.
Thanks in advance!
For anyone else hitting this problem: I had chosen the ARM embedded tool chain instead of the linaro one, which is suitable for aarch64
I want to write some C code such that gcc using the -msse4.1 flag can optimize it. Basically I want to check whether or not the compiler is taking advantage of SSE4.1 instructions.
There are many SSE4.1 instructions (http://en.wikipedia.org/wiki/SSE4#New_instructions) but I am not able to write a fragment of C Code which is using any of those instructions in the generated assembly code.
Thanks in advance.
From what I've seen, compilers rarely ever generate SSE4.1 instructions. I've seen a few cases where it will use the insert/extract instructions to pack data.
But for the most part, if you want to use the SSE4.1 instructions, you need to do them explicitly using intrinsics:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_bk_sse41.htm
I doubt GCC would emit SSE4.1 instructions that easily. But you could have a look at Intel SPMD Program Compiler:
Under the SPMD model, the programmer writes a program that mostly
appears to be a regular serial program, though the execution model is
actually that a number of program instances execute in parallel on the
hardware. (See a more detailed example that illustrates this concept.)
ispc compiles a C-based SPMD programming language to run on the SIMD
units of CPUs; it frequently provides a 3x or more speedup on CPUs
with 4-wide SSE units, without any of the difficulty of writing
intrinsics code.