why is a vdiv instruction generated with neon flags? - arm

I disassembled an arm binary previously compiled with neon flags:
-mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp -ftree-vectorize
The dump shows a vdiv.f64 instruction generated by the compiler. According to the arm manual for armv7 (cortex-a9) neon simd isa does not support vdiv instruction but the floating point (vfp) engine does. Why is this instruction generated? is it then a floating point instruction that will be executed by the vfp? Both neon and VFP support addition and multiplication for floating point so how can I differenciate them from eahc other?

In the case of Cortex-A9, the NEON FPU option also implements VFP; it is a superset of the cut-down 16-register VFP-only FPU option.
More generally, the architecture does not allow implementing floating-point Advanced SIMD without also implementing at least single-precision VFP, therefore GCC's -mfpu=neon implies VFPv3 as well. It is permissible to implement integer-only Advanced SIMD without any floating-point capability at all, but I'm not sure GCC can support that (or that anyone's ever built such a thing).
The actual VFP and Advanced SIMD variants of instructions are unambiguous from the syntax - anything operating on double-precision data (i.e. <op>.F64) is obviously VFP, as Advanced SIMD doesn't support double-precision. Single precision operations (i.e. <op>.F32) operating on 32-bit s registers are scalar, thus VFP; if they're operating on larger 64-bit d or 128-bit q registers, then they are handling multiple 32-bit values at once, thus are vectorised Advanced SIMD instructions.

Related

FPU version for Cortex-M microcontrollers

From a simple google search, I found out that the fpu version for Tiva C Launchpad is fpv4-sp-d16 but which document tells the fpu version of various microcontrollers(tm4c123gh6pm, stm32f407, stm32f446re, etc.)?
arm-none-eabi-gcc --print-multi-lib
gives the information about architecture and abi but fpu version is not mentioned for a particular architectute.
The FPU is defined by ARM, hence you need to look at the ARM core definitions. Note that FPU is optional for the cores, so you do need to check the silicon vendors' doc on whether they include the FPU or not.
For Cortex-M4, the optional FPU is 32-bits, i.e. single precision FP. Note that this means that double precision (i.e. 64-bit FP) is done without using the FPU.
Cortex-M7 definition includes an optional 64-bit FPU and can execute both single and double precision FP instructions.
Orthogonal to the FPU used is the calling convention that your program uses. As related to FP. basically it means whether to pass function arguments in FP registers on normal ARM registers.
The arm community suggested the following answer
"ARM Cortex‑M4 Processor Technical Reference Manual" gives this information
ARM Cortex-M4 TRM
Section 7.1 about fpu says "The Cortex-M4 FPU is an implementation of the single precision variant of the ARMv7-M Floating Point Extension(FPv4-SP)"
Also the 32 single precision registers can be combined into 16 double precision ones (d16) hence fpv4-sp-d16

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly.
I'm writing software that operates on bytes and words (8- and 16-bit) using up to SSE4.1 instructions via intrinsics.
I am confused whether there will be EVEX-encoded versions of all/most SSE4.1 instructions in AVX-512F, and whether this means I can expect my SSE code to automatically gain EVEX-extended instructions and map to all new registers.
Wikipedia says this:
The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
This unfortunately does not clarify whether compiling SSE4 code with AVX512-enabled will lead to the same (awesome) speedup that compiling it to AVX2 provides (VEX coding of legacy instructions).
Anybody know what will happen when SSE2/4 code (C intrinsics) are compiled for AVX-512F? Could one expect a speed bump like with AVX1's VEX coding of the byte and word instructions?
Okay, I think I've pieced together enough information to make a decent answer. Here goes.
What will happen when native SSE2/4 code is run on Knights Landing (KNL)?
The code will run in the bottom fourth of the registers on a single VPU (called the compatibility layer) within a core. According to a pre-release webinar from Colfax, this means occupying only 1/4 to 1/8 of the total register space available to a core and running in legacy mode.
What happens if the same code is recompiled with compiler flags for AVX-512F?
SSE2/4 code will be generated with VEX prefix. That means pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to AVX512's native EVEX or allowed to address the new zmm registers specifically. Instructions can only be promoted to EVEX with AVX512-VL, in which case they gain the ability to directly address (renamed) zmm registers. It is unknown whether register sharing is possible at this point, but pipelining on AVX2 has demonstrated similar throughput with half-width AVX2 (AVX-128) as with full 256-bit AVX2 code in many cases.
Most importantly, how do I get my SSE2/4/AVX128 byte/word size code running on AVX512F?
You'll have to load 128-bit chunks into xmm, sign/zero extend those bytes/words into 32-bit in zmm, and operate as if they were always larger integers. Then when finished, convert back to bytes/words.
Is this fast?
According to material published on Larrabee (Knights Landing's prototype), type conversions of any integer width are free from xmm to zmm and vice versa, so long as registers are available. Additionally, after calculations are performed, the 32-bit results can be truncated on the fly down to byte/word length and written (packed) to unaligned memory in 128-bit chunks, potentially saving an xmm register.
On KNL, each core has 2 VPUs that seem to be capable of talking to each other. Hence, 32-way 32-bit lookups are possible in a single vperm*2d instruction of presumably reasonable throughput. This is not possible even with AVX2, which can only permute within 128-bit lanes (or between lanes for the 32-bit vpermd only, which is inapplicable to byte/word instructions). Combined with free type conversions, the ability to use masks implicitly with AVX512 (sparing the costly and register-intensive use of blendv or explicit mask generation), and the presence of more comparators (native NOT, unsigned/signed lt/gt, etc), it may provide a reasonable performance boost to rewrite SSE2/4 byte/word code for AVX512F after all. At least on KNL.
Don't worry, I'll test the moment I get my hands on mine. ;-)

SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that:
1) 128-bit vector registers XMM are used;
2) SSE2 instruction MOVSD is invoked.
I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64-bit of an XMM register and set the upper 64-bit 0. But I just don't understand two things:
1) I never give the compiler any hint for using SSE2. Plus, I am using GCC not intel compiler. As far as I know, intel compiler will automatically seek opportunities for vectorization, but GCC will not. So how does GCC know to use MOVSD?? Or, has this x86 instruction been around long before SSE instruction set, and the _mm_load_sd() intrinsics in SSE2 is just to provide backward compatibility for using XMM registers for scalar computation?
2) Why does not the compiler use other floating point registers, either the 80-bit floating point stack, or 64-bit floating point registers?? Why must it take the toll using XMM register (by setting upper 64-bit 0 and essentially wasting that storage)? Does XMM do provide faster access??
By the way, I have another question regarding SSE2. I just can't see the difference between _mm_store_sd() and _mm_storel_sd(). Both store the lower 64-bit value to an address. What is the difference? Performance difference?? Alignment difference??
Thank you.
Update 1:
OKAY, obviously when I first asked this question, I lacked some basic knowledge on how a CPU manages floating point operations. So experts tend to think my question is non-sense. Since I did not include even the shortest sample C code, people might think this question vague as well. Here I would provide a review as an answer, which hopefully will be useful to any people unclear about the floating point operations on modern CPUs.
A review of floating point scalar/vector processing on modern CPUs
The idea of vector processing dates back to old time vector processors, but these processors had been superseded by modern architectures with cache systems. So we focus on modern CPUs, especially x86 and x86-64. These architectures are the main stream in high performance scientific computing.
Since i386, Intel introduced the floating point stack where floating point numbers up to 80-bit wide can be held. This stack is commonly known as x87 or 387 floating point "registers", with a set of x87 FPU instructions. x87 stack are not real, directly addressable registers like general purpose registers, as they are on a stack. Access to register st(i) is by offsetting the stack top register %st(0) or simply %st. With help of an instruction FXCH which swaps the contents between current stack top %st and some offset register %st(i), random access can be achieved. But FXCH can impose some performance penalty, though minimized. x87 stack provides high precision computation by calculating intermediate results with 80 bits of precision by default, to minimise roundoff error in numerically unstable algorithms. However, x87 instructions are completely scalar.
The first effort on vectorization is the MMX instruction set, which implemented integer vector operations. The vector registers under MMX are 64-bit wide registers MMX0, MMX1, ..., MMX7. Each can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format. A single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. So now there are the legacy general purpose registers for scalar integer operations, as well as new MMX for integer vector operations with no shared execution resources. But MMX shared execution resources with scalar x87 FPU operation: each MMX register corresponded to the lower 64 bits of an x87 register, and the upper 16 bits of the x87 registers is unused. These MMX registers were each directly addressable. But the aliasing made it difficult to work with floating point and integer vector operations in the same application. To maximize performance, programmers often used the processor exclusively in one mode or the other, deferring the relatively slow switch between them as long as possible.
Later, SSE created a separate set of 128-bit wide registers XMM0–XMM7 along side of x87 stack. SSE instructions focused exclusively on single-precision floating-point operations (32-bit); integer vector operations were still performed using the MMX register and MMX instruction set. But now both operations can proceed at the same time, as they share no execution resources. It is important to know that SSE not only do floating point vector operations, but also floating point scalar operations. Essentially it provides a new place where floating operations take place, and the x87 stack is no longer prior choice to carry out floating operations. Using XMM registers for scalar floating point operations is faster than using x87 stack, as all XMM registers are easier to access, while the x87 stack can't be randomly accessed without FXCH. When I posted my question, I was clearly unaware of this fact. The other concept I was not clear about is that general purpose registers are integer/address registers. Even if they are 64-bit on x86-64, they can not hold 64-bit floating point. The main reason is that the execution unit associated with general purpose registers is ALU (arithmetic & logical unit), which is not for floating point computation.
SSE2 is a major progress, as it extends vector data type, so SSE2 instructions, either scalar or vector, can work with all C standard data type. Such extension in fact makes MMX obsolete. Also, x87 stack is no long as important as it once was. Since there are two alternative places where floating point operations can take place, you can specify your option to the compiler. For example for GCC, compilation with flag
-mfpmath=387
will schedule floating point operations on the legacy x87 stack. Note that this seems to be the default for 32-bit x86, even if SSE is already available. For example, I have an Intel Core2Duo laptop made in 2007, and it was already equipped with SSE release up to version SSE4, while GCC will still by default use x87 stack, which makes scientific computations unnecessarily slower. In this case, we need compile with flag
-mfpmath=sse
and GCC will schedule floating point operations on XMM registers. 64-bit x86-64 user needs not worry about such configuration as this is default on x86-64. Such signal will only affect scalar floating point operation. If we have written code using vector instructions and compiler the code with flag
-msse2
then XMM registers will be the only place where computation can take place. In other words, this flags turns on -mfpmath=sse. For more information see GCC's configuration of x86, x86-64. For examples of writing SSE2 C code, see my other post How to ask GCC to completely unroll this loop (i.e., peel this loop)?.
SSE set of instructions, though very useful, are not the latest vector extensions. The AVX, advanced vector extensions enhances SSE by providing 3-operands and 4 operands instructions. See number of operands in instruction set if you are unclear of what this means. 3-operands instruction optimizes the commonly seen fused multiply-add (FMA) operation in scientific computing by 1) using 1 fewer register; 2) reducing the explicit amount of data movement between registers; 3) speeding up FMA computations in itself. For example of using AVX, see #Nominal Animal's answer to my post.

Is __int128_t arithmetic emulated by GCC, even with SSE?

I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC.
The reason I'm asking this is because I'm wondering if it makes sense to expect big differences in __int128_t performance between different versions of GCC, depending on what SSE instructions are taken advantage of.
So, what parts of __int128_t arithmetic are emulated by GCC, and what parts are implemented with SSE instructions (if any)?
I was confusing two different things in my question.
Firstly, as PaulR explained in the comments: "There are no 128 bit arithmetic operations in SSE or AVX (apart from bitwise operations)". Considering this, 128-bit arithmetic has to be emulated on modern x86-64 based processors (e.g. AMD Family 10 or Intel Core architecture). This has nothing to do with GCC.
The second part of the question is whether or not 128-bit arithmetic emulation in GCC benefits from SSE/AVX instructions or registers. As implied in PaulR's comments, there isn't much in SSE/AVX that's going to allow you to do 128-bit arithmetic more easily; most likely x86-64 instructions will be used for this. The code I'm interested in can't compile with -mno-sse, but it compiles fine with -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 and performance isn't affected. So my code doesn't benefit from modern SSE instructions.
SSE2-AVX instructions are available for 8,16,32,64-bit integer data types. They are mostly intended to treat packed data together, for example, 128-bit register may contain four 32-bit integers and so on.
Although SSE/AVX/AVX-512/etc. have no 128-bit mode (their vector elements are strictly 64-bit max, and operations will simply overflow), as Paul R has implied, the main CPU does support limited 128-bit operations, by using a pair of registers.
When multiplying two regular 64-bit number, MUL/IMUL can outputs its 128-bit result in the RAX/RDX register pair.
Inversely, when dividing DIV/IDIV can take its input from then RAX/RDX pair to divide a 128-bit number by a 64-bit divisor (and outputs 64-bit quotient + 64-bit modulo)
Of course the CPU's ALU is 64-bit, thus - as implied Intel docs - these higher extra 64-bit come at the cost of extra micro-ops in the microcode. This is more dramatic for divisions (> 3x more) which already require lots of micro-ops to be processed.
Still that means that under some circumstances (like using a rule of three to scale a value), it's possible for a compiler to emit regular CPU instruction and not care to do any 128-bit emulation by itself.
This has been available for a long time:
since 80386, 32-bit CPU could do 64-bit multiplication/division using EAX:EDX pair
since 8086/88, 16-bit CPU could do 32-bit multiplication/division using AX:DX pair
(As for additions and subtraction: thank to the support for carry, it's completely trivial to do additions/subtractions of numbers of any arbitrary length that can fill your storage).

ARM Cortex-A8: Whats the difference between VFP and NEON

In ARM Cortex-A8 processor, I understand what NEON is, it is an SIMD co-processor.
But is VFP(Vector Floating Point) unit, which is also a co-processor, works as a SIMD processor? If so which one is better to use?
I read few links such as -
Link1
Link2.
But not really very clear what they mean. They say that VFP was never intended to be used for SIMD but on Wiki I read the following - "The VFP architecture also supports execution of short vector instructions but these operate on each vector element sequentially and thus do not offer the performance of true SIMD (Single Instruction Multiple Data) parallelism."
It so not so clear what to believe, can anyone elaborate more on this topic?
There are quite some difference between the two. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor as part of the ARM core. It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. Since there is parallelism inside the Neon, you can get more MIPS or FLOPS out of Neon than you can a standard SISD processor running at the same clock rate.
The biggest benefit of Neon is if you want to execute operation with vectors, i.e. video encoding/decoding. Also it can perform single precision floating point(float) operations in parallel.
VFP is a classic floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. It supports single and double precision floating point.
You have 3 possibilities to use Neon:
use intrinsics functions #include "arm_neon.h"
inline the assembly code
let the gcc to do the optimizations for you by providing -mfpu=neon as argument (gcc 4.5 is good on this)
For armv7 ISA (and variants)
The NEON is a SIMD and parallel data processing unit for integer and floating point data and the VFP is a fully IEEE-754 compatible floating point unit. In particular on the A8, the NEON unit is much faster for just about everything, even if you don't have highly parallel data, since the VFP is non-pipelined.
So why would you ever use the VFP?!
The most major difference is that the VFP provides double precision floating point.
Secondly, there are some specialized instructions that that VFP offers that there are no equivalent implementations for in the NEON unit. SQRT comes to mind, perhaps some type conversions.
But the most important difference not mentioned in Cosmin's answer is that the NEON floating point pipeline is not entirely IEEE-754 compliant. The best description of the differences are in the FPSCR Register Description.
Because it is not IEEE-754 compliant, a compiler cannot generate these instructions unless you tell the compiler that you are not interested in full compliance. This can be done in several ways.
Using an intrinsic function to force NEON usage, for example see the GCC Neon Intrinsic Function List.
Ask the compiler, very nicely. Even newer GCC versions with -mfpu=neon will not generate floating point NEON instructions unless you also specify -funsafe-math-optimizations.
For armv8+ ISA (and variants) [Update]
NEON is now fully IEE-754 compliant, and from a programmer (and compiler's) point of view, there is actually not too much difference. Double precision has been vectorized. From a micro-architecture point of view I kind of doubt they are even different hardware units. ARM does document scalar and vector instructions separately but both are part of "Advanced SIMD."
Architecturally, VFP (it wasn't called Vector Floating Point for nothing) indeed has a provision for operating on a floating-point vector in a single instruction. I don't think it ever actually executes multiples operations simultaneously (like true SIMD), but it could save some code size. However, if you read the ARM Architecture Reference Manual in the Shark help (as I describe in my introduction to NEON, link 1 in the question), you'll see at section A2.6 that the vector feature of VFP is deprecated in ARMv7 (which is what the Cortex A8 implements), and software should use Advanced SIMD for floating-point vector operations.
Worse yet, in the Cortex A8 implementation, VFP is implemented with a VFP Lite execution unit (read lite as occupying a smaller silicon surface, not as having less features), which means that it's actually slower than on the ARM11, for instance! Fortunately, most single-precision VFP instructions get executed by the NEON unit, but I'm not sure vector VFP operations do; and even if they do, they certainly execute slower than with NEON instructions.
Hope that clears thing up!
IIRC, the VFP is a floating point coprocessor which works sequentially.
This means that you can use instruction on a vector of floats for SIMD-like behaviour, but internally, the instruction is performed on each element of the vector in sequence.
While the overall time required for the instruction is reduced by this because of the single load instruction, the VFP still needs time to process all elements of the vector.
True SIMD will gain more net floating point performance, but using the VFP with vectors is still faster then using it purely sequential.

Resources