Software optimization guide for AArch64 Neon and SVE

Software optimization guide for AArch64 Neon and SVE - arm

There is ARM software optimization guide (e.g., https://developer.arm.com/documentation/swog309707/latest for neoverse n1).
This guide doesn't seem to contain the latency and throughput for Neon or SVE. Is there a separate guide for NEON or SVE (e.g., the instruction latency and throughput for INSR (SIMD&FP scalar) instruction)?
A pointer would be very helpful!

The timings for Neon instructions are in that document, listed under ASIMD (which is Arm's more formal name for that instruction set). See Sections 3.15 onward.
There are no timings for SVE instructions because, as I understand it, the N1 simply doesn't support that extension. But if you look at the guide for some core that does support SVE, you'll see the timings included. For the Neoverse N2 they are from Section 3.26 onward.

Related

Why doesn’t Clang use vcnt for __builtin_popcountll on AArch32?

The simple test,
unsigned f(unsigned long long x) {
return __builtin_popcountll(x);
}
when compiled with clang --target=arm-none-linux-eabi -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a15 -Os,⁎ results in the compiler emitting the numerous instructions required to implement the classic popcount for the low and high words in x in parallel, then add the results.
It seems to me from skimming the architecture manuals that NEON code similar to that generated for
#include <arm_neon.h>
unsigned f(unsigned long long x) {
uint8x8_t v = vcnt_u8(vcreate_u8(x));
return vget_lane_u64(vpaddl_u32(vpaddl_u16(vpaddl_u8(v))), 0);
}
should have been beneficial in terms of size at least, even if not necessarily a performance improvement.
Why doesn’t Clang† do that? Am I just giving it the wrong options? Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on the A15, that it wouldn’t be worth it? (This is what a comment on a related question seems to suggest, but very briefly.) Is Clang codegen for AArch32 lacking for care and attention, seeing as almost every modern mobile device uses AArch64? (That seems farfetched, but GCC, for example, is known to occasionally have bad codegen on non-prominent architectures such as PowerPC or MIPS.)
⁎ Clang options could be wrong or redundant, adjust as necessary.
† GCC doesn’t seem to do that in my experiments, either, just emitting a call to __popcountdi2, but that suggests I might simply be calling it wrong.

Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on
the A15, that it wouldn’t be worth it?
Well you asked very right question.
Shortly, yes, it's. It's slow and in most cases moving data between NEON and ARM CPU and vise-versa is a big performance penalty that over performance gain from using 'fast' NEON instructions.
In details, NEON is a optional co-processor in ARMv7 based chips.
ARM CPU and NEON work in parallel and I might say 'independently' from each other.
Interaction between CPU and NEON co-processor is organised via FIFO. CPU places neon instruction in FIFO and NEON co-processor fetch and execute it.
Delay comes at the point when CPU and NEON needs sync between each other. Sync is accessing same memory region or transfering data between registers.
So whole process of using vcnt would be something like:
ARM CPU placing vcnt into NEON FIFO
Moving data from CPU register into NEON register
NEON fetching vcnt from FIFO
NEON executing vcnt
Moving data from NEON register to CPU register
And all that time CPU is simply waiting while NEON is doing it's work.
Due to NEON pipelining, delay might be up to 20 cycles (if I remember this number correctly).
Note: "up to 20 cycles" is arbitrary, since if ARM CPU has other instructions that does not depend on result of NEON computations, CPU could execute them.
Conclusion: as a rule of thumb that's not worthy, unless you are manually optimise code to reduce/eliminate that sync delays.
PS: That's true for ARMv7. ARMv8 has NEON extension as part of a core, so it's not relevant.

Is Eigen NEON backend optimized to take advantage of the 2x128b NEON execution units which exist starting from ARM A76?

Going over Eigen documentation, its not clear whether it was updated since the release of A76 CPU core to take advantage of the wider SIMD it contains (2x128b vs. previous 128b)
I am hoping someone from the development team (or an expert user) can help clarifying that.

I'm not familiar with Eigen in particular, but in general, one doesn't need to do much to SIMD code to take advantage of different amounts of hardware execution units - especially when the CPUs support out of order execution, they will pick up more instructions that can be executed in parallel when there's more execution units.
If compiling e.g. SIMD intrinsics with a compiler, the compiler may be able to tune the exact scheduling of code if told to optimize specifically for that core (and if the compiler knows the scheduling characteristics for the core). Same thing for handwritten assembly code - it can be tuned and tweaked a bit for different cores' characteristics, but in most cases, it doesn't change very dramatically; more capable cores will execute it faster.
(The factor that primarily affects the bigger picture of how the code is written, which would require a proper rewrite to take advantage of, is usually the number of registers available in the instruction set - but that doesn't change with a hardware implementation with more execution units.)

ARM/Thumb interworking confusion regarding Thumb-2

I've been going through ARM ISA related documentation since a while and so far I believe that I've got a good understanding for the basics of ARM/Thumb interworking. I'll quickly summarize that in the following:
Instructions can be either 4 byte aligned (ARM) or 2 byte aligned (Thumb).
Thumb and ARM instructions reside in separate regions i.e. they are not intermixed without explicit processor state change.
State change can happen upon executing either of bx, blx, ldm, ldr. Choosing between ARM or Thumb depends on the value of the least significant bit in the address which can be 0 or 1 respectively.
The current state of the processor can be either ARM or thumb. That depends on the state of bit 5 of CPSR.
Rules for state change can be summarized in the following figure taken from this paper:
However, Thumb-2 instructions have confused me a bit. For instance, let's inspect the encoding of instruction ADC which can be found in section A8.8.2 of the ARMv7-A/R reference manual. Basically, the same instruction has 3 distinct encodings 16 bit (Thumb), 32 bit (Thumb2), and 32 bit (ARM).
Here are my questions:
Does the 32-bit Thumb-2 instructions execute in ARM or Thumb mode of the processor? (I'm assuming its the latter but not sure)
Some resources mention that ARM/Thumb instructions can be "freely" intermixed in thumb-2. Does that mean explicit state change using bx, blx, ldm or ldr doesn't need to happen?
Final note, this is the closest question to mine, however, I'm focusing on interworking.

Choicing a mode
so far I believe that I've got a good understanding for the basics of ARM/Thumb interworking.
Well, that is useful, it is really part of an older story. Originally, there was only ARM 32-bit instructions (1980-mid 1990s). Then ARM made a mode that was like a compression front-end that expanded a strictly 16bit opcodes to 32 bits. This was thumb mode (mid 1990s to ~2005). Then ARM came out with thumb2 (which is somewhat nebulous) mainly typified by a mix of both 16bit and 32bit instructions (~2005 to current).
The concept of interworking is only useful for a CPU with thumb (old) and ARM functions. If you have a thumb2 CPU and a good compiler with normal memory (1+ wait states), then the thumb2 is almost always the best choice.
Thumb2 intermixing
In a thumb2 capable processor, you do not need interworking! Ie, you don't change modes. You can use the thumb 16bit encodings and if you ask for a mnemonic where this is not possible, the assembler emits a 32bit version. The Cortex-M CPUs only have a thumb2 mode (really thumb mode with instruction extensions).
Disassembling
There are not really three types of opcodes but two with one extension.
Original 32 bit ARM opcodes.
16 bit only thumb encodings.
the thumb2 extension with all thumb opcodes plus more.
As the thumb opcodes are more dense, it is not possible to do all types of operations. So the thumb ADC is limited compared to the ARM. However, for most instructions ARM Holding updated the thumb2 (the only mode in the CPU is thumb; thumb2 is extra instructions/opcodes) to have all the capabilities of the ARM mode ADC.
There are discussions on recognizing the mode in a binary elsewhere. Assuming the code is not trying to obfuscate and people made rational choices, you will only have a two types of disassembly.
ARM 32 bit
thumb2
A thumb2 disassembler should work with pure thumb code. Most people do not use interworking. If they do, a large part of the binary will be thumb mode, with a small performance critical section in ARM mode.
A difficulty with thumb2 is the mixed 16/32 bit can lead a disassembler to mis-interpret an instruction stream if it decodes a 32bit encoding mid stream.
Final note, this is the closest question to mine, however, I'm focusing on interworking.
Interworking makes no sense on a thumb2 CPU. Since you question is tagged disassembling, I tried to answer with that focus versus the other questions that is mainly about what the modes are. For elf disassembly, the disassembler should have no trouble to locate major function entry points and should be able to disassemble without much issues.

Does the 32-bit Thumb-2 instructions execute in ARM or Thumb mode of the processor?
Thumb-2 instructions are accessible as were Thumb instructions when the processor is in Thumb state, that is, the T bit in the CPSR is 1 and the J bit in the CPSR is 0. (source)
Some resources mention that ARM/Thumb instructions can be "freely" intermixed in thumb-2. Does that mean explicit state change using bx, blx, ldm or ldr doesn't need to happen?
No state change needs to happen, since Thumb-2 instructions and ordinary Thumb instructions execute in the same state. As for how this fits with the instruction encoding, the ARM Architecture Reference Manual : Thumb-2 Supplement says this:
The new 32-bit Thumb instructions are added in the space previously occupied by the Thumb BL and BLX
instructions. This is made possible by treating the BL and BLX instructions as 32-bit instructions, instead of
treating them as two 16-bit instructions.

ARM Thumb/Thumb-2 performance

I am working on an ARM Cortex-M3 controller which has the Thumb-2 instruction set.
Thumb mode is used to compress the instruction to a 16-bit size.
So size of code is reduced. But with normal Thumb mode, why is it said that performance is reduced?
In case of Thumb-2, it is said performance is improved as per these two links:
Wikipedia
Arm.com
Improve performance in cases where a single 16-bit instruction restricts functions available to the compiler.
A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.
What exactly is this performance? Can someone give a few examples related to it?

When compared against the ARM 32 bit instruction set, the thumb 16 bit instruction set (not talking about thumb2 extensions yet) takes less space because the instructions are half the size, but there is a performance drop, in general, because it takes more instructions to do the same thing as on arm. There are less features to the instruction set, and most instructions only operate on registers r0-r7. Apples to Apples comparison more instructions to do the same thing is slower.
Now thumb2 extensions take formerly undefined thumb instructions and create 32 bit thumb instructions. Understand that there is more than one set of thumb2 extensions. ARMv6m adds a couple dozen perhaps. ARMv7m adds something like 150 instructions to the thumb instruction set, I dont know what ARMv8 or the future hold. So assuming ARMv7m, they have bridged the gap between what you can do in thumb and what you can do in ARM. So thumb2 is a reduced ARM instruction set as thumb is, but not as reduced. So it might still take more instructions to do the same thing in thumb2 (assume plus thumb) compared to ARM doing the same thing.
This gives a taste of the issue, a single instruction in arm and its equivalent in thumb.
ARM
and r8,r9,r10
THUMB
push {r0,r1}
mov r0,r8
mov r1,r9
and r0,r1
mov r1,r10
and r0,r1
mov r8,r0
pop {r0,r1}
Now a compiler wouldnt do that, the compiler would know it is targeting thumb and do things differently by choosing other registers. You still have fewer registers and fewer features per instruction:
mov r0,r1
and r0,r2
Still takes two instructions/execution cycles to and two registers together, without modifying the operands, and put the result in a third register. Thumb2 has a three register and so you are back to a single instruction using the thumb2 extensions. And that thumb2 instruction allows for r0-r15 on any of those three registers where thumb is limited to r0-r7.
Look at the ARMv5 Architectural Reference Manual, under each thumb instruction it shows you the equivalent ARM instruction. Then go to that ARM instruction and compare what you can do with that arm instruction that you cant do with the thumb instruction. It is a one way path the thumb instructions (not thumb2) have a one to one relationship with an ARM instruction. all thumb instructions have an equivalent arm instruction. but not all arm instructions have an equivalent thumb instruction. You should be able to see from this exercise the limitation on the compilers when using the thumb instruction set. Then get the ARMv7m Architectural Reference Manual and look at the instruction set, and compare the "all thumb variants" encodings (the ones that include ARMv4T) and the ones that are limited to ARMv6 and/or v7 and see the expansion of features between thumb and thumb2 as well as the thumb2 only instructions that have no thumb counterpart. This should clarify what the compilers have to work with between thumb and thumb2. You can then go so far as to compare thumb+thumb2 with the full blown ARM instructions (ARMv7 AR is that what it is called?). And see that thumb2 gets a lot closer to ARM, but you lose for example conditionals on every instruction, so conditional execution in thumb becomes comparisons with branching over code, where in ARM you can sometimes have an if-then-else without branching...

Thumb-2 introduced variable length instructions to the original Thumb; now instructions can be a mixture of 16-bit and 32-bit. That means you retain the size advantage of the original Thumb in everyday code, but now have access to almost the full ARM feature-set in more complex code, but without the ARM-interworking overhead previously incurred by Thumb.
Aside from the aforementioned access to the full register set from all register operations, Thumb-2 added back branchless conditional execution in the form of the IF-THEN (IT) block. The original Thumb removed the trademark ARM feature of conditional execution on nearly all instructions; this is now achieved in Thumb-2 by prepending the IT instruction with conditions for up to four succeeding instructions.
In addition, the instruction set itself has been vastly expanded; for example, the Cortex-M4F implements the DSP extension as well as the FPv4-SP floating point extension. In fact, I believe even NEON can be encoded in Thumb2.

ARM 32bit
ARM is a 32bit instruction set. All opcodes are 32bits. The leading bits denote conditional execution. This is generally wasteful as 90+% of code executes unconditionally. The ARM mode supports 16 registers nearly symmetric (with some special cases for PC, LR and SP).
Most instruction included an 's' suffix to set condition codes.
Thumb 16bit
The original thumb is 16bit only opcodes. It does not support conditional execution and access was mainly restricted to the lower eight registers. All arithmetic instructions set condition codes. Some instructions could retrieve data from the higher registers. It can be looked at as a compression engine on the instruction decode.
For some algorithms and memory topology, thumb can be faster than ARM. However it is fairly rare and needs slow (non-zero wait state) instruction memory for this to be the case.
As a practical example, some 'Game boy advance' code would be mainly execute in thumb mode, but would jump to zero wait state RAM and transition to ARM mode for a performance critical routine.
Thumb2 mixed mode
Thumb2 extended the thumb ISA but allows for both 16bit and 32bit opcodes. Almost the entire original ARM instruction set functionality can be achieved with Thumb2. Since the instruction stream is more dense, it is higher performance than the original ARM in almost every case due to lower instruction fetch overhead.
Thumb2 allows conditional execution for four instructions with 'if/else' opcode conditions. It allows use of all 16 registers and .unified code can be written to produce either ARM 32bit or mixed Thumb2 code.
Unified code will always be faster when Thumb2 is selected. There are fairly rare ARM sequences that can not be encoded directly to Thumb2. These few cases snippets could be faster. But generally, for any large code base, Thumb2 is faster.
This mode can be confusing with loop unrolling and jump tables. It is something that an x86 programmer would naturally think of. Ie, there are '.n'/narrow/16bit and '.w'/wide/32bit encodings of identical instructions. So if you treat code as an 'array' of tasks, the computations can be more complex. You also have transfer of control to mid-instruction possibilities.
As an example of 'un-encodeable' Thumb2 ARM code,
movlo r0,#1
moveq r0,#0
movhi r0,#-1
Above is only possible in ARM mode. However, such sequences are very rare and would only matter if you are porting assembler code from ARM to Thumb2. If it is selecting a compiler mode, Thumb2 should always produce better code (faster and smaller).
Summary
Each mode has variations on available opcodes depending on CPU model. However, the general concepts of each mode and performance are as stated.

ARM Cortex-A8: Whats the difference between VFP and NEON

In ARM Cortex-A8 processor, I understand what NEON is, it is an SIMD co-processor.
But is VFP(Vector Floating Point) unit, which is also a co-processor, works as a SIMD processor? If so which one is better to use?
I read few links such as -
Link1
Link2.
But not really very clear what they mean. They say that VFP was never intended to be used for SIMD but on Wiki I read the following - "The VFP architecture also supports execution of short vector instructions but these operate on each vector element sequentially and thus do not offer the performance of true SIMD (Single Instruction Multiple Data) parallelism."
It so not so clear what to believe, can anyone elaborate more on this topic?

There are quite some difference between the two. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor as part of the ARM core. It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. Since there is parallelism inside the Neon, you can get more MIPS or FLOPS out of Neon than you can a standard SISD processor running at the same clock rate.
The biggest benefit of Neon is if you want to execute operation with vectors, i.e. video encoding/decoding. Also it can perform single precision floating point(float) operations in parallel.
VFP is a classic floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. It supports single and double precision floating point.
You have 3 possibilities to use Neon:
use intrinsics functions #include "arm_neon.h"
inline the assembly code
let the gcc to do the optimizations for you by providing -mfpu=neon as argument (gcc 4.5 is good on this)

For armv7 ISA (and variants)
The NEON is a SIMD and parallel data processing unit for integer and floating point data and the VFP is a fully IEEE-754 compatible floating point unit. In particular on the A8, the NEON unit is much faster for just about everything, even if you don't have highly parallel data, since the VFP is non-pipelined.
So why would you ever use the VFP?!
The most major difference is that the VFP provides double precision floating point.
Secondly, there are some specialized instructions that that VFP offers that there are no equivalent implementations for in the NEON unit. SQRT comes to mind, perhaps some type conversions.
But the most important difference not mentioned in Cosmin's answer is that the NEON floating point pipeline is not entirely IEEE-754 compliant. The best description of the differences are in the FPSCR Register Description.
Because it is not IEEE-754 compliant, a compiler cannot generate these instructions unless you tell the compiler that you are not interested in full compliance. This can be done in several ways.
Using an intrinsic function to force NEON usage, for example see the GCC Neon Intrinsic Function List.
Ask the compiler, very nicely. Even newer GCC versions with -mfpu=neon will not generate floating point NEON instructions unless you also specify -funsafe-math-optimizations.
For armv8+ ISA (and variants) [Update]
NEON is now fully IEE-754 compliant, and from a programmer (and compiler's) point of view, there is actually not too much difference. Double precision has been vectorized. From a micro-architecture point of view I kind of doubt they are even different hardware units. ARM does document scalar and vector instructions separately but both are part of "Advanced SIMD."

Architecturally, VFP (it wasn't called Vector Floating Point for nothing) indeed has a provision for operating on a floating-point vector in a single instruction. I don't think it ever actually executes multiples operations simultaneously (like true SIMD), but it could save some code size. However, if you read the ARM Architecture Reference Manual in the Shark help (as I describe in my introduction to NEON, link 1 in the question), you'll see at section A2.6 that the vector feature of VFP is deprecated in ARMv7 (which is what the Cortex A8 implements), and software should use Advanced SIMD for floating-point vector operations.
Worse yet, in the Cortex A8 implementation, VFP is implemented with a VFP Lite execution unit (read lite as occupying a smaller silicon surface, not as having less features), which means that it's actually slower than on the ARM11, for instance! Fortunately, most single-precision VFP instructions get executed by the NEON unit, but I'm not sure vector VFP operations do; and even if they do, they certainly execute slower than with NEON instructions.
Hope that clears thing up!

IIRC, the VFP is a floating point coprocessor which works sequentially.
This means that you can use instruction on a vector of floats for SIMD-like behaviour, but internally, the instruction is performed on each element of the vector in sequence.
While the overall time required for the instruction is reduced by this because of the single load instruction, the VFP still needs time to process all elements of the vector.
True SIMD will gain more net floating point performance, but using the VFP with vectors is still faster then using it purely sequential.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight