I am working on a application to port SSE code to Neon.
I see the intrinsics _mm_mullo_epi32 and _mm_mul_epu32 in SSE.
Do we have an equivalent of Neon for these ?
The equivalent of _mm_mullo_epi32 is vmulq_s32
_mm_mul_epu32 is a bit tricky, no single NEON instruction does the job.
Still, the workaround is not that bad, only needs 3 instructions. Two vmovn_u64 instructions to discard every 2-nd lane of the arguments, followed by vmull_u32 to multiply 32-bit lanes into 64-bit ones.
Related
I am a bit confused about instruction sets. There are Thumb, ARM and Thumb 2. From what I have read Thumb instructions are all 16-bit but inside the ARMv7M user manual (page vi) there are Thumb 16-bit and Thumb 32-bit instructions mentioned.
Now I have to overcome this confusion. It is said that Thumb 2 supports 16-bit and 32-bit instructions. So is ARMv7M in fact supporting Thumb 2 instructions and not just Thumb?
One more thing. Can I say that Thumb (32-bit) is the same as ARM instructions which are allso 32-bit?
Oh, ARM and their silly naming...
It's a common misconception, but officially there's no such thing as a "Thumb-2 instruction set".
Ignoring ARMv8 (where everything is renamed and AArch64 complicates things), from ARMv4T to ARMv7-A there are two instruction sets: ARM and Thumb. They are both "32-bit" in the sense that they operate on up-to-32-bit-wide data in 32-bit-wide registers with 32-bit addresses. In fact, where they overlap they represent the exact same instructions - it is only the instruction encoding which differs, and the CPU effectively just has two different decode front-ends to its pipeline which it can switch between. For clarity, I shall now deliberately avoid the terms "32-bit" and "16-bit"...
ARM instructions have fixed-width 4-byte encodings which require 4-byte alignment. Thumb instructions have variable-length (2 or 4-byte, now known as "narrow" and "wide") encodings requiring 2-byte alignment - most instructions have 2-byte encodings, but bl and blx have always had 4-byte encodings*. The really confusing bit came in ARMv6T2, which introduced "Thumb-2 Technology". Thumb-2 encompassed not just adding a load more instructions to Thumb (mostly with 4-byte encodings) to bring it almost to parity with ARM, but also extending the execution state to allow for conditional execution of most Thumb instructions, and finally introducing a whole new assembly syntax (UAL, "Unified Assembly Language") which replaced the previous separate ARM and Thumb syntaxes and allowed writing code once and assembling it to either instruction set without modification.
The Cortex-M architectures only implement the Thumb instruction set - ARMv7-M (Cortex-M3/M4/M7) supports most of "Thumb-2 Technology", including conditional execution and encodings for VFP instructions, whereas ARMv6-M (Cortex-M0/M0+) only uses Thumb-2 in the form of a handful of 4-byte system instructions.
Thus, the new 4-byte encodings (and those added later in ARMv7 revisions) are still Thumb instructions - the "Thumb-2" aspect of them is that they can have 4-byte encodings, and that they can (mostly) be conditionally executed via it (and, I suppose, that their menmonics are only defined in UAL).
* Before ARMv6T2, it was actually a complicated implementation detail as to whether bl (or blx) was executed as a 4-byte instruction or as a pair of 2-byte instructions. The architectural definition was the latter, but since they could only ever be executed as a pair in sequence there was little to lose (other than the ability to take an interrupt halfway through) by fusing them into a single instruction for performance reasons. ARMv6T2 just redefined things in terms of the fused single-instruction execution
In addition to Notlikethat's answer, and as it hints at, ARMv8 introduces some new terminology to try to reduce the confusion (of course adding even more new terminology):
There is a 32-bit execution state (AArch32) and a 64-bit execution state (AArch64).
The 32-bit execution state supports two different instruction sets: T32 ("Thumb") and A32 ("ARM"). The 64-bit execution state supports only one instruction set - A64.
All A64, like all A32, instructions are 32-bit (4 byte) in size, requiring 4-byte alignment.
Many/most A64 instructions can operate on both 32-bit and 64-bit registers (or arguably 32-bit or 64-bit views of the same underlying 64-bit register).
All ARMv8 processors (like all ARMv7 processors) that implement AArch32 support Thumb-2 instructions in the T32 instruction set.
Not all ARMv8-A processors implement AAarch32, and some don't implement AArch64. Some Processors support both, but only support AArch32 at lower exception levels.
Thumb: 16 bit instruction set
ARM: 32 bit wide instruction set hence more flexible instructions and less code density
Thumb2 (mixed 16/32 bit): somehow a compromise between ARM and thumb(16) (mixing them), to get both performance/flexibility of ARM and instruction density of Thumb. so a Thumb2 instruction can be either an ARM (only a subset of) with 32 bit wide instruction or a Thumb instruction with 16 bit wide.
It was confusing for me the Cortex M3 having 4-byte instructions, yet not executing the ARM instructions. Or CPUs capable to have 2-byte and 4-byte opcodes, but capable to execute the ARM instructions too. So I read a book about Arm and now I understand it slightly better. Still, the naming and the overlap are still confusing to me. I was thinking it would be interesting to compare a few CPUs first and then talk about the ISAs.
To compare a few CPUs and what they can do and how they overlap:
Cortex M0/M0+/M1/M23 are considered Thumb (Thumb-1) and can execute the 2-byte opcodes which are limited compared to others. However, some instructions such as mrs, msr, bl, dmb, dsb, isb are from Thumb-2 and are 4-byte. The Cortex M0/M0+/M1 are ARMv6, while Cortex M23 is ARMv8. The Thumb-1 instruction was extended in the ARMv7, so it can be said that ARMv8 Cortext M23 supports fuller Thumb-1 (except it instruction) while ARMv6 Cortex M0/M0+ only a subset of the ISA (they are missing specifically it, cbz and cbnz instructions). I might be wrong (please correct me if this is not right), but noticed something funny, that only CPUs I see which support Thumb-1 fully are CPUs that already support Thumb-2 as well, I do not know Thumb-1 only CPU which supports 100% of Thumb-1. I think it's because of the it which could be seen as Thumb-2 opcode which is 2-byte and was in essence added to Thumb-1. On the Thumb-1 CPUs the 4-byte opcodes could be seen as two 2-bytes to represent the 4-byte opcode instead.
Cortex M3/M4/M7/M33/M35P/M55 can execute 2-byte and 4-byte opcodes, both are Thumb-1 and Thumb-2 and support a full set of the ISAs. The 2-byte and 4-byte opcodes are mixed more evenly, while the Cortex M0/M0+/M1/M23 above are biased to use 2-byte opcodes most of the time. Cortex M3/M4/M7 are ARMv7, while Cortex M33/M35P/M55 are ARMv8.
Cortex A/R can accept both ARM and Thumb opcodes and therefore have 2-byte and 4-byte. To switch between the modes the PC needs to be offset by one byte (forcefully unaligned), this can be done for example with branch instruction bx which sets the T bit of the CPSR and switches the mode depending on the lowest bit of address. This works well, for example when calling subroutine the PC (and its mode) get saved, then inside the subroutine it could be switched to Thumb mode, yet when returning from Thumb mode it will restore the PC (and its T-bit) and switches back to whatever the caller was (ARM or Thumb mode) without any issue.
ARM7 only supports ARMv3 4-byte ISA
ARM7T supports both Thumb-1 and ARM ISAs (2-byte and 4-byte)
ARM11 (ARMv6, ARMv6T2, ARMv6Z, ARMv6K) supports Thumb-1, Thumb-2 and ARM ISAs
The book I referenced stated that in the ARMv7 and newer the architecture switched from Von Neumann (data and instructions sharing a bus) to Harvard (dedicated busses) to get better performance. However the absolute term "and newer" is not true, because ARMv8 is newer, yet the ARMv8 Cortex M23 is Von Neumann.
The ISAs are:
ARM has 16 registers (R0-R12, SP, LR, PC), only 4-byte opcodes, there are revisions to the ISA, but they are only 4-byte opcodes.
Thumb (aka Thumb-1) split the 16 registers to lower (R0-R7) and higher (R8-R12, SP, LR, PC), most instructions can access the lower set only, while only some can access the higher set. Only 2-byte opcodes. On low-end devices which have a 16-bit bus (and have to do 32-bit word access in two steps) perform better when they they execute 2-byte opcodes as it's matching their bus. The naming is confusing me the Thumb could be used as the family term for both Thumb-1 together with Thumb-2, or sometimes Thumb can be used for Thumb-1 only. I think the Thumb-1 is not an official Arm term, just something I have seen used by people to make the distinguishment between the Thumb family of both ISAs and the first Thumb ISA clearer. Instructions in ARM can have the optional s suffix to update the CPSR register (for example ands, orrs, movs, adds, subs instruction), while in the Thumb-1 the s is always on and it saves the CPSR register all the time. In some older toolchains the implicit s is not needed, however in the efforts of Unified Assembly Language (UAL) now it's a requirement to explicitly specify the s even when there is no option to not use the s.
Thumb-2 is an extension to Thumb and can access all registers like ARM does, has 4-byte opcodes with some differences compared to ARM. In the assembly, the Thumb-1 2-byte narrow opcode and Thumb-2 4-byte wide opcode can be forced with .n and .w postfix (example orr.w). The ARM and Thumb-2 opcode formats/encodings are different and their capabilities differ too. The conditional execution of instructions can be used, but only when it (if-then) instruction/block is prepended. This can be done explicitly or implied (and done by the toolchain behind the user's back). And the confusion might be actually good as Arm (the company) wanted them to be similar, a lot of effort went to Unified Assembly Language (UAL) so assembly files made for ARM could be compiled on Thumb-2 without change. If I understand this correctly, that can't be 100% guaranteed and some edge cases could probably be made where the ARM assembly can't compile as Thumb-2 and this is another absolute statement that is not fully true. For example the ARM7 bl instruction can address +-32MB while on Cortex M3 it can only +-16MB. The situation such be much better compared to Thumb-1 where the ARM assembly has to be more likely rewritten to target Thumb-1, while ARM to Thumb-2 rewrite is less likely to happen. Another difference are the data processing instructions. Both ARM and Thumb-2 support 8-bit immediates while ARM can rotate bits only to the right and only by even bits, while Thumb can do rotations to left and by even/odd amount of bits and on top of that allows repetitive byte patterns such as 0xXYXYXYXY, 0x00XY00XY or 0xXY00XY00. Because the shifts are rotating, the left and right shifts can be achieved by 'overflowing', shifting so much to one direction that it's effectively a shift to the opposite direction 1 << (32 - n) == 1 >> n
So in conclusion some Arm CPUs can do:
only 4-byte opcode instructions which are pure ARM ISA
2-byte/4-byte Thumb-1/Thumb-2 ISAs with a focus to use the 2-byte most of the time with only a few 4-byte opcodes, these often are labeled as Thumb (Thumb-1) 2-byte opcode CPUs (and the few 4-byte opcodes are sometimes not mentioned)
2-byte/4-byte Thumb-1/Thumb-2 ISAs and are more evenly mixed between 2-byte and 4-byte opcodes, often labeled as Thumb-2
2-byte/4-byte opcodes by switching between ARM/Thumb modes
Reference for this information: ARM Assembly Language Programming & Architecture Muhammad Ali Mazidi et al 2016. The book was written before the company name change from ARM to Arm, so sometimes it was confusing when it was referencing the company Arm and when the ARM ISA.
Please refer to https://developer.arm.com/documentation/ddi0344/c/programmer-s-model/thumb-2-instruction-set
It explains in detail about the enhancement of the Thumb2 architecture. The same covers the ARM, Thumb and Thumb2 instruction set description implicitly.
As per my understanding by referring to many links to ARM's site I understand Cortex-M7 doesn't support NEON instructions, but the host (CORTEX-M7) processor that we are using in our organization specifies "ARM Cortex-M7 with single precision floating point and SIMD operations". Now I am totally out of mind and in confusion.
Is there any difference between SIMD and NEON instructions, please can any one explain in detail?
Thanks in advance for the good explanation.
There are some instructions in the basic instruction set that can add and subtract 32-bit wide vectors of 8 or 16 bit integer values and in the ARM marketing material they are referred to as SIMD. NEON on the other hand is a much more capable SIMD implementation that works on 64 or 128 bit wide vectors of 8, 16, or 32 bit integer values and single or double precision floats. In the marketing material NEON is often referred to as "advanced SIMD".
The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly.
I'm writing software that operates on bytes and words (8- and 16-bit) using up to SSE4.1 instructions via intrinsics.
I am confused whether there will be EVEX-encoded versions of all/most SSE4.1 instructions in AVX-512F, and whether this means I can expect my SSE code to automatically gain EVEX-extended instructions and map to all new registers.
Wikipedia says this:
The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
This unfortunately does not clarify whether compiling SSE4 code with AVX512-enabled will lead to the same (awesome) speedup that compiling it to AVX2 provides (VEX coding of legacy instructions).
Anybody know what will happen when SSE2/4 code (C intrinsics) are compiled for AVX-512F? Could one expect a speed bump like with AVX1's VEX coding of the byte and word instructions?
Okay, I think I've pieced together enough information to make a decent answer. Here goes.
What will happen when native SSE2/4 code is run on Knights Landing (KNL)?
The code will run in the bottom fourth of the registers on a single VPU (called the compatibility layer) within a core. According to a pre-release webinar from Colfax, this means occupying only 1/4 to 1/8 of the total register space available to a core and running in legacy mode.
What happens if the same code is recompiled with compiler flags for AVX-512F?
SSE2/4 code will be generated with VEX prefix. That means pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to AVX512's native EVEX or allowed to address the new zmm registers specifically. Instructions can only be promoted to EVEX with AVX512-VL, in which case they gain the ability to directly address (renamed) zmm registers. It is unknown whether register sharing is possible at this point, but pipelining on AVX2 has demonstrated similar throughput with half-width AVX2 (AVX-128) as with full 256-bit AVX2 code in many cases.
Most importantly, how do I get my SSE2/4/AVX128 byte/word size code running on AVX512F?
You'll have to load 128-bit chunks into xmm, sign/zero extend those bytes/words into 32-bit in zmm, and operate as if they were always larger integers. Then when finished, convert back to bytes/words.
Is this fast?
According to material published on Larrabee (Knights Landing's prototype), type conversions of any integer width are free from xmm to zmm and vice versa, so long as registers are available. Additionally, after calculations are performed, the 32-bit results can be truncated on the fly down to byte/word length and written (packed) to unaligned memory in 128-bit chunks, potentially saving an xmm register.
On KNL, each core has 2 VPUs that seem to be capable of talking to each other. Hence, 32-way 32-bit lookups are possible in a single vperm*2d instruction of presumably reasonable throughput. This is not possible even with AVX2, which can only permute within 128-bit lanes (or between lanes for the 32-bit vpermd only, which is inapplicable to byte/word instructions). Combined with free type conversions, the ability to use masks implicitly with AVX512 (sparing the costly and register-intensive use of blendv or explicit mask generation), and the presence of more comparators (native NOT, unsigned/signed lt/gt, etc), it may provide a reasonable performance boost to rewrite SSE2/4 byte/word code for AVX512F after all. At least on KNL.
Don't worry, I'll test the moment I get my hands on mine. ;-)
NEON can do SIMD operations for 32 bit float numbers. But does not do SIMD operations for 64 bit float numbers.
VFU is not SIMD. It can do 32 bit or 64 bit floating point operations only on one element.
Does ARM support SIMD operations for 64 bit floating point numbers?
This is only possible on processors supporting ARMv8, and only when running Aarch64 instruction set. This is not possible in Aarch32 instruction set.
However most processors support 32-bit and 64-bit scalar floating-point operations (ie floating-point unit).
ARMv8
In ARMv8, it is possible:
fadd v2.2d, v0.2d, v1.2d
Minimal runnable example with an assert and QEMU user setup.
The analogous ARMv7 does not work:
vadd.f64 q2, q0, q1
assembly fails with:
bad type in Neon instruction -- `vadd.f64 q2,q0,q1'
Minimal runnable 32-bit float v7 code for comparison.
Manual
https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf A1.5 "Advanced SIMD and floating-point support" says:
The SIMD instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations, and support:
Single-precision and double-precision arithmetic in AArch64 state.
For ARMv7, F6.1.27 "VADD (floating-point)" says:
<dt> Is the data type for the elements of the vectors, encoded in the "sz" field. It can have the following values:
F32 when sz = 0
F16 when sz = 1
but there is no F64, which suggests that it is not possible.
I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC.
The reason I'm asking this is because I'm wondering if it makes sense to expect big differences in __int128_t performance between different versions of GCC, depending on what SSE instructions are taken advantage of.
So, what parts of __int128_t arithmetic are emulated by GCC, and what parts are implemented with SSE instructions (if any)?
I was confusing two different things in my question.
Firstly, as PaulR explained in the comments: "There are no 128 bit arithmetic operations in SSE or AVX (apart from bitwise operations)". Considering this, 128-bit arithmetic has to be emulated on modern x86-64 based processors (e.g. AMD Family 10 or Intel Core architecture). This has nothing to do with GCC.
The second part of the question is whether or not 128-bit arithmetic emulation in GCC benefits from SSE/AVX instructions or registers. As implied in PaulR's comments, there isn't much in SSE/AVX that's going to allow you to do 128-bit arithmetic more easily; most likely x86-64 instructions will be used for this. The code I'm interested in can't compile with -mno-sse, but it compiles fine with -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 and performance isn't affected. So my code doesn't benefit from modern SSE instructions.
SSE2-AVX instructions are available for 8,16,32,64-bit integer data types. They are mostly intended to treat packed data together, for example, 128-bit register may contain four 32-bit integers and so on.
Although SSE/AVX/AVX-512/etc. have no 128-bit mode (their vector elements are strictly 64-bit max, and operations will simply overflow), as Paul R has implied, the main CPU does support limited 128-bit operations, by using a pair of registers.
When multiplying two regular 64-bit number, MUL/IMUL can outputs its 128-bit result in the RAX/RDX register pair.
Inversely, when dividing DIV/IDIV can take its input from then RAX/RDX pair to divide a 128-bit number by a 64-bit divisor (and outputs 64-bit quotient + 64-bit modulo)
Of course the CPU's ALU is 64-bit, thus - as implied Intel docs - these higher extra 64-bit come at the cost of extra micro-ops in the microcode. This is more dramatic for divisions (> 3x more) which already require lots of micro-ops to be processed.
Still that means that under some circumstances (like using a rule of three to scale a value), it's possible for a compiler to emit regular CPU instruction and not care to do any 128-bit emulation by itself.
This has been available for a long time:
since 80386, 32-bit CPU could do 64-bit multiplication/division using EAX:EDX pair
since 8086/88, 16-bit CPU could do 32-bit multiplication/division using AX:DX pair
(As for additions and subtraction: thank to the support for carry, it's completely trivial to do additions/subtractions of numbers of any arbitrary length that can fill your storage).