Integer SIMD Instruction AVX in C - c

I am trying to run SIMD instruction over data types int, float and double.
I need multiply, add and load operation.
For float and double I successfully managed to make those instructions work:
_mm256_add_ps, _mm256_mul_ps and _mm256_load_ps (ending *pd for double).
(Direct FMADD operation isn't supported)
But for integer I couldn't find a working instruction. All of those showed at intel AVX manual give similar error by GCC 4.7 like "‘_mm256_mul_epu32’ was not declared in this scope".
For loading integer I use _mm256_set_epi32 and that's fine for GCC. I don't know why those other instructions aren't defined. Do I need to update something?
I am including all of those <pmmintrin.h>, <immintrin.h> <x86intrin.h>
My processor is an Intel core i5 3570k (Ivy Bridge).

256-bit integer operations are only added since AVX2, so you'll have to use 128-bit __m128i vectors for integer intrinsics if you only have AVX1.
AVX1 does have integer loads/stores, and intrinsics like _mm256_set_epi32 can be implemented with FP shuffles or a simple load of a compile-time constant.
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2
Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions,[2] is an expansion of the AVX instruction set introduced in Intel's Haswell microarchitecture. AVX2 makes the following additions:
expansion of most vector integer SSE and AVX instructions to 256 bits
three-operand general-purpose bit manipulation and multiply
three-operand fused multiply-accumulate support (FMA3)
Gather support, enabling vector elements to be loaded from non-contiguous memory locations
DWORD- and QWORD-granularity any-to-any permutes
vector shifts.
FMA3 is actually a separate feature; AMD Piledriver/Steamroller have it but not AVX2.
Nevertheless if the int value range fits in 24 bits then you can use float instead. However note that if you need the exact result or the low bits of the result then you'll have to convert float to double, because a 24x24 multiplication will produce a 48-bit result which can be only stored exactly in a double. At that point you still only have 4 elements per vector, and might have been better off with XMM vectors of int32. (But note that FMA throughput is typically better than integer multiply throughput.)
AVX1 has VEX encodings of 128-bit integer operations so you can use them in the same function as 256-bit FP intrinsics without causing SSE-AVX transition stalls. (In C you generally don't have to worry about that; your compiler will take care of using vzeroupper where needed.)
You could try to simulate an integer addition with AVX bitwise instructions like VANDPS and VXORPS, but without a bitwise left shift for ymm vectors it won't work.
If you're sure FTZ / DAZ are not set, you can use small integers as denormal / subnormal float values, where the bits outside the mantissa are all zero. Then FP addition and integer addition are the same bitwise operation. (And VADDPS doesn't need a microcode assist on Intel hardware when the inputs and result are both denormal.)

Related

Is there any Intrinsic in AVX2 Architecture similar to _mm_min_round_ss in avx512?

I'm a beginner and working on AVX2 architecture and I would like to use an intrinsic which does the same functionality of the _mm_min_round_ss in AVX-512. So Is there any intrinsic which is similar to this?
Rounding-mode override and FP-exception suppression (with per-instruction overrides) are unique to AVX-512. (These are the ..._round_... versions of scalar and 512-bit intrinsics; packed 128-bit and 256-bit vector instructions don't have room to encode the SAE stuff in the EVEX prefix, they need some of those bits to signal the narrower vector length.)
Does the rounding mode ever make a difference for vminps? I think no, since it's a compare, not actually rounding a new result. I guess suppressing exceptions can, in case you're going to check fenv later to see if anything set the denormal or invalid flags or something? The Intrinsics guide only mentions _MM_FROUND_NO_EXC as relevant, not overrides to floor/ceil/trunc rounding.
If you don't need exception suppression, just use the normal scalar or packed ..._min_ps / ss intrinsic, e.g. _mm256_min_ps (8 floats in a __m256 vector) or _mm_min_ss (scalar, just the low element of a __m128 vector, leaving others unmodified).
See What is the instruction that gives branchless FP min and max on x86? for details on exact FP semantics (not symmetric wrt. NaN), and the fact that until quite recently, GCC treated the intrinsic as commutative even though the instruction isn't. (Other compilers, and current GCC, only do that with -ffast-math)

Dividing packed 16-bit integer with mask using AVX512 or SVML intrinsics

I am looking for a solution for dividing packed 16-bit integers with mask (__mmask16 for example). _mm512_mask_div_epi32 intrinsics seem to be good; however they only support packed 32-bit integers, which unnecessarily forces me to wide my packed 16-bit to packed 32-bit before using.
_mm512_mask_div_epi32 isn't a real intrinsic; it's an Intel SVML function. x86 doesn't have SIMD integer division, only SIMD FP double and float.
If your divisor vectors are compile-time constants (or reused for multiple dividends), see https://libdivide.com/ for exact division using a multiplicative inverse.
Otherwise probably your best bet is to convert to single-precision FP which can exactly represent every 16-bit integer. If _mm512_mask_div_epi32 does any extra work to deal with the fact that FP32 can't exactly represent every possible int32_t, that's wasted for your use case.
(Some future CPUs may have support for some kind of 16-bit FP in the IA cores, not just the GPU, but for now the best way to take advantage of the high-throughput hardware div/sqrt SIMD execution unit is via conversion to float. Like one __m256 per 5 clock cycles for Skylake vdivps ymm with a single uop, or one per 10 clock cycles for __m512 with a 3-uop vdivps zmm)

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly.
I'm writing software that operates on bytes and words (8- and 16-bit) using up to SSE4.1 instructions via intrinsics.
I am confused whether there will be EVEX-encoded versions of all/most SSE4.1 instructions in AVX-512F, and whether this means I can expect my SSE code to automatically gain EVEX-extended instructions and map to all new registers.
Wikipedia says this:
The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
This unfortunately does not clarify whether compiling SSE4 code with AVX512-enabled will lead to the same (awesome) speedup that compiling it to AVX2 provides (VEX coding of legacy instructions).
Anybody know what will happen when SSE2/4 code (C intrinsics) are compiled for AVX-512F? Could one expect a speed bump like with AVX1's VEX coding of the byte and word instructions?
Okay, I think I've pieced together enough information to make a decent answer. Here goes.
What will happen when native SSE2/4 code is run on Knights Landing (KNL)?
The code will run in the bottom fourth of the registers on a single VPU (called the compatibility layer) within a core. According to a pre-release webinar from Colfax, this means occupying only 1/4 to 1/8 of the total register space available to a core and running in legacy mode.
What happens if the same code is recompiled with compiler flags for AVX-512F?
SSE2/4 code will be generated with VEX prefix. That means pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to AVX512's native EVEX or allowed to address the new zmm registers specifically. Instructions can only be promoted to EVEX with AVX512-VL, in which case they gain the ability to directly address (renamed) zmm registers. It is unknown whether register sharing is possible at this point, but pipelining on AVX2 has demonstrated similar throughput with half-width AVX2 (AVX-128) as with full 256-bit AVX2 code in many cases.
Most importantly, how do I get my SSE2/4/AVX128 byte/word size code running on AVX512F?
You'll have to load 128-bit chunks into xmm, sign/zero extend those bytes/words into 32-bit in zmm, and operate as if they were always larger integers. Then when finished, convert back to bytes/words.
Is this fast?
According to material published on Larrabee (Knights Landing's prototype), type conversions of any integer width are free from xmm to zmm and vice versa, so long as registers are available. Additionally, after calculations are performed, the 32-bit results can be truncated on the fly down to byte/word length and written (packed) to unaligned memory in 128-bit chunks, potentially saving an xmm register.
On KNL, each core has 2 VPUs that seem to be capable of talking to each other. Hence, 32-way 32-bit lookups are possible in a single vperm*2d instruction of presumably reasonable throughput. This is not possible even with AVX2, which can only permute within 128-bit lanes (or between lanes for the 32-bit vpermd only, which is inapplicable to byte/word instructions). Combined with free type conversions, the ability to use masks implicitly with AVX512 (sparing the costly and register-intensive use of blendv or explicit mask generation), and the presence of more comparators (native NOT, unsigned/signed lt/gt, etc), it may provide a reasonable performance boost to rewrite SSE2/4 byte/word code for AVX512F after all. At least on KNL.
Don't worry, I'll test the moment I get my hands on mine. ;-)

Does ARM support SIMD operations for 64 bit floating point numbers?

NEON can do SIMD operations for 32 bit float numbers. But does not do SIMD operations for 64 bit float numbers.
VFU is not SIMD. It can do 32 bit or 64 bit floating point operations only on one element.
Does ARM support SIMD operations for 64 bit floating point numbers?
This is only possible on processors supporting ARMv8, and only when running Aarch64 instruction set. This is not possible in Aarch32 instruction set.
However most processors support 32-bit and 64-bit scalar floating-point operations (ie floating-point unit).
ARMv8
In ARMv8, it is possible:
fadd v2.2d, v0.2d, v1.2d
Minimal runnable example with an assert and QEMU user setup.
The analogous ARMv7 does not work:
vadd.f64 q2, q0, q1
assembly fails with:
bad type in Neon instruction -- `vadd.f64 q2,q0,q1'
Minimal runnable 32-bit float v7 code for comparison.
Manual
https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf A1.5 "Advanced SIMD and floating-point support" says:
The SIMD instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations, and support:
Single-precision and double-precision arithmetic in AArch64 state.
For ARMv7, F6.1.27 "VADD (floating-point)" says:
<dt> Is the data type for the elements of the vectors, encoded in the "sz" field. It can have the following values:
F32 when sz = 0
F16 when sz = 1
but there is no F64, which suggests that it is not possible.

Is __int128_t arithmetic emulated by GCC, even with SSE?

I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC.
The reason I'm asking this is because I'm wondering if it makes sense to expect big differences in __int128_t performance between different versions of GCC, depending on what SSE instructions are taken advantage of.
So, what parts of __int128_t arithmetic are emulated by GCC, and what parts are implemented with SSE instructions (if any)?
I was confusing two different things in my question.
Firstly, as PaulR explained in the comments: "There are no 128 bit arithmetic operations in SSE or AVX (apart from bitwise operations)". Considering this, 128-bit arithmetic has to be emulated on modern x86-64 based processors (e.g. AMD Family 10 or Intel Core architecture). This has nothing to do with GCC.
The second part of the question is whether or not 128-bit arithmetic emulation in GCC benefits from SSE/AVX instructions or registers. As implied in PaulR's comments, there isn't much in SSE/AVX that's going to allow you to do 128-bit arithmetic more easily; most likely x86-64 instructions will be used for this. The code I'm interested in can't compile with -mno-sse, but it compiles fine with -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 and performance isn't affected. So my code doesn't benefit from modern SSE instructions.
SSE2-AVX instructions are available for 8,16,32,64-bit integer data types. They are mostly intended to treat packed data together, for example, 128-bit register may contain four 32-bit integers and so on.
Although SSE/AVX/AVX-512/etc. have no 128-bit mode (their vector elements are strictly 64-bit max, and operations will simply overflow), as Paul R has implied, the main CPU does support limited 128-bit operations, by using a pair of registers.
When multiplying two regular 64-bit number, MUL/IMUL can outputs its 128-bit result in the RAX/RDX register pair.
Inversely, when dividing DIV/IDIV can take its input from then RAX/RDX pair to divide a 128-bit number by a 64-bit divisor (and outputs 64-bit quotient + 64-bit modulo)
Of course the CPU's ALU is 64-bit, thus - as implied Intel docs - these higher extra 64-bit come at the cost of extra micro-ops in the microcode. This is more dramatic for divisions (> 3x more) which already require lots of micro-ops to be processed.
Still that means that under some circumstances (like using a rule of three to scale a value), it's possible for a compiler to emit regular CPU instruction and not care to do any 128-bit emulation by itself.
This has been available for a long time:
since 80386, 32-bit CPU could do 64-bit multiplication/division using EAX:EDX pair
since 8086/88, 16-bit CPU could do 32-bit multiplication/division using AX:DX pair
(As for additions and subtraction: thank to the support for carry, it's completely trivial to do additions/subtractions of numbers of any arbitrary length that can fill your storage).

Resources