Masked Load/Store - arm

I am writing a function using neon intrinsics to optimize some matrix operations and I need to treat special cases (like reaching the end of an array with a size which is not a multiple of the register size).
I would prefer if there were a SIMD instruction in neon to do something like this (like the masks in AVX).
Is there any way you can do a masked load/store using Neon intrinsics?

Related

_mm_mul_epu32 and _mm_mullo_epi32 on arm neon

I am working on a application to port SSE code to Neon.
I see the intrinsics _mm_mullo_epi32 and _mm_mul_epu32 in SSE.
Do we have an equivalent of Neon for these ?
The equivalent of _mm_mullo_epi32 is vmulq_s32
_mm_mul_epu32 is a bit tricky, no single NEON instruction does the job.
Still, the workaround is not that bad, only needs 3 instructions. Two vmovn_u64 instructions to discard every 2-nd lane of the arguments, followed by vmull_u32 to multiply 32-bit lanes into 64-bit ones.

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly.
I'm writing software that operates on bytes and words (8- and 16-bit) using up to SSE4.1 instructions via intrinsics.
I am confused whether there will be EVEX-encoded versions of all/most SSE4.1 instructions in AVX-512F, and whether this means I can expect my SSE code to automatically gain EVEX-extended instructions and map to all new registers.
Wikipedia says this:
The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
This unfortunately does not clarify whether compiling SSE4 code with AVX512-enabled will lead to the same (awesome) speedup that compiling it to AVX2 provides (VEX coding of legacy instructions).
Anybody know what will happen when SSE2/4 code (C intrinsics) are compiled for AVX-512F? Could one expect a speed bump like with AVX1's VEX coding of the byte and word instructions?
Okay, I think I've pieced together enough information to make a decent answer. Here goes.
What will happen when native SSE2/4 code is run on Knights Landing (KNL)?
The code will run in the bottom fourth of the registers on a single VPU (called the compatibility layer) within a core. According to a pre-release webinar from Colfax, this means occupying only 1/4 to 1/8 of the total register space available to a core and running in legacy mode.
What happens if the same code is recompiled with compiler flags for AVX-512F?
SSE2/4 code will be generated with VEX prefix. That means pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to AVX512's native EVEX or allowed to address the new zmm registers specifically. Instructions can only be promoted to EVEX with AVX512-VL, in which case they gain the ability to directly address (renamed) zmm registers. It is unknown whether register sharing is possible at this point, but pipelining on AVX2 has demonstrated similar throughput with half-width AVX2 (AVX-128) as with full 256-bit AVX2 code in many cases.
Most importantly, how do I get my SSE2/4/AVX128 byte/word size code running on AVX512F?
You'll have to load 128-bit chunks into xmm, sign/zero extend those bytes/words into 32-bit in zmm, and operate as if they were always larger integers. Then when finished, convert back to bytes/words.
Is this fast?
According to material published on Larrabee (Knights Landing's prototype), type conversions of any integer width are free from xmm to zmm and vice versa, so long as registers are available. Additionally, after calculations are performed, the 32-bit results can be truncated on the fly down to byte/word length and written (packed) to unaligned memory in 128-bit chunks, potentially saving an xmm register.
On KNL, each core has 2 VPUs that seem to be capable of talking to each other. Hence, 32-way 32-bit lookups are possible in a single vperm*2d instruction of presumably reasonable throughput. This is not possible even with AVX2, which can only permute within 128-bit lanes (or between lanes for the 32-bit vpermd only, which is inapplicable to byte/word instructions). Combined with free type conversions, the ability to use masks implicitly with AVX512 (sparing the costly and register-intensive use of blendv or explicit mask generation), and the presence of more comparators (native NOT, unsigned/signed lt/gt, etc), it may provide a reasonable performance boost to rewrite SSE2/4 byte/word code for AVX512F after all. At least on KNL.
Don't worry, I'll test the moment I get my hands on mine. ;-)

Macro for generating immediates for AVX shuffle intrinsics

In AVX, is there any special macro that helps to construct the immediate constant for _mm256_shuffle_* intrinsics, like _MM_SHUFFLE(..) for its SSE counterpart? I can't find any.
You still use _MM_SHUFFLE() for shuffles that take the control input as an 8bit immediate. e.g. _mm256_shuffle_epi32 (vpshufd) does the same shuffle on both lanes.
_MM_SHUFFLE(dd,cc,bb,aa) just packs the low 2 bits of each arg into a 0bddccbbaa.
You can write _MM_SHUFFLE(1,1,1,1) (broadcast element 1) as 0b01010101, i.e. 0x55.
You can use C++14 separators (or whatever they're called) to write it as 0b01'01'01'01 for better human-readability, esp. in cases where each element is different.

Integer SIMD Instruction AVX in C

I am trying to run SIMD instruction over data types int, float and double.
I need multiply, add and load operation.
For float and double I successfully managed to make those instructions work:
_mm256_add_ps, _mm256_mul_ps and _mm256_load_ps (ending *pd for double).
(Direct FMADD operation isn't supported)
But for integer I couldn't find a working instruction. All of those showed at intel AVX manual give similar error by GCC 4.7 like "‘_mm256_mul_epu32’ was not declared in this scope".
For loading integer I use _mm256_set_epi32 and that's fine for GCC. I don't know why those other instructions aren't defined. Do I need to update something?
I am including all of those <pmmintrin.h>, <immintrin.h> <x86intrin.h>
My processor is an Intel core i5 3570k (Ivy Bridge).
256-bit integer operations are only added since AVX2, so you'll have to use 128-bit __m128i vectors for integer intrinsics if you only have AVX1.
AVX1 does have integer loads/stores, and intrinsics like _mm256_set_epi32 can be implemented with FP shuffles or a simple load of a compile-time constant.
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2
Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions,[2] is an expansion of the AVX instruction set introduced in Intel's Haswell microarchitecture. AVX2 makes the following additions:
expansion of most vector integer SSE and AVX instructions to 256 bits
three-operand general-purpose bit manipulation and multiply
three-operand fused multiply-accumulate support (FMA3)
Gather support, enabling vector elements to be loaded from non-contiguous memory locations
DWORD- and QWORD-granularity any-to-any permutes
vector shifts.
FMA3 is actually a separate feature; AMD Piledriver/Steamroller have it but not AVX2.
Nevertheless if the int value range fits in 24 bits then you can use float instead. However note that if you need the exact result or the low bits of the result then you'll have to convert float to double, because a 24x24 multiplication will produce a 48-bit result which can be only stored exactly in a double. At that point you still only have 4 elements per vector, and might have been better off with XMM vectors of int32. (But note that FMA throughput is typically better than integer multiply throughput.)
AVX1 has VEX encodings of 128-bit integer operations so you can use them in the same function as 256-bit FP intrinsics without causing SSE-AVX transition stalls. (In C you generally don't have to worry about that; your compiler will take care of using vzeroupper where needed.)
You could try to simulate an integer addition with AVX bitwise instructions like VANDPS and VXORPS, but without a bitwise left shift for ymm vectors it won't work.
If you're sure FTZ / DAZ are not set, you can use small integers as denormal / subnormal float values, where the bits outside the mantissa are all zero. Then FP addition and integer addition are the same bitwise operation. (And VADDPS doesn't need a microcode assist on Intel hardware when the inputs and result are both denormal.)

assembly intrinsic to do a masked load

int main()
{
const int STRIDE=2,SIZE=8192;
int i=0;
double u[SIZE][STRIDE];
#pragma vector aligned
for(i=0;i<SIZE;i++)
{
u[i][STRIDE-1]= i;
}
printf("%lf\n",u[7][STRIDE-1]);
return 0;
}
The compiler uses xmm registers here. There is stride 2 access and I want to make the compiler ignore this and do a regular load of memory and then mask alternate bits so I would be using 50% of the SIMD registers. I need intrinsics which can be used to load and then mask the register bitwise before storing back to memory
P.S: I have never done assembly coding before
A masked store with a mask value as 0xAA (10101010)
You can't do a masked load (only a masked store). The easiest alternative would be to do a load and then mask it yourself (e.g. using intrinsics).
A potentially better alternative would be to change your array to "double u[STRIDE][SIZE];" so that you don't need to mask anything and don't end up with half an XMM register wasted/masked.
Without AVX, half a SIMD register is only one double anyway, so there seems little wrong with regular 64-bit stores.
If you want to use masked stores (MASKMOVDQU/MASKMOVQ), note that they write directly to DRAM just like the non-temporal stores like MOVNTPS. This may or may not be what you want. If the data fits in cache and you plan to read it soon, it is likely better not to use them.
Certain AMD processors can do a 64-bit non-temporal store from an XMM register using MOVNTSD; this may simplify things slightly compared to MASKMOVDQU).

Resources