How do you load 3 floats using neon intrinsics - arm

I'm trying to convert this neon code to intrinsics:
vld1.32 {d0}, [%[pInVertex1]]
flds s2, [%[pInVertex1], #8]
This loads 3 32-bit floats from the variable pInVertex1 into the d0 and d1 registers.
I can't find any equivalent version for instrinsics. There is vld1q_f32, but that only works for 4 floats. Anyone know of a efficient way of doing this (I mean without extra copying)?

The only instruction that writes only 3 32-bit floats in Aarch32 is a multiple-load instruction:
r0 holds the address of the structure
FLDMIAS r0, {s0-s2}
This can be used either in VFP or Neon code.
I do not know about the corresponding intrinsic.

In DirectXMath I implemented the ARM-NEON version of XMLoadFloat3 as:
float32x2_t x = vld1_f32( reinterpret_cast<const float*>(pSource) );
float32x2_t zero = vdup_n_f32(0);
float32x2_t y = vld1_lane_f32( reinterpret_cast<const float*>(pSource)+2, zero, 0 );
return vcombine_f32( x, y );

Related

NEON : Swap 4 scalars in float32x4

I used the following code to swap 4 scalars in float32x4_t vector.
{1,2,3,4} -> {4,3,2,1}
float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}
Can you suggest a faster code ?
Thank you,
Zvika
That is possibly as good as it gets.
The reverse engineered code (for aarch64, gcc/clang -O3) would be
vec = vrev64q_f32(vec);
return vextq_f32(vec,vec,2);
On armv7 (gcc 11.2) your original version compiles to
vrev64.32 q0, q0
vswp d0, d1
where as the other more compact version compiles to
vrev64.32 q0, q0
vext.32 q0, q0, q0, #2
If you prefer the vswp approach (only on armv7) keep your code as is, since there are no intrinsics for swaps.
On armv7 you could also use
float32x2_t lo = vrev64_f32(vget_high_f32(vec));
float32x2_t hi = vrev64_f32(vget_low_f32(vec));
return vcombine_f32(lo, hi);
When inlined and when the result can be produced on another register, this can compile just to two instructions with no dependency between them. Permutations on Cortex-A7 are typically 1 cycle / 64 bits, with 4 cycle latency, so this could be twice as fast as the other approaches.

Perform a horizontal logical/bitwise AND operation across all lanes of uint8x8 Neon vector

I have a uint8x8 Neon vector which is a result of some operation. I need to perform a logical AND operation of the all the lanes to get my final result. Each element is either 0xff (TRUE) or 0x00 (FALSE). How do I perform it in Neon?
In that case, you can simply do a binary negation, and check if the 64bit result is 0.
vmvn d0, d0
vpaddl.u32 d0, d0 // 64bit vceq isn't possible.
vceq.i32 d0, d0, #0
You now have the desired result in d0.
If you are working on aarch64, 64bit cmeq is possible
mvn v0.16b, v0.16b
cmeq v0.2d, v0.2d, #0
The best thing about this algorithm is that you don't need any other registers, because zero is the only immediate value accepted by the compare instructions.
Simple/obvious method (pseudo-code):
v = VAND(v, v >> 8)
v = VAND(v, v >> 16)
v = VAND(v, v >> 32)
3 x shifts and 3 x bitwise ANDs = 6 instructions.
Possibly more efficient method: do a horizontal sum of all elements, then return TRUE if sum == -8, otherwise return FALSE.
Possibly simpler method: just compare vector with a vector of all 1s.
return v == 0xffffffffffffffff;
Doing this efficiently is left as an exercise for the reader (may require 2 x 32 bit compares ?).

The best way to shift a __m128i?

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m).
What is the best way to do this?!
Note that _mm_slli_epi64 shifts v0 and v1 seperately:
r0 := v0 << count
r1 := v1 << count
so the last bits of v0 missed, but I want to move those bits to r1.
Edit:
I looking for a code, faster than this (m<64):
r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);
For compile-time constant shift counts, you can get fairly good results. Otherwise not really.
This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.
// untested
#include <immintrin.h>
/* some compilers might choke on slli / srli with non-compile-time-constant args
* gcc generates the xmm, imm8 form with constants,
* and generates the xmm, xmm form with otherwise. (With movd to get the count in an xmm)
*/
// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
__m128i carry = _mm_bslli_si128(x, 8); // old compilers only have the confusingly named _mm_slli_si128 synonym
if (count >= 64)
return _mm_slli_epi64(carry, count-64); // the non-carry part is all zero, so return early
// else
carry = _mm_srli_epi64(carry, 64-count); // After bslli shifted left by 64b
x = _mm_slli_epi64(x, count);
return _mm_or_si128(x, carry);
}
__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100); }
I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.
The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):
mm_bitshift_left_3(long long __vector(2)):
vpslldq xmm1, xmm0, 8
vpsrlq xmm1, xmm1, 61
vpsllq xmm0, xmm0, 3
vpor xmm0, xmm0, xmm1
ret
Performance:
This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)
Or for counts >= 64:
mm_bitshift_left_100(long long __vector(2)):
vpslldq xmm0, xmm0, 8
vpsllq xmm0, xmm0, 36
ret
If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.
It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.
The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)
__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
return x << count; // undefined if count >= 128
}
In SSE4.A the instructions insrq and extrq can be used to shift (and rotate) through __mm128i 1-64 bits at a time. Unlike the 8/16/32/64 bit counterparts pextrN/pinsrX, these instructions select or insert m bits (between 1 and 64) at any bit offset from 0 to 127. The caveat is that the sum of lenght and offset must not exceed 128.

Testing equality between two __m128i variables

If I want to do a bitwise equality test between two __m128i variables, am I required to use an SSE instruction or can I use ==? If not, which SSE instruction should I use?
Although using _mm_movemask_epi8 is one solution, if you have a processor with SSE4.1 I think a better solution is to use an instruction which sets the zero or carry flag in the FLAGS register. This saves a test or cmp instruction.
To do this you could do this:
if(_mm_test_all_ones(_mm_cmpeq_epi8(v1,v2))) {
//v0 == v1
}
Edit: as Paul R pointed out _mm_test_all_ones generates two instructions: pcmpeqd and ptest. With _mm_cmpeq_epi8 that's three instructions total. Here's a better solution which only uses two instructions in total:
__m128i neq = _mm_xor_si128(v1,v2);
if(_mm_test_all_zeros(neq,neq)) {
//v0 == v1
}
This generates
pxor %xmm1, %xmm0
ptest %xmm0, %xmm0
You can use a compare and then extract a mask from the comparison result:
__m128i vcmp = _mm_cmpeq_epi8(v0, v1); // PCMPEQB
uint16_t vmask = _mm_movemask_epi8(vcmp); // PMOVMSKB
if (vmask == 0xffff)
{
// v0 == v1
}
This works with SSE2 and later.
As noted by #Zboson, if you have SSE 4.1 then you can do it like this, which may be slightly more efficient, as it's two SSE instructions and then a test on a flag (ZF):
__m128i vcmp = _mm_xor_si128(v0, v1); // PXOR
if (_mm_testz_si128(vcmp, vcmp)) // PTEST (requires SSE 4.1)
{
// v0 == v1
}
FWIW I just benchmarked both of these implementations on a Haswell Core i7 using clang to compile the test harness and the timing results were very similar - the SSE4 implementation appears to be very slightly faster but it's hard to measure the difference.
Consider using an SSE4.1 instruction ptest:
if(_mm_testc_si128(v0, v1)) {if equal}
else {if not}
ptest computes the bitwise AND of 128 bits (representing integer data) in a and mask, and return 1 if the result is zero, otherwise return 0.

Addition in neon register

suppose I have a 64 bit d register in neon. lets say it stores the value ABCDEFGH.
Now I want o add A&E, B&F, C&G, D&H and so on.. Is here any intrinsic by which it is possible to so such an operation
I looked at the documentation but didn't find something suitable.
If you want the addition to be carried out in 16 bits, i.e. produce an uint16x4 result, you can use vmovl to promote the input vector from uint8x8 to uint8x16, then use vadd to add the lower and higher halves. Expressed in NEON intrinsics, this is achieved by
const int16x8_t t = vmovl_u8(input);
const int16x4_t r = vadd_u16(vget_low(t), vget_high(t))
This should compile to the following assembly (d0 is the 64-bit input register, d1 is the 64-bit output register). Note the vget_low and vget_high don't produce any instructions - these intrinsics are implemented by suitable register allocation, by exploiting that Q registers are just a convenient way to name two consecutive D register. Q{n} refers to the pair (D{2n}, D{2n+1}).
VMOVL.U8 q1, d0
VADD.I16 d1, d2
If you want the operation to be carried out in 8 bits, and saturate in case of an overflow, do
const int8x8_t t = vreinterpret_u8_u64(vshr_n_u64(vreinterpret_u64_u8(input), 32));
const int8x8_t r = vqadd_u8(input, t);
This compiles to (d0 is the input again, output in d1)
VSHR.U64 d1, d0, #32
VQADD.I8 d1, d0
By replacing VQADD with just VADD, the results will wrap-around on overflow instead of being saturated to 0xff.

Resources