NEON : Swap 4 scalars in float32x4 - arm

I used the following code to swap 4 scalars in float32x4_t vector.
{1,2,3,4} -> {4,3,2,1}
float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}
Can you suggest a faster code ?
Thank you,
Zvika

That is possibly as good as it gets.
The reverse engineered code (for aarch64, gcc/clang -O3) would be
vec = vrev64q_f32(vec);
return vextq_f32(vec,vec,2);
On armv7 (gcc 11.2) your original version compiles to
vrev64.32 q0, q0
vswp d0, d1
where as the other more compact version compiles to
vrev64.32 q0, q0
vext.32 q0, q0, q0, #2
If you prefer the vswp approach (only on armv7) keep your code as is, since there are no intrinsics for swaps.
On armv7 you could also use
float32x2_t lo = vrev64_f32(vget_high_f32(vec));
float32x2_t hi = vrev64_f32(vget_low_f32(vec));
return vcombine_f32(lo, hi);
When inlined and when the result can be produced on another register, this can compile just to two instructions with no dependency between them. Permutations on Cortex-A7 are typically 1 cycle / 64 bits, with 4 cycle latency, so this could be twice as fast as the other approaches.

Related

In-quadword-vector Shuffle with ARM NEON

I want to switch the two elements stored in a 128-bit (quadword) NEON register.
[a3, a2, a1, a0] --> [a3, a1, a2, a0]
After reading GNU's "ARM NEON Intrinsics" and ARM ACLE, it seems it can done as:
// qr0 being the input vector variable of type float32x4_t
float32x2_t hi = vget_high_f32(qr0);
float32x2_t lo = vget_low_f32(qr0);
float32x2x2_t qr0_z = vzip_f32(hi, lo); //may also use transpose
qr0 = vcombine_f32(qr0_z.val[0], qr0_z.val[1]);
My question is: is this any better way to do this via intrinsics? Thank you for reading this.

How do you load 3 floats using neon intrinsics

I'm trying to convert this neon code to intrinsics:
vld1.32 {d0}, [%[pInVertex1]]
flds s2, [%[pInVertex1], #8]
This loads 3 32-bit floats from the variable pInVertex1 into the d0 and d1 registers.
I can't find any equivalent version for instrinsics. There is vld1q_f32, but that only works for 4 floats. Anyone know of a efficient way of doing this (I mean without extra copying)?
The only instruction that writes only 3 32-bit floats in Aarch32 is a multiple-load instruction:
r0 holds the address of the structure
FLDMIAS r0, {s0-s2}
This can be used either in VFP or Neon code.
I do not know about the corresponding intrinsic.
In DirectXMath I implemented the ARM-NEON version of XMLoadFloat3 as:
float32x2_t x = vld1_f32( reinterpret_cast<const float*>(pSource) );
float32x2_t zero = vdup_n_f32(0);
float32x2_t y = vld1_lane_f32( reinterpret_cast<const float*>(pSource)+2, zero, 0 );
return vcombine_f32( x, y );

Intel FMA Instructions Offer Zero Performance Advantage

Consider the following instruction sequence using Haswell's FMA instructions:
__m256 r1 = _mm256_xor_ps (r1, r1);
r1 = _mm256_fmadd_ps (rp1, m6, r1);
r1 = _mm256_fmadd_ps (rp2, m7, r1);
r1 = _mm256_fmadd_ps (rp3, m8, r1);
__m256 r2 = _mm256_xor_ps (r2, r2);
r2 = _mm256_fmadd_ps (rp1, m3, r2);
r2 = _mm256_fmadd_ps (rp2, m4, r2);
r2 = _mm256_fmadd_ps (rp3, m5, r2);
__m256 r3 = _mm256_xor_ps (r3, r3);
r3 = _mm256_fmadd_ps (rp1, m0, r3);
r3 = _mm256_fmadd_ps (rp2, m1, r3);
r3 = _mm256_fmadd_ps (rp3, m2, r3);
The same computation can be expressed using non-FMA instructions as follows:
__m256 i1 = _mm256_mul_ps (rp1, m6);
__m256 i2 = _mm256_mul_ps (rp2, m7);
__m256 i3 = _mm256_mul_ps (rp3, m8);
__m256 r1 = _mm256_xor_ps (r1, r1);
r1 = _mm256_add_ps (i1, i2);
r1 = _mm256_add_ps (r1, i3);
i1 = _mm256_mul_ps (rp1, m3);
i2 = _mm256_mul_ps (rp2, m4);
i3 = _mm256_mul_ps (rp3, m5);
__m256 r2 = _mm256_xor_ps (r2, r2);
r2 = _mm256_add_ps (i1, i2);
r2 = _mm256_add_ps (r2, i3);
i1 = _mm256_mul_ps (rp1, m0);
i2 = _mm256_mul_ps (rp2, m1);
i3 = _mm256_mul_ps (rp3, m2);
__m256 r3 = _mm256_xor_ps (r3, r3);
r3 = _mm256_add_ps (i1, i2);
r3 = _mm256_add_ps (r3, i3);
One would expect the FMA version to provide some performance advantage over the non-FMA version.
But unfortunately, in this case, there is zero (0) performance improvement.
Can anyone help me understand why?
I measured both approaches on a core i7-4790 based machine.
UPDATE:
So I analyzed the generated machine code and determined that the MSFT VS2013 C++ compiler was generating the machine code such that the dependency chains of r1 and r2 could dispatch in parallel since Haswell has 2 FMA pipes.
r3 must dispatch after r1 so in this case, the second FMA pipe is idle.
I thought that if I unroll the loop to do 6 sets of FMAs instead of 3, then I could keep all the FMA pipes busy on every iteration.
Unfortunately, when I checked the assembly dump in this case, the MSFT compiler did not choose register assignments that would have allowed the type of parallel dispatch that I was looking for and I verified that I didn't get the performance increase that I was looking for.
Is there a way I can change my C code (using intrinsics) to enable the compiler to generate better code?
You didn't provide a full code sample that includes the surrounding loop (presumably there is a surrounding loop), so it is hard to answer definitively, but the main problem I see is that the latency of the dependency chains of your FMA code is considerably longer than your multiply + addition code.
Each of the three blocks in your FMA code is doing the same independent operation:
TOTAL += A1 * B1;
TOTAL += A2 * B2;
TOTAL += A3 * B3;
As it is structured, each operation depends on the previous due since each one reads and writes total. So the latency of this string of operation is 3 ops x 5 cycles/FMA = 15 cycles.
In your re-written version without FMA, the dependency chain on TOTAL is now broken, since you've done:
TOTAL_1 = A1 * B1; # 1
TOTAL_2 = A2 * B2; # 2
TOTAL_3 = A3 * B3; # 3
TOTAL_1_2 = TOTAL_1 + TOTAL2; # 5, depends on 1,2
TOTAL = TOTAL_1_2 + TOTAL3; # 6, depends on 3,5
The first three MUL instructions can execute independently since they don't have any dependencies. The two add instructions are serially dependent on the multiplications. The latency of this sequence is thus 5 + 3 + 3 = 11.
So the latency of the second method is lower, even though it uses more CPU resources (5 total instructions issued). It is certainly possible then, that depending on how the overall loop is structured, that the lower latency cancels out the throughput advantages of FMA for this code - if it is at least partly latency bound.
For a more comprehensive static analysis, I highly recommend Intel's IACA - which can take a loop iteration like the above, and tell you exactly what the bottleneck is, at least in the best case scenario. It can identify the critical paths in the loop, whether you are latency bound, etc.
Another possibility is that you are memory bound (latency or throughput), in which you'll also see similar behavior for FMA vs MUL + ADD.
re: your edit: Your code has three dependency chains (r1, r2, and r3), so it can keep three FMAs in flight at once. FMA on Haswell is 5c latency, one per 0.5c throughput, so the machine can sustain 10 FMAs in flight.
If your code is in a loop, and the inputs to one iteration aren't generated by the previous iteration, then you could be getting 10 FMAs in flight that way. (i.e. no loop-carried dependency chain involving the FMAs). But since you don't see a perf gain, there's probably a dep chain causing throughput to be limited by latency.
You didn't post the ASM you're getting from MSVC, but you claim something about register assignments. xorps same,same is a recognized zeroing idiom that starts a new dependency chain, just like using a register as a write-only operand (e.g. the destination of a non-FMA AVX instruction.)
It's highly unlikely that the code could be correct but still contain a dependency of r3 on r1. Make sure you understand that out-of-order execution with register renaming allows separate dependency chains to use the same register.
BTW, instead of __m256 r1 = _mm256_xor_ps (r1, r1);, you should use __m256 r1 = _mm256_setzero_ps();. You should avoid using the variable you're declaring in its own initializer! Compilers sometimes make silly code when you use uninitialized vectors, e.g. loading garbage from stack memory, or doing an extra xorps.
Even better would be:
__m256 r1 = _mm256_mul_ps (rp1, m6);
r1 = _mm256_fmadd_ps (rp2, m7, r1);
r1 = _mm256_fmadd_ps (rp3, m8, r1);
This avoids needing an xorps to zero a reg for the accumulator.
On Broadwell, mulps has lower latency than FMA.
On Skylake, FMA/mul/add are all 4c latency, one per 0.5c throughput. They dropped the separate adder from port1 and do it on the FMA unit. They shaved a cycle of latency off the FMA unit.

Testing equality between two __m128i variables

If I want to do a bitwise equality test between two __m128i variables, am I required to use an SSE instruction or can I use ==? If not, which SSE instruction should I use?
Although using _mm_movemask_epi8 is one solution, if you have a processor with SSE4.1 I think a better solution is to use an instruction which sets the zero or carry flag in the FLAGS register. This saves a test or cmp instruction.
To do this you could do this:
if(_mm_test_all_ones(_mm_cmpeq_epi8(v1,v2))) {
//v0 == v1
}
Edit: as Paul R pointed out _mm_test_all_ones generates two instructions: pcmpeqd and ptest. With _mm_cmpeq_epi8 that's three instructions total. Here's a better solution which only uses two instructions in total:
__m128i neq = _mm_xor_si128(v1,v2);
if(_mm_test_all_zeros(neq,neq)) {
//v0 == v1
}
This generates
pxor %xmm1, %xmm0
ptest %xmm0, %xmm0
You can use a compare and then extract a mask from the comparison result:
__m128i vcmp = _mm_cmpeq_epi8(v0, v1); // PCMPEQB
uint16_t vmask = _mm_movemask_epi8(vcmp); // PMOVMSKB
if (vmask == 0xffff)
{
// v0 == v1
}
This works with SSE2 and later.
As noted by #Zboson, if you have SSE 4.1 then you can do it like this, which may be slightly more efficient, as it's two SSE instructions and then a test on a flag (ZF):
__m128i vcmp = _mm_xor_si128(v0, v1); // PXOR
if (_mm_testz_si128(vcmp, vcmp)) // PTEST (requires SSE 4.1)
{
// v0 == v1
}
FWIW I just benchmarked both of these implementations on a Haswell Core i7 using clang to compile the test harness and the timing results were very similar - the SSE4 implementation appears to be very slightly faster but it's hard to measure the difference.
Consider using an SSE4.1 instruction ptest:
if(_mm_testc_si128(v0, v1)) {if equal}
else {if not}
ptest computes the bitwise AND of 128 bits (representing integer data) in a and mask, and return 1 if the result is zero, otherwise return 0.

How to reorder a quadword vector data using Neon Intrinsics?

The question is related to ARM NEON intrinsics.
Iam using ARM neon intrinsics for FIR implementation.
I want to reorder a quadword vector data.
For example,
There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit.
A3 A2 A1 A0
I want to reorder Q0 as A0 A1 A2 A3.
Is there any option to do this?
Reading http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html together with the ARM infocenter, I think the following would do what you ask:
uint32x2_t dvec_h = vget_high_u32(qvec);
uint32x2_t dvec_l = vget_low_u32(qvec);
dvec_h = vrev64_u32(dvec_h);
dvec_l = vrev64_u32(dvec_l);
qvec = vcombine_u32(dvec_h, dvec_l);
In assembly, this could be written simply as:
VSWP d0, d1
VREV64.32 q0, q0

Resources