average operation ARM NEON - arm

I need to compute the same operation as the SSE one:
__m128i result1=_mm_avg_epu8 (upper, lower);
With NEON I do the following:
uint8x16_t result1=vhaddq_u8(upper, lower);
The results should be the same but with the SSE instruction I obtain:
91cb c895 aaa3 b0d4 cfc0 c1b0 aac7 b9b9
whereas with the NEON instruction I obtain:
91ca c894 a9a2 b0d3 cec0 c1af aac7 b8b8
I don't understand why the two results are different. Can you help me?

The Neon "halving add" operation vhadd works like this:
A = (B + C) >> 1
whereas the SSE average intrinsic _mm_avg_epu8 does this:
A = (B + C + 1) >> 1
In other words Neon does a truncating average with its "halving add" operation, whereas SSE correctly rounds the result.
Fortunately there is a Neon instruction which rounds in the same way as SSE's _mm_avg_epu8 - it's called vrhadd - Vector Rounding Halving Add.

You could use vrhadd [1] [2].
Vector rounding halving add: vrhadd -> Vr[i]:=(Va[i]+Vb[i]+1)>>1

Related

Why don't GCC and Clang optimize multiplication by 2^n with a float to integer PADDD of the exponent, even with -ffast-math?

Considering this function,
float mulHalf(float x) {
return x * 0.5f;
}
the following function produces the same result with normal input/output.
float mulHalf_opt(float x) {
__m128i e = _mm_set1_epi32(-1 << 23);
__asm__ ("paddd\t%0, %1" : "+x"(x) : "xm"(e));
return x;
}
This is the assembly output with -O3 -ffast-math.
mulHalf:
mulss xmm0, DWORD PTR .LC0[rip]
ret
mulHalf_opt:
paddd xmm0, XMMWORD PTR .LC1[rip]
ret
-ffast-math enables -ffinite-math-only which "assumes that arguments and results are not NaNs or +-Infs" [1].
So the compiled output of mulHalf might better use paddd with -ffast-math on if doing so produces faster code under the tolerance of -ffast-math.
I got the following tables from Intel Intrinsics Guide.
(MULSS)
Architecture Latency Throughput (CPI)
Skylake 4 0.5
Broadwell 3 0.5
Haswell 5 0.5
Ivy Bridge 5 1
(PADDD)
Architecture Latency Throughput (CPI)
Skylake 1 0.33
Broadwell 1 0.5
Haswell 1 0.5
Ivy Bridge 1 0.5
Clearly, paddd is a faster instruction. Then I thought maybe it's because of the bypass delay between integer and floating-point units.
This answer shows a table from Agner Fog.
Processor Bypass delay, clock cycles
Intel Core 2 and earlier 1
Intel Nehalem 2
Intel Sandy Bridge and later 0-1
Intel Atom 0
AMD 2
VIA Nano 2-3
Seeing this, paddd still seems like a winner, especially on CPUs later than Sandy Bridge, but specifying -march for recent CPUs just change mulss to vmulss, which has a similar latency/throughput.
Why don't GCC and Clang optimize multiplication by 2^n with a float to paddd even with -ffast-math?
This fails for an input of 0.0f, which -ffast-math doesn't rule out. (Even though technically that's a special case of a subnormal that just happens to also have a zero mantissa.).
Integer subtraction would wrap to an all-ones exponent field, and flip the sign bit, so you'd get 0.0f * 0.5f producing -Inf, which is simply not acceptable.
#chtz points out that the +0.0f case can be repaired by using psubusw, but that still fails for -0.0f -> +Inf. So unfortunately that's not usable either, even with -ffast-math allowing the "wrong" sign of zero. But being fully wrong for infinities and NaNs is also undesirable even with fast-math.
Other than that, yes I think this would work, and pay for itself in bypass latency vs. ALU latency on CPUs other than Nehalem, even if used between other FP instructions.
The 0.0 behaviour is a showstopper. Besides that, the underflow behaviour is a lot less desirable than with FP multiply for other inputs, e.g. producing a subnormal even when FTZ (flush to zero on output) is set. Code that reads it with DAZ set (denormals are zero) would still handle it properly, but the FP bit-pattern might also be wrong for a number with the minimum normalized exponent (encoded as 1) and a non-zero mantissa. e.g. you could get a bit-pattern of 0x00000001 as a result of multiplying a normalized number by 0.5f.
Even if not for the 0.0f showstopper, this weirdness might be more than GCC would be willing to inflict on people. So I wouldn't expect it even for cases where GCC can prove non-zero, unless it could also prove far from FLT_MIN. That may be rare enough not to be worth looking for.
You can certainly do it manually when you know it's safe, although much more convenient with SIMD intrinsics. I'd expect rather bad asm from scalar type-punning, probably 2x movd around integer sub, instead of keeping it in an XMM for paddd when you only want the low scalar FP element.
Godbolt for several attempts, including straightforward intrinsics which clang compiles to just a memory-source paddd like we hoped. Clang's shuffle optimizer sees that the upper elements are "dead" (_mm_cvtss_f32 only reads the bottom one), and is able to treat them as "don't care".
// clang compiles this fully efficiently
// others waste an instruction or more on _mm_set_ss to zero the upper XMM elements
float mulHalf_opt_intrinsics(float x) {
__m128i e = _mm_set1_epi32(-1u << 23);
__m128 vx = _mm_set_ss(x);
vx = _mm_castsi128_ps( _mm_add_epi32(_mm_castps_si128(vx), e) );
return _mm_cvtss_f32(vx);
}
And a plain scalar version. I haven't tested to see if it can auto-vectorize, but it might conceivably do so. Without that, GCC and clang do both movd/add/movd (or sub) to bounce the value to a GP-integer register.
float mulHalf_opt_memcpy_scalar(float x) {
uint32_t xi;
memcpy(&xi, &x, sizeof(x));
xi += -1u << 23;
memcpy(&x, &xi, sizeof(x));
return x;
}

Are there are ARM Neon instructions for round function?

I am trying to implement round function using ARM Neon intrinsics.
This function looks like this:
float roundf(float x) {
return signbit(x) ? ceil(x - 0.5) : floor(x + 0.5);
}
Is there a way to do this using Neon intrinsics? If not, how to use Neon intrinsics to implement this function?
edited
After calculating the multiplication of two floats, call roundf(on armv7 and armv8).
My compiler is clang.
this can be done with vrndaq_f32: https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:#navigationhierarchiessimdisa=[Neon]&q=vrndaq_f32 for armv8.
How to do this on armv7?
edited
My implementation
// input: float32x4_t arg
float32x4_t vector_zero = vdupq_n_f32(0.f);
float32x4_t neg_half = vdupq_n_f32(-0.5f);
float32x4_t pos_half = vdupq_n_f32(0.5f);
uint32x4_t mask = vcgeq_f32(arg, vector_zero);
uint32x4_t mask_neg = vandq_u32(mask, neg_half);
uint32x4_t mask_pos = vandq_u32(mask, pos_half);
arg = vaddq_f32(arg, (float32x4_t)mask_pos);
arg = vaddq_f32(arg, (float32x4_t)mask_neg);
int32x4_t arg_int32 = vcvtq_s32_f32(arg);
arg = vcvtq_f32_s32(arg_int32);
Is there a better way to implement this?
It's important that you define which form of rounding you really want. See Wikipedia for a sense of how many rounding choices there are.
From your code-snippet, you are asking for commercial or symmetric rounding which is round-away from zero for ties. For ARMv8 / ARM64, vrndaq_f32 should do that.
The SSE4 _mm_round_ps and ARMv8 ARM-NEON vrndnq_f32 do bankers rounding i.e. round-to-nearest (even).
Your solution is VERY expensive, both in cycle counts and register utilization.
Provided -(2^30) <= arg < (2^30), you can do following:
int32x4_t argi = vcvtq_n_s32_f32(arg, 1);
argi = vsraq_n_s32(argi, argi, 31);
argi = vrshrq_n_s32(argi, 1);
arg = vcvtq_f32_s32(argi);
It doesn't require any other register than arg itself, and it will be done with 4 inexpensive instructions. And it works both for aarch32 and aarch64
godblot link

pairwise addition in neon

I want to add 00 and 01 indices value of int64x2_t vector in neon .
I am not able to find any pairwise-add instruction which will do this functionality .
int64x2_t sum_64_2;
//I am expecting result should be..
//int64_t result = sum_64_2[0] + sum_64_2[1];
Is there any instruction in neon do to this logic.
You can write it in two ways. This one explicitly uses the NEON VADD.I64 instruction:
int64x1_t f(int64x2_t v)
{
return vadd_s64 (vget_high_s64 (v), vget_low_s64 (v));
}
and the following one relies on the compiler to correctly select between using the NEON and general integer instruction sets. GCC 4.9 does the right thing in this case, but other compilers may not.
int64x1_t g(int64x2_t v)
{
int64x1_t r;
r=vset_lane_s64(vgetq_lane_s64(v, 0) + vgetq_lane_s64(v, 1), r, 0);
return r;
}
When targeting ARM, the code generation is efficient. For AArch64, extra instructions are used, but the compiler could do better.

converting SSE code to AVX - cost of _mm256_and_ps

I'm converting SSE2 sine and cosine functions (from Julien Pommier's sse_mathfun.h; based on the CEPHES sinf function) to use AVX in order to accept 8 float vectors or 4 doubles.
So, Julien's function sin_ps becomes sin_ps8 (for 8 floats) and sin_pd4 for 4 doubles. (The "advanced" editor here fails to accept my code, so please visit http://arstechnica.com/civis/viewtopic.php?f=20&t=1227375 to see it.)
Testing with clang 3.3 under Mac OS X 10.6.8 running on a 2011 Core2 i7 # 2.7Ghz, benchmarking results look like this:
sinf .. -> 27.7 millions of vector evaluations/second over 5.56e+07
iters (standard, scalar sinf() function)
sin_ps .. -> 41.0 millions of vector evaluations/second over
8.22e+07 iters
sin_pd4 .. -> 40.2 millions of vector evaluations/second over
8.06e+07 iters
sin_ps8 .. -> 2.5 millions of vector evaluations/second over
5.1e+06 iters
The cost of sin_ps8 is downright frightening, and it seems it is due to the use of _mm256_castsi256_ps . In fact, commenting out the line "poly_mask = _mm256_castsi256_ps(emmm2);" results in a more normal performance.
sin_pd4 uses _mm_castsi128_pd, but it appears that is not (just) the mix of SSE and AVX instructions that is biting me in sin_ps8: when I emulate the _mm256_castsi256_ps calls with 2 calls to _mm_castsi128_ps, performance doesn't improve. emm2 and emm0 are pointers to emmm2 and emmm0, both v8si instances and thus (a priori) correctly aligned to 32 bits boundaries.
See sse_mathfun.h and sse_mathfun_test.c for compilable code.
Is there a(n easy) way to avoid the penalty I'm seeing?
Transferring stuff out of registers into memory isn't usually a good idea. You are doing this every time you store into a pointer.
Instead of this:
{ ALIGN32_BEG v4sf *yy ALIGN32_END = (v4sf*) &y;
emm2[0] = _mm_and_si128(_mm_add_epi32( _mm_cvttps_epi32( yy[0] ), _v4si_pi32_1), _v4si_pi32_inv1),
emm2[1] = _mm_and_si128(_mm_add_epi32( _mm_cvttps_epi32( yy[1] ), _v4si_pi32_1), _v4si_pi32_inv1);
yy[0] = _mm_cvtepi32_ps(emm2[0]),
yy[1] = _mm_cvtepi32_ps(emm2[1]);
}
/* get the swap sign flag */
emm0[0] = _mm_slli_epi32(_mm_and_si128(emm2[0], _v4si_pi32_4), 29),
emm0[1] = _mm_slli_epi32(_mm_and_si128(emm2[1], _v4si_pi32_4), 29);
/* get the polynom selection mask
there is one polynom for 0 <= x <= Pi/4
and another one for Pi/4<x<=Pi/2
Both branches will be computed.
*/
emm2[0] = _mm_cmpeq_epi32(_mm_and_si128(emm2[0], _v4si_pi32_2), _mm_setzero_si128()),
emm2[1] = _mm_cmpeq_epi32(_mm_and_si128(emm2[1], _v4si_pi32_2), _mm_setzero_si128());
((v4sf*)&poly_mask)[0] = _mm_castsi128_ps(emm2[0]);
((v4sf*)&poly_mask)[1] = _mm_castsi128_ps(emm2[1]);
swap_sign_bit = _mm256_castsi256_ps(emmm0);
Try something like this:
__m128i emm2a = _mm_and_si128(_mm_add_epi32( _mm256_castps256_ps128(y), _v4si_pi32_1), _v4si_pi32_inv1);
__m128i emm2b = _mm_and_si128(_mm_add_epi32( _mm256_extractf128_ps(y, 1), _v4si_pi32_1), _v4si_pi32_inv1);
y = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_cvtepi32_ps(emm2a)), _mm_cvtepi32_ps(emm2b), 1);
/* get the swap sign flag */
__m128i emm0a = _mm_slli_epi32(_mm_and_si128(emm2a, _v4si_pi32_4), 29),
__m128i emm0b = _mm_slli_epi32(_mm_and_si128(emm2b, _v4si_pi32_4), 29);
swap_sign_bit = _mm256_castsi256_ps(_mm256_insertf128_si256(_mm256_castsi128_si256(emm0a), emm0b, 1));
/* get the polynom selection mask
there is one polynom for 0 <= x <= Pi/4
and another one for Pi/4<x<=Pi/2
Both branches will be computed.
*/
emm2a = _mm_cmpeq_epi32(_mm_and_si128(emm2a, _v4si_pi32_2), _mm_setzero_si128()),
emm2b = _mm_cmpeq_epi32(_mm_and_si128(emm2b, _v4si_pi32_2), _mm_setzero_si128());
poly_mask = _mm256_castsi256_ps(_mm256_insertf128_si256(_mm256_castsi128_si256(emm2a), emm2b, 1));
As mentioned in comments, cast intrinsics are purely compile-time and emit no instructions.
Maybe you could compare your code to the already working AVX extension of Julien Pommier SSE math functions?
http://software-lisc.fbk.eu/avx_mathfun/
This code works in GCC but not MSVC and only supports floats (float8) but I think you could easily extend it to use doubles (double4) as well. A quick comparison of your sin function shows that they are quite similar except for the SSE2 integer part.

128-bit rotation using ARM Neon intrinsics

I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t).
Here is my c code:
uint16_t rotated[8];
uint16_t temp[8];
uint16_t j;
for(j = 0; j < 8; j++)
{
//Rotation <<< 24 over 128 bits (x << shift) | (x >> (16 - shift)
rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff);
}
I've checked the gcc documentation about Neon Intrinsics and it doesn't have instruction for vector rotations. Moreover, I've tried to do this using vshlq_n_u16(temp, 8) but all the bits shifted outside a uint16_t word are lost.
How to achieve this using neon intrinsics ? By the way is there a better documentation about GCC Neon Intrinsics ?
After some reading on Arm Community Blogs, I've found this :
VEXT: Extract
VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors. VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.
The following Neon GCC Intrinsic does the same as the assembly provided in the picture :
uint16x8_t vextq_u16 (uint16x8_t, uint16x8_t, const int)
So the the 24bit rotation over a full 128bit vector (not over each element) could be done by the following:
uint16x8_t input;
uint16x8_t t0;
uint16x8_t t1;
uint16x8_t rotated;
t0 = vextq_u16(input, input, 1);
t0 = vshlq_n_u16(t0, 8);
t1 = vextq_u16(input, input, 2);
t1 = vshrq_n_u16(t1, 8);
rotated = vorrq_u16(t0, t1);
Use vext.8 to concat a vector with itself and give you the 16-byte window that you want (in this case offset by 3 bytes).
Doing this with intrinsics requires casting to keep the compiler happy, but it's still a single instruction:
#include <arm_neon.h>
uint16x8_t byterotate3(uint16x8_t input) {
uint8x16_t tmp = vreinterpretq_u8_u16(input);
uint8x16_t rotated = vextq_u8(tmp, tmp, 16-3);
return vreinterpretq_u16_u8(rotated);
}
g++5.4 -O3 -march=armv7-a -mfloat-abi=hard -mfpu=neon (on Godbolt) compiles it to this:
byterotate3(__simd128_uint16_t):
vext.8 q0, q0, q0, #13
bx lr
A count of 16-3 means we left-rotate by 3 bytes. (It means we take 13 bytes from the left vector and 3 bytes from the right vector, so it's also a right-rotate by 13).
Related: x86 also has instruction that takes a sliding window into the concatenation of two registers: palignr (added in SSSE3).
Maybe I'm missing something about NEON, but I don't understand why the OP's self-answer is using vext.16 (vextq_u16), which has 16-bit granularity. It's not even a different instruction, just an alias for vext.8 which makes it impossible to use an odd-numbered count, requiring extra instructions. The manual for vext.8 says:
VEXT pseudo-instruction
You can specify a datatype of 16, 32, or 64 instead of 8. In this
case, #imm refers to halfwords, words, or doublewords instead of
referring to bytes, and the permitted ranges are correspondingly
reduced.
I'm not 100% sure but I don't think NEON has rotate instructions.
You can compose the rotation operation you require with a left shift, a right shit and an or, e.g.:
uint8_t ror(uint8_t in, int rotation)
{
return (in >> rotation) | (in << (8-rotation));
}
Just do the same with the Neon intrinsics for left shift, right shit and or.
uint16x8_t temp;
uint8_t rot;
uint16x8_t rotated = vorrq_u16 ( vshlq_n_u16(temp, rot) , vshrq_n_u16(temp, 16 - rot) );
See http://en.wikipedia.org/wiki/Circular_shift "Implementing circular shifts."
This will rotate the values inside the lanes. If you want to rotate the lanes themselves use VEXT as described in the other answer.

Resources