I want to switch the two elements stored in a 128-bit (quadword) NEON register.
[a3, a2, a1, a0] --> [a3, a1, a2, a0]
After reading GNU's "ARM NEON Intrinsics" and ARM ACLE, it seems it can done as:
// qr0 being the input vector variable of type float32x4_t
float32x2_t hi = vget_high_f32(qr0);
float32x2_t lo = vget_low_f32(qr0);
float32x2x2_t qr0_z = vzip_f32(hi, lo); //may also use transpose
qr0 = vcombine_f32(qr0_z.val[0], qr0_z.val[1]);
My question is: is this any better way to do this via intrinsics? Thank you for reading this.
Related
I used the following code to swap 4 scalars in float32x4_t vector.
{1,2,3,4} -> {4,3,2,1}
float32x4_t Vec = {1,2,3,4};
float32x4_t Rev = vrev64q_f32 (Vec); //{2,1,4,3}
High = vget_high_f32 (Rev); //{4,3}
Low = vget_low_f32 (Rev); //{1,2}
float32x4_t Swap = vcombine_f32 (High, Low); //{4,3,2,1}
Can you suggest a faster code ?
Thank you,
Zvika
That is possibly as good as it gets.
The reverse engineered code (for aarch64, gcc/clang -O3) would be
vec = vrev64q_f32(vec);
return vextq_f32(vec,vec,2);
On armv7 (gcc 11.2) your original version compiles to
vrev64.32 q0, q0
vswp d0, d1
where as the other more compact version compiles to
vrev64.32 q0, q0
vext.32 q0, q0, q0, #2
If you prefer the vswp approach (only on armv7) keep your code as is, since there are no intrinsics for swaps.
On armv7 you could also use
float32x2_t lo = vrev64_f32(vget_high_f32(vec));
float32x2_t hi = vrev64_f32(vget_low_f32(vec));
return vcombine_f32(lo, hi);
When inlined and when the result can be produced on another register, this can compile just to two instructions with no dependency between them. Permutations on Cortex-A7 are typically 1 cycle / 64 bits, with 4 cycle latency, so this could be twice as fast as the other approaches.
I am trying to implement round function using ARM Neon intrinsics.
This function looks like this:
float roundf(float x) {
return signbit(x) ? ceil(x - 0.5) : floor(x + 0.5);
}
Is there a way to do this using Neon intrinsics? If not, how to use Neon intrinsics to implement this function?
edited
After calculating the multiplication of two floats, call roundf(on armv7 and armv8).
My compiler is clang.
this can be done with vrndaq_f32: https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:#navigationhierarchiessimdisa=[Neon]&q=vrndaq_f32 for armv8.
How to do this on armv7?
edited
My implementation
// input: float32x4_t arg
float32x4_t vector_zero = vdupq_n_f32(0.f);
float32x4_t neg_half = vdupq_n_f32(-0.5f);
float32x4_t pos_half = vdupq_n_f32(0.5f);
uint32x4_t mask = vcgeq_f32(arg, vector_zero);
uint32x4_t mask_neg = vandq_u32(mask, neg_half);
uint32x4_t mask_pos = vandq_u32(mask, pos_half);
arg = vaddq_f32(arg, (float32x4_t)mask_pos);
arg = vaddq_f32(arg, (float32x4_t)mask_neg);
int32x4_t arg_int32 = vcvtq_s32_f32(arg);
arg = vcvtq_f32_s32(arg_int32);
Is there a better way to implement this?
It's important that you define which form of rounding you really want. See Wikipedia for a sense of how many rounding choices there are.
From your code-snippet, you are asking for commercial or symmetric rounding which is round-away from zero for ties. For ARMv8 / ARM64, vrndaq_f32 should do that.
The SSE4 _mm_round_ps and ARMv8 ARM-NEON vrndnq_f32 do bankers rounding i.e. round-to-nearest (even).
Your solution is VERY expensive, both in cycle counts and register utilization.
Provided -(2^30) <= arg < (2^30), you can do following:
int32x4_t argi = vcvtq_n_s32_f32(arg, 1);
argi = vsraq_n_s32(argi, argi, 31);
argi = vrshrq_n_s32(argi, 1);
arg = vcvtq_f32_s32(argi);
It doesn't require any other register than arg itself, and it will be done with 4 inexpensive instructions. And it works both for aarch32 and aarch64
godblot link
I'm trying to find a more efficient way to "rotate" or shift the 32 bit floating point values within an avx _m256 vector to the right or left by one place.
Such that:
a7, a6, a5, a4, a3, a2, a1, a0
becomes
0, a7, a6, a5, a4, a3, a2, a1
(I dont mind if the data gets lost as I replace the cell anyway.)
I've already taken a look at this thread: Emulating shifts on 32 bytes with AVX
but I don't really understand what is going on, and it doesn't explained what the _MM_SHUFFLE(0, 0, 3, 0) does as an input parameter.
I'm trying to optimise this code:
_mm256_store_ps(temp, array[POS(ii, jj)]);
_mm256_store_ps(left, array[POS(ii, jj-1)]);
tmp_array[POS(ii, jj)] = _mm256_set_ps(left[0], temp[7], temp[6], temp[5], temp[4], temp[3], temp[2], temp[1]);
I know once a shift is in place, I can use an insert to replace the remaining cell. I feel this will be more efficient then unpacking into a float[8] array and reconstructing.
-- I'd also like to be able to shift both left and right, as I need to perform a similar operation elsewhere.
Any help is greatly appreciated! Thanks!
For AVX2:
Use VPERMPS to do it in one lane-crossing shuffle instruction.
rotated_right = _mm256_permutevar8x32_ps(src, _mm256_set_epi32(0,7,6,5,4,3,2,1));
For AVX (without AVX2)
Since you say the data is coming from memory already, this might be good:
use an unaligned load to get the 7 elements to the right place, solving all the lane-crossing problems.
Then blend the element that wrapped around into that vector of the other 7.
To get the element that wrapped in-place for the blend, maybe use a broadcast-load to get it to the high position. AVX can broadcast-load in one VBROADCASTPS instruction (so set1() is cheap), although it does need the shuffle port on Intel SnB and IvB (the only two Intel microarchitectures with AVX but not AVX2). (See perf links in the x86 tag wiki.
INSERTPS only work on XMM destinations, and can't reach the upper lane.
You could maybe use VINSERTF128 to do another unaligned load that ends up putting the element you want as the high element in the upper lane (with some don't-care vector in the low lane).
This compiles, but isn't tested.
__m256 load_rotr(float *src)
{
#ifdef __AVX2__
__m256 orig = _mm256_loadu_ps(src);
__m256 rotated_right = _mm256_permutevar8x32_ps(orig, _mm256_set_epi32(0,7,6,5,4,3,2,1));
return rotated_right;
#else
__m256 shifted = _mm256_loadu_ps(src + 1);
__m256 bcast = _mm256_set1_ps(*src);
return _mm256_blend_ps(shifted, bcast, 0b10000000);
#endif
}
See the code + asm on Godbolt
I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t).
Here is my c code:
uint16_t rotated[8];
uint16_t temp[8];
uint16_t j;
for(j = 0; j < 8; j++)
{
//Rotation <<< 24 over 128 bits (x << shift) | (x >> (16 - shift)
rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff);
}
I've checked the gcc documentation about Neon Intrinsics and it doesn't have instruction for vector rotations. Moreover, I've tried to do this using vshlq_n_u16(temp, 8) but all the bits shifted outside a uint16_t word are lost.
How to achieve this using neon intrinsics ? By the way is there a better documentation about GCC Neon Intrinsics ?
After some reading on Arm Community Blogs, I've found this :
VEXT: Extract
VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors. VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.
The following Neon GCC Intrinsic does the same as the assembly provided in the picture :
uint16x8_t vextq_u16 (uint16x8_t, uint16x8_t, const int)
So the the 24bit rotation over a full 128bit vector (not over each element) could be done by the following:
uint16x8_t input;
uint16x8_t t0;
uint16x8_t t1;
uint16x8_t rotated;
t0 = vextq_u16(input, input, 1);
t0 = vshlq_n_u16(t0, 8);
t1 = vextq_u16(input, input, 2);
t1 = vshrq_n_u16(t1, 8);
rotated = vorrq_u16(t0, t1);
Use vext.8 to concat a vector with itself and give you the 16-byte window that you want (in this case offset by 3 bytes).
Doing this with intrinsics requires casting to keep the compiler happy, but it's still a single instruction:
#include <arm_neon.h>
uint16x8_t byterotate3(uint16x8_t input) {
uint8x16_t tmp = vreinterpretq_u8_u16(input);
uint8x16_t rotated = vextq_u8(tmp, tmp, 16-3);
return vreinterpretq_u16_u8(rotated);
}
g++5.4 -O3 -march=armv7-a -mfloat-abi=hard -mfpu=neon (on Godbolt) compiles it to this:
byterotate3(__simd128_uint16_t):
vext.8 q0, q0, q0, #13
bx lr
A count of 16-3 means we left-rotate by 3 bytes. (It means we take 13 bytes from the left vector and 3 bytes from the right vector, so it's also a right-rotate by 13).
Related: x86 also has instruction that takes a sliding window into the concatenation of two registers: palignr (added in SSSE3).
Maybe I'm missing something about NEON, but I don't understand why the OP's self-answer is using vext.16 (vextq_u16), which has 16-bit granularity. It's not even a different instruction, just an alias for vext.8 which makes it impossible to use an odd-numbered count, requiring extra instructions. The manual for vext.8 says:
VEXT pseudo-instruction
You can specify a datatype of 16, 32, or 64 instead of 8. In this
case, #imm refers to halfwords, words, or doublewords instead of
referring to bytes, and the permitted ranges are correspondingly
reduced.
I'm not 100% sure but I don't think NEON has rotate instructions.
You can compose the rotation operation you require with a left shift, a right shit and an or, e.g.:
uint8_t ror(uint8_t in, int rotation)
{
return (in >> rotation) | (in << (8-rotation));
}
Just do the same with the Neon intrinsics for left shift, right shit and or.
uint16x8_t temp;
uint8_t rot;
uint16x8_t rotated = vorrq_u16 ( vshlq_n_u16(temp, rot) , vshrq_n_u16(temp, 16 - rot) );
See http://en.wikipedia.org/wiki/Circular_shift "Implementing circular shifts."
This will rotate the values inside the lanes. If you want to rotate the lanes themselves use VEXT as described in the other answer.
The question is related to ARM NEON intrinsics.
Iam using ARM neon intrinsics for FIR implementation.
I want to reorder a quadword vector data.
For example,
There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit.
A3 A2 A1 A0
I want to reorder Q0 as A0 A1 A2 A3.
Is there any option to do this?
Reading http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html together with the ARM infocenter, I think the following would do what you ask:
uint32x2_t dvec_h = vget_high_u32(qvec);
uint32x2_t dvec_l = vget_low_u32(qvec);
dvec_h = vrev64_u32(dvec_h);
dvec_l = vrev64_u32(dvec_l);
qvec = vcombine_u32(dvec_h, dvec_l);
In assembly, this could be written simply as:
VSWP d0, d1
VREV64.32 q0, q0