Neon equivalent of mm_madd_epi16 and mm_maddubs_epi16 - c

I am trying to port a code in SSE to Neon.
I could not find the equivalent intrinsics for mm_maddubs_epi16 and mm_madd_epi16.
Any insights on these intrinsics for Neon.

You might want to look at the implementations is SIMDe for _mm_madd_epi16 and _mm_maddubs_epi16 (note to future readers: you might want to check the latest version of those files since implementations in SIMDe get improved sometimes and it's very unlikely I'll remember to update this answer). These implementations are just copied from there.
If you're on AArch64, for _mm_madd_epi16 you probably want to use an vmull_s16+vget_low_s16 for the low half, a vmull_high_s16 for the high half, then use vpaddq_s32 to add them together into a 128-bit result. Without AArch64 you'll need two vmull_s16 calls (one with vget_low_s16 and one with vget_high_s16), but since vpaddq_s32 isn't supported you'll need two vpadd_s32 calls with a vcombine_s32:
#if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
int32x4_t pl = vmull_s16(vget_low_s16(a_.neon_i16), vget_low_s16(b_.neon_i16));
int32x4_t ph = vmull_high_s16(a_.neon_i16, b_.neon_i16);
r_.neon_s32 = vpaddq_s32(pl, ph);
#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
int32x4_t pl = vmull_s16(vget_low_s16(a_.neon_i16), vget_low_s16(b_.neon_i16));
int32x4_t ph = vmull_s16(vget_high_s16(a_.neon_i16), vget_high_s16(b_.neon_i16));
int32x2_t rl = vpadd_s32(vget_low_s32(pl), vget_high_s32(pl));
int32x2_t rh = vpadd_s32(vget_low_s32(ph), vget_high_s32(ph));
r_.neon_i32 = vcombine_s32(rl, rh);
#endif
For _mm_maddubs_epi16 it's a little more complicated, but I don't think an AArch64-specific version will do much good:
/* Zero extend a */
int16x8_t a_odd = vreinterpretq_s16_u16(vshrq_n_u16(a_.neon_u16, 8));
int16x8_t a_even = vreinterpretq_s16_u16(vbicq_u16(a_.neon_u16, vdupq_n_u16(0xff00)));
/* Sign extend by shifting left then shifting right. */
int16x8_t b_even = vshrq_n_s16(vshlq_n_s16(b_.neon_i16, 8), 8);
int16x8_t b_odd = vshrq_n_s16(b_.neon_i16, 8);
/* multiply */
int16x8_t prod1 = vmulq_s16(a_even, b_even);
int16x8_t prod2 = vmulq_s16(a_odd, b_odd);
/* saturated add */
r_.neon_i16 = vqaddq_s16(prod1, prod2);

Related

Are there are ARM Neon instructions for round function?

I am trying to implement round function using ARM Neon intrinsics.
This function looks like this:
float roundf(float x) {
return signbit(x) ? ceil(x - 0.5) : floor(x + 0.5);
}
Is there a way to do this using Neon intrinsics? If not, how to use Neon intrinsics to implement this function?
edited
After calculating the multiplication of two floats, call roundf(on armv7 and armv8).
My compiler is clang.
this can be done with vrndaq_f32: https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:#navigationhierarchiessimdisa=[Neon]&q=vrndaq_f32 for armv8.
How to do this on armv7?
edited
My implementation
// input: float32x4_t arg
float32x4_t vector_zero = vdupq_n_f32(0.f);
float32x4_t neg_half = vdupq_n_f32(-0.5f);
float32x4_t pos_half = vdupq_n_f32(0.5f);
uint32x4_t mask = vcgeq_f32(arg, vector_zero);
uint32x4_t mask_neg = vandq_u32(mask, neg_half);
uint32x4_t mask_pos = vandq_u32(mask, pos_half);
arg = vaddq_f32(arg, (float32x4_t)mask_pos);
arg = vaddq_f32(arg, (float32x4_t)mask_neg);
int32x4_t arg_int32 = vcvtq_s32_f32(arg);
arg = vcvtq_f32_s32(arg_int32);
Is there a better way to implement this?
It's important that you define which form of rounding you really want. See Wikipedia for a sense of how many rounding choices there are.
From your code-snippet, you are asking for commercial or symmetric rounding which is round-away from zero for ties. For ARMv8 / ARM64, vrndaq_f32 should do that.
The SSE4 _mm_round_ps and ARMv8 ARM-NEON vrndnq_f32 do bankers rounding i.e. round-to-nearest (even).
Your solution is VERY expensive, both in cycle counts and register utilization.
Provided -(2^30) <= arg < (2^30), you can do following:
int32x4_t argi = vcvtq_n_s32_f32(arg, 1);
argi = vsraq_n_s32(argi, argi, 31);
argi = vrshrq_n_s32(argi, 1);
arg = vcvtq_f32_s32(argi);
It doesn't require any other register than arg itself, and it will be done with 4 inexpensive instructions. And it works both for aarch32 and aarch64
godblot link

How to use fused multiply and add in AVX for 16 bit packed integers

I know there it is possible to do multiply-and-add using a single instruction in AVX2. I want to use multiply-and-add instruction where each 256-bit AVX2 variable is packed with 16, 16-bit variables. For instance, consider the example below,
res=a0*b0+a1*b1+a2*b2+a3*b3
here each of res, a0, a1, a2, a3, b0, b1, b2, b3 are 16-bit variables.
I have closely followed the discussion. Please find my code below to calculate the example shown above,
#include<stdio.h>
#include<stdint.h>
#include <immintrin.h>
#include<time.h>
#include "cpucycles.c"
#pragma STDC FP_CONTRACT ON
#define AVX_LEN 16
inline __m256i mul_add(__m256i a, __m256i b, __m256i c) {
return _mm256_add_epi16(_mm256_mullo_epi16(a, b), c);
}
void fill_random(int16_t *a, int32_t len){ //to fill up the random array
int32_t i;
for(i=0;i<len;i++){
a[i]=(int16_t)rand()&0xffff;
}
}
void main(){
int16_t a0[16*AVX_LEN], b0[16*AVX_LEN];
int16_t a1[16*AVX_LEN], b1[16*AVX_LEN];
int16_t a2[16*AVX_LEN], b2[16*AVX_LEN];
int16_t a3[16*AVX_LEN], b3[16*AVX_LEN];
int16_t res[16*AVX_LEN];
__m256i a0_avx[AVX_LEN], b0_avx[AVX_LEN];
__m256i a1_avx[AVX_LEN], b1_avx[AVX_LEN];
__m256i a2_avx[AVX_LEN], b2_avx[AVX_LEN];
__m256i a3_avx[AVX_LEN], b3_avx[AVX_LEN];
__m256i res_avx[AVX_LEN];
int16_t res_avx_check[16*AVX_LEN];
int32_t i,j;
uint64_t mask_ar[4]; //for unloading AVX variables
mask_ar[0]=~(0UL);mask_ar[1]=~(0UL);mask_ar[2]=~(0UL);mask_ar[3]=~(0UL);
__m256i mask;
mask = _mm256_loadu_si256 ((__m256i const *)mask_ar);
time_t t;
srand((unsigned) time(&t));
int32_t repeat=100000;
uint64_t clock1, clock2, fma_clock;
clock1=clock2=fma_clock=0;
for(j=0;j<repeat;j++){
printf("j : %d\n",j);
fill_random(a0,16*AVX_LEN);// Genrate random data
fill_random(a1,16*AVX_LEN);
fill_random(a2,16*AVX_LEN);
fill_random(a3,16*AVX_LEN);
fill_random(b0,16*AVX_LEN);
fill_random(b1,16*AVX_LEN);
fill_random(b2,16*AVX_LEN);
fill_random(b3,16*AVX_LEN);
for(i=0;i<AVX_LEN;i++){ //Load values in AVX variables
a0_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a0[i*16]));
a1_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a1[i*16]));
a2_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a2[i*16]));
a3_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a3[i*16]));
b0_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b0[i*16]));
b1_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b1[i*16]));
b2_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b2[i*16]));
b3_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b3[i*16]));
}
for(i=0;i<AVX_LEN;i++){
res_avx[i]= _mm256_set_epi64x(0, 0, 0, 0);
}
//to calculate a0*b0 + a1*b1 + a2*b2 + a3*b3
//----standard calculation----
for(i=0;i<16*AVX_LEN;i++){
res[i]=a0[i]*b0[i] + a1[i]*b1[i] + a2[i]*b2[i] + a3[i]*b3[i];
}
//-----AVX-----
clock1=cpucycles();
for(i=0;i<AVX_LEN;i++){ //simple approach
a0_avx[i]=_mm256_mullo_epi16(a0_avx[i], b0_avx[i]);
res_avx[i]=_mm256_add_epi16(a0_avx[i], res_avx[i]);
a1_avx[i]=_mm256_mullo_epi16(a1_avx[i], b1_avx[i]);
res_avx[i]=_mm256_add_epi16(a1_avx[i], res_avx[i]);
a2_avx[i]=_mm256_mullo_epi16(a2_avx[i], b2_avx[i]);
res_avx[i]=_mm256_add_epi16(a2_avx[i], res_avx[i]);
a3_avx[i]=_mm256_mullo_epi16(a3_avx[i], b3_avx[i]);
res_avx[i]=_mm256_add_epi16(a3_avx[i], res_avx[i]);
}
/*
for(i=0;i<AVX_LEN;i++){ //FMA approach
res_avx[i]=mul_add(a0_avx[i], b0_avx[i], res_avx[i]);
res_avx[i]=mul_add(a1_avx[i], b1_avx[i], res_avx[i]);
res_avx[i]=mul_add(a2_avx[i], b2_avx[i], res_avx[i]);
res_avx[i]=mul_add(a3_avx[i], b3_avx[i], res_avx[i]);
}
*/
clock2=cpucycles();
fma_clock = fma_clock + (clock2-clock1);
//-----Check----
for(i=0;i<AVX_LEN;i++){ //store avx results for comparison
_mm256_maskstore_epi64 (res_avx_check + i*16, mask, res_avx[i]);
}
for(i=0;i<16*AVX_LEN;i++){
if(res[i]!=res_avx_check[i]){
printf("\n--ERROR--\n");
return;
}
}
}
printf("Total time taken is :%llu\n", fma_clock/repeat);
}
The cpucycles code is from ECRYPT and given below,
#include "cpucycles.h"
long long cpucycles(void)
{
unsigned long long result;
asm volatile(".byte 15;.byte 49;shlq $32,%%rdx;orq %%rdx,%%rax"
: "=a" (result) :: "%rdx");
return result;
}
My gcc -version returns,
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
I am using
Intel(R) Core(TM) i7-7700 CPU # 3.60GHz
When I run this on my computer, I get the following cycles for fma approach and simple approach respectively
FMA approach : Total time taken is :109
Simple approach : Total time taken is :141
As you can see, the FMA approach is slightly faster but I expected to be even more faster. I understand that in my sample code there are many memory accesses which might be the reason of deteriorating performance. But,
When I dump the assembly I see the almost similar instructions for both approaches. I do not see any fma instructions in the FMA version. I don't understand the reason. Is it beacause _mm256_mullo_epi16 instructions?
Is my approach correct?
Can you please help me to fix this?
I am new to AVX2 programming so it is highly possible that I ahve done something which is not very standard but I will be glad to answer something which is not clear.
I thank you all for your help in advance.
x86 doesn't have SIMD-integer FMA / MAC (multiply-accumulate) other than horizontal pmaddubsw / pmaddwd which add horizontal into wider integers. (Until AVX512IFMA _mm_madd52lo_epu64 or AVX512_4VNNIW _mm512_4dpwssd_epi32(__m512i, __m512ix4, __m128i *)).
FP-contract and -ffast-math options have nothing to do with SIMD-integer stuff; integer math is always exact.
I think your "simple" approach is slower because you're also modifying the input arrays and this doesn't get optimized away, e.g.
a0_avx[i] = _mm256_mullo_epi16(a0_avx[i], b0_avx[i]);
as well as updating res_avx[i].
If the compiler doesn't optimize that away, those extra stores might be exactly why it's slower than your mul_add function. rdtsc without a serializing instruction doesn't even have to wait for earlier instructions to execute, let alone retire or commit stores to L1d cache, but extra uops for the front-end are still more to chew through. At only 1 store per clock throughput, that could easily be a new bottleneck.
FYI, you don't need to copy your inputs to arrays of __m256i. Normally you'd just use SIMD loads on regular data. That's no slower than indexing arrays of __m256i. Your arrays are too big for the compiler to fully unroll and keep everything in registers (like it would for scalar __m256i variables).
If you'd just used __m256i a0 = _mm256_loadu_si256(...) inside the loop then you could have updated a0 without slowing your code down because it would just be a single local variable that can be kept in a YMM reg.
But I find it's good style to use new named tmp vars for most steps to make code more self-documenting. Like __m256i ab = ... or sum = .... You could reuse the same sum temporary for each a0+b0 and a1+b1.
You might also use a temporary for the result vector instead of making the compiler optimize away the updates of memory at res_avx[i] until the final one.
You can use alignas(32) int16_t a0[...]; to make plain arrays aligned for _mm256_load instead of loadu.
Your cpucycles() RDTSC function doesn't need to use inline asm. Use __rdtsc() instead.

How to extract 8 integers from a 256 vector using intel intrinsics?

I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX).
I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions.
This is the code snippet that I'm trying to enhance:
/* code snipet */
kfac1 = kfac + factor; /* 7 cycles for 7 additions */
kfac2 = kfac1 + factor;
kfac3 = kfac2 + factor;
kfac4 = kfac3 + factor;
kfac5 = kfac4 + factor;
kfac6 = kfac5 + factor;
kfac7 = kfac6 + factor;
k1fac1 = k1fac + factor1; /* 7 cycles for 7 additions */
k1fac2 = k1fac1 + factor1;
k1fac3 = k1fac2 + factor1;
k1fac4 = k1fac3 + factor1;
k1fac5 = k1fac4 + factor1;
k1fac6 = k1fac5 + factor1;
k1fac7 = k1fac6 + factor1;
k2fac1 = k2fac + factor2; /* 7 cycles for 7 additions */
k2fac2 = k2fac1 + factor2;
k2fac3 = k2fac2 + factor2;
k2fac4 = k2fac3 + factor2;
k2fac5 = k2fac4 + factor2;
k2fac6 = k2fac5 + factor2;
k2fac7 = k2fac6 + factor2;
/* code snipet */
From the Intel Manuals, I found this.
an integer addition ADD takes 1 cycle (latency).
a vector of 8 integers (32 bit) takes 1 cycle also.
So I've tried ton make it this way:
fac = _mm256_set1_epi32 (factor )
fac1 = _mm256_set1_epi32 (factor1)
fac2 = _mm256_set1_epi32 (factor2)
v1 = _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac)
v2 = _mm256_set_epi32 (0,k1fac6,k1fac5,k1fac4,k1fac3,k1fac2,k1fac1,k1fac)
v3 = _mm256_set_epi32 (0,k2fac6,k2fac5,k2fac4,k2fac3,k2fac2,k2fac1,k2fac)
res1 = _mm256_add_epi32 (v1,fac) ////////////////////
res2 = _mm256_add_epi32 (v2,fa1) // just 3 cycles //
res3 = _mm256_add_epi32 (v3,fa2) ////////////////////
But the problem is that these factors are going to be used as tables indexes ( table[kfac] ... ). So i have to extract the factor as seperate integers again.
I wonder if there is any possible way to do it??
A smart compiler could get table+factor into a register and use indexed addressing modes to get table+factor+k1fac6 as an address. Check the asm, and if the compiler doesn't do this for you, try changing the source to hand-hold the compiler:
const int *tf = table + factor;
const int *tf2 = table + factor2; // could be lea rdx, [rax+rcx*4] or something.
...
foo = tf[kfac2];
bar = tf2[k2fac6]; // could be mov r12, [rdx + rdi*4]
But to answer the question you asked:
Latency isn't a big deal when you have that many independent adds happening. The throughput of 4 scalar add instructions per clock on Haswell is much more relevant.
If k1fac2 and so on are already in contiguous memory, then using SIMD is possibly worth it. Otherwise all the shuffling and data transfer to get them in/out of vector regs makes it definitely not worth it. (i.e. the stuff compiler emits to implement _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac).
You could avoid needing to get the indices back into integer registers by using an AVX2 gather for the table loads. But gather is slow on Haswell, so probably not worth it. Maybe worth it on Broadwell.
On Skylake, gather is fast so it could be good if you can SIMD whatever you do with the LUT results. If you need to extract all the gather results back to separate integer registers, it's probably not worth it.
If you did need to extract 8x 32-bit integers from a __m256i into integer registers, you have three main choices of strategy:
Vector store to a tmp array and scalar loads
ALU shuffle instructions like pextrd (_mm_extract_epi32). Use _mm256_extracti128_si256 to get the high lane into a separate __m128i.
A mix of both strategies (e.g. store the high 128 to memory while using ALU stuff on the low half).
Depending on the surrounding code, any of these three could be optimal on Haswell.
pextrd r32, xmm, imm8 is 2 uops on Haswell, with one of them needing the shuffle unit on port5. That's a lot of shuffle uops, so a pure ALU strategy is only going to be good if your code is bottlenecked on L1d cache throughput. (Not the same thing as memory bandwidth). movd r32, xmm is only 1 uop, and compilers do know to use that when compiling _mm_extract_epi32(vec, 0), but you can also write int foo = _mm_cvtsi128_si32(vec) to make it explicit and remind yourself that the bottom element can be accessed more efficiently.
Store/reload has good throughput. Intel SnB-family CPUs including Haswell can run two loads per clock, and IIRC store-forwarding works from an aligned 32-byte store to any 4-byte element of it. But make sure it's an aligned store, e.g. into _Alignas(32) int tmp[8], or into a union between an __m256i and an int array. You could still store into the int array instead of the __m256i member to avoid union type-punning while still having the array aligned, but it's easiest to just use C++11 alignas or C11 _Alignas.
_Alignas(32) int tmp[8];
_mm256_store_si256((__m256i*)tmp, vec);
...
foo2 = tmp[2];
However, the problem with store/reload is latency. Even the first result won't be ready for 6 cycles after the store-data is ready.
A mixed strategy gives you the best of both worlds: ALU to extract the first 2 or 3 elements lets execution get started on whatever code uses them, hiding the store-forwarding latency of the store/reload.
_Alignas(32) int tmp[8];
_mm256_store_si256((__m256i*)tmp, vec);
__m128i lo = _mm256_castsi256_si128(vec); // This is free, no instructions
int foo0 = _mm_cvtsi128_si32(lo);
int foo1 = _mm_extract_epi32(lo, 1);
foo2 = tmp[2];
// rest of foo3..foo7 also loaded from tmp[]
// Then use foo0..foo7
You might find that it's optimal to do the first 4 elements with pextrd, in which case you only need to store/reload the upper lane. Use vextracti128 [mem], ymm, 1:
_Alignas(16) int tmp[4];
_mm_store_si128((__m128i*)tmp, _mm256_extracti128_si256(vec, 1));
// movd / pextrd for foo0..foo3
int foo4 = tmp[0];
...
With fewer larger elements (e.g. 64-bit integers), a pure ALU strategy is more attractive. 6-cycle vector-store / integer-reload latency is longer than it would take to get all of the results with ALU ops, but store/reload could still be good if there's a lot of instruction-level parallelism and you bottleneck on ALU throughput instead of latency.
With more smaller elements (8 or 16-bit), store/reload is definitely attractive. Extracting the first 2 to 4 elements with ALU instructions is still good. And maybe even vmovd r32, xmm and then picking that apart with integer shift/mask instructions is good.
Your cycle-counting for the vector version is also bogus. The three _mm256_add_epi32 operations are independent, and Haswell can run two vpaddd instructions in parallel. (Skylake can run all three in a single cycle, each with 1 cycle latency.)
Superscalar pipelined out-of-order execution means there's a big difference between latency and throughput, and keeping track of dependency chains matters a lot. See http://agner.org/optimize/, and other links in the x86 tag wiki for more optimization guides.

Add all elements in a lane

Is there an intrinsic which allows one to add all of the elements in a lane? I am using Neon to multiply 8 numbers together, and I need to sum the result. Here is some paraphrased code to show what I'm currently doing (this could probably be optimised):
int16_t p[8], q[8], r[8];
int32_t sum;
int16x8_t pneon, qneon, result;
p[0] = some_number;
p[1] = some_other_number;
//etc etc
pneon = vld1q_s16(p);
q[0] = some_other_other_number;
q[1] = some_other_other_other_number;
//etc etc
qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
vst1q_s16(r,result);
sum = ((int32_t) r[0] + (int32_t) r[1] + ... //etc );
Is there a "better" way to do this?
If you're targeting the newer arm 64 bit architecture, then ADDV is just the right instruction for you.
Here's how your code will look with it.
qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
sum = vaddvq_s16(result);
That's it. Just one instruction to sum up all of the lanes in the vector register.
Sadly, this instruction doesn't feature in the older 32 bit arm architecture.
Something like this should work pretty optimal (caution: not tested)
const int16x4_t result_low = vget_low_s16(result); // Extract low 4 elements
const int16x4_t result_high = vget_high_s16(result); // Extract high 4 elements
const int32x4_t twopartsum = vaddl_s16(result_low, result_high); // Extend to 32 bits and add (4 partial 32-bit sums are formed)
const int32x2_t twopartsum_low = vget_low_s32(twopartsum); // Extract 2 low 32-bit partial sums
const int32x2_t twopartsum_high = vget_high_s32(twopartsum); // Extract 2 high 32-bit partial sums
const int32x2_t fourpartsum = vadd_s32(twopartsum_low, twopartsum_high); // Add partial sums (2 partial 32-bit sum are formed)
const int32x2_t eightpartsum = vpadd_s32(fourpartsum, fourpartsum); // Final reduction
const int32_t sum = vget_lane_s32(eightpartsum, 0); // Move to general-purpose registers
temp = vadd_f32(vget_high_f32(variance_n), vget_low_f32(variance_n));
sum = vget_lane_f32(vpadd_f32(variance_temp, variance_temp), 0);

SSE _mm_movemask_epi8 equivalent method for ARM NEON

I decided to continue Fast corners optimisation and stucked at
_mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?
I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument.
const uint8_t __attribute__ ((aligned (16))) _Powers[16]=
{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };
// Set the powers of 2 (do it once for all, if applicable)
uint8x16_t Powers= vld1q_u8(_Powers);
// Compute the mask from the input
uint64x2_t Mask= vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(Input, Powers))));
// Get the resulting bytes
uint16_t Output;
vst1q_lane_u8((uint8_t*)&Output + 0, (uint8x16_t)Mask, 0);
vst1q_lane_u8((uint8_t*)&Output + 1, (uint8x16_t)Mask, 8);
(Mind http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47553, anyway.)
Similarly to Michael, the trick is to form the powers of the indexes of the non-null entries, and to sum them pairwise three times. This must be done with increasing data size to double the stride on every addition. You reduce from 2 x 8 8-bit entries to 2 x 4 16-bit, then 2 x 2 32-bit and 2 x 1 64-bit. The low byte of these two numbers gives the solution. I don't think there is an easy way to pack them together to form a single short value using NEON.
Takes 6 NEON instructions if the input is in the suitable form and the powers can be preloaded.
The obvious solution seems to be completely missed here.
// Use shifts to collect all of the sign bits.
// I'm not sure if this works on big endian, but big endian NEON is very
// rare.
int vmovmaskq_u8(uint8x16_t input)
{
// Example input (half scale):
// 0x89 FF 1D C0 00 10 99 33
// Shift out everything but the sign bits
// 0x01 01 00 01 00 00 01 00
uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7));
// Merge the even lanes together with vsra. The '??' bytes are garbage.
// vsri could also be used, but it is slightly slower on aarch64.
// 0x??03 ??02 ??00 ??01
uint32x4_t paired16 = vreinterpretq_u32_u16(
vsraq_n_u16(high_bits, high_bits, 7));
// Repeat with wider lanes.
// 0x??????0B ??????04
uint64x2_t paired32 = vreinterpretq_u64_u32(
vsraq_n_u32(paired16, paired16, 14));
// 0x??????????????4B
uint8x16_t paired64 = vreinterpretq_u8_u64(
vsraq_n_u64(paired32, paired32, 28));
// Extract the low 8 bits from each lane and join.
// 0x4B
return vgetq_lane_u8(paired64, 0) | ((int)vgetq_lane_u8(paired64, 8) << 8);
}
This question deserves a newer answer for aarch64. The addition of new capabilities to Armv8 allows the same function to be implemented in fewer instructions. Here's my version:
uint32_t _mm_movemask_aarch64(uint8x16_t input)
{
const uint8_t __attribute__ ((aligned (16))) ucShift[] = {-7,-6,-5,-4,-3,-2,-1,0,-7,-6,-5,-4,-3,-2,-1,0};
uint8x16_t vshift = vld1q_u8(ucShift);
uint8x16_t vmask = vandq_u8(input, vdupq_n_u8(0x80));
uint32_t out;
vmask = vshlq_u8(vmask, vshift);
out = vaddv_u8(vget_low_u8(vmask));
out += (vaddv_u8(vget_high_u8(vmask)) << 8);
return out;
}
after some tests it looks like following code works correct:
int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
const int8_t __attribute__ ((aligned (16))) xr[8] = {-7,-6,-5,-4,-3,-2,-1,0};
uint8x8_t mask_and = vdup_n_u8(0x80);
int8x8_t mask_shift = vld1_s8(xr);
uint8x8_t lo = vget_low_u8(input);
uint8x8_t hi = vget_high_u8(input);
lo = vand_u8(lo, mask_and);
lo = vshl_u8(lo, mask_shift);
hi = vand_u8(hi, mask_and);
hi = vshl_u8(hi, mask_shift);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
return ((hi[0] << 8) | (lo[0] & 0xFF));
}
Note that I haven't tested any of this, but something like this might work:
X := the vector that you want to create the mask from
A := 0x808080808080...
B := 0x00FFFEFDFCFB... (i.e. 0,-1,-2,-3,...)
X = vand_u8(X, A); // Keep d7 of each byte in X
X = vshl_u8(X, B); // X[7]>>=0; X[6]>>=1; X[5]>>=2; ...
// Each byte of X now contains its msb shifted 7-N bits to the right, where N
// is the byte index.
// Do 3 pairwise adds in order to pack all these into X[0]
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
// X[0] should now contain the mask. Clear the remaining bytes if necessary
This would need to be repeated once to process a 128-bit vector, since vpadd only works on 64-bit vectors.
I know this question is here for 8 years already but let me give you the answer which might solve all performance problems with emulation. It's based on the blog Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more.
Most usages of movemask instructions are coming from comparisons where the vectors have 0xFF or 0x00 values from the result of every 16 bytes. After that most cases to use movemasks are to check if none/all match, find leading/trailing or iterate over bits.
If this is the case which often is, then you can use shrn reg1, reg2, #4 instruction. This instruction called Shift-Right-then-Narrow instruction can reduce a 128-bit byte mask to a 64-bit nibble mask (by alternating low and high nibbles to the result). This allows the mask to be extracted to a 64-bit general purpose register.
const uint16x8_t equalMask = vreinterpretq_u16_u8(vceqq_u8(chunk, vdupq_n_u8(tag)));
const uint8x8_t res = vshrn_n_u16(equalMask, 4);
const uint64_t matches = vget_lane_u64(vreinterpret_u64_u8(res), 0);
return matches;
After that you can use all bit operations you typically use on x86 with very minor tweaks like shifting by 2 or doing a scalar AND.

Resources