comparision with zero using neon instruction - arm

I have the below code
if(value == 0)
{
value = 1;
}
Using NEON vectorized instructions I need to perform the above. How do I compare a NEON register value with 0 for equality at a time 4 elements and change the value to 1 if the element is zero.

If you want to check if any element of a vector is non-zero and branch on that:
You can use get min/max across vector lanes.
if(vmaxvq_u32(value) == 0) { // Max value across quad vector, equals zero?
value = vmovq_n_u32(1); // Set all lanes to 1
}
For double vectors
if(vmaxv_u32(value) == 0) { // Max value across double vector, equals zero?
value = vmov_n_u32(1); // Set all lanes to 1
}
Notice the only difference is the 'q' which is used to indicate quad 128-bit vector or 64-bit double vector if not. The compiler will use a mov instruction to transfer from a neon single to arm generic register to do the comparison.

Assuming integer data, then thanks to NEON having specific "compare against zero" instructions, and the bitwise way comparison results work, there's a really cheeky way to do this using just one spare register. In generalised pseudo-assembly:
VCEQ.type mask, data, #0 # Generate bitmask vector with all bits set in elements
# corresponding to zero elements in the data
VSUB.type data, data, mask # Interpret "mask" as a vector of 0s and -1s, with the
# result of incrementing just the zero elements of "data"
# (thanks to twos complement underflow)
This trick doesn't work for floating-point data as the bit-patterns for nonzero values are more complicated, and neither does it work if the replacement value is to be anything other than 1 (or -1), so in those cases you would need to construct a separate vector containing the appropriate replacement elements and do a conditional select using the comparison mask as per #Ermlg's answer.

Maybe it will look something like this:
uint32x4_t value = {7, 0, 0, 3};
uint32x4_t zero = {0, 0, 0, 0};
uint32x4_t one = {1, 1, 1, 1};
uint32x4_t mask = vceqq_u32(value, zero);
value = vbslq_u32(mask, one, value);
To get more information see here.

Related

Find the INDEX of element having max. absolute value using AVX512 instructions

I'm a newbie for coding using AVX512 instructions. My machine is Intel KNL 7250. I am trying to use AVX512 instructions to find the INDEX of the element having maximum absolute value, which double precision and size of array % 8 = 0. But it prints an output index = 0 every time. I don't know where is a problem, please help me.
Also, how to use printf for __m512i type?
Thanks.
Code:
void main()
{
int i;
int N=160;
double vec[N];
for(i=0;i<N;i++)
{
vec[i]=(double)(-i) ;
if(i==10)
{
vec[i] = -1127;
}
}
int max = avxmax_(N, vec);
printf("maxindex=%d\n", max);
}
int avxmax_(int N, double *X )
{
// return the index of element having maximum absolute value.
int maxindex, ix, i, M;
register __m512i increment, indices, maxindices, maxvalues, absmax, max_values_tmp, abs_max_tmp, tmp;
register __mmask8 gt;
double values_X[8];
double indices_X[8];
double maxvalue;
maxindex = 1;
if( N == 1) return(maxindex);
M = N % 8;
if( M == 0)
{
increment = _mm512_set1_epi64(8); // [8,8,8,8,8,8,8,8]
indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
maxindices = indices;
maxvalues = _mm512_loadu_si512(&X[0]);
absmax = _mm512_abs_epi64(maxvalues);
for( i = 8; i < N; i += 8)
{
// advance scalar indices: indices[0] + 8, indices[1] + 8,...,indices[7] + 8
indices = _mm512_add_epi64(indices, increment);
// compare
max_values_tmp = _mm512_loadu_si512(&X[i]);
abs_max_tmp = _mm512_abs_epi64(max_values_tmp);
gt = _mm512_cmpgt_epi64_mask(abs_max_tmp, absmax);
// update
maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
absmax = _mm512_max_epi64(absmax, abs_max_tmp);
}
// scalar part
_mm512_storeu_si512((__m512i*)values_X, absmax);
_mm512_storeu_si512((__m512i*)indices_X, maxindices);
maxindex = indices_X[0];
maxvalue = values_X[0];
for(i = 1; i < 8; i++)
{
if(values_X[i] > maxvalue)
{
maxvalue = values_X[i];
maxindex = indices_X[i];
}
}
return(maxindex);
}
}
Your function returns 0 because you're treating the int64 index as the bit-pattern for a double, and converting that (tiny) number to an integer. double indices_X[8]; is the bug; should be uint64_t. There are other bugs, see below.
This bug is easier to spot if you declare variables as you use them, C99 style, not obsolete C89 style.
You _mm512_storeu_si512 the vector of int64_t indices into double indices_X[8], type-punning it to double, then in plain C do int maxindex = indices_X[0];. This is implicit type-conversion, converting that subnormal double to an integer.
(I noticed a mysterious vcvttsd2si FP->int conversion in the asm https://godbolt.org/z/zsfc36 while converting the code to C99 style variable declarations next to initializers. That was a clue: there should be no FP->int conversion in this function. I noticed that around the same time I was moving the double indices_X[8]; declaration down into the block that uses it, and noticing it had type double.)
It is actually possible to use integer operations on FP bit-patterns
But only if you use the right ones! IEEE754 exponent biases mean that the encoding / bit-pattern can be compared as a sign/magnitude integer. So you can do abs / min / max and compare on it, but not of course integer add / sub (unless you're implementing nextafter).
_mm512_abs_epi64 is 2's complement abs, not sign-magnitude. Instead, you must just mask off the sign bit. Then you're all set to treat the result as an unsigned integer or signed-2's-complement. (Either works because the high bit is clear.)
Using integer max has the interesting property that NaNs will compare higher than any other value, Inf below that, then finite values. So we get a NaN-propagating max-finder basically for free.
On KNL (Knight's Landing), FP vmaxpd and vcmppd have the same performance as their integer equivalents: 2 cycle latency, 0.5c throughput. (https://agner.org/optimize/). So your way has zero advantage on KNL, but it's a neat trick for mainstream Intel, like Skylake-X and IceLake.
Bugfixed optimized version:
Use size_t for return type and loop counters / indices to handle potentially huge arrays, instead of a random mix of int and 64-bit vector elements. (uint64_t for the temp array that collects the horizontal-max: it's always 64-bit even in a build with 32-bit pointers / size_t.)
bugfix: return 0 on N==1, not 1: the index of the only element is 0.
bugfix: return -1 on N%8 != 0, instead of falling off the end of the non-void function. (Undefined behaviour if the caller uses the result in C, or as soon as execution falls off the end in C++).
bugfix: abs of an FP value = clear the sign bit, not 2's complement abs on the bit-pattern
sort of bugfix: use unsigned integer compare and max, so it would work for 2's complement integers with _mm512_abs_epi64 (which produces an unsigned result; remember that -LONG_MIN overflows to LONG_MIN if you keep treating it as signed).
style improvement: if (N%8 != 0) return -1; instead of putting most of the body in an if block.
style improvement: declare vars when they're first used, and removed some unused ones that were pure noise. This is idiomatic for C since C99, which was standardized over 20 years ago.
style improvement: use simpler names for tmp vector vars that just hold a load result. Sometimes you just need a tmp var because intrinsic names are so long that you don't want to type _mm...load... as an arg for another intrinsics. A name like v scoped to the inner loop is a clear sign it's just a placeholder, not used later. (This style works best when you're declaring it as you init it, so it's easy to see it can't be used in an outer scope.)
optimization: reduce 8 -> 4 elements after the loop with SIMD: extract the high half, combine with existing low half. (Same as you would for a simpler horizontal reduction like sum or max). Inconvenient when we need instructions that only AVX512 has, but KNL doesn't have AVX512VL, so we have to use the 512-bit version and ignore the high garbage. But KNL does have AVX1 / AVX2 so we can still store 256-bit vectors and do some things.
Using a merge-masking _mm512_mask_extracti64x4_epi64 extract to blend the high half directly the low half of the same vector is a cool trick which compilers don't find if you use a 512-bit mask-blend. :P
sort of bugfix: in C, main has a return type of int in hosted implementations (running under an OS).
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
// bugfix: indices can be larger than an int
size_t avxmax_(size_t N, double *X )
{
// return the index of element having maximum absolute value.
if( N == 1)
return 0; // bugfix: 0 is the only valid element in this case, not 1
if( N % 8 != 0) // [[unlikely]] // C++20
return -1; // bugfix: don't fall off the end of the function in this case
const __m512i fp_absmask = _mm512_set1_epi64(0x7FFFFFFFFFFFFFFF);
__m512i indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
__m512i maxindices = indices;
__m512i v = _mm512_loadu_si512(&X[0]);
__m512i absmax = _mm512_and_si512(v, fp_absmask);
for(size_t i = 8; i < N; i += 8) // [[likely]] // C++20
{
// advance indices by 8 each.
indices = _mm512_add_epi64(indices, _mm512_set1_epi64(8));
// compare
v = _mm512_loadu_si512(&X[i]);
__m512i vabs = _mm512_and_si512(v, fp_absmask);
// vabs = _mm512_abs_epi64(max_values_tmp); // for actual integers, not FP bit patterns
__mmask8 gt = _mm512_cmpgt_epu64_mask(vabs, absmax);
// update
maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
absmax = _mm512_max_epu64(absmax, vabs);
}
// reduce 512->256; KNL doesn't have AVX512VL so some ops require 512-bit vectors
__m256i absmax_hi = _mm512_extracti64x4_epi64(absmax, 1);
__m512i absmax_hi512 = _mm512_castsi256_si512(absmax_hi); // free
__mmask8 gt = _mm512_cmpgt_epu64_mask(absmax_hi512, absmax);
__m256i abs256 = _mm512_castsi512_si256(_mm512_max_epu64(absmax_hi512, absmax)); // reduced to low 4 elements
// extract with merge-masking = blend
__m256i maxindices256 = _mm512_mask_extracti64x4_epi64(
_mm512_castsi512_si256(maxindices), gt, maxindices, 1);
// scalar part
double values_X[4];
uint64_t indices_X[4];
_mm256_storeu_si256((__m256i*)values_X, abs256);
_mm256_storeu_si256((__m256i*)indices_X, maxindices256);
size_t maxindex = indices_X[0];
double maxvalue = values_X[0];
for(int i = 1; i < 4; i++)
{
if(values_X[i] > maxvalue)
{
maxvalue = values_X[i];
maxindex = indices_X[i];
}
}
return maxindex;
}
On Godbolt: the main loop from GCC10.2 -O3 -march=knl is 8 instructions. So even if (best case) KNL could decode and run it at 2/clock, it's still taking 4 cycles per vector. You can run the program on Godbolt; it runs on Skylake-X servers so it can run AVX512 code. You can see it prints 10.
.L4:
vpandd zmm2, zmm5, ZMMWORD PTR [rsi+rax*8] # load, folded into AND
add rax, 8
vpcmpuq k1, zmm2, zmm0, 6
vpaddq zmm1, zmm1, zmm4 # increment current indices
cmp rdi, rax
vmovdqa64 zmm3{k1}, zmm1 # blend maxidx using merge-masking
vpmaxuq zmm0, zmm0, zmm2
ja .L4
vmovapd zmm1, zmm3 # silly missed optimization related to the case where the loop runs 0 times.
.L3:
vextracti64x4 ymm2, zmm0, 0x1 # high half of absmax
vpcmpuq k1, zmm2, zmm0, 6 # compare high and low
vpmaxuq zmm0, zmm0, zmm2
# vunpckhpd xmm2, xmm0, xmm0 # setting up for unrolled scalar loop
vextracti64x4 ymm1{k1}, zmm3, 0x1 # masked extract of indices
Another option for the loop would be a masked vpbroadcastq zmm3{k1}, rax, adding the [0..7] per-element offsets after the loop. That would actually save the vpaddq in the loop, and we have the right i in a register if GCC is going to use an indexed addressing-mode anyway. (That's not good on Skylake-X; defeats micro-fusion of the memory-source vpandd.)
Agner Fog doesn't list performance for GP->vector broadcasts, but hopefully it's only single-uop on KNL at least. (And https://uops.info/ doesn't have KNL or KNM results).
Branchy strategy: when a new max is very rare
If you expect finding a new max to be very rare (e.g. array is large and uniformly distributed, or at least not trending upwards), it could be even faster to broadcast the current max and branch on finding any greater vector element.
Finding a new max means branching out of the loop (which probably mispredicts, so that's slow) and broadcasting that element (probably with a tzcnt to find the element index, then a broadcast-load, and update the index).
Especially with KNL's 4-way SMT to hide branch miss costs, this could be an overall throughput win for large arrays; fewer instructions per element on average.
But probably significantly worse for inputs that do trend upwards, so a new max is found O(n) times on average, not sqrt(n) or log(n) or whatever frequency a uniform distribution would give us.
PS: to print vectors, store to an array and reload the elements. print a __m128i variable
Or use a debugger to show you their elements.

Testing NEON SIMD registers for equality over all lanes

I'm using Neon Instrinics with clang.
I want to test two uint32x4_t SIMD values for equality over all lanes.
So not 4 test results, but one single result that tells me if A and B are equal for all lanes.
On Intel AVX, I would use something like:
_mm256_testz_si256( _mm256_xor_si256( A, B ), _mm256_set1_epi64x( -1 ) )
What would be a good way to perform an all-lane equality test for NEON SIMD?
I am assuming I will need intrinsics that operate across lanes.
Does ARM Neon have those features?
Try this:
uint16x4_t t = vqmovn_u32(veorq_u32(a, b));
vget_lane_u64(vreinterpret_u64_u16(t), 0) == 0
I expect the compiler to find target-specific optimizations when implementing that test.
I just realised something handy...
If you want to test that all lanes are less than some power of two, you can do this by replacing vqmovn_u32() with vqshrn_n_u32(); and I believe this can be extended to being within +/- a power of two (including the lower bound, excluding the upper bound) for signed types using vqrshrn_n_s32(). For example, you should be able to accept both -1 and 0 in a single test using vqrshrn_n_s32(x, 1).
If your just want to know if two vectors are equal or not, try following code:
result = vceqq_u32(a, b);
if (vminvq_u32(result ) != 0xffffffff) {
// not equal
} else {
// equal
}
See ARM's manual: CMEQ and UMINV

Efficient comparison of small integer vectors

I have small vectors. Each of them is made of 10 integers which are between 0 and 15. This means that every element in a vector can be written using 4 bits. Hence I can concatenate my vector elements and store the whole vector in a single long type (in C, C++, java...)
Vector v1 dominates vector v2 if for each i in 0,...,9, v1[i] >= v2[i]
I want to write a method compare(long v1, long v2) that would return 0 if non of the vectors dominates the other, 1 if the first one dominates and -1 if the second one dominates.
Is there any efficient way to implement compare other than getting every i component and doing 10 times the normal integer comparison?
EDIT
if v1 is exactly the same as v2 returning 1 or -1 are both fine
It's possible to do this using bit-manipulation. Space your values out so that each takes up 5 bits, with 4 bits for the value and an empty 0 in the most significant position as a kind of spacing bit.
Placing a spacing bit between each value stops borrows/carries from propagating between adjacent values and means you can do certain SIMD-like arithmetic operations on the vector just by using regular integer addition or subtraction. We can use subtraction to do a vector comparison.
To do the test you can set all the spacing bits to 1 in one of the vectors and then subtract the second one. If the value in the 4 bits below the spacing bit is greater in the second one then it will carry the bit from the spacing bit and set it to zero in the result, if not then it will remain a one (the first value is greater than or equal to the second). If the first vector dominates the second then all the spacing bits will be one after the subtraction.
Simple demonstration using ints:
#define SPACING_BITS ((1<<4)|(1<<9)|(1<<14)|(1<<19))
int createVector(int v0, int v1, int v2, int v3)
{
return v0 | (v1 << 5) | (v2 << 10) | (v3 << 15);
}
int vectorDominates(int vectorA, int vectorB)
{
// returns 1 if vectorA dominates vectorB:
return (((vectorA | SPACING_BITS) - vectorB) & SPACING_BITS) == SPACING_BITS;
}
int compare(int vectorA, int vectorB)
{
if(vectorDominates(vectorA, vectorB))
return 1;
else if(vectorDominates(vectorB, vectorA))
return -1;
return 0;
}
You can extend it to use 64 bit values using 50 bits to store the 10 values. You can also inline the calls to vectorDominates in the compare function.
Demo
Well, in C you can likely leverage vectorization to do this. I don't think it's directly possible to compare on 4-bit operands, so you're going to have to re-pack (either on the fly or just keep your data in a more suitable format) up to 8-bit before doing the comparison. Since 10 * 8 = 80 which is more than 64, you're going to need 128-bit vector instructions.
Not sure if Java VMs support that yet, but this question suggests that JNI is the answer, i.e. call C code from Java.

SSE intrinsic over int16[8] to extract the sign of each element

I'm working with SSE intrinsic functions. I have an __m128i representing an array of 8 signed short (16 bit) values.
Is there a function to get the sign of each element?
EDIT1:
something that can be used like this:
short tmpVec[8];
__m128i tmp, sgn;
for (i-0;i<8;i++)
tmp.m128i_i16[i] = tmpVec[i]
sgn = _mm_sign_epi16(tmp);
of course "_mm_sign_epi16" doesn't exist, so that's what I'm looking for.
How slow it is to do it element by element?
EDIT2:
desired behaviour: 1 for positive values, 0 for zero, and -1 for negative values.
thanks
You can use min/max operations to get the desired result, e.g.
inline __m128i _mm_sgn_epi16(__m128i v)
{
v = _mm_min_epi16(v, _mm_set1_epi16(1));
v = _mm_max_epi16(v, _mm_set1_epi16(-1));
return v;
}
This is probably a little more efficient than explicitly comparing with zero + shifting + combining results.
Note that there is already an _mm_sign_epi16 intrinsic in SSSE3 (PSIGNW - see tmmintrin.h), which behaves somewhat differently, so I changed the name for the required function to _mm_sgn_epi16. Using _mm_sign_epi16 might be more efficient when SSSE3 is available however, so you could do something like this:
inline __m128i _mm_sgn_epi16(__m128i v)
{
#ifdef __SSSE3__
v = _mm_sign_epi16(_mm_set1_epi16(1), v); // use PSIGNW on SSSE3 and later
#else
v = _mm_min_epi16(v, _mm_set1_epi16(1)); // use PMINSW/PMAXSW on SSE2/SSE3.
v = _mm_max_epi16(v, _mm_set1_epi16(-1));
#endif
return v;
}
Fill a register of zeros, and compare it with your register, first with "greater than", than with "lower than" (or invert the order of the operands in the "greater than" instruction).
http://msdn.microsoft.com/en-us/library/xd43yfsa%28v=vs.90%29.aspx
http://msdn.microsoft.com/en-us/library/t863edb2%28v=vs.90%29.aspx
The problem at this point is that the true value is represented as 0xffff, which happens to be -1, correct result for the negative number but not for the positive. However, as pointed out by Raymond Chen in the comments, 0x0000 - 0xffff = 0x0001, so it's enough now to subtract the result of "greater than" from the result of "lower than".
http://msdn.microsoft.com/en-us/library/y25yya27%28v=vs.90%29.aspx
Of course Paul R answer is preferable, as it uses only 2 instructions.
You can shift all 8 shorts at once using _mm_srai_epi16(tmp, 15) which will return eight 16-bit integers, with each being all ones (i.e. -1) if the input was negative, or all zeros (i.e. 0) if positive.

How to properly add/subtract a 128-bit number (as two uint64_t)?

I'm working in C and need to add and subtract a 64-bit number and a 128-bit number. The result will be held in the 128-bit number. I am using an integer array to store the upper and lower halves of the 128-bit number (i.e. uint64_t bigNum[2], where bigNum[0] is the least significant).
Can anybody help with an addition and subtraction function that can take in bigNum and add/subtract a uint64_t to it?
I have seen many incorrect examples on the web, so consider this:
bigNum[0] = 0;
bigNum[1] = 1;
subtract(&bigNum, 1);
At this point bigNum[0] should have all bits set, while bigNum[1] should have no bits set.
In many architectures it's very easy to add/subtract any arbitrarily-long integers because there's a carry flag and add/sub-with-flag instruction. For example on x86 rdx:rax += r8:r9 can be done like this
add rax, r9 # add the low parts and store the carry
adc rdx, r8 # add the high parts with carry
In C there's no way to access this carry flag so you must calculate the flag on your own. The easiest way is to check if the unsigned sum is less than either of the operand like this. For example to do a += b we'll do
aL += bL;
aH += bH + (aL < bL);
This is exactly how multi-word add is done in architectures that don't have a flag register. For example in MIPS it's done like this
# alow = blow + clow
addu alow, blow, clow
# set tmp = 1 if alow < clow, else 0
sltu tmp, alow, clow
addu ahigh, bhigh, chigh
addu ahigh, ahigh, tmp
Here's some example assembly output
This should work for the subtraction:
typedef u_int64_t bigNum[2];
void subtract(bigNum *a, u_int64_t b)
{
const u_int64_t borrow = b > a[1];
a[1] -= b;
a[0] -= borrow;
}
Addition is very similar. The above could of course be expressed with an explicit test, too, but I find it cleaner to always do the borrowing. Optimization left as an exercise.
For a bigNum equal to { 0, 1 }, subtracting two would make it equal { ~0UL, ~0UL }, which is the proper bit pattern to represent -1. Here, UL is assumed to promote an integer to 64 bits, which is compiler-dependent of course.
In grade 1 or 2, you should have learn't how to break down the addition of 1 and 10 into parts, by splitting it into multiple separate additions of tens and units. When dealing with big numbers, the same principals can be applied to compute arithmetic operations on arbitrarily large numbers, by realizing your units are now units of 2^bits, your "tens" are 2^bits larger and so on.
For the case the value that your are subtracting is less or equal to bignum[0] you don't have to touch bignum[1].
If it isn't, you subtract it from bignum[0], anyhow. This operation will wrap around, but this is the behavior you need here. In addition you'd then have to substact 1 from bignum[1].
Most compilers support a __int128 type intrinsically.
Try it and you might be lucky.

Resources