I'm working with SSE intrinsic functions. I have an __m128i representing an array of 8 signed short (16 bit) values.
Is there a function to get the sign of each element?
EDIT1:
something that can be used like this:
short tmpVec[8];
__m128i tmp, sgn;
for (i-0;i<8;i++)
tmp.m128i_i16[i] = tmpVec[i]
sgn = _mm_sign_epi16(tmp);
of course "_mm_sign_epi16" doesn't exist, so that's what I'm looking for.
How slow it is to do it element by element?
EDIT2:
desired behaviour: 1 for positive values, 0 for zero, and -1 for negative values.
thanks
You can use min/max operations to get the desired result, e.g.
inline __m128i _mm_sgn_epi16(__m128i v)
{
v = _mm_min_epi16(v, _mm_set1_epi16(1));
v = _mm_max_epi16(v, _mm_set1_epi16(-1));
return v;
}
This is probably a little more efficient than explicitly comparing with zero + shifting + combining results.
Note that there is already an _mm_sign_epi16 intrinsic in SSSE3 (PSIGNW - see tmmintrin.h), which behaves somewhat differently, so I changed the name for the required function to _mm_sgn_epi16. Using _mm_sign_epi16 might be more efficient when SSSE3 is available however, so you could do something like this:
inline __m128i _mm_sgn_epi16(__m128i v)
{
#ifdef __SSSE3__
v = _mm_sign_epi16(_mm_set1_epi16(1), v); // use PSIGNW on SSSE3 and later
#else
v = _mm_min_epi16(v, _mm_set1_epi16(1)); // use PMINSW/PMAXSW on SSE2/SSE3.
v = _mm_max_epi16(v, _mm_set1_epi16(-1));
#endif
return v;
}
Fill a register of zeros, and compare it with your register, first with "greater than", than with "lower than" (or invert the order of the operands in the "greater than" instruction).
http://msdn.microsoft.com/en-us/library/xd43yfsa%28v=vs.90%29.aspx
http://msdn.microsoft.com/en-us/library/t863edb2%28v=vs.90%29.aspx
The problem at this point is that the true value is represented as 0xffff, which happens to be -1, correct result for the negative number but not for the positive. However, as pointed out by Raymond Chen in the comments, 0x0000 - 0xffff = 0x0001, so it's enough now to subtract the result of "greater than" from the result of "lower than".
http://msdn.microsoft.com/en-us/library/y25yya27%28v=vs.90%29.aspx
Of course Paul R answer is preferable, as it uses only 2 instructions.
You can shift all 8 shorts at once using _mm_srai_epi16(tmp, 15) which will return eight 16-bit integers, with each being all ones (i.e. -1) if the input was negative, or all zeros (i.e. 0) if positive.
Related
I have this piece of code and I would like to eventually implement a modified version of the bitmask evaluation algorithm(s) from this paper - Adapting Tree Structures for Processing with SIMD
Instructions
#include <stdint.h>
#include <immintrin.h>
#include <assert.h>
#include <limits.h>
#include <math.h>
#include <stdalign.h>
int main(void)
{
__m256d avx_creg, res, avx_sreg;
int bitmask;
uint64_t key = 503;
avx_sreg = _mm256_castsi256_pd(_mm256_set1_epi64x(key));
alignas(32) uint64_t v[4];
_mm256_store_pd((double*)v, avx_sreg);
printf("v2_u64: %lld %lld %lld %lld\n", v[0], v[1],v[2],v[3]);
uint64_t b[4]= {500,505,510,515};
avx_creg = _mm256_castsi256_pd(
_mm256_loadu_si256((__m256i const *)&b));
//
alignas(32) uint64_t v1[4];
_mm256_store_pd((double*)v1, avx_creg);
printf("v2_u64: %lld %lld %lld %lld\n", v1[0], v1[1],v1[2],v1[3]);
res = _mm256_cmp_pd(avx_sreg, avx_creg, 30);
bitmask = _mm256_movemask_pd(res);
int mmask = __builtin_popcount(bitmask);
printf("mmask is %d\n",mmask);
return 0;
}
The above code prints the value of mmask as 1. So here is where I am not clear at all. Am I supposed to interpret the number "1" as the array index, where the array element is greater than the input key, or does it refer to the number of bits that are set ?
If for instance I change the key to 499 the mmask prints as 0.
Finally if I change the key to 517 the value of mmask is 4.
Can somebody clarify ? I also had a second question and I can ask this as a separate question if it is suggested. Is it possible to get all the values that are greater than the given input key from AVX intrinsics?
movemask produces an integer bitmap by taking the high bit of each element from the vector. Print it as hex or base-2 to see it better.
If you only care about 0 vs. non-zero counts, just check if(bitmask != 0)
Or if(bitmask == 0x0f) to check if they're all true. (4 bits for a 4-element vector).
Use popcount to find out how many were true. __builtin_popcnt counts the number of set bits in its input.
Use __builtin_ctz to find the position of the first element where the comparison was true. (Counting from low to high memory address, if the vectors were loaded from memory). Beware that __builtin_ctz is only meaningful for non-zero inputs. e.g. in a memchr loop, you'd use ctz only after breaking out of the search loop on _mm256_movemask_epi8(cmp_result) == 0 to establish that there was a match in this vector. (epi8 because I'm talking about a byte-search loop, unlike your packed-double compares).
You might want to use BMI1 _lzcnt_u32(bitmask) to get a well-defined result (32 leading zeros) on bitmask=0, if you're already requiring AVX2. (Because I think all AVX2 CPUs have BMI1.)
To iterate over the matches, you could use a clear-lowest-set-bit operation, and if there are still any bits set, then ctz to find out which one. See Clearing the lowest set bit of a number.
x & (x-1) will efficiently compile to a BMI1 blsr instruction if you compile with BMI1 enabled, e.g. with -march=haswell.
(for this to work well, you definitely want a movemask that matches your vector element size, so for 64-bit integer, cast your vector to _pd so you can use _mm256_movemask_pd.)
I have the below code
if(value == 0)
{
value = 1;
}
Using NEON vectorized instructions I need to perform the above. How do I compare a NEON register value with 0 for equality at a time 4 elements and change the value to 1 if the element is zero.
If you want to check if any element of a vector is non-zero and branch on that:
You can use get min/max across vector lanes.
if(vmaxvq_u32(value) == 0) { // Max value across quad vector, equals zero?
value = vmovq_n_u32(1); // Set all lanes to 1
}
For double vectors
if(vmaxv_u32(value) == 0) { // Max value across double vector, equals zero?
value = vmov_n_u32(1); // Set all lanes to 1
}
Notice the only difference is the 'q' which is used to indicate quad 128-bit vector or 64-bit double vector if not. The compiler will use a mov instruction to transfer from a neon single to arm generic register to do the comparison.
Assuming integer data, then thanks to NEON having specific "compare against zero" instructions, and the bitwise way comparison results work, there's a really cheeky way to do this using just one spare register. In generalised pseudo-assembly:
VCEQ.type mask, data, #0 # Generate bitmask vector with all bits set in elements
# corresponding to zero elements in the data
VSUB.type data, data, mask # Interpret "mask" as a vector of 0s and -1s, with the
# result of incrementing just the zero elements of "data"
# (thanks to twos complement underflow)
This trick doesn't work for floating-point data as the bit-patterns for nonzero values are more complicated, and neither does it work if the replacement value is to be anything other than 1 (or -1), so in those cases you would need to construct a separate vector containing the appropriate replacement elements and do a conditional select using the comparison mask as per #Ermlg's answer.
Maybe it will look something like this:
uint32x4_t value = {7, 0, 0, 3};
uint32x4_t zero = {0, 0, 0, 0};
uint32x4_t one = {1, 1, 1, 1};
uint32x4_t mask = vceqq_u32(value, zero);
value = vbslq_u32(mask, one, value);
To get more information see here.
I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers includes any(<boolN x M>) -> bool and all(<boolN x M>) -> bool functions, where <T x K> is a vector of K elements of type T and boolN is an N-bit boolean, i.e. all ones or all zeros, as SSE and NEON return for their comparison operations.
For example, let v be a <bool32 x 4> (a 128-bit vector), it could be the result of VCLT.S32 or something. I'd like to compute all(v) = v[0] && v[1] && v[2] && v[3] and any(v) = v[0] || v[1] || v[2] || v[3].
This is easy with SSE, e.g. movmskps will extract the high bit of each element, so all for the type above becomes (with C intrinsics):
#include<xmmintrin.h>
int all(__m128 x) {
return _mm_movemask_ps(x) == 8 + 4 + 2 + 1;
}
and similarly for any.
I'm struggling to find obvious/nice/efficient ways to implement this with NEON, which doesn't support an instruction like movmskps. There's the approach of simply extracting each element and computing with scalars. E.g. there's the naive method but there's also the approach of using the "horizontal" operations NEON supports, like VPMAX and VPMIN.
#include<arm_neon.h>
int all_naive(uint32x4_t v) {
return v[0] && v[1] && v[2] && v[3];
}
int all_horiz(uint32x4_t v) {
uint32x2_t x = vpmin_u32(vget_low_u32(v),
vget_high_u32(v));
uint32x2_t y = vpmin_u32(x, x);
return x[0] != 0;
}
(One can do a similar thing for the latter with VPADD, which may be faster, but it's fundamentally the same idea.)
Are there are other tricks one can use to implement this?
Yes, I know that horizontal operations are not great with SIMD vector units. But sometimes it is useful, e.g. many SIMD implementations of mandlebrot will operate on 4 points at once, and bail out of the inner loop when all of them are out of range... which requires doing a comparison and then a horizontal and.
This is my current solution that is implemented in eve library.
If your backend has C++20 support, you can just use the library: it has implementations for arm-v7, arm-v8 (only little endian at the moment) and all x86 from sse2 to avx-512. It's open source and MIT licensed. In beta at the moment. Feel free to reach out (for example with an issue) if you are trying out the library.
Take everything with a grain of salt - I don't yet have the arm benchmarks set up.
NOTE: On top of basic all and any we also have a movemask equivalent to do more complex operations like first_true. That wasn't part of the question and it's not amazing but the code can be found here
ARM-V7, 8 bytes register
Now, arm-v7 is 32 bit architecture, so we try to get to 32 bit elements where we can.
any
Use pairwise 32 bit max. If any element is true, the max is true.
// cast to dwords
dwords = vpmax_u32(dwords, dwords);
return vget_lane_u32(dwords, 0);
all
Pairwise min instead of max. Also what you test against changes.
If you have 4 byte element - just test for true. If shorts or chars - you need to test for -1;
// cast to dwords
dwords = vpmin_u32(dwords, dwords);
std::uint32_t combined = vget_lane_u32(dwords, 0);
// Assuming T is your scalar type
if constexpr ( sizeof(T) >= 4 ) return combined;
// I decided that !~ is better than -1, compiler will figure it out.
return !~combined;
ARM-V7, 16 bytes register
For anything bigger than chars, just do a conversion to a 64 bit one. Here is the list of vector narrow integer conversions.
For chars, the best I found is to reinterpret as uint32 and do an extra check.
So compare for == -1 for all and > 0 for any.
Seemed nicer asm the split in two 8 byte registers.
Then just do all/any on that dword register.
ARM-v8, 8 byte
ARM-v8 has 64 bit support, so you can just get a 64 bit lane. That one is trivially testable.
ARM-v8, 16 byte
We use vmaxvq_u32 since there is not a 64 bit one for any and vminvq_u32, vminvq_u16 or vminvq_u8 for all depending on the element size.
(Which is similar to glibc strlen)
Conclusion
Lack of benchmarks definitely makes me worried, some instructions are problematic sometimes and I don't know about it.
Regardless, that's the best I've got, so far at least.
NOTE: first time looking at arm today, I might be wrong about things.
UPD: Removed ARM-V7 and will write up what we ended up doing in a separate answer
ARM-V8.
For ARM-V8, have a look at this strlen implementation from glibc:
https://code.woboq.org/userspace/glibc/sysdeps/aarch64/multiarch/strlen_asimd.S.html
ARM-V8 introduced reductions across registers. Here they use min to compare with 0
uminv datab2, datav.16b
mov tmp1, datav2.d[0]
cbnz tmp1, L(main_loop)
Find the smallest char, compare with 0 - take the next 16 bytes.
There are a few other reductions in ARM-V8 like vaddvq_u8.
I'm pretty sure you can do most of the things you'd want from movemask and alike with this.
Another interesting thing here is how they find the first_true
/* Set te NULL byte as 0xff and the rest as 0x00, move the data into a
pair of scalars and then compute the length from the earliest NULL
byte. */
cmeq datav.16b, datav.16b, #0
mov data1, datav.d[0]
mov data2, datav.d[1]
cmp data1, 0
csel data1, data1, data2, ne
sub len, src, srcin
rev data1, data1
add tmp2, len, 8
clz tmp1, data1
csel len, len, tmp2, ne
add len, len, tmp1, lsr 3
Looks a bit intimidating, but my understanding is:
they narrow it down to a 64 bit number just by doing if/else (if the first half doesn't have the zero - the second half does.
use count leading zeroes to find the position (didn't quite understand all of the endianness stuff here but it's libc - so this is the correct one).
So - if you only need V8 - there is a solution.
I'm trying to limit arithmetic operations before they are executed to the result of at most 32 bit integers, specifically for addition.
This loop will find the bit position:
size_t highestOneBitPosition(uint32_t a) {
size_t bits=0;
while (a!=0) {
++bits;
a>>=1;
};
return bits;
}
This function effectively limits multiplication:
bool multiplication_is_safe(uint32_t a, uint32_t b) {
size_t a_bits=highestOneBitPosition(a), b_bits=highestOneBitPosition(b);
return (a_bits+b_bits<=32);
}
However, I'm unsure how to do this with addition. Something like this:
bool addition_is_safe(uint32_t a, uint32_t b) {
size_t a_bits=highestOneBitPosition(a), b_bits=highestOneBitPosition(b);
return (a_bits<32 && b_bits<32);
}
However, this will not limit the integer to 32bit (or 0x7FFFFFFF for signed). It will make sure each operand has has that many bit positions.
Mathematically, if you add two numbers, you have at most a carry of 1 into the place beyond the longest. So if you add a 4 digit number to a 3 digit number (or anything 4 digits or less), you have at most a 5 digit number. Except, when you have two with the same, you can end up with more (99 * 99 = 9801) so then it would be the same concept as in multiplication (a_bits+b_bits <=32)
What I would have to do is determine the longest operand, then add 1 and make sure that it's not exceeding 32 bit positions. I am entirely unsure how to do this with a function. My question is how can I modify addition_is_safe(uint32_t a, uint32_t b) to limit the result to <=32 as it is in multiplication_is_safe. I definitely want to utilize the HighestOneBit Position with this.
First of all, I don't even think that this function is correct:
bool multiplication_is_safe(uint32_t a, uint32_t b) {
size_t a_bits=highestOneBitPosition(a), b_bits=highestOneBitPosition(b);
return (a_bits+b_bits<=32);
}
It does return false when the multiply will overflow, but it also returns false when the multiply doesn't overflow. For example, given a = 0x10000 and b = 0x8000, this function returns false even though the result of a*b is 0x80000000 which fits in 32 bits. But if you change a and b to 0x1ffff and 0xffff (which have the same "highest one bit positions") then the multiply actually does overflow. But you couldn't tell by just using the highest bit position. You would need to look at more than just the top bit to figure out whether the multiplication will overflow. In fact, you would need to do part or all of the actual multiplication to figure out the right answer.
Similarly, you could construct a function addition_is_safe that detects "possible overflows" (both in the positive and negative direction) using bit positions. But you can't detect "actual overflow" unless you do part or all of the actual addition.
I believe that in the worst case, you will be forced to do the full multiplication or addition, so I'm not sure you will be saving anything by not letting the machine just do the full multiplication/addition for you.
Mathematically, if you add two numbers, you have at most a carry of 1
into the place beyond the longest.
That's absolutely correct (for unsigned binary numbers) without exception; you just got lost in your further consideration. So, the addition_is_safe condition based on the summands' numbers of bits is: the highest number of bits of the summands has to be smaller than the available number of bits.
bool addition_is_safe(uint32_t a, uint32_t b)
{
size_t a_bits=highestOneBitPosition(a), b_bits=highestOneBitPosition(b);
return (a_bits<b_bits?b_bits:a_bits)<32;
}
Surely you are aware that a false return from that function doesn't always mean overflow would occur, but a true return means overflow cannot occur.
you can check for overflow from adding two positive integers with (a + b) < a || (a + b) < b
overflow will either make the value negative (for signed integers), or will leave a smaller positive mode 32 remainder (for unsigned)
a positive added to a negative will never overflow
two negatives should be similar to two positives
I'm working in C and need to add and subtract a 64-bit number and a 128-bit number. The result will be held in the 128-bit number. I am using an integer array to store the upper and lower halves of the 128-bit number (i.e. uint64_t bigNum[2], where bigNum[0] is the least significant).
Can anybody help with an addition and subtraction function that can take in bigNum and add/subtract a uint64_t to it?
I have seen many incorrect examples on the web, so consider this:
bigNum[0] = 0;
bigNum[1] = 1;
subtract(&bigNum, 1);
At this point bigNum[0] should have all bits set, while bigNum[1] should have no bits set.
In many architectures it's very easy to add/subtract any arbitrarily-long integers because there's a carry flag and add/sub-with-flag instruction. For example on x86 rdx:rax += r8:r9 can be done like this
add rax, r9 # add the low parts and store the carry
adc rdx, r8 # add the high parts with carry
In C there's no way to access this carry flag so you must calculate the flag on your own. The easiest way is to check if the unsigned sum is less than either of the operand like this. For example to do a += b we'll do
aL += bL;
aH += bH + (aL < bL);
This is exactly how multi-word add is done in architectures that don't have a flag register. For example in MIPS it's done like this
# alow = blow + clow
addu alow, blow, clow
# set tmp = 1 if alow < clow, else 0
sltu tmp, alow, clow
addu ahigh, bhigh, chigh
addu ahigh, ahigh, tmp
Here's some example assembly output
This should work for the subtraction:
typedef u_int64_t bigNum[2];
void subtract(bigNum *a, u_int64_t b)
{
const u_int64_t borrow = b > a[1];
a[1] -= b;
a[0] -= borrow;
}
Addition is very similar. The above could of course be expressed with an explicit test, too, but I find it cleaner to always do the borrowing. Optimization left as an exercise.
For a bigNum equal to { 0, 1 }, subtracting two would make it equal { ~0UL, ~0UL }, which is the proper bit pattern to represent -1. Here, UL is assumed to promote an integer to 64 bits, which is compiler-dependent of course.
In grade 1 or 2, you should have learn't how to break down the addition of 1 and 10 into parts, by splitting it into multiple separate additions of tens and units. When dealing with big numbers, the same principals can be applied to compute arithmetic operations on arbitrarily large numbers, by realizing your units are now units of 2^bits, your "tens" are 2^bits larger and so on.
For the case the value that your are subtracting is less or equal to bignum[0] you don't have to touch bignum[1].
If it isn't, you subtract it from bignum[0], anyhow. This operation will wrap around, but this is the behavior you need here. In addition you'd then have to substact 1 from bignum[1].
Most compilers support a __int128 type intrinsically.
Try it and you might be lucky.