How to add all int32 element in a lane using neon intrinsic - c

Here is my code for adding all int16x4 element in a lane:
#include <arm_neon.h>
...
int16x4_t acc = vdup_n_s16(1);
int32x2_t acc1;
int64x1_t acc2;
int32_t sum;
acc1 = vpaddl_s16(acc);
acc2 = vpaddl_s32(acc1);
sum = (int)vget_lane_s64(acc2, 0);
printf("%d\n", sum);// 4
And I tried to add all int32x4 element in a lane.
but my code looks inefficient:
#include <arm_neon.h>
...
int32x4_t accl = vdupq_n_s32(1);
int64x2_t accl_1;
int64_t temp;
int64_t temp2;
int32_t sum1;
accl_1=vpaddlq_s32(accl);
temp = (int)vgetq_lane_s64(accl_1,0);
temp2 = (int)vgetq_lane_s64(accl_1,1);
sum1=temp+temp2;
printf("%d\n", sum);// 4
Is there simply and clearly way to do this? I hope the LLVM assembly code is simply and clearly after compile it. and I also hope the final type of sum is 32 bits.
I used ellcc cross-compiler base on LLVM compiler infrastructure to compile it.
I saw the similar question(Add all elements in a lane) on stackoverflow, but the intrinsic addv doesn't work on my host.

If you only want a 32-bit result, presumably either intermediate overflow is unlikely or you simply don't care about it, in which case you could just stay 32-bit all the way:
int32x2_t temp = vadd_s32(vget_high_s32(accl), vget_low_s32(accl));
int32x2_t temp2 = vpadd_s32(temp, temp);
int32_t sum1 = vget_lane_s32(temp2, 0);
However, using 64-bit accumulation isn't actually any more hassle, and can also be done without dropping out of NEON - it's just a different order of operations:
int64x2_t temp = vpaddlq_s32(accl);
int64x1_t temp2 = vadd_s64(vget_high_s64(temp), vget_low_s64(temp));
int32_t sum1 = vget_lane_s32(temp2, 0);
Either of those boils down to just 3 NEON instructions and no scalar arithmetic. The crucial trick on 32-bit ARM is that a pairwise add of two halves of a Q register is simply a normal add of two D registers - that doesn't apply to AArch64 where the SIMD register layout is different, but then AArch64 has the aforementioned horizontal addv anyway.
Now, how horrible any of that looks in LLVM IR I don't know - I suppose it depends on how it treats vector types and operations internally - but in terms of the final ARM machine code both could be considered optimal.

Related

ARM Neon in C: How to combine different 128bit data types while using intrinsics?

TLTR
For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t into a function expecting uint16x8_t?
EXTENDED VERSION
Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:
for (int y = 0; y < rows; y += 2) {
uint8_t* p_out = outBuffer + (y / 2) * outStride;
uint8_t* p_in = inBuffer + y * inStride;
for (int x = 0; x < cols; x += 2) {
*p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
p_out++;
p_in+=2;
}
}
Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.
Now I want to vectorize this. The idea is:
take 2 consecutive rows of pixels
load 16 bytes in a from the top row, and load the 16 bytes immediately below in b
compute the minimum byte by byte between a and b. Store in a.
create a copy of a shifting it right by 1 byte (8 bits). Store it in b.
compute the minimum byte by byte between a and b. Store in a.
store every second byte of a in the output image (discards half of the bytes)
I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.
For example, at point 3 one can use (from here):
uint8x16_t vminq_u8(uint8x16_t a, uint8x16_t b);
And at point 4 one can use one of the following using a shift of 8 bits (from here):
uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);
That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)
HOWEVER, the output of vminq_u8 is of type uint8x16_t, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union (Edit: although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.
What is the way to combine different data types while using ARM Neon intrinsics?
For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.
In some situations, you might want to treat a vector as having a
different type, without changing its value. A set of intrinsics is
provided to perform this type of conversion.
So, assuming a and b are declared as:
uint8x16_t a, b;
Your point 4 can be written as(*):
b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );
However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?
(*) It should be said, this is much more cumbersome of the equivalent (in this specific context) SSE code, as SSE has only one 128 bit integer data type (namely __m128i):
__m128i b = _mm_srli_si128(a,1);

Optimizing horizontal boolean reduction in ARM NEON

I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers includes any(<boolN x M>) -> bool and all(<boolN x M>) -> bool functions, where <T x K> is a vector of K elements of type T and boolN is an N-bit boolean, i.e. all ones or all zeros, as SSE and NEON return for their comparison operations.
For example, let v be a <bool32 x 4> (a 128-bit vector), it could be the result of VCLT.S32 or something. I'd like to compute all(v) = v[0] && v[1] && v[2] && v[3] and any(v) = v[0] || v[1] || v[2] || v[3].
This is easy with SSE, e.g. movmskps will extract the high bit of each element, so all for the type above becomes (with C intrinsics):
#include<xmmintrin.h>
int all(__m128 x) {
return _mm_movemask_ps(x) == 8 + 4 + 2 + 1;
}
and similarly for any.
I'm struggling to find obvious/nice/efficient ways to implement this with NEON, which doesn't support an instruction like movmskps. There's the approach of simply extracting each element and computing with scalars. E.g. there's the naive method but there's also the approach of using the "horizontal" operations NEON supports, like VPMAX and VPMIN.
#include<arm_neon.h>
int all_naive(uint32x4_t v) {
return v[0] && v[1] && v[2] && v[3];
}
int all_horiz(uint32x4_t v) {
uint32x2_t x = vpmin_u32(vget_low_u32(v),
vget_high_u32(v));
uint32x2_t y = vpmin_u32(x, x);
return x[0] != 0;
}
(One can do a similar thing for the latter with VPADD, which may be faster, but it's fundamentally the same idea.)
Are there are other tricks one can use to implement this?
Yes, I know that horizontal operations are not great with SIMD vector units. But sometimes it is useful, e.g. many SIMD implementations of mandlebrot will operate on 4 points at once, and bail out of the inner loop when all of them are out of range... which requires doing a comparison and then a horizontal and.
This is my current solution that is implemented in eve library.
If your backend has C++20 support, you can just use the library: it has implementations for arm-v7, arm-v8 (only little endian at the moment) and all x86 from sse2 to avx-512. It's open source and MIT licensed. In beta at the moment. Feel free to reach out (for example with an issue) if you are trying out the library.
Take everything with a grain of salt - I don't yet have the arm benchmarks set up.
NOTE: On top of basic all and any we also have a movemask equivalent to do more complex operations like first_true. That wasn't part of the question and it's not amazing but the code can be found here
ARM-V7, 8 bytes register
Now, arm-v7 is 32 bit architecture, so we try to get to 32 bit elements where we can.
any
Use pairwise 32 bit max. If any element is true, the max is true.
// cast to dwords
dwords = vpmax_u32(dwords, dwords);
return vget_lane_u32(dwords, 0);
all
Pairwise min instead of max. Also what you test against changes.
If you have 4 byte element - just test for true. If shorts or chars - you need to test for -1;
// cast to dwords
dwords = vpmin_u32(dwords, dwords);
std::uint32_t combined = vget_lane_u32(dwords, 0);
// Assuming T is your scalar type
if constexpr ( sizeof(T) >= 4 ) return combined;
// I decided that !~ is better than -1, compiler will figure it out.
return !~combined;
ARM-V7, 16 bytes register
For anything bigger than chars, just do a conversion to a 64 bit one. Here is the list of vector narrow integer conversions.
For chars, the best I found is to reinterpret as uint32 and do an extra check.
So compare for == -1 for all and > 0 for any.
Seemed nicer asm the split in two 8 byte registers.
Then just do all/any on that dword register.
ARM-v8, 8 byte
ARM-v8 has 64 bit support, so you can just get a 64 bit lane. That one is trivially testable.
ARM-v8, 16 byte
We use vmaxvq_u32 since there is not a 64 bit one for any and vminvq_u32, vminvq_u16 or vminvq_u8 for all depending on the element size.
(Which is similar to glibc strlen)
Conclusion
Lack of benchmarks definitely makes me worried, some instructions are problematic sometimes and I don't know about it.
Regardless, that's the best I've got, so far at least.
NOTE: first time looking at arm today, I might be wrong about things.
UPD: Removed ARM-V7 and will write up what we ended up doing in a separate answer
ARM-V8.
For ARM-V8, have a look at this strlen implementation from glibc:
https://code.woboq.org/userspace/glibc/sysdeps/aarch64/multiarch/strlen_asimd.S.html
ARM-V8 introduced reductions across registers. Here they use min to compare with 0
uminv datab2, datav.16b
mov tmp1, datav2.d[0]
cbnz tmp1, L(main_loop)
Find the smallest char, compare with 0 - take the next 16 bytes.
There are a few other reductions in ARM-V8 like vaddvq_u8.
I'm pretty sure you can do most of the things you'd want from movemask and alike with this.
Another interesting thing here is how they find the first_true
/* Set te NULL byte as 0xff and the rest as 0x00, move the data into a
pair of scalars and then compute the length from the earliest NULL
byte. */
cmeq datav.16b, datav.16b, #0
mov data1, datav.d[0]
mov data2, datav.d[1]
cmp data1, 0
csel data1, data1, data2, ne
sub len, src, srcin
rev data1, data1
add tmp2, len, 8
clz tmp1, data1
csel len, len, tmp2, ne
add len, len, tmp1, lsr 3
Looks a bit intimidating, but my understanding is:
they narrow it down to a 64 bit number just by doing if/else (if the first half doesn't have the zero - the second half does.
use count leading zeroes to find the position (didn't quite understand all of the endianness stuff here but it's libc - so this is the correct one).
So - if you only need V8 - there is a solution.

Finding the instances of the number in a vector array in KNC (Xeon Phi)

I am trying to exploit the SIMD 512 offered by knc (Xeon Phi) to improve performance of the below C code using intel intrinsics. However, my intrinsic embedded code runs slower than auto-vectorized code
C Code
int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
radomize(myArray); //to fill some random data
int searchVal=24;
#pragma vector always
for(int i=0;i<SIZE;i++) {
if (myArray[i]==searchVal) match++;
return match;
Intrinsic embedded code:
In the below code I am first loading the array and comparing it with search key. Intrinsics return 16bit mask values that is reduced using _mm512_mask_reduce_add_epi32().
register int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
const int values[16]=\
{ 1,1,1,1,\
1,1,1,1,\
1,1,1,1,\
1,1,1,1,\
};
__m512i const flag = _mm512_load_epi32((void*) values);
__mmask16 countMask;
__m512i searchVal = _mm512_set1_epi32(16);
__m512i kV = _mm512_setzero_epi32();
for (int i=0;i<SIZE;i+=16)
{
// kV = _mm512_setzero_epi32();
kV = _mm512_loadunpacklo_epi32(kV,(void* )(&myArray[i]));
kV = _mm512_loadunpackhi_epi32(kV,(void* )(&myArray[i + 16]));
countMask = _mm512_cmpeq_epi32_mask(kV, searchVal);
match += _mm512_mask_reduce_add_epi32(countMask,flag);
}
return match;
I believe I have some how introduced extra cycles in this code and hence it is running slowly compared to the auto-vectorized code. Unlike SIMD128 which directly returns the value of the compare in 128bit register, SIMD512 returns the values in mask register which is adding more complexity to my code. Am I missing something here, there must be a way out to directly compare and keep count of successful search rather than using masks such as XOR ops.
Finally, please suggest me the ways to increase the performance of this code using intrinsics. I believe I can juice out more performance using intrinsics. This was at least true for SIMD128 where in using intrinsics allowed me to gain 25% performance.
I suggest the following optimizations:
Use prefetching. Your code performs very little computations, and almost surely bandwidth-bound. Xeon Phi has hardware prefetching only for L2 cache, so for optimal performance you need to insert prefetching instructions manually.
Use aligned read _mm512_load_epi32 as hinted by #PaulR. Use memalign function instead of malloc to guarantee that the array is really aligned on 64 bytes. And in case you will ever need misaligned instructions, use _mm512_undefined_epi32() as the source for the first misaligned load, as it breaks dependency on kV (in your current code) and lets the compiler do additional optimizations.
Unroll the array by 2 or use at least two threads to hide instruction latency.
Avoid using int variable as an index. unsigned int, size_t or ssize_t are better options.

_mm_shuffle_ps() equivalent for integer vectors (__m128i)?

The _mm_shuffle_ps() intrinsic allows one to interleave float inputs into low 2 floats and high 2 floats of the output.
For example:
R = _mm_shuffle_ps(L1, H1, _MM_SHUFFLE(3,2,3,2))
will result in:
R[0] = L1[2];
R[1] = L1[3];
R[2] = H1[2];
R[3] = H1[3]
I wanted to know if there was a similar intrinsic available for the integer data type? Something that took two __m128i variables and a mask for interleaving?
The _mm_shuffle_epi32() intrinsic, takes just one 128-bit vector instead of two.
Nope, there is no integer equivalent to this. So you have to either emulate it, or cheat.
One method is to use _mm_shuffle_epi32() on A and B. Then mask out the desired terms and OR them back together.
That tends to be messy and has around 5 instructions. (Or 3 if you use the SSE4.1 blend instructions.)
Here's the SSE4.1 solution with 3 instructions:
__m128i A = _mm_set_epi32(13,12,11,10);
__m128i B = _mm_set_epi32(23,22,21,20);
A = _mm_shuffle_epi32(A,2*1 + 3*4 + 2*16 + 3*64);
B = _mm_shuffle_epi32(B,2*1 + 3*4 + 2*16 + 3*64);
__m128i C = _mm_blend_epi16(A,B,0xf0);
The method that I prefer is to actually cheat - and floating-point shuffle like this:
__m128i Ai,Bi,Ci;
__m128 Af,Bf,Cf;
Af = _mm_castsi128_ps(Ai);
Bf = _mm_castsi128_ps(Bi);
Cf = _mm_shuffle_ps(Af,Bf,_MM_SHUFFLE(3,2,3,2));
Ci = _mm_castps_si128(Cf);
What this does is to convert the datatype to floating-point so that it can use the float-shuffle. Then convert it back.
Note that these "conversions" are bitwise conversions (aka reinterpretations). No conversion is actually done and they don't map to any instructions. In the assembly, there is no distinction between an integer or a floating-point SSE register. These cast intrinsics are just to get around the type-safety imposed by C/C++.
However, be aware that this approach incurs extra latency for moving data back-and-forth between the integer and floating-point SIMD execution units. So it will be more expensive than just the shuffle instruction.

Why is floor() so slow?

I wrote some code recently (ISO/ANSI C), and was surprised at the poor performance it achieved. Long story short, it turned out that the culprit was the floor() function. Not only it was slow, but it did not vectorize (with Intel compiler, aka ICL).
Here are some benchmarks for performing floor for all cells in a 2D matrix:
VC: 0.10
ICL: 0.20
Compare that to a simple cast:
VC: 0.04
ICL: 0.04
How can floor() be that much slower than a simple cast?! It does essentially the same thing (apart for negative numbers).
2nd question: Does someone know of a super-fast floor() implementation?
PS: Here is the loop that I was benchmarking:
void Floor(float *matA, int *intA, const int height, const int width, const int width_aligned)
{
float *rowA=NULL;
int *intRowA=NULL;
int row, col;
for(row=0 ; row<height ; ++row){
rowA = matA + row*width_aligned;
intRowA = intA + row*width_aligned;
#pragma ivdep
for(col=0 ; col<width; ++col){
/*intRowA[col] = floor(rowA[col]);*/
intRowA[col] = (int)(rowA[col]);
}
}
}
A couple of things make floor slower than a cast and prevent vectorization.
The most important one:
floor can modify the global state. If you pass a value that is too huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behavior is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).
Detecting these problematic conditions is not simple and makes up more than 90% of the execution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.
Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.
Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.
If you are going to convert the result of the floor() operation to an int, and if you aren't worried about overflow, then the following code is much faster than (int)floor(x):
inline int int_floor(double x)
{
int i = (int)x; /* truncate */
return i - ( i > x ); /* convert trunc to floor */
}
Branch-less Floor and Ceiling (better utilize the pipiline) no error check
int f(double x)
{
return (int) x - (x < (int) x); // as dgobbi above, needs less than for floor
}
int c(double x)
{
return (int) x + (x > (int) x);
}
or using floor
int c(double x)
{
return -(f(-x));
}
The actual fastest implementation for a large array on modern x86 CPUs would be
change the MXCSR FP rounding mode to round towards -Infinity (aka floor). In C, this should be possible with fenv stuff, or _mm_getcsr / _mm_setcsr.
loop over the array doing _mm_cvtps_epi32 on SIMD vectors, converting 4 floats to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)
cvtps2dq xmm0, [rdi] is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.
restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)
This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x required changing the x87 rounding mode to truncation and then back. cvttps2dq for packed float->int with truncation (note the extra t in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si or cvttsd2si for scalar double to scalar integer.
With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.
Without changing the current rounding mode, you need SSE4.1 roundps to round a float to the nearest integer float using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)
(With the right compiler options, like -fno-math-errno, and the right -march or -msse4 options, compilers can inline floor using roundps, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd for floor even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)
Yes, floor() is extremely slow on all platforms since it has to implement a lot of behaviour from the IEEE fp spec. You can't really use it in inner loops.
I sometimes use a macro to approximate floor():
#define PSEUDO_FLOOR( V ) ((V) >= 0 ? (int)(V) : (int)((V) - 1))
It does not behave exactly as floor(): for example, floor(-1) == -1 but PSEUDO_FLOOR(-1) == -2, but it's close enough for most uses.
An actually branchless version that requires a single conversion between floating point and integer domains would shift the value x to all positive or all negative range, then cast/truncate and shift it back.
long fast_floor(double x)
{
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x + offset) - offset);
}
long fast_ceil(double x) {
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x - offset) + offset );
}
As pointed in the comments, this implementation relies on the temporary value x +- offset not overflowing.
On 64-bit platforms, the original code using int64_t intermediate value will result in three instruction kernel, the same available for int32_t reduced range floor/ceil, where |x| < 0x40000000 --
inline int floor_x64(double x) {
return (int)((int64_t)(x + 0x80000000UL) - 0x80000000LL);
}
inline int floor_x86_reduced_range(double x) {
return (int)(x + 0x40000000) - 0x40000000;
}
They do not do the same thing. floor() is a function. Therefore, using it incurs a function call, allocating a stack frame, copying of parameters and retrieving the result.
Casting is not a function call, so it uses faster mechanisms (I believe that it may use registers to process the values).
Probably floor() is already optimized.
Can you squeeze more performance out of your algorithm? Maybe switching rows and columns may help? Can you cache common values? Are all your compiler's optimizations on? Can you switch an operating system? a compiler?
Jon Bentley's Programming Pearls has a great review of possible optimizations.
Fast double round
double round(double x)
{
return double((x>=0.5)?(int(x)+1):int(x));
}
Terminal log
test custom_1 8.3837
test native_1 18.4989
test custom_2 8.36333
test native_2 18.5001
test custom_3 8.37316
test native_3 18.5012
Test
void test(char* name, double (*f)(double))
{
int it = std::numeric_limits<int>::max();
clock_t begin = clock();
for(int i=0; i<it; i++)
{
f(double(i)/1000.0);
}
clock_t end = clock();
cout << "test " << name << " " << double(end - begin) / CLOCKS_PER_SEC << endl;
}
int main(int argc, char **argv)
{
test("custom_1",round);
test("native_1",std::round);
test("custom_2",round);
test("native_2",std::round);
test("custom_3",round);
test("native_3",std::round);
return 0;
}
Result
Type casting and using your brain is ~3 times faster than using native functions.

Resources