_mm_shuffle_ps() equivalent for integer vectors (__m128i)? - c

The _mm_shuffle_ps() intrinsic allows one to interleave float inputs into low 2 floats and high 2 floats of the output.
For example:
R = _mm_shuffle_ps(L1, H1, _MM_SHUFFLE(3,2,3,2))
will result in:
R[0] = L1[2];
R[1] = L1[3];
R[2] = H1[2];
R[3] = H1[3]
I wanted to know if there was a similar intrinsic available for the integer data type? Something that took two __m128i variables and a mask for interleaving?
The _mm_shuffle_epi32() intrinsic, takes just one 128-bit vector instead of two.

Nope, there is no integer equivalent to this. So you have to either emulate it, or cheat.
One method is to use _mm_shuffle_epi32() on A and B. Then mask out the desired terms and OR them back together.
That tends to be messy and has around 5 instructions. (Or 3 if you use the SSE4.1 blend instructions.)
Here's the SSE4.1 solution with 3 instructions:
__m128i A = _mm_set_epi32(13,12,11,10);
__m128i B = _mm_set_epi32(23,22,21,20);
A = _mm_shuffle_epi32(A,2*1 + 3*4 + 2*16 + 3*64);
B = _mm_shuffle_epi32(B,2*1 + 3*4 + 2*16 + 3*64);
__m128i C = _mm_blend_epi16(A,B,0xf0);
The method that I prefer is to actually cheat - and floating-point shuffle like this:
__m128i Ai,Bi,Ci;
__m128 Af,Bf,Cf;
Af = _mm_castsi128_ps(Ai);
Bf = _mm_castsi128_ps(Bi);
Cf = _mm_shuffle_ps(Af,Bf,_MM_SHUFFLE(3,2,3,2));
Ci = _mm_castps_si128(Cf);
What this does is to convert the datatype to floating-point so that it can use the float-shuffle. Then convert it back.
Note that these "conversions" are bitwise conversions (aka reinterpretations). No conversion is actually done and they don't map to any instructions. In the assembly, there is no distinction between an integer or a floating-point SSE register. These cast intrinsics are just to get around the type-safety imposed by C/C++.
However, be aware that this approach incurs extra latency for moving data back-and-forth between the integer and floating-point SIMD execution units. So it will be more expensive than just the shuffle instruction.

Related

SSE interleave/merge/combine 2 vectors using a mask, per-element conditional move?

Essentially i am trying to implement a ternary-like operation on 2 SSE (__m128) vectors.
The mask is another __m128 vector obtained from _mm_cmplt_ps.
What i want to achieve is to select element of vector a when the corresponding element of the mask is 0xffff'ffff and element of b when the mask's element is 0.
Example of the desired operation (in semi-pseudocode):
const __m128i mask = {0xffffffff, 0, 0xffffffff, 0}; // e.g. a compare result
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};
const __m128 c = interleave(a, b, mask); // c contains {1.0, 2.1, 1.2, 2.3}
I am having trouble implementing this operation in SIMD (SSE) intrinsics.
My original idea was to mix a and b using moves and then shuffle the elements using the mask, however _mm_shuffle_ps takes an int mask consisting of 4-bit indices, not an __m128 mask.
Another idea was to use something akin to a conditional move, but there does not seem to be a conditional move in SSE (or at least I did not manage to find it in Intel's guide).
How is this normally done in SSE?
That's called a "blend".
Intel's intrinsics guide groups blend instructions under the "swizzle" category, along with shuffles.
You're looking for SSE4.1 blendvps (intrinsic _mm_blendv_ps). The other element sizes are _mm_blendv_pd and _mm_blendv_epi8. These use the high bit of the corresponding element as the control, so you can use a float directly (without _mm_cmp_ps) if its sign bit is interesting.
__m128i mask = _mm_castps_si128(_mm_cmplt_ps(x, y)); // integer 0 / -1 bit patterns
__m128 c = _mm_blendv_ps(b, a, mask); // copy element from 2nd op where the mask is set
Note that I reversed a, b to b, a because SSE blends take the element from the 2nd operand in positions where the mask was set. Like a conditional-move which copies when the condition is true. If you name your constants / variables accordingly, you can write blend(a,b, mask) instead of having them backwards. Or give them meaningful names line ones and twos.
In other cases where your control operand is a constant, there's also _mm_blend_ps / pd / _mm_blend_epi16 (an 8-bit immediate operand can only control 8 separate elements, so 8x 2-byte.)
Performance
blendps xmm, xmm, imm8 is a single-uop instruction for any vector ALU port on Intel CPUs, as cheap as andps. (https://uops.info/). pblendw is also single-uop, but only runs on port 5 on Intel, competing with shuffles. AVX2 vpblendd blends with dword granularity, an integer version of vblendps, and with the same very good efficiency. (It's an integer-SIMD instruction; unlike shuffles, blends have extra bypass latency on Intel CPUs if you mix integer and FP SIMD.)
But variable blendvps is 2 uops on Intel before Skylake (and only for port 5). And the AVX version (vblendvps) is unfortunately still 2 uops on Intel (3 on Alder Lake-P, 4 on Alder Lake-E). Although the uops can at least run on any of 3 vector ALU ports.
The vblendvps version is funky in asm because it has 4 operands, not overwriting any of the inputs registers. (The non-AVX version overwrites one input, and uses XMM0 implicitly as the mask input.) Intel uops apparently can't handle 4 separate registers, only 3 for stuff like FMA, adc, and cmov. (And AVX-512 vpternlogd which can do a bitwise blend as a single uop)
AMD has fully efficient handling of vblendvps, single uop (except for YMM on Zen1) with 2/clock throughput.
Without SSE4.1, you can emulate with ANDN/AND/OR
(x&~mask) | (y&mask) is equivalent to _mm_blendv_ps(x,y,mask), except it's pure bitwise so all the bits of each mask element should match the top bit. (e.g. a compare result, or broadcast the top bit with _mm_srai_epi32(mask, 31).)
Compilers know this trick and will use it when auto-vectorizing scalar code if you compile without any arch options like -march=haswell or whatever. (SSE4.1 was new in 2nd-gen Core 2, so it's increasingly widespread but not universal.)
For constant / loop-invariant a^b without SSE4.1
x ^ ((x ^ y) & mask saves one operation if you can reuse x ^ y. (Suggested in comments by Aki). Otherwise this is worse, longer critical-path latency and no instruction-level parallelism.
Without AVX non-destructive 3-operand instructions, this way would need a movaps xmm,xmm register-copy to save b, but it can choose to destroy the mask instead of a. The AND/ANDN/OR way would normally destroy its 2nd operand, the one you use with y&mask, and destroy the mask with ANDN (~mask & x).
With AVX, vblendvps is guaranteed available. Although if you're targeting Intel (especially Haswell) and don't care about AMD, you might still choose an AND/XOR if a^b can be pre-computed.
Blending with 0: just AND[N]
(Applies to integer and FP; the bit-pattern for 0.0f and 0.0 is all-zeros, same as integer 0.)
You don't need to copy a zero from anywhere, just x & mask, or x & ~mask.
(The (x & ~mask) | (y & mask) expression reduces to this for x=0 or y=0; that term becomes zero, and z|=0 is a no-op.)
For example, to implement x = mask ? x+y : x, which would put the latency of an add and blend on the critical path, you simplify to x += select y or zero according to mask, i.e. to x += y & mask; Or to do the opposite, x += ~mask & y using _mm_andn_ps(mask, vy).
This has an ADD and an AND operation (so already cheaper than blend on some CPUs, and you don't need a 0.0 source operand in another register). Also, the dependency chain through x now only includes the += operation, if you were doing this in a loop with loop-carried x but independent y & mask. e.g. summing only matching elements of an array, sum += A[i]>=thresh ? A[i] : 0.0f;
For an example of an extra slowdown due to lengthening the critical path unnecessarily, see gcc optimization flag -O3 makes code slower than -O2 where GCC's scalar asm using cmov has that flaw, doing cmov as part of the loop-carried dependency chain instead of to prepare a 0 or arr[i] input for it.
Clamping to a MIN or MAX
If you want something like a < upper ? a : upper, you can do that clamping in one instruction with _mm_min_ps instead of cmpps / blendvps. (Similarly _mm_max_ps, and _mm_min_pd / _mm_max_pd.)
See What is the instruction that gives branchless FP min and max on x86? for details on their exact semantics, including a longstanding (but recently fixed) GCC bug where the FP intrinsics didn't provide the expected strict-FP semantics of which operand would be the one to keep if one was NaN.
Or for integer, SSE2 is highly non-orthogonal (signed min/max for int16_t, unsigned min/max for uint8_t). Similar for saturating pack instructions. SSE4.1 fills in the missing operand-size and signedness combinations.
Signed: SSE2 _mm_max_epi16 (and corresponding mins for all of these)
SSE4.1 _mm_max_epi32 / _mm_max_epi8; AVX-512 _mm_max_epi64
Unsigned: SSE2 _mm_max_epu8
SSE4.1 _mm_max_epu16 / _mm_max_epu32; AVX-512 _mm_max_epu64
AVX-512 makes masking/blending a first-class operation
AVX-512 compares into a mask register, k0..k7 (intrinsic types __mmask16 and so on). Merge-masking or zero-masking can be part of most ALU instructions. There is also a dedicated blend instruction that blends according to a mask.
I won't go into the details here, suffice it to say if you have a lot of conditional stuff to do, AVX-512 is great (even if you only use 256-bit vectors to avoid the turbo clock speed penalties and so on.) And you'll want to read up on the details for AVX-512 specifically.
As suggested by #Peter Cordes in the comments to the question, the blendvps instruction (_mm_blendv_* intrinsics) is used to preform the interleave/conditional move operation.
It should be noted that _mm_blendv_* family select the left-hand elements if the mask contains 0 instead of 0xffffffff, thus a and b should be passed in reverse order.
The implementation then would look like this
const __m128i mask = {0xffffffff, 0, 0xffffffff, 0}; // e.g. a compare result
const __m128 m_ps = _mm_castsi128_ps(mask);
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};
#ifdef __SSE4_1__ // _mm_blendv_ps requires SSE4.1
const __m128 c = _mm_blendv_ps(b, a, m_ps);
#else
const __m128 c = _mm_or_ps(_mm_and_ps(m_ps, a), _mm_andnot_ps(m_ps, b));
#endif
// c contains {1.0, 2.1, 1.2, 2.3}

ARM Neon in C: How to combine different 128bit data types while using intrinsics?

TLTR
For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t into a function expecting uint16x8_t?
EXTENDED VERSION
Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:
for (int y = 0; y < rows; y += 2) {
uint8_t* p_out = outBuffer + (y / 2) * outStride;
uint8_t* p_in = inBuffer + y * inStride;
for (int x = 0; x < cols; x += 2) {
*p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
p_out++;
p_in+=2;
}
}
Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.
Now I want to vectorize this. The idea is:
take 2 consecutive rows of pixels
load 16 bytes in a from the top row, and load the 16 bytes immediately below in b
compute the minimum byte by byte between a and b. Store in a.
create a copy of a shifting it right by 1 byte (8 bits). Store it in b.
compute the minimum byte by byte between a and b. Store in a.
store every second byte of a in the output image (discards half of the bytes)
I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.
For example, at point 3 one can use (from here):
uint8x16_t vminq_u8(uint8x16_t a, uint8x16_t b);
And at point 4 one can use one of the following using a shift of 8 bits (from here):
uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);
That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)
HOWEVER, the output of vminq_u8 is of type uint8x16_t, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union (Edit: although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.
What is the way to combine different data types while using ARM Neon intrinsics?
For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.
In some situations, you might want to treat a vector as having a
different type, without changing its value. A set of intrinsics is
provided to perform this type of conversion.
So, assuming a and b are declared as:
uint8x16_t a, b;
Your point 4 can be written as(*):
b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );
However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?
(*) It should be said, this is much more cumbersome of the equivalent (in this specific context) SSE code, as SSE has only one 128 bit integer data type (namely __m128i):
__m128i b = _mm_srli_si128(a,1);

How to add all int32 element in a lane using neon intrinsic

Here is my code for adding all int16x4 element in a lane:
#include <arm_neon.h>
...
int16x4_t acc = vdup_n_s16(1);
int32x2_t acc1;
int64x1_t acc2;
int32_t sum;
acc1 = vpaddl_s16(acc);
acc2 = vpaddl_s32(acc1);
sum = (int)vget_lane_s64(acc2, 0);
printf("%d\n", sum);// 4
And I tried to add all int32x4 element in a lane.
but my code looks inefficient:
#include <arm_neon.h>
...
int32x4_t accl = vdupq_n_s32(1);
int64x2_t accl_1;
int64_t temp;
int64_t temp2;
int32_t sum1;
accl_1=vpaddlq_s32(accl);
temp = (int)vgetq_lane_s64(accl_1,0);
temp2 = (int)vgetq_lane_s64(accl_1,1);
sum1=temp+temp2;
printf("%d\n", sum);// 4
Is there simply and clearly way to do this? I hope the LLVM assembly code is simply and clearly after compile it. and I also hope the final type of sum is 32 bits.
I used ellcc cross-compiler base on LLVM compiler infrastructure to compile it.
I saw the similar question(Add all elements in a lane) on stackoverflow, but the intrinsic addv doesn't work on my host.
If you only want a 32-bit result, presumably either intermediate overflow is unlikely or you simply don't care about it, in which case you could just stay 32-bit all the way:
int32x2_t temp = vadd_s32(vget_high_s32(accl), vget_low_s32(accl));
int32x2_t temp2 = vpadd_s32(temp, temp);
int32_t sum1 = vget_lane_s32(temp2, 0);
However, using 64-bit accumulation isn't actually any more hassle, and can also be done without dropping out of NEON - it's just a different order of operations:
int64x2_t temp = vpaddlq_s32(accl);
int64x1_t temp2 = vadd_s64(vget_high_s64(temp), vget_low_s64(temp));
int32_t sum1 = vget_lane_s32(temp2, 0);
Either of those boils down to just 3 NEON instructions and no scalar arithmetic. The crucial trick on 32-bit ARM is that a pairwise add of two halves of a Q register is simply a normal add of two D registers - that doesn't apply to AArch64 where the SIMD register layout is different, but then AArch64 has the aforementioned horizontal addv anyway.
Now, how horrible any of that looks in LLVM IR I don't know - I suppose it depends on how it treats vector types and operations internally - but in terms of the final ARM machine code both could be considered optimal.

Fast float to int conversion (truncate)

I'm looking for a way to truncate a float into an int in a fast and portable (IEEE 754) way. The reason is because in this function 50% of the time is spent in the cast:
float fm_sinf(float x) {
const float a = 0.00735246819687011731341356165096815f;
const float b = -0.16528911397014738207016302002888890f;
const float c = 0.99969198629596757779830113868360584f;
float r, x2;
int k;
/* bring x in range */
k = (int) (F_1_PI * x + copysignf(0.5f, x)); /* <-- 50% of time is spent in cast */
x -= k * F_PI;
/* if x is in an odd pi count we must flip */
r = 1 - 2 * (k & 1); /* trick for r = (k % 2) == 0 ? 1 : -1; */
x2 = x * x;
return r * x*(c + x2*(b + a*x2));
}
The slowness of float->int casts mainly occurs when using x87 FPU instructions on x86. To do the truncation, the rounding mode in the FPU control word needs to be changed to round-to-zero and back, which tends to be very slow.
When using SSE instead of x87 instructions, a truncation is available without control word changes. You can do this using compiler options (like -mfpmath=sse -msse -msse2 in GCC) or by compiling the code as 64-bit.
The SSE3 instruction set has the FISTTP instruction to convert to integer with truncation without changing the control word. A compiler may generate this instruction if instructed to assume SSE3.
Alternatively, the C99 lrint() function will convert to integer with the current rounding mode (round-to-nearest unless you changed it). You can use this if you remove the copysignf term. Unfortunately, this function is still not ubiquitous after more than ten years.
I found a fast truncate method by Sree Kotay which provides exactly the optimization that I needed.
to be portable you would have to add some directives and learn a couple assembler languages but you could theoretically could use some inline assembly to move portions of the floating point register into eax/rax ebx/rbx and convert what you would need by hand, floating point specification though is a pain in the butt, but I am pretty certain that if you do it with assembly you will be way faster, as your needs are very specific and the system method is probably more generic and less efficient for your purpose
You could skip the conversion to int altogether by using frexpf to get the mantissa and exponent, and inspect the raw mantissa (use a union) at the appropriate bit position (calculated using the exponent) to determine (the quadrant dependent) r.

How to map a long integer number to a N-dimensional vector of smaller integers (and fast inverse)?

Given a N-dimensional vector of small integers is there any simple way to map it with one-to-one correspondence to a large integer number?
Say, we have N=3 vector space. Can we represent a vector X=[(int16)x1,(int16)x2,(int16)x3] using an integer (int48)y? The obvious answer is "Yes, we can". But the question is: "What is the fastest way to do this and its inverse operation?"
Will this new 1-dimensional space possess some very special useful properties?
For the above example you have 3 * 32 = 96 bits of information, so without any a priori knowledge you need 96 bits for the equivalent long integer.
However, if you know that your x1, x2, x3, values will always fit within, say, 16 bits each, then you can pack them all into a 48 bit integer.
In either case the technique is very simple you just use shift, mask and bitwise or operations to pack/unpack the values.
Just to make this concrete, if you have a 3-dimensional vector of 8-bit numbers, like this:
uint8_t vector[3] = { 1, 2, 3 };
then you can join them into a single (24-bit number) like so:
uint32_t all = (vector[0] << 16) | (vector[1] << 8) | vector[2];
This number would, if printed using this statement:
printf("the vector was packed into %06x", (unsigned int) all);
produce the output
the vector was packed into 010203
The reverse operation would look like this:
uint8_t v2[3];
v2[0] = (all >> 16) & 0xff;
v2[1] = (all >> 8) & 0xff;
v2[2] = all & 0xff;
Of course this all depends on the size of the individual numbers in the vector and the length of the vector together not exceeding the size of an available integer type, otherwise you can't represent the "packed" vector as a single number.
If you have sets Si, i=1..n of size Ci = |Si|, then the cartesian product set S = S1 x S2 x ... x Sn has size C = C1 * C2 * ... * Cn.
This motivates an obvious way to do the packing one-to-one. If you have elements e1,...,en from each set, each in the range 0 to Ci-1, then you give the element e=(e1,...,en) the value e1+C1*(e2 + C2*(e3 + C3*(...Cn*en...))).
You can do any permutation of this packing if you feel like it, but unless the values are perfectly correlated, the size of the full set must be the product of the sizes of the component sets.
In the particular case of three 32 bit integers, if they can take on any value, you should treat them as one 96 bit integer.
If you particularly want to, you can map small values to small values through any number of means (e.g. filling out spheres with the L1 norm), but you have to specify what properties you want to have.
(For example, one can map (n,m) to (max(n,m)-1)^2 + k where k=n if n<=m and k=n+m if n>m--you can draw this as a picture of filling in a square like so:
1 2 5 | draw along the edge of the square this way
4 3 6 v
8 7
if you start counting from 1 and only worry about positive values; for integers, you can spiral around the origin.)
I'm writing this without having time to check details, but I suspect the best way is to represent your long integer via modular arithmetic, using k different integers which are mutually prime. The original integer can then be reconstructed using the Chinese remainder theorem. Sorry this is a bit sketchy, but hope it helps.
To expand on Rex Kerr's generalised form, in C you can pack the numbers like so:
X = e[n];
X *= MAX_E[n-1] + 1;
X += e[n-1];
/* ... */
X *= MAX_E[0] + 1;
X += e[0];
And unpack them with:
e[0] = X % (MAX_E[0] + 1);
X /= (MAX_E[0] + 1);
e[1] = X % (MAX_E[1] + 1);
X /= (MAX_E[1] + 1);
/* ... */
e[n] = X;
(Where MAX_E[n] is the greatest value that e[n] can have). Note that these maximum values are likely to be constants, and may be the same for every e, which will simplify things a little.
The shifting / masking implementations given in the other answers are a generalisation of this, for cases where the MAX_E + 1 values are powers of 2 (and thus the multiplication and division can be done with a shift, the addition with a bitwise-or and the modulus with a bitwise-and).
There is some totally non portable ways to make this real fast using packed unions and direct accesses to memory. That you really need this kind of speed is suspicious. Methods using shifts and masks should be fast enough for most purposes. If not, consider using specialized processors like GPU for wich vector support is optimized (parallel).
This naive storage does not possess any usefull property than I can foresee, except you can perform some computations (add, sub, logical bitwise operators) on the three coordinates at once as long as you use positive integers only and you don't overflow for add and sub.
You'd better be quite sure you won't overflow (or won't go negative for sub) or the vector will become garbage.
#include <stdint.h> // for uint8_t
long x;
uint8_t * p = &x;
or
union X {
long L;
uint8_t A[sizeof(long)/sizeof(uint8_t)];
};
works if you don't care about the endian. In my experience compilers generate better code with the union because it doesn't set of their "you took the address of this, so I must keep it in RAM" rules as quick. These rules will get set off if you try to index the array with stuff that the compiler can't optimize away.
If you do care about the endian then you need to mask and shift.
I think what you want can be solved using multi-dimensional space filling curves. The link gives a lot of references on this, which in turn give different methods and insights. Here's a specific example of an invertible mapping. It works for any dimension N.
As for useful properties, these mappings are related to Gray codes.
Hard to say whether this was what you were looking for, or whether the "pack 3 16-bit ints into a 48-bit int" does the trick for you.

Resources