128-bit rotation using ARM Neon intrinsics

128-bit rotation using ARM Neon intrinsics - c

I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t).
Here is my c code:
uint16_t rotated[8];
uint16_t temp[8];
uint16_t j;
for(j = 0; j < 8; j++)
{
//Rotation <<< 24 over 128 bits (x << shift) | (x >> (16 - shift)
rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff);
}
I've checked the gcc documentation about Neon Intrinsics and it doesn't have instruction for vector rotations. Moreover, I've tried to do this using vshlq_n_u16(temp, 8) but all the bits shifted outside a uint16_t word are lost.
How to achieve this using neon intrinsics ? By the way is there a better documentation about GCC Neon Intrinsics ?

After some reading on Arm Community Blogs, I've found this :
VEXT: Extract
VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors. VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.
The following Neon GCC Intrinsic does the same as the assembly provided in the picture :
uint16x8_t vextq_u16 (uint16x8_t, uint16x8_t, const int)
So the the 24bit rotation over a full 128bit vector (not over each element) could be done by the following:
uint16x8_t input;
uint16x8_t t0;
uint16x8_t t1;
uint16x8_t rotated;
t0 = vextq_u16(input, input, 1);
t0 = vshlq_n_u16(t0, 8);
t1 = vextq_u16(input, input, 2);
t1 = vshrq_n_u16(t1, 8);
rotated = vorrq_u16(t0, t1);

Use vext.8 to concat a vector with itself and give you the 16-byte window that you want (in this case offset by 3 bytes).
Doing this with intrinsics requires casting to keep the compiler happy, but it's still a single instruction:
#include <arm_neon.h>
uint16x8_t byterotate3(uint16x8_t input) {
uint8x16_t tmp = vreinterpretq_u8_u16(input);
uint8x16_t rotated = vextq_u8(tmp, tmp, 16-3);
return vreinterpretq_u16_u8(rotated);
}
g++5.4 -O3 -march=armv7-a -mfloat-abi=hard -mfpu=neon (on Godbolt) compiles it to this:
byterotate3(__simd128_uint16_t):
vext.8 q0, q0, q0, #13
bx lr
A count of 16-3 means we left-rotate by 3 bytes. (It means we take 13 bytes from the left vector and 3 bytes from the right vector, so it's also a right-rotate by 13).
Related: x86 also has instruction that takes a sliding window into the concatenation of two registers: palignr (added in SSSE3).
Maybe I'm missing something about NEON, but I don't understand why the OP's self-answer is using vext.16 (vextq_u16), which has 16-bit granularity. It's not even a different instruction, just an alias for vext.8 which makes it impossible to use an odd-numbered count, requiring extra instructions. The manual for vext.8 says:
VEXT pseudo-instruction
You can specify a datatype of 16, 32, or 64 instead of 8. In this
case, #imm refers to halfwords, words, or doublewords instead of
referring to bytes, and the permitted ranges are correspondingly
reduced.

I'm not 100% sure but I don't think NEON has rotate instructions.
You can compose the rotation operation you require with a left shift, a right shit and an or, e.g.:
uint8_t ror(uint8_t in, int rotation)
{
return (in >> rotation) | (in << (8-rotation));
}
Just do the same with the Neon intrinsics for left shift, right shit and or.
uint16x8_t temp;
uint8_t rot;
uint16x8_t rotated = vorrq_u16 ( vshlq_n_u16(temp, rot) , vshrq_n_u16(temp, 16 - rot) );
See http://en.wikipedia.org/wiki/Circular_shift "Implementing circular shifts."
This will rotate the values inside the lanes. If you want to rotate the lanes themselves use VEXT as described in the other answer.

Related

How can I generate a 256 bit mask

I have an array of uint64_t[4], and I need to generate a mask,
such that the array, if it were a 256-bit integer, equals
(1 << w) - 1, where w goes from 1 to 256.
The best thing I have come up with is branchless, but it takes MANY instructions. It is in Zig because Clang doesn't seem to expose llvm's saturating subtraction. http://localhost:10240/z/g8h1rV
Is there a better way to do this?
var mask: [4]u64 = undefined;
for (mask) |_, i|
mask[i] = 0xffffffffffffffff;
mask[3] ^= ((u64(1) << #intCast(u6, (inner % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[2] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 64) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[1] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 128) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[0] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 192) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));

Are you targeting x86-64 with AVX2 for 256-bit vectors? I thought that was an interesting case to answer for.
If so, you can do this in a few instructions using saturating subtraction and a variable count shift.
x86 SIMD shifts like vpsrlvq saturate the shift count, shifting all the bits out when the count is >= element width. Unlike integer shifts the shift count is masked (and thus wraps around).
For the lowest u64 element, starting with all-ones we need to leave it unmodified for bitpos >= 64. Or for smaller bit positions, right-shift it by 64-bitpos. Unsigned saturating subtraction looks like the way to go here, as you observed, to create a shift count of 0 for larger bitpos. But x86 only has SIMD saturating subtraction, and only for byte or word elements. But if we don't care about bitpos > 256, that's fine we can use 16-bit elements at the bottom of each u64, and let a 0-0 happen in the rest of the u64.
Your code looks pretty overcomplicated, creating (1<<n) - 1 and XORing. I think it's a lot easier to just use a variable-count shift on the 0xFFFF...FF elements directly.
I don't know Zig, so do whatever you have to to get it to emit asm like this. Hopefully this is useful because you tagged this assembly; should be easy to translate to intrinsics for C, or Zig if it has them.
default rel
section .rodata
shift_offsets: dw 64, 128, 192, 256 ; 16-bit elements, to be loaded with zero-extension to 64
section .text
pos_to_mask256:
vpmovzxwq ymm2, [shift_offsets] ; _mm256_set1_epi64x(256, 192, 128, 64)
vpcmpeqd ymm1, ymm1,ymm1 ; ymm1 = all-ones
; set up vector constants, can be hoisted
vmovd xmm0, edi
vpbroadcastq ymm0, xmm0 ; ymm0 = _mm256_set1_epi64(bitpos)
vpsubusw ymm0, ymm2, ymm0 ; ymm0 = {256,192,128,64}-bitpos with unsigned saturation
vpsrlvq ymm0, ymm1, ymm0 ; mask[i] >>= count, where counts >= 64 create 0s.
ret
If the input integer starts in memory, you can of course efficiently broadcast-load it into a ymm register directly.
The shift-offsets vector can of course be hoisted out of a loop, as can the all-ones.
With input = 77, the high 2 elements are zeroed by shifts of 256-77=179, and 192-77=115 bits. Tested with NASM + GDB for EDI=77, and the result is
(gdb) p /x $ymm0.v4_int64
{0xffffffffffffffff, 0x1fff, 0x0, 0x0}
GDB prints low element first, opposite of Intel notation / diagrams. This vector is actually 0, 0, 0x1fff, 0xffffffffffffffff, i.e. 64+13 = 77 one bits, and the rest all zeros. Other test cases
edi=0: mask = all-zero
edi=1: mask = 1
... : mask = edi one bits at the bottom, then zeros
edi=255: mask = all ones except for the top bit of the top element
edi=256: mask = all ones
edi>256: mask = all ones. (unsigned subtraction saturates to 0 everywhere.)
You need AVX2 for the variable-count shifts. psubusb/w is SSE2, so you could consider doing that part with SIMD and then go back to scalar integer for the shifts, or maybe just use SSE2 shifts for one element at a time. Like psrlq xmm1, xmm0 which takes the low 64 bits of xmm0 as the shift count for all elements of xmm1.
Most ISAs don't have saturating scalar subtraction. Some ARM CPUs do for scalar integer, I think, but x86 doesn't. IDK what you're using.
On x86 (and many other ISAs) you have 2 problems:
keep all-ones for low elements (either modify the shift result, or saturate shift count to 0)
produce 0 for high elements above the one containing the top bit of the mask. x86 scalar shifts can't do this at all, so you might feed the shift an input of 0 for that case. Maybe using cmov to create it based on flags set by sub for 192-w or something.
count = 192-w;
shift_input = count<0 ? 0 : ~0ULL;
shift_input >>= count & 63; // mask to avoid UB in C. Optimizes away on x86 where shr does this anyway.
Hmm, this doesn't handle saturating the subtraction to 0 to keep the all-ones, though.
If tuning for ISAs other than x86, maybe look at some other options. Or maybe there's something better on x86 as well. Creating the all-ones or all-zeros with sar reg,63 is an interesting option (broadcast the sign bit), but we actually need all-ones when 192-count has sign bit = 0.

Here's some Zig code that compiles and runs:
const std = #import("std");
noinline fn thing(x: u256) bool {
return x > 0xffffffffffffffff;
}
pub fn main() anyerror!void {
var num: u256 = 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff;
while (thing(num)) {
num /= 2;
std.debug.print(".", .{});
}
std.debug.print("done\n", .{});
}
Zig master generates relatively clean x86 assembler from that.

SIMD (AVX2) mask store and pack [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

The best way to shift a __m128i?

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m).
What is the best way to do this?!
Note that _mm_slli_epi64 shifts v0 and v1 seperately:
r0 := v0 << count
r1 := v1 << count
so the last bits of v0 missed, but I want to move those bits to r1.
Edit:
I looking for a code, faster than this (m<64):
r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);

For compile-time constant shift counts, you can get fairly good results. Otherwise not really.
This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.
// untested
#include <immintrin.h>
/* some compilers might choke on slli / srli with non-compile-time-constant args
* gcc generates the xmm, imm8 form with constants,
* and generates the xmm, xmm form with otherwise. (With movd to get the count in an xmm)
*/
// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
__m128i carry = _mm_bslli_si128(x, 8); // old compilers only have the confusingly named _mm_slli_si128 synonym
if (count >= 64)
return _mm_slli_epi64(carry, count-64); // the non-carry part is all zero, so return early
// else
carry = _mm_srli_epi64(carry, 64-count); // After bslli shifted left by 64b
x = _mm_slli_epi64(x, count);
return _mm_or_si128(x, carry);
}
__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100); }
I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.
The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):
mm_bitshift_left_3(long long __vector(2)):
vpslldq xmm1, xmm0, 8
vpsrlq xmm1, xmm1, 61
vpsllq xmm0, xmm0, 3
vpor xmm0, xmm0, xmm1
ret
Performance:
This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)
Or for counts >= 64:
mm_bitshift_left_100(long long __vector(2)):
vpslldq xmm0, xmm0, 8
vpsllq xmm0, xmm0, 36
ret
If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.
It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.
The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)
__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
return x << count; // undefined if count >= 128
}

In SSE4.A the instructions insrq and extrq can be used to shift (and rotate) through __mm128i 1-64 bits at a time. Unlike the 8/16/32/64 bit counterparts pextrN/pinsrX, these instructions select or insert m bits (between 1 and 64) at any bit offset from 0 to 127. The caveat is that the sum of lenght and offset must not exceed 128.

Comparing two pairs of 4 variables and returning the number of matches?

Given the following struct:
struct four_points {
uint32_t a, b, c, d;
}
What would be the absolute fastest way to compare two such structures and return the number of variables that match (in any position)?
For example:
four_points s1 = {0, 1, 2, 3};
four_points s2 = {1, 2, 3, 4};
I'd be looking for a result of 3, since three numbers match between the two structs. However, given the following:
four_points s1 = {1, 0, 2, 0};
four_points s2 = {0, 1, 9, 7};
Then I'd expect a result of only 2, because only two variables match between either struct (despite there being two zeros in the first).
I've figured out a few rudimentary systems for performing the comparison, but this is something that is going to be called a couple million times in a short time span and needs to be relatively quick. My current best attempt was to use a sorting network to sort all four values for either input, then loop over the sorted values and keep a tally of the values that are equal, advancing the current index of either input accordingly.
Is there any kind of technique that might be able to perform better then a sort & iteration?

On modern CPUs, sometimes brute force applied properly is the way to go. The trick is writing code that isn't limited by instruction latencies, just throughput.
Are duplicates common? If they're very rare, or have a pattern, using a branch to handle them makes the common case faster. If they're really unpredictable, it's better to do something branchless. I was thinking about using a branch to check for duplicates between positions where they're rare, and going branchless for the more common place.
Benchmarking is tricky because a version with branches will shine when tested with the same data a million times, but will have lots of branch mispredicts in real use.
I haven't benchmarked anything yet, but I have come up with a version that skips duplicates by using OR instead of addition to combine found-matches. It compiles to nice-looking x86 asm that gcc fully unrolls. (no conditional branches, not even loops).
Here it is on godbolt. (g++ is dumb and uses 32bit operations on the output of x86 setcc, which only sets the low 8 bits. This partial-register access will produce slowdowns. And I'm not even sure it ever zeroes the upper 24bits at all... Anyway, the code from gcc 4.9.2 looks good, and so does clang on godbolt)
// 8-bit types used because x86's setcc instruction only sets the low 8 of a register
// leaving the other bits unmodified.
// Doing a 32bit add from that creates a partial register slowdown on Intel P6 and Sandybridge CPU families
// Also, compilers like to insert movzx (zero-extend) instructions
// because I guess they don't realize the previous high bits are all zero.
// (Or they're tuning for pre-sandybridge Intel, where the stall is worse than SnB inserting the extra uop itself).
// The return type is 8bit because otherwise clang decides it should generate
// things as 32bit in the first place, and does zero-extension -> 32bit adds.
int8_t match4_ordups(const four_points *s1struct, const four_points *s2struct)
{
const int32_t *s1 = &s1struct->a; // TODO: check if this breaks aliasing rules
const int32_t *s2 = &s2struct->a;
// ignore duplicates by combining with OR instead of addition
int8_t matches = 0;
for (int j=0 ; j<4 ; j++) {
matches |= (s1[0] == s2[j]);
}
for (int i=1; i<4; i++) { // i=0 iteration is broken out above
uint32_t s1i = s1[i];
int8_t notdup = 1; // is s1[i] a duplicate of s1[0.. i-1]?
for (int j=0 ; j<i ; j++) {
notdup &= (uint8_t) (s1i != s1[j]); // like dup |= (s1i == s1[j]); but saves a NOT
}
int8_t mi = // match this iteration?
(s1i == s2[0]) |
(s1i == s2[1]) |
(s1i == s2[2]) |
(s1i == s2[3]);
// gcc and clang insist on doing 3 dependent OR insns regardless of parens, not that it matters
matches += mi & notdup;
}
return matches;
}
// see the godbolt link for a main() simple test harness.
On a machine with 128b vectors that can work with 4 packed 32bit integers (e.g. x86 with SSE2), you can broadcast each element of s1 to its own vector, deduplicate, and then do 4 packed-compares. icc does something like this to autovectorize my match4_ordups function (check it out on godbolt.)
Store the compare results back to integer registers with movemask, to get a bitmap of which elements compared equal. Popcount those bitmaps, and add the results.
This led me to a better idea: Getting all the compares done with only 3 shuffles with element-wise rotation:
{ 1d 1c 1b 1a }
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1a 1d 1c 1b }
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1b 1a 1d 1c } # if dups didn't matter: do this shuffle on s2
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1c 1b 1a 1d } # if dups didn't matter: this result from { 1a ... }
== == == == packed-compare with
{ 2d 2c 2b 2a } { 2b ...
That's only 3 shuffles, and still does all 16 comparisons. The trick is combining them with ORs where we need to merge duplicates, and then being able to count them efficiently. A packed-compare outputs a vector with each element = zero or -1 (all bits set), based on the comparison between the two elements in that position. It's designed to make a useful operand to AND or XOR to mask off some vector elements, e.g. to make v1 += v2 & mask conditional on a per-element basis. It also works as just a boolean truth value.
All 16 compares with only 2 shuffles is possible by rotating one vector by two, and the other vector by one, and then comparing between the four shifted and unshifted vectors. This would be great if we didn't need to eliminate dups, but since we do, it matters which results are where. We're not just adding all 16 comparison results.
OR together the packed-compare results down to one vector. Each element will be set based on whether that element of s2 had any matches in s1. int _mm_movemask_ps (__m128 a) to turn the vector into a bitmap, then popcount the bitmap. (Nehalem or newer CPU required for popcnt, otherwise fall back to a version with a 4-bit lookup table.)
The vertical ORs take care of duplicates in s1, but duplicates in s2 is a less obvious extension, and would take more work. I did eventually think of a way that was less than twice as slow (see below).
#include <stdint.h>
#include <immintrin.h>
typedef struct four_points {
int32_t a, b, c, d;
} four_points;
//typedef uint32_t four_points[4];
// small enough to inline, only 62B of x86 instructions (gcc 4.9.2)
static inline int match4_sse_noS2dup(const four_points *s1pointer, const four_points *s2pointer)
{
__m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
__m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
__m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
// no shuffle needed for first compare
__m128i match = _mm_cmpeq_epi32(s1 , s2); //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
__m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
s1b = _mm_cmpeq_epi32(s1b, s2);
match = _mm_or_si128(match, s1b); // merge dups by ORing instead of adding
// note that we shuffle the original vector every time
// multiple short dependency chains are better than one long one.
__m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
s1c = _mm_cmpeq_epi32(s1c, s2);
match = _mm_or_si128(match, s1c);
s1d = _mm_cmpeq_epi32(s1d, s2);
match = _mm_or_si128(match, s1d); // match = { s2.a in s1?, s2.b in s1?, etc. }
// turn the the high bit of each 32bit element into a bitmap of s2 elements that have matches anywhere in s1
// use float movemask because integer movemask does 8bit elements.
int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
return _mm_popcnt_u32(matchmask); // or use a 4b lookup table for CPUs with SSE2 but not popcnt
}
See the version that eliminates duplicates in s2 for the same code with lines in a more readable order. I tried to schedule instructions in case the CPU was only just barely decoding instructions ahead of what was executing, but gcc puts the instructions in the same order regardless of what order you put the intrinsics in.
This is extremely fast, if there isn't a store-forwarding stall in the 128b loads. If you just wrote the struct with four 32bit stores, running this function within the next several clock cycles will produce a stall when it tries to load the whole struct with a 128b load. See Agner Fog's site. If calling code already has many of the 8 values in registers, the scalar version could be a win, even though it'll be slower for a microbenchmark test that only reads the structs from memory.
I got lazy on cycle-counting for this, since dup-handling isn't done yet. IACA says Haswell can run it with a throughput of one iteration per 4.05 clock cycles, and latency of 17 cycles (Not sure if that's including the memory latency of the loads. There's a lot of instruction-level parallelism available, and all the instructions have single-cycle latency, except for movmsk(2) and popcnt(3)). It's slightly slower without AVX, because gcc chooses a worse instruction ordering, and still wastes a movdqa instruction copying a vector register.
With AVX2, this could do two match4 operations in parallel, in 256b vectors. AVX2 usually works as two 128b lanes, rather than full 256b vectors. Setting up your code to be able to take advantage of 2 or 4 (AVX-512) match4 operations in parallel will give you gains when you can compile for those CPUs. It's not essential for both the s1s or s2s to be stored contiguously so a single 32B load can get two structs. AVX2 has a fairly fast load 128b to the upper lane of a register.
Handling duplicates in s2
Maybe compare s2 to a shifted instead of rotated version of itself.
#### comparing S2 with itself to mask off duplicates
{ 0 2d 2c 2b }
{ 2d 2c 2b 2a } == == ==
{ 0 0 2d 2c }
{ 2d 2c 2b 2a } == ==
{ 0 0 0 2d }
{ 2d 2c 2b 2a } ==
Hmm, if zero can occur as a regular element, we may need to byte-shift after the compare as well, to turn potential false positives into zeros. If there was a sentinel value that couldn't appear in s1, you could shift in elements of that, instead of 0. (SSE has PALIGNR, which gives you any contiguous 16B window you want of the contents of two registers appended. Named for the use-case of simulating an unaligned load from two aligned loads. So you'd have a constant vector of that element.)
update: I thought of a nice trick that avoids the need for an identity element. We can actually get all 6 necessary s2 vs. s2 comparisons to happen with just two vector compares, and then combine the results.
Doing the same compare in the same place in two vectors lets you OR two results together without having to mask before the OR. (Works around the lack of a sentinel value).
Shuffling the output of the compares instead of extra shuffle&compare of S2. This means we can get d==a done next to the other compares.
Notice that we aren't limited to shuffling whole elements around. Byte-wise shuffle to get bytes from different compare results into a single vector element, and compare that against zero. (This saves less than I'd hoped, see below).
Checking for duplicates is a big slowdown (esp. in throughput, not so much in latency). So you're still best off arranging for a sentinel value in s2 that will never match any s1 element, which you say is possible. I only present this because I thought it was interesting. (And gives you an option in case you need a version that doesn't require sentinels sometime.)
static inline
int match4_sse(const four_points *s1pointer, const four_points *s2pointer)
{
// IACA_START
__m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
__m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
// s1a = unshuffled = s1.a in the low element
__m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
__m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
__m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
__m128i match = _mm_cmpeq_epi32(s1 , s2); //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
s1b = _mm_cmpeq_epi32(s1b, s2);
match = _mm_or_si128(match, s1b); // merge dups by ORing instead of adding
s1c = _mm_cmpeq_epi32(s1c, s2);
match = _mm_or_si128(match, s1c);
s1d = _mm_cmpeq_epi32(s1d, s2);
match = _mm_or_si128(match, s1d);
// match = { s2.a in s1?, s2.b in s1?, etc. }
// s1 vs s2 all done, now prepare a mask for it based on s2 dups
/*
* d==b c==a b==a d==a #s2b
* d==c c==b b==a d==a #s2c
* OR together -> s2bc
* d==abc c==ba b==a 0 pshufb(s2bc) (packed as zero or non-zero bytes within the each element)
* !(d==abc) !(c==ba) !(b==a) !0 pcmpeq setzero -> AND mask for s1_vs_s2 match
*/
__m128i s2b = _mm_shuffle_epi32(s2, _MM_SHUFFLE(1, 0, 0, 3));
__m128i s2c = _mm_shuffle_epi32(s2, _MM_SHUFFLE(2, 1, 0, 3));
s2b = _mm_cmpeq_epi32(s2b, s2);
s2c = _mm_cmpeq_epi32(s2c, s2);
__m128i s2bc= _mm_or_si128(s2b, s2c);
s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12, -1,-1,-1,8, -1,-1,-1,4, -1,-1,-1,-1));
__m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
// see below for alternate insn sequences that can go here.
match = _mm_and_si128(match, dupmask);
// turn the the high bit of each 32bit element into a bitmap of s2 matches
// use float movemask because integer movemask does 8bit elements.
int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
int ret = _mm_popcnt_u32(matchmask); // or use a 4b lookup table for CPUs with SSE2 but not popcnt
// IACA_END
return ret;
}
This requires SSSE3 for pshufb. It and a pcmpeq (and a pxor to generate a constant) are replacing a shuffle (bslli(s2bc, 12)), an OR, and an AND.
d==bc c==ab b==a a==d = s2b|s2c
d==a 0 0 0 = byte-shift-left(s2b) = s2d0
d==abc c==ab b==a a==d = s2abc
d==abc c==ab b==a 0 = mask(s2abc). Maybe use PBLENDW or MOVSS from s2d0 (which we know has zeros) to save loading a 16B mask.
__m128i s2abcd = _mm_or_si128(s2b, s2c);
//s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12, -1,-1,-1,8, -1,-1,-1,4, -1,-1,-1,-1));
//__m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
__m128i s2d0 = _mm_bslli_si128(s2b, 12); // d==a 0 0 0
s2abcd = _mm_or_si128(s2abcd, s2d0);
__m128i dupmask = _mm_blend_epi16(s2abcd, s2d0, 0 | (2 | 1));
//__m128i dupmask = _mm_and_si128(s2abcd, _mm_set_epi32(-1, -1, -1, 0));
match = _mm_andnot_si128(dupmask, match); // ~dupmask & match; first arg is the one that's inverted
I can't recommend MOVSS; it will incur extra latency on AMD because it runs in the FP domain. PBLENDW is SSE4.1. popcnt is available on AMD K10, but PBLENDW isn't (some Barcelona-core PhenomII CPUs are probably still in use). Actually, K10 doesn't have PSHUFB either, so just require SSE4.1 and POPCNT, and use PBLENDW. (Or use the PSHUFB version, unless it's going to cache-miss a lot.)
Another option to avoid a loading a vector constant from memory is to movemask s2bc, and use integer instead of vector ops. However, it looks like that'll be slower, because the extra movemask isn't free, and integer ANDN isn't usable. BMI1 didn't appear until Haswell, and even Skylake Celerons and Pentiums won't have it. (Very annoying, IMO. It means compilers can't start using BMI for even longer.)
unsigned int dupmask = _mm_movemask_ps(cast(s2bc));
dupmask |= dupmask << 3; // bit3 = d==abc. garbage in bits 4-6, careful if using AVX2 to do two structs at once
// only 2 instructions. compiler can use lea r2, [r1*8] to copy and scale
dupmask &= ~1; // clear the low bit
unsigned int matchmask = _mm_movemask_ps(cast(match));
matchmask &= ~dupmask; // ANDN is in BMI1 (Haswell), so this will take 2 instructions
return _mm_popcnt_u32(matchmask);
AMD XOP's VPPERM (pick bytes from any element of two source registers) would let the byte-shuffle replace the OR that merges s2b and s2c as well.
Hmm, pshufb isn't saving me as much as I thought, because it requires a pcmpeqd, and a pxor to zero a register. It's also loading its shuffle mask from a constant in memory, which can miss in the D-cache. It is the fastest version I've come up with, though.
If inlined into a loop, the same zeroed register could be used, saving one instruction. However, OR and AND can run on port0 (Intel CPUs), which can't run shuffle or compare instructions. The PXOR doesn't use any execution ports, though (on Intel SnB-family microarchitecture).
I haven't run real benchmarks of any of these, only IACA.
The PBLENDW and PSHUFB versions have the same latency (22 cycles, compiled for non-AVX), but the PSHUFB version has better throughput (one per 7.1c, vs. one per 7.4c, because PBLENDW needs the shuffle port, and there's already a lot of contention for it.) IACA says the version using PANDN with a constant instead of PBLENDW is also one-per-7.4c throughput, disappointingly. Port0 isn't saturated, so IDK why it's as slow as PBLENDW.
Old ideas that didn't pan out.
Leaving them in for the benefit of people looking for things to try when using vectors for related things.
Dup-checking s2 with vectors is more work than checking s2 vs. s1, because one compare is as expensive as 4 if done with vectors. The shuffling or masking needed after the compare, to remove false positives if there's no sentinel value, is annoying.
Ideas so far:
Shift s2 over by an element, and compare it to itself. Mask off false positives from shifting in 0. Vertically OR these together, and use it to ANDN the s1 vs s2 vector.
scalar code to do the smaller number of s2 vs. itself comparisons, building a bitmask to use before popcnt.
Broadcast s2.d and check it against s2 (all positions). But that puts the results horizontally in one vector, instead of vertically in 3 vectors. To use that, maybe PTEST / SETCC to make a mask for the bitmap (to apply before popcount). (PTEST with a mask of _mm_setr_epi32(0, -1, -1, -1), to only test the c,b,a, not d==d). Do (c==a | c==b) and b==a with scalar code, and combine that into a mask. Intel Haswell and later have 4 ALU execution ports, but only 3 of them can run vector instructions, so some scalar code in the mix could fill port6. AMD has even more separation between vector and integer execution resources.
shuffle s2 to get all the necessary comparisons done somehow, then shuffle the outputs. Maybe use movemask -> 4-bit lookup table for something?

SSE _mm_movemask_epi8 equivalent method for ARM NEON

I decided to continue Fast corners optimisation and stucked at
_mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?

I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument.
const uint8_t __attribute__ ((aligned (16))) _Powers[16]=
{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };
// Set the powers of 2 (do it once for all, if applicable)
uint8x16_t Powers= vld1q_u8(_Powers);
// Compute the mask from the input
uint64x2_t Mask= vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(Input, Powers))));
// Get the resulting bytes
uint16_t Output;
vst1q_lane_u8((uint8_t*)&Output + 0, (uint8x16_t)Mask, 0);
vst1q_lane_u8((uint8_t*)&Output + 1, (uint8x16_t)Mask, 8);
(Mind http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47553, anyway.)
Similarly to Michael, the trick is to form the powers of the indexes of the non-null entries, and to sum them pairwise three times. This must be done with increasing data size to double the stride on every addition. You reduce from 2 x 8 8-bit entries to 2 x 4 16-bit, then 2 x 2 32-bit and 2 x 1 64-bit. The low byte of these two numbers gives the solution. I don't think there is an easy way to pack them together to form a single short value using NEON.
Takes 6 NEON instructions if the input is in the suitable form and the powers can be preloaded.

The obvious solution seems to be completely missed here.
// Use shifts to collect all of the sign bits.
// I'm not sure if this works on big endian, but big endian NEON is very
// rare.
int vmovmaskq_u8(uint8x16_t input)
{
// Example input (half scale):
// 0x89 FF 1D C0 00 10 99 33
// Shift out everything but the sign bits
// 0x01 01 00 01 00 00 01 00
uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7));
// Merge the even lanes together with vsra. The '??' bytes are garbage.
// vsri could also be used, but it is slightly slower on aarch64.
// 0x??03 ??02 ??00 ??01
uint32x4_t paired16 = vreinterpretq_u32_u16(
vsraq_n_u16(high_bits, high_bits, 7));
// Repeat with wider lanes.
// 0x??????0B ??????04
uint64x2_t paired32 = vreinterpretq_u64_u32(
vsraq_n_u32(paired16, paired16, 14));
// 0x??????????????4B
uint8x16_t paired64 = vreinterpretq_u8_u64(
vsraq_n_u64(paired32, paired32, 28));
// Extract the low 8 bits from each lane and join.
// 0x4B
return vgetq_lane_u8(paired64, 0) | ((int)vgetq_lane_u8(paired64, 8) << 8);
}

This question deserves a newer answer for aarch64. The addition of new capabilities to Armv8 allows the same function to be implemented in fewer instructions. Here's my version:
uint32_t _mm_movemask_aarch64(uint8x16_t input)
{
const uint8_t __attribute__ ((aligned (16))) ucShift[] = {-7,-6,-5,-4,-3,-2,-1,0,-7,-6,-5,-4,-3,-2,-1,0};
uint8x16_t vshift = vld1q_u8(ucShift);
uint8x16_t vmask = vandq_u8(input, vdupq_n_u8(0x80));
uint32_t out;
vmask = vshlq_u8(vmask, vshift);
out = vaddv_u8(vget_low_u8(vmask));
out += (vaddv_u8(vget_high_u8(vmask)) << 8);
return out;
}

after some tests it looks like following code works correct:
int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
const int8_t __attribute__ ((aligned (16))) xr[8] = {-7,-6,-5,-4,-3,-2,-1,0};
uint8x8_t mask_and = vdup_n_u8(0x80);
int8x8_t mask_shift = vld1_s8(xr);
uint8x8_t lo = vget_low_u8(input);
uint8x8_t hi = vget_high_u8(input);
lo = vand_u8(lo, mask_and);
lo = vshl_u8(lo, mask_shift);
hi = vand_u8(hi, mask_and);
hi = vshl_u8(hi, mask_shift);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
return ((hi[0] << 8) | (lo[0] & 0xFF));
}

Note that I haven't tested any of this, but something like this might work:
X := the vector that you want to create the mask from
A := 0x808080808080...
B := 0x00FFFEFDFCFB... (i.e. 0,-1,-2,-3,...)
X = vand_u8(X, A); // Keep d7 of each byte in X
X = vshl_u8(X, B); // X[7]>>=0; X[6]>>=1; X[5]>>=2; ...
// Each byte of X now contains its msb shifted 7-N bits to the right, where N
// is the byte index.
// Do 3 pairwise adds in order to pack all these into X[0]
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
// X[0] should now contain the mask. Clear the remaining bytes if necessary
This would need to be repeated once to process a 128-bit vector, since vpadd only works on 64-bit vectors.

I know this question is here for 8 years already but let me give you the answer which might solve all performance problems with emulation. It's based on the blog Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more.
Most usages of movemask instructions are coming from comparisons where the vectors have 0xFF or 0x00 values from the result of every 16 bytes. After that most cases to use movemasks are to check if none/all match, find leading/trailing or iterate over bits.
If this is the case which often is, then you can use shrn reg1, reg2, #4 instruction. This instruction called Shift-Right-then-Narrow instruction can reduce a 128-bit byte mask to a 64-bit nibble mask (by alternating low and high nibbles to the result). This allows the mask to be extracted to a 64-bit general purpose register.
const uint16x8_t equalMask = vreinterpretq_u16_u8(vceqq_u8(chunk, vdupq_n_u8(tag)));
const uint8x8_t res = vshrn_n_u16(equalMask, 4);
const uint64_t matches = vget_lane_u64(vreinterpret_u64_u8(res), 0);
return matches;
After that you can use all bit operations you typically use on x86 with very minor tweaks like shifting by 2 or doing a scalar AND.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight