Efficient way of rotating a byte inside an AVX register

Efficient way of rotating a byte inside an AVX register - c

Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together?
For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits.
Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the register 1 bit to the left, and 7 to the right individually. I then use the blend operation (intrinsic operation _mm256_blend_epi16) to chose the correct bits from the first and second temporary result to get my final rotated byte.
This costs a total of 2 shift operations and 1 blend operation per byte, and 6 bytes needs to be rotated, thus 18 operations per byte (shift and blend has just about the same performance).
There must be a faster way to do this than by using 18 operations to rotate a single byte!
Furthermore, I need to assemble all the bytes afterwards in the new register. I do this by loading 7 masks with the "set" instruction into registers, so I can extract the correct byte from each register. I AND these mask with the registers to extract the correct byte from them. Afterwards I XOR the single byte registers together to get the new register with all the bytes.
This takes a total of 7+7+6 operations, so another 20 operations (per register).
I could use the extract intrinsic (_mm256_extract_epi8) to get the single bytes, and then use _mm256_set_epi8 to assemble the new registers, but I don't know yet whether that would be faster. (There is no listed performance for these functions in the Intel intrinsics guide, so maybe I am misunderstanding something here.)
This gives a total of 38 operations per register, which seems less than optimal for rotating 6 bytes differently inside a register.
I hope someone more proficient in AVX/SIMD can guide me here—whether I am going about this the wrong way—as I feel I might be doing just that right now.

The XOP instruction set does provide _mm_rot_epi8() (which is NOT Microsoft-specific; it is also available in GCC since 4.4 or earlier, and should be available in recent clang, too). It can be used to perform the desired task in 128-bit units. Unfortunately, I don't have a CPU with XOP support, so I cannot test that.
On AVX2, splitting the 256-bit register into two halves, one containing even bytes, and the other odd bytes shifted right 8 bits, allows a 16-bit vector multiply to do the trick. Given constants (using GCC 64-bit component array format)
static const __m256i epi16_highbyte = { 0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL };
static const __m256i epi16_lowbyte = { 0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL };
static const __m256i epi16_oddmuls = { 0x4040101004040101ULL,
0x4040101004040101ULL,
0x4040101004040101ULL,
0x4040101004040101ULL };
static const __m256i epi16_evenmuls = { 0x8080202008080202ULL,
0x8080202008080202ULL,
0x8080202008080202ULL,
0x8080202008080202ULL };
the rotation operation can be written as
__m256i byteshift(__m256i value)
{
return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_lowbyte), epi16_oddmuls), 8),
_mm256_and_si256(_mm256_mullo_epi16(_mm256_and_si256(_mm256_srai_epi16(value, 8), epi16_lowbyte), epi16_evenmuls), epi16_highbyte));
}
This has been verified to yield correct results on Intel Core i5-4200U using GCC-4.8.4. As an example, the input vector (as a single 256-bit hexadecimal number)
88 87 86 85 84 83 82 81 38 37 36 35 34 33 32 31 28 27 26 25 24 23 22 21 FF FE FD FC FB FA F9 F8
gets rotated into
44 E1 D0 58 24 0E 05 81 1C CD C6 53 A1 CC 64 31 14 C9 C4 52 21 8C 44 21 FF BF BF CF DF EB F3 F8
where the leftmost octet is rotated left by 7 bits, next 6 bits, and so on; seventh octet is unchanged, eighth octet is rotated by 7 bits, and so on, for all 32 octets.
I am not sure if the above function definition compiles to optimal machine code -- that depends on the compiler --, but I'm certainly happy with its performance.
Since you probably dislike the above concise format for the function, here it is in procedural, expanded form:
static __m256i byteshift(__m256i value)
{
__m256i low, high;
high = _mm256_srai_epi16(value, 8);
low = _mm256_and_si256(value, epi16_lowbyte);
high = _mm256_and_si256(high, epi16_lowbyte);
low = _mm256_mullo_epi16(low, epi16_lowmuls);
high = _mm256_mullo_epi16(high, epi16_highmuls);
low = _mm256_srli_epi16(low, 8);
high = _mm256_and_si256(high, epi16_highbyte);
return _mm256_or_si256(low, high);
}
In a comment, Peter Cordes suggested replacing the srai+and with an srli, and possibly the final and+or with a blendv. The former makes a lot of sense, as it is purely an optimization, but the latter may not (yet, on current Intel CPUs!) actually be faster.
I tried some microbenchmarking, but was unable to get reliable results. I typically use the TSC on x86-64, and take the median of a few hundred thousand tests using inputs and outputs stored to an array.
I think it is most useful if I will just list the variants here, so any user requiring such a function can make some benchmarks on their real-world workloads, and test to see if there is any measurable difference.
I also agree with his suggestion to use odd and even instead of high and low, but note that since the first element in a vector is numbered element 0, the first element is even, the second odd, and so on.
#include <immintrin.h>
static const __m256i epi16_oddmask = { 0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL,
0xFF00FF00FF00FF00ULL };
static const __m256i epi16_evenmask = { 0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL,
0x00FF00FF00FF00FFULL };
static const __m256i epi16_evenmuls = { 0x4040101004040101ULL,
0x4040101004040101ULL,
0x4040101004040101ULL,
0x4040101004040101ULL };
static const __m256i epi16_oddmuls = { 0x8080202008080202ULL,
0x8080202008080202ULL,
0x8080202008080202ULL,
0x8080202008080202ULL };
/* Original version suggested by Nominal Animal. */
__m256i original(__m256i value)
{
return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
_mm256_and_si256(_mm256_mullo_epi16(_mm256_and_si256(_mm256_srai_epi16(value, 8), epi16_evenmask), epi16_oddmuls), epi16_oddmask));
}
/* Optimized as suggested by Peter Cordes, without blendv */
__m256i no_blendv(__m256i value)
{
return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
_mm256_and_si256(_mm256_mullo_epi16(_mm256_srli_epi16(value, 8), epi16_oddmuls), epi16_oddmask));
}
/* Optimized as suggested by Peter Cordes, with blendv.
* This is the recommended version. */
__m256i optimized(__m256i value)
{
return _mm256_blendv_epi8(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
_mm256_mullo_epi16(_mm256_srli_epi16(value, 8), epi16_oddmuls), epi16_oddmask);
}
Here are the same functions written in a way that shows the individual operations. Although it does not affect sane compilers at all, I've marked the function parameter and each temporary value const, so that it is obvious how you can insert each into a subsequent expression, to simplify the functions to their above concise forms.
__m256i original_verbose(const __m256i value)
{
const __m256i odd1 = _mm256_srai_epi16(value, 8);
const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
const __m256i odd2 = _mm256_and_si256(odd1, epi16_evenmask);
const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
const __m256i odd3 = _mm256_mullo_epi16(odd3, epi16_oddmuls);
const __m256i even3 = _mm256_srli_epi16(even3, 8);
const __m256i odd4 = _mm256_and_si256(odd3, epi16_oddmask);
return _mm256_or_si256(even3, odd4);
}
__m256i no_blendv_verbose(const __m256i value)
{
const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
const __m256i odd1 = _mm256_srli_epi16(value, 8);
const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
const __m256i odd2 = _mm256_mullo_epi16(odd1, epi16_oddmuls);
const __m256i even3 = _mm256_srli_epi16(even2, 8);
const __m256i odd3 = _mm256_and_si256(odd2, epi16_oddmask);
return _mm256_or_si256(even3, odd3);
}
__m256i optimized_verbose(const __m256i value)
{
const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
const __m256i odd1 = _mm256_srli_epi16(value, 8);
const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
const __m256i odd2 = _mm256_mullo_epi16(odd1, epi16_oddmuls);
const __m256i even3 = _mm256_srli_epi16(even2, 8);
return _mm256_blendv_epi8(even3, odd2, epi16_oddmask);
}
I personally do write my test functions initially in their above verbose forms, as forming the concise version is a trivial set of copy-pasting. I do, however, testing both versions to verify against introducing any errors, and keeping the verbose version accessible (as a comment or so), because the concise versions are basically write-only. It is much easier to edit the verbose version, then simplify it to the concise form, than trying to edit the concise version.

[Based on the first comment and some edits, the resulting solution is a little different. I will present that first, then leave the original thought below]
The main idea here is using multiplication by powers of 2 to accomplish the shifting, since these constants can vary across the vector. #harold pointed out the next idea, which is that multiplication of two duplicated bytes will automatically do the "rotation" of the shifted-out bits back into the lower bits.
Unpack and duplicate bytes into 16-bit values [... d c b a] -> [... dd cc bb aa]
Generate a 16-bit constant [128 64 32 16 8 4 2 1]
Multiply
The byte you want is the top eight bits of each 16-bit value, so right-shift and repack
Assuming __m128i source (you only have 8 bytes, right?):
__m128i duped = _mm_unpacklo_epi8(src, src);
__m128i res = _mm_mullo_epi16(duped, power_of_two_vector);
__m128i repacked = _mm_packus_epi16(_mm_srli_epi16(res, 8), __mm_setzero_si128());
[saving this original idea for comparison]
What about this: Use multiplication by powers of 2 to accomplish the shifts, using 16-bit products. Then OR the upper and lower halves of the product to accomplish the rotation.
Unpack the bytes into 16-bit words.
Generate a 16-bit [ 128 64 32 16 8 4 2 1 ]
Multiply the 16-bit words
Re-pack the 16-bit into two eight-bit vectors, a high byte vector and a low-byte vector
OR those two vectors to accomplish the rotate.
I'm a little fuzzy on the available multiply options and your instruction set limitation, but ideal would be an 8-bit by 8-bit multiply that produces 16-bit products. As far as I know it doesn't exist, which is why I suggest unpacking first, but I've seen other neat algorithms for doing this.

Related

Extract 10bits words from bitstream

I need to extract all 10-bit words from a raw bitstream whitch is built as ABACABACABAC...
It already works with a naive C implementation like
for(uint8_t *ptr = in_packet; ptr < max; ptr += 5){
const uint64_t val =
(((uint64_t)(*(ptr + 4))) << 32) |
(((uint64_t)(*(ptr + 3))) << 24) |
(((uint64_t)(*(ptr + 2))) << 16) |
(((uint64_t)(*(ptr + 1))) << 8) |
(((uint64_t)(*(ptr + 0))) << 0) ;
*a_ptr++ = (val >> 0);
*b_ptr++ = (val >> 10);
*a_ptr++ = (val >> 20);
*c_ptr++ = (val >> 30);
}
But performance is inadequate for my application so I would like to improve this using some AVX2 optimisations.
I visited the website https://software.intel.com/sites/landingpage/IntrinsicsGuide/# to find any functions that can help but it seems there is nothing to works with 10-bit words, only 8 or 16-bit. That seems logical since 10-bit is not native for a processor, but it make things hard for me.
Is there any way to use AVX2 to solve this problem?

Your scalar loop does not compile efficiently. Compilers do it as 5 separate byte loads. You can express an unaligned 8-byte load in C++ with memcpy:
#include <stdint.h>
#include <string.h>
// do an 8-byte load that spans the 5 bytes we want
// clang auto-vectorizes using an AVX2 gather for 4 qwords. Looks pretty clunky but not terrible
void extract_10bit_fields_v2calar(const uint8_t *__restrict src,
uint16_t *__restrict a_ptr, uint16_t *__restrict b_ptr, uint16_t *__restrict c_ptr,
const uint8_t *max)
{
for(const uint8_t *ptr = src; ptr < max; ptr += 5){
uint64_t val;
memcpy(&val, ptr, sizeof(val));
const unsigned mask = (1U<<10) - 1; // unused in original source!?!
*a_ptr++ = (val >> 0) & mask;
*b_ptr++ = (val >> 10) & mask;
*a_ptr++ = (val >> 20) & mask;
*c_ptr++ = (val >> 30) & mask;
}
}
ICC and clang auto-vectorize your 1-byte version, but do a very bad job (lots of insert/extract of single bytes). Here's your original and this function on Godbolt (with gcc and clang -O3 -march=skylake)
None of those 3 compilers are really close to what we can do manually.
Manual vectorization
My current AVX2 version of this answer forgot a detail: there are only 3 kinds of fields ABAC, not ABCD like 10-bit RGBA pixels. So I have a version of this which unpacks to 4 separate output streams (which I'll leave in because of the packed-RGBA use-case if I ever add a dedicated version for the ABAC interleave).
The existing version can use vpunpcklwd to interleave the two A parts instead of storing with separate vmovq should work for your case. There might be something more efficient, IDK.
BTW, I find it easier to remember and type instruction mnemonics, not intrinsic names. Intel's online intrinsics guide is searchable by instruction mnemonic.
Observations about your layout:
Each field spans one byte boundary, never two, so it's possible to assemble any 4 pairs of bytes in a qword that hold 4 complete fields.
Or with a byte shuffle, to create 2-byte words that each have a whole field at some offset. (e.g. for AVX512BW vpsrlvw, or for AVX2 2x vpsrld + word-blend.) A word shuffle like AVX512 vpermw would not be sufficient: some individual bytes need to be duplicated with the start of one field and end of another. I.e the source positions aren't all aligned words, especially when you have 2x 5 bytes inside the same 16-byte "lane" of a vector.
00-07|08-15|16-23|24-31|32-39 byte boundaries (8-bit)
00...09|10..19|20...29|30..39 field boundaries (10-bit)
Luckily 8 and 10 have a GCD of 2 which is >= 10-8=2. 8*5 = 4*10 so we don't get all possible start positions, e.g. never a field starting at the last bit of 1 byte, spanning another byte, and including the first bit of a 3rd byte.
Possible AVX2 strategy: unaligned 32-byte load that leave 2x 5 bytes at the top of the low lane, and 2x 5 bytes at the bottom of the high lane. Then vpshufb in-lane shuffle to set up for 2x vpsrlvd variable-count shifts, and a blend.
Quick summary of a new idea I haven't expanded yet.
Given an input of xxx a0B0A0C0 a1B1A1C1 | a2B2A2C2 a3B3A3C3 from our unaligned load, we can get a result of
a0 A0 a1 A1 B0 B1 C0 C1 | a2 A2 a3 A3 B2 B3 C2 C3 with the right choice of vpshufb control.
Then a vpermd can put all of those 32-bit groups into the right order, with all the A elements in the high half (ready for a vextracti128 to memory), and the B and C in the low half (ready for vmovq / vmovhps stores).
Use different vpermd shuffles for adjacent pairs so we can vpblendd to merge them for 128-bit B and C stores.
Old version, probably worse than unaligned load + vpshufb.
With AVX2, one option is to broadcast the containing 64-bit element to all positions in a vector and then use variable-count right shifts to get the bits to the bottom of a dword element.
You probably want to do a separate 64-bit broadcast-load for each group (thus partially overlapping with the previous), instead of trying to pick apart a __m256i of contiguous bits. (Broadcast-loads are cheap, shuffling is expensive.)
After _mm256_srlvd_epi64, then AND to isolate the low 10 bits in each qword.
Repeat that 4 times for 4 vectors of input, then use _mm256_packus_epi32 to do in-lane packing down to 32-bit then 16-bit elements.
That's the simple version. Optimizations of the interleaving are possible, e.g. by using left or right shifts to set up for vpblendd instead of a 2-input shuffle like vpackusdw or vshufps. _mm256_blend_epi32 is very efficient on existing CPUs, running on any port.
This also allows delaying the AND until after the first packing step because we don't need to avoid saturation from high garbage.
Design notes:
shown as 32-bit chunks after variable-count shifts
[0 d0 0 c0 | 0 b0 0 a0] # after an AND mask
[0 d1 0 c1 | 0 b1 0 a1]
[0 d1 0 c1 0 d0 0 c0 | 0 b1 0 a1 0 b0 0 a0] # vpackusdw
shown as 16-bit elements but actually the same as what vshufps can do
---------
[X d0 X c0 | X b0 X a0] even the top element is only garbage right shifted by 30, not quite zero
[X d1 X c1 | X b1 X a1]
[d1 c1 d0 c0 | b1 a1 b0 a0 ] vshufps (can't do d1 d0 c1 c0 unfortunately)
---------
[X d0 X c0 | X b0 X a0] variable-count >> qword
[d1 X c1 X | b1 X a1 0] variable-count << qword
[d1 d0 c1 c0 | b1 b0 a1 a0] vpblendd
This last trick extends to vpblendw, allowing us to do everything with interleaving blends, no shuffle instructions at all, resulting in the outputs we want contiguous and in the right order in qwords of a __m256i.
x86 SIMD variable-count shifts can only be left or right for all elements, so we need to make sure that all the data is either left or right of the desired position, not some of each within the same vector. We could use an immediate-count shift to set up for this, but even better is to just adjust the byte-address we load from. For loads after the first, we know it's safe to load some of the bytes before the first bitfield we want (without touching an unmapped page).
# as 16-bit elements
[X X X d0 X X X c0 | ...] variable-count >> qword
[X X d1 X X X c1 X | ...] variable-count >> qword from an offset load that started with the 5 bytes we want all to the left of these positions
[X d2 X X X c2 X X | ...] variable-count << qword
[d3 X X X c3 X X X | ...] variable-count << qword
[X d2 X d0 X c2 X c0 | ...] vpblendd
[d3 X d1 X c3 X c1 X | ...] vpblendd
[d3 d2 d1 d0 c3 c2 c1 c0 | ...] vpblendw (Same behaviour in both high and low lane)
Then mask off the high garbage inside each 16-bit word
Note: this does 4 separate outputs, like ABCD or RGBA->planar, not ABAC.
// potentially unaligned 64-bit broadcast-load, hopefully vpbroadcastq. (clang: yes, gcc: no)
// defeats gcc/clang folding it into an AVX512 broadcast memory source
// but vpsllvq's ymm/mem operand is the shift count, not data
static inline
__m256i bcast_load64(const uint8_t *p) {
// hopefully safe with strict-aliasing since the deref is inside an intrinsic?
__m256i bcast = _mm256_castpd_si256( _mm256_broadcast_sd( (const double*)p ) );
return bcast;
}
// UNTESTED
// unpack 10-bit fields from 4x 40-bit chunks into 16-bit dst arrays
// overreads past the end of the last chunk by 1 byte
// for ABCD repeating, not ABAC, e.g. packed 10-bit RGBA
void extract_10bit_fields_4output(const uint8_t *__restrict src,
uint16_t *__restrict da, uint16_t *__restrict db, uint16_t *__restrict dc, uint16_t *__restrict dd,
const uint8_t *max)
{
// FIXME: cleanup loop for non-whole-vectors at the end
while( src<max ){
__m256i bcast = bcast_load64(src); // data we want is from bits [0 to 39], last starting at 30
__m256i ext0 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place at bottome of each qword
bcast = bcast_load64(src+5-2); // data we want is from bits [16 to 55], last starting at 30+16 = 46
__m256i ext1 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place it at bit 16 in each qword element
bcast = bcast_load64(src+10); // data we want is from bits [0 to 39]
__m256i ext2 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 32 in each qword element
bcast = bcast_load64(src+15-2); // data we want is from bits [16 to 55], last field starting at 46
__m256i ext3 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 48 in each qword element
__m256i blend20 = _mm256_blend_epi32(ext0, ext2, 0b10101010); // X d2 X d0 X c2 X c0 | X b2 ...
__m256i blend31 = _mm256_blend_epi32(ext1, ext3, 0b10101010); // d3 X d1 X c3 X c1 X | b3 X ...
__m256i blend3210 = _mm256_blend_epi16(blend20, blend31, 0b10101010); // d3 d2 d1 d0 c3 c2 c1 c0
__m256i res = _mm256_and_si256(blend3210, _mm256_set1_epi16((1U<<10) - 1) );
__m128i lo = _mm256_castsi256_si128(res);
__m128i hi = _mm256_extracti128_si256(res, 1);
_mm_storel_epi64((__m128i*)da, lo); // movq store of the lowest 64 bits
_mm_storeh_pi((__m64*)db, _mm_castsi128_ps(lo)); // movhps store of the high half of the low 128. Efficient: no shuffle uop needed on Intel CPUs
_mm_storel_epi64((__m128i*)dc, hi);
_mm_storeh_pi((__m64*)dd, _mm_castsi128_ps(hi)); // clang pessmizes this to vpextrq :(
da += 4;
db += 4;
dc += 4;
dd += 4;
src += 4*5;
}
}
This compiles (Godbolt) to about 21 front-end uops (on Skylake) in the loop per 4 groups of 4 fields. (Including has a useless register copy for _mm256_castsi256_si128 instead of just using the low half of ymm0 = xmm0). This will be very good on Skylake. There's a good balance of uops for different ports, and variable-count shift is 1 uop for either p0 or p1 on SKL (vs. more expensive previously). The bottleneck might be just the front-end limit of 4 fused-domain uops per clock.
Replays of cache-line-split loads will happen because the unaligned loads will sometimes cross a 64-byte cache-line boundary. But that's just in the back-end, and we have a few spare cycles on ports 2 and 3 because of the front-end bottleneck (4 loads and 4 stores per set of results, with indexed stores which thus can't use port 7). If dependent ALU uops have to get replayed as well, we might start seeing back-end bottlenecks.
Despite the indexed addressing modes, there won't be unlamination because Haswell and later can keep indexed stores micro-fused, and the broadcast loads are a single pure uop anyway, not micro-fused ALU+load.
On Skylake, it can maybe come close to 4x 40-bit groups per 5 clock cycles, if memory bandwidth isn't a bottleneck. (e.g. with good cache blocking.) Once you factor in overhead and cost of cache-line-split loads causing occasional stalls, maybe 1.5 cycles per 40 bits of input, i.e. 6 cycles per 20 bytes of input on Skylake.
On other CPUs (Haswell and Ryzen), the variable-count shifts will be a bottleneck, but you can't really do anything about that. I don't think there's anything better. On HSW it's 3 uops: p5 + 2p0. On Ryzen it's only 1 uop, but it only has 1 per 2 clock throughput (for the 128-bit version), or per 4 clocks for the 256-bit version which costs 2 uops.
Beware that clang pessmizes the _mm_storeh_pi store to vpextrq [mem], xmm, 1: 2 uops, shuffle + store. (Instead of vmovhps : pure store on Intel, no ALU). GCC compiles it as written.
I used _mm256_broadcast_sd even though I really want vpbroadcastq just because there's an intrinsic that takes a pointer operand instead of __m256i (because with AVX1, only the memory-source version existed. But with AVX2, register-source versions of all the broadcast instructions exist). To use _mm256_set1_epi64, I'd have to write pure C that didn't violate strict aliasing (e.g. with memcpy) to do an unaligned uint64_t load. I don't think it will hurt performance to use an FP broadcast load on current CPUs, though.
I'm hoping _mm256_broadcast_sd allows its source operand to alias anything without C++ strict-aliasing undefined behaviour, the same way _mm256_loadu_ps does. Either way it will work in practice if it doesn't inline into a function that stores into *src, and maybe even then. So maybe a memcpy unaligned load would have made more sense!
I've had bad results in the past with getting compilers to emit pmovzxdw xmm0, [mem] from code like _mm_cvtepu16_epi32( _mm_loadu_si64(ptr) ); you often get an actual movq load + reg-reg pmovzx. That's why I didn't try that _mm256_broadcastq_epi64(__m128i).
Old idea; if we already need a byte shuffle we might as well use plain word shifts instead of vpmultishift.
With AVX512VBMI (IceLake, CannonLake), you might want vpmultishiftqb. Instead of broadcasting / shifting one group at a time, we can do all the work for a whole vector of groups after putting the right bytes in the right places first.
You'd still need/want a version for CPUs with some AVX512 but not AVX512VBMI (e.g. Skylake-avx512). Probably vpermd + vpshufb can get the bytes we need into the 128-bit lanes we want.
I don't think we can get away with using only dword-granularity shifts to allow merge-masking instead of dword blend after qword shift. We might be able to merge-mask a vpblendw though, saving a vpblendd
IceLake has 1/clock vpermw and vpermb, single-uop. (It has a 2nd shuffle unit on another port that handles some shuffle uops). So we can load a full vector that contains 4 or 8 groups of 4 elements and shuffle every byte into place efficiently. I think every CPU that has vpermb has it single-uop. (But that's only Ice Lake and the limited-release Cannon Lake).
vpermt2w (to combine 16-bit element from 2 vectors into any order) is one per 2 clock throughput. (InstLatx64 for IceLake-Y), so unfortunately it's not as efficient as the one-vector shuffles.
Anyway, you might use it like this:
64-byte / 512-bit load (includes some over-read at the end from 8x 8-byte groups instead of 8x 5-byte groups. Optionally use a zero-masked load to make this safe near the end of an array thanks to fault suppression)
vpermb to put the 2 bytes containing each field into desired final destination position.
vpsrlvw + vpandq to extract each 10-bit field into a 16-bit word
That's about 4 uops, not including the stores.
You probably want the high half containing the A elements for a contiguous vextracti64x4 and the low half containing the B and C elements for vmovdqu and vextracti128 stores.
Or for 2x vpblenddd to set up for 256-bit stores. (Use 2 different vpermb vectors to create 2 different layouts.)
You shouldn't need vpermt2w or vpermt2d to combine adjacent vectors for wider stores.
Without AVX512VBMI, probably a vpermd + vpshufb can get all the necessary bytes into each 128-bit chunk instead of vpermb. The rest of it only requires AVX512BW which Skylake-X has.

SIMD (AVX2) mask store and pack [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

Comparing two pairs of 4 variables and returning the number of matches?

Given the following struct:
struct four_points {
uint32_t a, b, c, d;
}
What would be the absolute fastest way to compare two such structures and return the number of variables that match (in any position)?
For example:
four_points s1 = {0, 1, 2, 3};
four_points s2 = {1, 2, 3, 4};
I'd be looking for a result of 3, since three numbers match between the two structs. However, given the following:
four_points s1 = {1, 0, 2, 0};
four_points s2 = {0, 1, 9, 7};
Then I'd expect a result of only 2, because only two variables match between either struct (despite there being two zeros in the first).
I've figured out a few rudimentary systems for performing the comparison, but this is something that is going to be called a couple million times in a short time span and needs to be relatively quick. My current best attempt was to use a sorting network to sort all four values for either input, then loop over the sorted values and keep a tally of the values that are equal, advancing the current index of either input accordingly.
Is there any kind of technique that might be able to perform better then a sort & iteration?

On modern CPUs, sometimes brute force applied properly is the way to go. The trick is writing code that isn't limited by instruction latencies, just throughput.
Are duplicates common? If they're very rare, or have a pattern, using a branch to handle them makes the common case faster. If they're really unpredictable, it's better to do something branchless. I was thinking about using a branch to check for duplicates between positions where they're rare, and going branchless for the more common place.
Benchmarking is tricky because a version with branches will shine when tested with the same data a million times, but will have lots of branch mispredicts in real use.
I haven't benchmarked anything yet, but I have come up with a version that skips duplicates by using OR instead of addition to combine found-matches. It compiles to nice-looking x86 asm that gcc fully unrolls. (no conditional branches, not even loops).
Here it is on godbolt. (g++ is dumb and uses 32bit operations on the output of x86 setcc, which only sets the low 8 bits. This partial-register access will produce slowdowns. And I'm not even sure it ever zeroes the upper 24bits at all... Anyway, the code from gcc 4.9.2 looks good, and so does clang on godbolt)
// 8-bit types used because x86's setcc instruction only sets the low 8 of a register
// leaving the other bits unmodified.
// Doing a 32bit add from that creates a partial register slowdown on Intel P6 and Sandybridge CPU families
// Also, compilers like to insert movzx (zero-extend) instructions
// because I guess they don't realize the previous high bits are all zero.
// (Or they're tuning for pre-sandybridge Intel, where the stall is worse than SnB inserting the extra uop itself).
// The return type is 8bit because otherwise clang decides it should generate
// things as 32bit in the first place, and does zero-extension -> 32bit adds.
int8_t match4_ordups(const four_points *s1struct, const four_points *s2struct)
{
const int32_t *s1 = &s1struct->a; // TODO: check if this breaks aliasing rules
const int32_t *s2 = &s2struct->a;
// ignore duplicates by combining with OR instead of addition
int8_t matches = 0;
for (int j=0 ; j<4 ; j++) {
matches |= (s1[0] == s2[j]);
}
for (int i=1; i<4; i++) { // i=0 iteration is broken out above
uint32_t s1i = s1[i];
int8_t notdup = 1; // is s1[i] a duplicate of s1[0.. i-1]?
for (int j=0 ; j<i ; j++) {
notdup &= (uint8_t) (s1i != s1[j]); // like dup |= (s1i == s1[j]); but saves a NOT
}
int8_t mi = // match this iteration?
(s1i == s2[0]) |
(s1i == s2[1]) |
(s1i == s2[2]) |
(s1i == s2[3]);
// gcc and clang insist on doing 3 dependent OR insns regardless of parens, not that it matters
matches += mi & notdup;
}
return matches;
}
// see the godbolt link for a main() simple test harness.
On a machine with 128b vectors that can work with 4 packed 32bit integers (e.g. x86 with SSE2), you can broadcast each element of s1 to its own vector, deduplicate, and then do 4 packed-compares. icc does something like this to autovectorize my match4_ordups function (check it out on godbolt.)
Store the compare results back to integer registers with movemask, to get a bitmap of which elements compared equal. Popcount those bitmaps, and add the results.
This led me to a better idea: Getting all the compares done with only 3 shuffles with element-wise rotation:
{ 1d 1c 1b 1a }
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1a 1d 1c 1b }
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1b 1a 1d 1c } # if dups didn't matter: do this shuffle on s2
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1c 1b 1a 1d } # if dups didn't matter: this result from { 1a ... }
== == == == packed-compare with
{ 2d 2c 2b 2a } { 2b ...
That's only 3 shuffles, and still does all 16 comparisons. The trick is combining them with ORs where we need to merge duplicates, and then being able to count them efficiently. A packed-compare outputs a vector with each element = zero or -1 (all bits set), based on the comparison between the two elements in that position. It's designed to make a useful operand to AND or XOR to mask off some vector elements, e.g. to make v1 += v2 & mask conditional on a per-element basis. It also works as just a boolean truth value.
All 16 compares with only 2 shuffles is possible by rotating one vector by two, and the other vector by one, and then comparing between the four shifted and unshifted vectors. This would be great if we didn't need to eliminate dups, but since we do, it matters which results are where. We're not just adding all 16 comparison results.
OR together the packed-compare results down to one vector. Each element will be set based on whether that element of s2 had any matches in s1. int _mm_movemask_ps (__m128 a) to turn the vector into a bitmap, then popcount the bitmap. (Nehalem or newer CPU required for popcnt, otherwise fall back to a version with a 4-bit lookup table.)
The vertical ORs take care of duplicates in s1, but duplicates in s2 is a less obvious extension, and would take more work. I did eventually think of a way that was less than twice as slow (see below).
#include <stdint.h>
#include <immintrin.h>
typedef struct four_points {
int32_t a, b, c, d;
} four_points;
//typedef uint32_t four_points[4];
// small enough to inline, only 62B of x86 instructions (gcc 4.9.2)
static inline int match4_sse_noS2dup(const four_points *s1pointer, const four_points *s2pointer)
{
__m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
__m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
__m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
// no shuffle needed for first compare
__m128i match = _mm_cmpeq_epi32(s1 , s2); //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
__m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
s1b = _mm_cmpeq_epi32(s1b, s2);
match = _mm_or_si128(match, s1b); // merge dups by ORing instead of adding
// note that we shuffle the original vector every time
// multiple short dependency chains are better than one long one.
__m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
s1c = _mm_cmpeq_epi32(s1c, s2);
match = _mm_or_si128(match, s1c);
s1d = _mm_cmpeq_epi32(s1d, s2);
match = _mm_or_si128(match, s1d); // match = { s2.a in s1?, s2.b in s1?, etc. }
// turn the the high bit of each 32bit element into a bitmap of s2 elements that have matches anywhere in s1
// use float movemask because integer movemask does 8bit elements.
int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
return _mm_popcnt_u32(matchmask); // or use a 4b lookup table for CPUs with SSE2 but not popcnt
}
See the version that eliminates duplicates in s2 for the same code with lines in a more readable order. I tried to schedule instructions in case the CPU was only just barely decoding instructions ahead of what was executing, but gcc puts the instructions in the same order regardless of what order you put the intrinsics in.
This is extremely fast, if there isn't a store-forwarding stall in the 128b loads. If you just wrote the struct with four 32bit stores, running this function within the next several clock cycles will produce a stall when it tries to load the whole struct with a 128b load. See Agner Fog's site. If calling code already has many of the 8 values in registers, the scalar version could be a win, even though it'll be slower for a microbenchmark test that only reads the structs from memory.
I got lazy on cycle-counting for this, since dup-handling isn't done yet. IACA says Haswell can run it with a throughput of one iteration per 4.05 clock cycles, and latency of 17 cycles (Not sure if that's including the memory latency of the loads. There's a lot of instruction-level parallelism available, and all the instructions have single-cycle latency, except for movmsk(2) and popcnt(3)). It's slightly slower without AVX, because gcc chooses a worse instruction ordering, and still wastes a movdqa instruction copying a vector register.
With AVX2, this could do two match4 operations in parallel, in 256b vectors. AVX2 usually works as two 128b lanes, rather than full 256b vectors. Setting up your code to be able to take advantage of 2 or 4 (AVX-512) match4 operations in parallel will give you gains when you can compile for those CPUs. It's not essential for both the s1s or s2s to be stored contiguously so a single 32B load can get two structs. AVX2 has a fairly fast load 128b to the upper lane of a register.
Handling duplicates in s2
Maybe compare s2 to a shifted instead of rotated version of itself.
#### comparing S2 with itself to mask off duplicates
{ 0 2d 2c 2b }
{ 2d 2c 2b 2a } == == ==
{ 0 0 2d 2c }
{ 2d 2c 2b 2a } == ==
{ 0 0 0 2d }
{ 2d 2c 2b 2a } ==
Hmm, if zero can occur as a regular element, we may need to byte-shift after the compare as well, to turn potential false positives into zeros. If there was a sentinel value that couldn't appear in s1, you could shift in elements of that, instead of 0. (SSE has PALIGNR, which gives you any contiguous 16B window you want of the contents of two registers appended. Named for the use-case of simulating an unaligned load from two aligned loads. So you'd have a constant vector of that element.)
update: I thought of a nice trick that avoids the need for an identity element. We can actually get all 6 necessary s2 vs. s2 comparisons to happen with just two vector compares, and then combine the results.
Doing the same compare in the same place in two vectors lets you OR two results together without having to mask before the OR. (Works around the lack of a sentinel value).
Shuffling the output of the compares instead of extra shuffle&compare of S2. This means we can get d==a done next to the other compares.
Notice that we aren't limited to shuffling whole elements around. Byte-wise shuffle to get bytes from different compare results into a single vector element, and compare that against zero. (This saves less than I'd hoped, see below).
Checking for duplicates is a big slowdown (esp. in throughput, not so much in latency). So you're still best off arranging for a sentinel value in s2 that will never match any s1 element, which you say is possible. I only present this because I thought it was interesting. (And gives you an option in case you need a version that doesn't require sentinels sometime.)
static inline
int match4_sse(const four_points *s1pointer, const four_points *s2pointer)
{
// IACA_START
__m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
__m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
// s1a = unshuffled = s1.a in the low element
__m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
__m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
__m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
__m128i match = _mm_cmpeq_epi32(s1 , s2); //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
s1b = _mm_cmpeq_epi32(s1b, s2);
match = _mm_or_si128(match, s1b); // merge dups by ORing instead of adding
s1c = _mm_cmpeq_epi32(s1c, s2);
match = _mm_or_si128(match, s1c);
s1d = _mm_cmpeq_epi32(s1d, s2);
match = _mm_or_si128(match, s1d);
// match = { s2.a in s1?, s2.b in s1?, etc. }
// s1 vs s2 all done, now prepare a mask for it based on s2 dups
/*
* d==b c==a b==a d==a #s2b
* d==c c==b b==a d==a #s2c
* OR together -> s2bc
* d==abc c==ba b==a 0 pshufb(s2bc) (packed as zero or non-zero bytes within the each element)
* !(d==abc) !(c==ba) !(b==a) !0 pcmpeq setzero -> AND mask for s1_vs_s2 match
*/
__m128i s2b = _mm_shuffle_epi32(s2, _MM_SHUFFLE(1, 0, 0, 3));
__m128i s2c = _mm_shuffle_epi32(s2, _MM_SHUFFLE(2, 1, 0, 3));
s2b = _mm_cmpeq_epi32(s2b, s2);
s2c = _mm_cmpeq_epi32(s2c, s2);
__m128i s2bc= _mm_or_si128(s2b, s2c);
s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12, -1,-1,-1,8, -1,-1,-1,4, -1,-1,-1,-1));
__m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
// see below for alternate insn sequences that can go here.
match = _mm_and_si128(match, dupmask);
// turn the the high bit of each 32bit element into a bitmap of s2 matches
// use float movemask because integer movemask does 8bit elements.
int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
int ret = _mm_popcnt_u32(matchmask); // or use a 4b lookup table for CPUs with SSE2 but not popcnt
// IACA_END
return ret;
}
This requires SSSE3 for pshufb. It and a pcmpeq (and a pxor to generate a constant) are replacing a shuffle (bslli(s2bc, 12)), an OR, and an AND.
d==bc c==ab b==a a==d = s2b|s2c
d==a 0 0 0 = byte-shift-left(s2b) = s2d0
d==abc c==ab b==a a==d = s2abc
d==abc c==ab b==a 0 = mask(s2abc). Maybe use PBLENDW or MOVSS from s2d0 (which we know has zeros) to save loading a 16B mask.
__m128i s2abcd = _mm_or_si128(s2b, s2c);
//s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12, -1,-1,-1,8, -1,-1,-1,4, -1,-1,-1,-1));
//__m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
__m128i s2d0 = _mm_bslli_si128(s2b, 12); // d==a 0 0 0
s2abcd = _mm_or_si128(s2abcd, s2d0);
__m128i dupmask = _mm_blend_epi16(s2abcd, s2d0, 0 | (2 | 1));
//__m128i dupmask = _mm_and_si128(s2abcd, _mm_set_epi32(-1, -1, -1, 0));
match = _mm_andnot_si128(dupmask, match); // ~dupmask & match; first arg is the one that's inverted
I can't recommend MOVSS; it will incur extra latency on AMD because it runs in the FP domain. PBLENDW is SSE4.1. popcnt is available on AMD K10, but PBLENDW isn't (some Barcelona-core PhenomII CPUs are probably still in use). Actually, K10 doesn't have PSHUFB either, so just require SSE4.1 and POPCNT, and use PBLENDW. (Or use the PSHUFB version, unless it's going to cache-miss a lot.)
Another option to avoid a loading a vector constant from memory is to movemask s2bc, and use integer instead of vector ops. However, it looks like that'll be slower, because the extra movemask isn't free, and integer ANDN isn't usable. BMI1 didn't appear until Haswell, and even Skylake Celerons and Pentiums won't have it. (Very annoying, IMO. It means compilers can't start using BMI for even longer.)
unsigned int dupmask = _mm_movemask_ps(cast(s2bc));
dupmask |= dupmask << 3; // bit3 = d==abc. garbage in bits 4-6, careful if using AVX2 to do two structs at once
// only 2 instructions. compiler can use lea r2, [r1*8] to copy and scale
dupmask &= ~1; // clear the low bit
unsigned int matchmask = _mm_movemask_ps(cast(match));
matchmask &= ~dupmask; // ANDN is in BMI1 (Haswell), so this will take 2 instructions
return _mm_popcnt_u32(matchmask);
AMD XOP's VPPERM (pick bytes from any element of two source registers) would let the byte-shuffle replace the OR that merges s2b and s2c as well.
Hmm, pshufb isn't saving me as much as I thought, because it requires a pcmpeqd, and a pxor to zero a register. It's also loading its shuffle mask from a constant in memory, which can miss in the D-cache. It is the fastest version I've come up with, though.
If inlined into a loop, the same zeroed register could be used, saving one instruction. However, OR and AND can run on port0 (Intel CPUs), which can't run shuffle or compare instructions. The PXOR doesn't use any execution ports, though (on Intel SnB-family microarchitecture).
I haven't run real benchmarks of any of these, only IACA.
The PBLENDW and PSHUFB versions have the same latency (22 cycles, compiled for non-AVX), but the PSHUFB version has better throughput (one per 7.1c, vs. one per 7.4c, because PBLENDW needs the shuffle port, and there's already a lot of contention for it.) IACA says the version using PANDN with a constant instead of PBLENDW is also one-per-7.4c throughput, disappointingly. Port0 isn't saturated, so IDK why it's as slow as PBLENDW.
Old ideas that didn't pan out.
Leaving them in for the benefit of people looking for things to try when using vectors for related things.
Dup-checking s2 with vectors is more work than checking s2 vs. s1, because one compare is as expensive as 4 if done with vectors. The shuffling or masking needed after the compare, to remove false positives if there's no sentinel value, is annoying.
Ideas so far:
Shift s2 over by an element, and compare it to itself. Mask off false positives from shifting in 0. Vertically OR these together, and use it to ANDN the s1 vs s2 vector.
scalar code to do the smaller number of s2 vs. itself comparisons, building a bitmask to use before popcnt.
Broadcast s2.d and check it against s2 (all positions). But that puts the results horizontally in one vector, instead of vertically in 3 vectors. To use that, maybe PTEST / SETCC to make a mask for the bitmap (to apply before popcount). (PTEST with a mask of _mm_setr_epi32(0, -1, -1, -1), to only test the c,b,a, not d==d). Do (c==a | c==b) and b==a with scalar code, and combine that into a mask. Intel Haswell and later have 4 ALU execution ports, but only 3 of them can run vector instructions, so some scalar code in the mix could fill port6. AMD has even more separation between vector and integer execution resources.
shuffle s2 to get all the necessary comparisons done somehow, then shuffle the outputs. Maybe use movemask -> 4-bit lookup table for something?

SSE _mm_movemask_epi8 equivalent method for ARM NEON

I decided to continue Fast corners optimisation and stucked at
_mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?

I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument.
const uint8_t __attribute__ ((aligned (16))) _Powers[16]=
{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };
// Set the powers of 2 (do it once for all, if applicable)
uint8x16_t Powers= vld1q_u8(_Powers);
// Compute the mask from the input
uint64x2_t Mask= vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(Input, Powers))));
// Get the resulting bytes
uint16_t Output;
vst1q_lane_u8((uint8_t*)&Output + 0, (uint8x16_t)Mask, 0);
vst1q_lane_u8((uint8_t*)&Output + 1, (uint8x16_t)Mask, 8);
(Mind http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47553, anyway.)
Similarly to Michael, the trick is to form the powers of the indexes of the non-null entries, and to sum them pairwise three times. This must be done with increasing data size to double the stride on every addition. You reduce from 2 x 8 8-bit entries to 2 x 4 16-bit, then 2 x 2 32-bit and 2 x 1 64-bit. The low byte of these two numbers gives the solution. I don't think there is an easy way to pack them together to form a single short value using NEON.
Takes 6 NEON instructions if the input is in the suitable form and the powers can be preloaded.

The obvious solution seems to be completely missed here.
// Use shifts to collect all of the sign bits.
// I'm not sure if this works on big endian, but big endian NEON is very
// rare.
int vmovmaskq_u8(uint8x16_t input)
{
// Example input (half scale):
// 0x89 FF 1D C0 00 10 99 33
// Shift out everything but the sign bits
// 0x01 01 00 01 00 00 01 00
uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7));
// Merge the even lanes together with vsra. The '??' bytes are garbage.
// vsri could also be used, but it is slightly slower on aarch64.
// 0x??03 ??02 ??00 ??01
uint32x4_t paired16 = vreinterpretq_u32_u16(
vsraq_n_u16(high_bits, high_bits, 7));
// Repeat with wider lanes.
// 0x??????0B ??????04
uint64x2_t paired32 = vreinterpretq_u64_u32(
vsraq_n_u32(paired16, paired16, 14));
// 0x??????????????4B
uint8x16_t paired64 = vreinterpretq_u8_u64(
vsraq_n_u64(paired32, paired32, 28));
// Extract the low 8 bits from each lane and join.
// 0x4B
return vgetq_lane_u8(paired64, 0) | ((int)vgetq_lane_u8(paired64, 8) << 8);
}

This question deserves a newer answer for aarch64. The addition of new capabilities to Armv8 allows the same function to be implemented in fewer instructions. Here's my version:
uint32_t _mm_movemask_aarch64(uint8x16_t input)
{
const uint8_t __attribute__ ((aligned (16))) ucShift[] = {-7,-6,-5,-4,-3,-2,-1,0,-7,-6,-5,-4,-3,-2,-1,0};
uint8x16_t vshift = vld1q_u8(ucShift);
uint8x16_t vmask = vandq_u8(input, vdupq_n_u8(0x80));
uint32_t out;
vmask = vshlq_u8(vmask, vshift);
out = vaddv_u8(vget_low_u8(vmask));
out += (vaddv_u8(vget_high_u8(vmask)) << 8);
return out;
}

after some tests it looks like following code works correct:
int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
const int8_t __attribute__ ((aligned (16))) xr[8] = {-7,-6,-5,-4,-3,-2,-1,0};
uint8x8_t mask_and = vdup_n_u8(0x80);
int8x8_t mask_shift = vld1_s8(xr);
uint8x8_t lo = vget_low_u8(input);
uint8x8_t hi = vget_high_u8(input);
lo = vand_u8(lo, mask_and);
lo = vshl_u8(lo, mask_shift);
hi = vand_u8(hi, mask_and);
hi = vshl_u8(hi, mask_shift);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
return ((hi[0] << 8) | (lo[0] & 0xFF));
}

Note that I haven't tested any of this, but something like this might work:
X := the vector that you want to create the mask from
A := 0x808080808080...
B := 0x00FFFEFDFCFB... (i.e. 0,-1,-2,-3,...)
X = vand_u8(X, A); // Keep d7 of each byte in X
X = vshl_u8(X, B); // X[7]>>=0; X[6]>>=1; X[5]>>=2; ...
// Each byte of X now contains its msb shifted 7-N bits to the right, where N
// is the byte index.
// Do 3 pairwise adds in order to pack all these into X[0]
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
// X[0] should now contain the mask. Clear the remaining bytes if necessary
This would need to be repeated once to process a 128-bit vector, since vpadd only works on 64-bit vectors.

I know this question is here for 8 years already but let me give you the answer which might solve all performance problems with emulation. It's based on the blog Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more.
Most usages of movemask instructions are coming from comparisons where the vectors have 0xFF or 0x00 values from the result of every 16 bytes. After that most cases to use movemasks are to check if none/all match, find leading/trailing or iterate over bits.
If this is the case which often is, then you can use shrn reg1, reg2, #4 instruction. This instruction called Shift-Right-then-Narrow instruction can reduce a 128-bit byte mask to a 64-bit nibble mask (by alternating low and high nibbles to the result). This allows the mask to be extracted to a 64-bit general purpose register.
const uint16x8_t equalMask = vreinterpretq_u16_u8(vceqq_u8(chunk, vdupq_n_u8(tag)));
const uint8x8_t res = vshrn_n_u16(equalMask, 4);
const uint64_t matches = vget_lane_u64(vreinterpret_u64_u8(res), 0);
return matches;
After that you can use all bit operations you typically use on x86 with very minor tweaks like shifting by 2 or doing a scalar AND.

Addition in neon register

suppose I have a 64 bit d register in neon. lets say it stores the value ABCDEFGH.
Now I want o add A&E, B&F, C&G, D&H and so on.. Is here any intrinsic by which it is possible to so such an operation
I looked at the documentation but didn't find something suitable.

If you want the addition to be carried out in 16 bits, i.e. produce an uint16x4 result, you can use vmovl to promote the input vector from uint8x8 to uint8x16, then use vadd to add the lower and higher halves. Expressed in NEON intrinsics, this is achieved by
const int16x8_t t = vmovl_u8(input);
const int16x4_t r = vadd_u16(vget_low(t), vget_high(t))
This should compile to the following assembly (d0 is the 64-bit input register, d1 is the 64-bit output register). Note the vget_low and vget_high don't produce any instructions - these intrinsics are implemented by suitable register allocation, by exploiting that Q registers are just a convenient way to name two consecutive D register. Q{n} refers to the pair (D{2n}, D{2n+1}).
VMOVL.U8 q1, d0
VADD.I16 d1, d2
If you want the operation to be carried out in 8 bits, and saturate in case of an overflow, do
const int8x8_t t = vreinterpret_u8_u64(vshr_n_u64(vreinterpret_u64_u8(input), 32));
const int8x8_t r = vqadd_u8(input, t);
This compiles to (d0 is the input again, output in d1)
VSHR.U64 d1, d0, #32
VQADD.I8 d1, d0
By replacing VQADD with just VADD, the results will wrap-around on overflow instead of being saturated to 0xff.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight