converting SSE code to AVX - cost of _mm256_and_ps

converting SSE code to AVX - cost of _mm256_and_ps - c

I'm converting SSE2 sine and cosine functions (from Julien Pommier's sse_mathfun.h; based on the CEPHES sinf function) to use AVX in order to accept 8 float vectors or 4 doubles.
So, Julien's function sin_ps becomes sin_ps8 (for 8 floats) and sin_pd4 for 4 doubles. (The "advanced" editor here fails to accept my code, so please visit http://arstechnica.com/civis/viewtopic.php?f=20&t=1227375 to see it.)
Testing with clang 3.3 under Mac OS X 10.6.8 running on a 2011 Core2 i7 # 2.7Ghz, benchmarking results look like this:
sinf .. -> 27.7 millions of vector evaluations/second over 5.56e+07
iters (standard, scalar sinf() function)
sin_ps .. -> 41.0 millions of vector evaluations/second over
8.22e+07 iters
sin_pd4 .. -> 40.2 millions of vector evaluations/second over
8.06e+07 iters
sin_ps8 .. -> 2.5 millions of vector evaluations/second over
5.1e+06 iters
The cost of sin_ps8 is downright frightening, and it seems it is due to the use of _mm256_castsi256_ps . In fact, commenting out the line "poly_mask = _mm256_castsi256_ps(emmm2);" results in a more normal performance.
sin_pd4 uses _mm_castsi128_pd, but it appears that is not (just) the mix of SSE and AVX instructions that is biting me in sin_ps8: when I emulate the _mm256_castsi256_ps calls with 2 calls to _mm_castsi128_ps, performance doesn't improve. emm2 and emm0 are pointers to emmm2 and emmm0, both v8si instances and thus (a priori) correctly aligned to 32 bits boundaries.
See sse_mathfun.h and sse_mathfun_test.c for compilable code.
Is there a(n easy) way to avoid the penalty I'm seeing?

Transferring stuff out of registers into memory isn't usually a good idea. You are doing this every time you store into a pointer.
Instead of this:
{ ALIGN32_BEG v4sf *yy ALIGN32_END = (v4sf*) &y;
emm2[0] = _mm_and_si128(_mm_add_epi32( _mm_cvttps_epi32( yy[0] ), _v4si_pi32_1), _v4si_pi32_inv1),
emm2[1] = _mm_and_si128(_mm_add_epi32( _mm_cvttps_epi32( yy[1] ), _v4si_pi32_1), _v4si_pi32_inv1);
yy[0] = _mm_cvtepi32_ps(emm2[0]),
yy[1] = _mm_cvtepi32_ps(emm2[1]);
}
/* get the swap sign flag */
emm0[0] = _mm_slli_epi32(_mm_and_si128(emm2[0], _v4si_pi32_4), 29),
emm0[1] = _mm_slli_epi32(_mm_and_si128(emm2[1], _v4si_pi32_4), 29);
/* get the polynom selection mask
there is one polynom for 0 <= x <= Pi/4
and another one for Pi/4<x<=Pi/2
Both branches will be computed.
*/
emm2[0] = _mm_cmpeq_epi32(_mm_and_si128(emm2[0], _v4si_pi32_2), _mm_setzero_si128()),
emm2[1] = _mm_cmpeq_epi32(_mm_and_si128(emm2[1], _v4si_pi32_2), _mm_setzero_si128());
((v4sf*)&poly_mask)[0] = _mm_castsi128_ps(emm2[0]);
((v4sf*)&poly_mask)[1] = _mm_castsi128_ps(emm2[1]);
swap_sign_bit = _mm256_castsi256_ps(emmm0);
Try something like this:
__m128i emm2a = _mm_and_si128(_mm_add_epi32( _mm256_castps256_ps128(y), _v4si_pi32_1), _v4si_pi32_inv1);
__m128i emm2b = _mm_and_si128(_mm_add_epi32( _mm256_extractf128_ps(y, 1), _v4si_pi32_1), _v4si_pi32_inv1);
y = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_cvtepi32_ps(emm2a)), _mm_cvtepi32_ps(emm2b), 1);
/* get the swap sign flag */
__m128i emm0a = _mm_slli_epi32(_mm_and_si128(emm2a, _v4si_pi32_4), 29),
__m128i emm0b = _mm_slli_epi32(_mm_and_si128(emm2b, _v4si_pi32_4), 29);
swap_sign_bit = _mm256_castsi256_ps(_mm256_insertf128_si256(_mm256_castsi128_si256(emm0a), emm0b, 1));
/* get the polynom selection mask
there is one polynom for 0 <= x <= Pi/4
and another one for Pi/4<x<=Pi/2
Both branches will be computed.
*/
emm2a = _mm_cmpeq_epi32(_mm_and_si128(emm2a, _v4si_pi32_2), _mm_setzero_si128()),
emm2b = _mm_cmpeq_epi32(_mm_and_si128(emm2b, _v4si_pi32_2), _mm_setzero_si128());
poly_mask = _mm256_castsi256_ps(_mm256_insertf128_si256(_mm256_castsi128_si256(emm2a), emm2b, 1));
As mentioned in comments, cast intrinsics are purely compile-time and emit no instructions.

Maybe you could compare your code to the already working AVX extension of Julien Pommier SSE math functions?
http://software-lisc.fbk.eu/avx_mathfun/
This code works in GCC but not MSVC and only supports floats (float8) but I think you could easily extend it to use doubles (double4) as well. A quick comparison of your sin function shows that they are quite similar except for the SSE2 integer part.

Related

Low pass shape curve with low cycle count

Hi, I want to achieve the above curve in software without using a dsp function. I was hoping to use a fast and low cycle arm function like multiply-accumulate.
Is there any fast way of doing this in C on an embedded arm processor?

The curve shown is that of the simplest possible first-order filter charactarised by a 3dB cut-off frequency fc> and a 6dB/Octave or 20dB/Decade roll-off. As an analogue filter it could be implemented as a simple passive RC filter thus:
In the digital domain such a filter would be implemented by:
yn = a0 xn + b1 yn-1
Where y are input samples and x output samples. Or in code:
void lowPassFilter( const tSample* x, tSample* y, size_t sample_count )
{
static tSample y_1 = 0 ;
for( int i = 0; i < n; ++i)
{
y[i] = a0 * x[i] + b1 * y_1 ;
y_1 = y[i];
}
}
The filter is characterised by the coefficients:
a0 = 1 - x
b1 = x
where x is a value between 0 and 1 (I'll address the eradication of the implied floating point operations in due course):
x = e-2πfc
Where fc is the desired -3dB cut-off frequency expressed as a fraction of the sample rate. So for a sample rate 32Ksps and a cut-off frequency of 1KHz, fc = 1000/32000 = 0.03125, so:
b1 = x = e-2πfc = 0.821725
a0 = 1 - x = 0.178275
Now naïvely plugging those constants into the lowPassFilter() will result in generation of floating point code and on an MCU without an FPU that might be prohibitive and even with an FPU might be be sub-optimal. So in this case we might use a fixed-point representation. Since all the real values are less than one, and the machine is 32bit, a UQ0.16 representation would be appropriate, as intermediate multiplication results will not then overflow a 32 bit machine word. This does require the sample width to be 16bit or less (or scaled accordingly). So using fixed-point the code might look like:
typedef uint16_t tSample ;
#define b1 53852 // 0.821725 * 65535
#define a0 (1 - b1)
#define FIXED_MUL( x, y ) (((x)*(y))>>16))
void lowPassFilter( const tSample* x, tSample* y, size_t sample_count )
{
static tSample y_1 = 0 ;
for( int i = 0; i < n; ++i)
{
y[i] = FIXED_MUL(a0, x[i]) + FIXED_MUL(b1, y_1) ;
y_1 = y[i];
}
}
Now that is not a significant amount of processing for most ARM processors at 32ksps suggested in this example. Obviously it depends what other demands are on the processor, but on its own this would not be a significant load, even without applying compiler optimisation. As with any optimisation, you should implement it, measure it and improve it if necessary.
As a first stab I'd trust the compiler optimiser to generate code that in most cases will meet requirements or at least be as good as you might achieve with handwritten assembler. Whether or not it would choose to use a multiply-accumulate instruction is out of your hands, but if it didn't the chances are that it is because there s no advantage.
Bare in mind that ARM Cortex-M4 and M7 for example include DSP instructions not supported in M0 or M3 ports. The compiler may or may not utilise these, but the simplest way to guarantee that without resorting to assembler would be to use the CMSIS DSP Library whether or not that provided greater performance or better fidelity than the above, you would have to test.
Worth noting that the function lowPassFilter() retains state staically so can be called iteratively for "blocks" of samples (from ADC DMA transfer for example), so you might have:
int dma_buffer_n = 0
for(;;)
{
waitEvent( DMA_BUFFER_READY ) ;
lowPassFilter( dma_buffer[dma_buffer_n], output_buffer, DMA_BLOCK_SIZE ) ;
dma_buffer_n = dma_buffer_n == 0 ? 1 : 0 ; // Flip buffers
}
The use of DMA double-buffering is likely to be far more important to performance than the filter function implementation. I have worked on a DSP application sampling two channels at 48ksps on a 72MHz Cortex-M3 with far more complex DSP requirements than this with each channel having a high pass IIR, an 18 coefficient FIR and a Viterbi decoder, so I really do think that your assumption that this simple filter will not be fast enough is somewhat premature.

_mm_stream_si128 is 2000% slower than _mm_store_si128 [duplicate]

I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):
__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
__m128i in = _mm_loadu_si128(inptr);
__m128i out = in; // real code does more than this, but I've simplified it
_mm_stream_si12(outptr,out);
inptr += 12;
outptr += 16;
}
This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.
We upgraded to the lasted microcode on the Skylake,
and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.
outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.
If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.
So the two questions are
1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?
2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?
I'm a newbie when it comes to intrinsics.
Addendum - test case
Here is the entire test case: https://godbolt.org/z/toM2lB
Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).
// icpc -std=c++14 -msse4.2 -O3 -DNDEBUG ../mre.cpp -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU # 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz.
// The command line was
// perf stat ./mre 100000
//
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 1.65 7.29
// _mm_storeu_si128 0.41 0.40
The ratio of stream to store is 4x or 18x, respectively.
I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.
Second edit - 64B aligned output
The comment from #Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).
So changed my test code like this:
#if 0
std::vector<Tile> tiles(outputPixels/32);
#else
std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif
and now the numbers are quite different:
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 0.19 0.48
// _mm_storeu_si128 0.25 0.52
So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.
Third Edit. Purposely misalignment
I tried the test suggested by #HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.
perf stat ./mre1 1000000
To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).
// STORER alignment time (seconds)
// byte E5-2680 8180
// ---------------------------------------------------
// _mm_storeu_si128 16 3.15 2.69
// _mm_storeu_si128 32 3.16 2.60
// _mm_storeu_si128 64 1.72 1.71
// _mm_stream_si128 16 14.31 72.14
// _mm_stream_si128 32 14.44 72.09
// _mm_stream_si128 64 1.43 3.38
So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.
For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):
template <class T >
struct Mallocator {
typedef T value_type;
Mallocator() = default;
template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept
{}
T* allocate(std::size_t n) {
if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
if(! p1) throw std::bad_alloc();
p1 += 32; // misalign on purpose
return reinterpret_cast<T*>(p1);
}
void deallocate(T* p, std::size_t) noexcept {
uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
p1 -= 32;
std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }
...
std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);

The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.
The actual loop from your godbolt code is:
while (count > 0)
{
// std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
__m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
__m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
__m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
__m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));
__m128i tileVal0 = value0;
__m128i tileVal1 = value1;
__m128i tileVal2 = value2;
__m128i tileVal3 = value3;
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);
ptr += diffBytes * 4;
count -= diffBytes * 4;
tile += diffPixels * 4;
ipixel += diffPixels * 4;
if (ipixel == 32)
{
// go to next tile
ipixel = 0;
tileIter++;
tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
}
}
Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.
On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.
Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.

How to extract 8 integers from a 256 vector using intel intrinsics?

I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX).
I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions.
This is the code snippet that I'm trying to enhance:
/* code snipet */
kfac1 = kfac + factor; /* 7 cycles for 7 additions */
kfac2 = kfac1 + factor;
kfac3 = kfac2 + factor;
kfac4 = kfac3 + factor;
kfac5 = kfac4 + factor;
kfac6 = kfac5 + factor;
kfac7 = kfac6 + factor;
k1fac1 = k1fac + factor1; /* 7 cycles for 7 additions */
k1fac2 = k1fac1 + factor1;
k1fac3 = k1fac2 + factor1;
k1fac4 = k1fac3 + factor1;
k1fac5 = k1fac4 + factor1;
k1fac6 = k1fac5 + factor1;
k1fac7 = k1fac6 + factor1;
k2fac1 = k2fac + factor2; /* 7 cycles for 7 additions */
k2fac2 = k2fac1 + factor2;
k2fac3 = k2fac2 + factor2;
k2fac4 = k2fac3 + factor2;
k2fac5 = k2fac4 + factor2;
k2fac6 = k2fac5 + factor2;
k2fac7 = k2fac6 + factor2;
/* code snipet */
From the Intel Manuals, I found this.
an integer addition ADD takes 1 cycle (latency).
a vector of 8 integers (32 bit) takes 1 cycle also.
So I've tried ton make it this way:
fac = _mm256_set1_epi32 (factor )
fac1 = _mm256_set1_epi32 (factor1)
fac2 = _mm256_set1_epi32 (factor2)
v1 = _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac)
v2 = _mm256_set_epi32 (0,k1fac6,k1fac5,k1fac4,k1fac3,k1fac2,k1fac1,k1fac)
v3 = _mm256_set_epi32 (0,k2fac6,k2fac5,k2fac4,k2fac3,k2fac2,k2fac1,k2fac)
res1 = _mm256_add_epi32 (v1,fac) ////////////////////
res2 = _mm256_add_epi32 (v2,fa1) // just 3 cycles //
res3 = _mm256_add_epi32 (v3,fa2) ////////////////////
But the problem is that these factors are going to be used as tables indexes ( table[kfac] ... ). So i have to extract the factor as seperate integers again.
I wonder if there is any possible way to do it??

A smart compiler could get table+factor into a register and use indexed addressing modes to get table+factor+k1fac6 as an address. Check the asm, and if the compiler doesn't do this for you, try changing the source to hand-hold the compiler:
const int *tf = table + factor;
const int *tf2 = table + factor2; // could be lea rdx, [rax+rcx*4] or something.
...
foo = tf[kfac2];
bar = tf2[k2fac6]; // could be mov r12, [rdx + rdi*4]
But to answer the question you asked:
Latency isn't a big deal when you have that many independent adds happening. The throughput of 4 scalar add instructions per clock on Haswell is much more relevant.
If k1fac2 and so on are already in contiguous memory, then using SIMD is possibly worth it. Otherwise all the shuffling and data transfer to get them in/out of vector regs makes it definitely not worth it. (i.e. the stuff compiler emits to implement _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac).
You could avoid needing to get the indices back into integer registers by using an AVX2 gather for the table loads. But gather is slow on Haswell, so probably not worth it. Maybe worth it on Broadwell.
On Skylake, gather is fast so it could be good if you can SIMD whatever you do with the LUT results. If you need to extract all the gather results back to separate integer registers, it's probably not worth it.
If you did need to extract 8x 32-bit integers from a __m256i into integer registers, you have three main choices of strategy:
Vector store to a tmp array and scalar loads
ALU shuffle instructions like pextrd (_mm_extract_epi32). Use _mm256_extracti128_si256 to get the high lane into a separate __m128i.
A mix of both strategies (e.g. store the high 128 to memory while using ALU stuff on the low half).
Depending on the surrounding code, any of these three could be optimal on Haswell.
pextrd r32, xmm, imm8 is 2 uops on Haswell, with one of them needing the shuffle unit on port5. That's a lot of shuffle uops, so a pure ALU strategy is only going to be good if your code is bottlenecked on L1d cache throughput. (Not the same thing as memory bandwidth). movd r32, xmm is only 1 uop, and compilers do know to use that when compiling _mm_extract_epi32(vec, 0), but you can also write int foo = _mm_cvtsi128_si32(vec) to make it explicit and remind yourself that the bottom element can be accessed more efficiently.
Store/reload has good throughput. Intel SnB-family CPUs including Haswell can run two loads per clock, and IIRC store-forwarding works from an aligned 32-byte store to any 4-byte element of it. But make sure it's an aligned store, e.g. into _Alignas(32) int tmp[8], or into a union between an __m256i and an int array. You could still store into the int array instead of the __m256i member to avoid union type-punning while still having the array aligned, but it's easiest to just use C++11 alignas or C11 _Alignas.
_Alignas(32) int tmp[8];
_mm256_store_si256((__m256i*)tmp, vec);
...
foo2 = tmp[2];
However, the problem with store/reload is latency. Even the first result won't be ready for 6 cycles after the store-data is ready.
A mixed strategy gives you the best of both worlds: ALU to extract the first 2 or 3 elements lets execution get started on whatever code uses them, hiding the store-forwarding latency of the store/reload.
_Alignas(32) int tmp[8];
_mm256_store_si256((__m256i*)tmp, vec);
__m128i lo = _mm256_castsi256_si128(vec); // This is free, no instructions
int foo0 = _mm_cvtsi128_si32(lo);
int foo1 = _mm_extract_epi32(lo, 1);
foo2 = tmp[2];
// rest of foo3..foo7 also loaded from tmp[]
// Then use foo0..foo7
You might find that it's optimal to do the first 4 elements with pextrd, in which case you only need to store/reload the upper lane. Use vextracti128 [mem], ymm, 1:
_Alignas(16) int tmp[4];
_mm_store_si128((__m128i*)tmp, _mm256_extracti128_si256(vec, 1));
// movd / pextrd for foo0..foo3
int foo4 = tmp[0];
...
With fewer larger elements (e.g. 64-bit integers), a pure ALU strategy is more attractive. 6-cycle vector-store / integer-reload latency is longer than it would take to get all of the results with ALU ops, but store/reload could still be good if there's a lot of instruction-level parallelism and you bottleneck on ALU throughput instead of latency.
With more smaller elements (8 or 16-bit), store/reload is definitely attractive. Extracting the first 2 to 4 elements with ALU instructions is still good. And maybe even vmovd r32, xmm and then picking that apart with integer shift/mask instructions is good.
Your cycle-counting for the vector version is also bogus. The three _mm256_add_epi32 operations are independent, and Haswell can run two vpaddd instructions in parallel. (Skylake can run all three in a single cycle, each with 1 cycle latency.)
Superscalar pipelined out-of-order execution means there's a big difference between latency and throughput, and keeping track of dependency chains matters a lot. See http://agner.org/optimize/, and other links in the x86 tag wiki for more optimization guides.

SIMD (AVX2) mask store and pack [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

Comparing two pairs of 4 variables and returning the number of matches?

Given the following struct:
struct four_points {
uint32_t a, b, c, d;
}
What would be the absolute fastest way to compare two such structures and return the number of variables that match (in any position)?
For example:
four_points s1 = {0, 1, 2, 3};
four_points s2 = {1, 2, 3, 4};
I'd be looking for a result of 3, since three numbers match between the two structs. However, given the following:
four_points s1 = {1, 0, 2, 0};
four_points s2 = {0, 1, 9, 7};
Then I'd expect a result of only 2, because only two variables match between either struct (despite there being two zeros in the first).
I've figured out a few rudimentary systems for performing the comparison, but this is something that is going to be called a couple million times in a short time span and needs to be relatively quick. My current best attempt was to use a sorting network to sort all four values for either input, then loop over the sorted values and keep a tally of the values that are equal, advancing the current index of either input accordingly.
Is there any kind of technique that might be able to perform better then a sort & iteration?

On modern CPUs, sometimes brute force applied properly is the way to go. The trick is writing code that isn't limited by instruction latencies, just throughput.
Are duplicates common? If they're very rare, or have a pattern, using a branch to handle them makes the common case faster. If they're really unpredictable, it's better to do something branchless. I was thinking about using a branch to check for duplicates between positions where they're rare, and going branchless for the more common place.
Benchmarking is tricky because a version with branches will shine when tested with the same data a million times, but will have lots of branch mispredicts in real use.
I haven't benchmarked anything yet, but I have come up with a version that skips duplicates by using OR instead of addition to combine found-matches. It compiles to nice-looking x86 asm that gcc fully unrolls. (no conditional branches, not even loops).
Here it is on godbolt. (g++ is dumb and uses 32bit operations on the output of x86 setcc, which only sets the low 8 bits. This partial-register access will produce slowdowns. And I'm not even sure it ever zeroes the upper 24bits at all... Anyway, the code from gcc 4.9.2 looks good, and so does clang on godbolt)
// 8-bit types used because x86's setcc instruction only sets the low 8 of a register
// leaving the other bits unmodified.
// Doing a 32bit add from that creates a partial register slowdown on Intel P6 and Sandybridge CPU families
// Also, compilers like to insert movzx (zero-extend) instructions
// because I guess they don't realize the previous high bits are all zero.
// (Or they're tuning for pre-sandybridge Intel, where the stall is worse than SnB inserting the extra uop itself).
// The return type is 8bit because otherwise clang decides it should generate
// things as 32bit in the first place, and does zero-extension -> 32bit adds.
int8_t match4_ordups(const four_points *s1struct, const four_points *s2struct)
{
const int32_t *s1 = &s1struct->a; // TODO: check if this breaks aliasing rules
const int32_t *s2 = &s2struct->a;
// ignore duplicates by combining with OR instead of addition
int8_t matches = 0;
for (int j=0 ; j<4 ; j++) {
matches |= (s1[0] == s2[j]);
}
for (int i=1; i<4; i++) { // i=0 iteration is broken out above
uint32_t s1i = s1[i];
int8_t notdup = 1; // is s1[i] a duplicate of s1[0.. i-1]?
for (int j=0 ; j<i ; j++) {
notdup &= (uint8_t) (s1i != s1[j]); // like dup |= (s1i == s1[j]); but saves a NOT
}
int8_t mi = // match this iteration?
(s1i == s2[0]) |
(s1i == s2[1]) |
(s1i == s2[2]) |
(s1i == s2[3]);
// gcc and clang insist on doing 3 dependent OR insns regardless of parens, not that it matters
matches += mi & notdup;
}
return matches;
}
// see the godbolt link for a main() simple test harness.
On a machine with 128b vectors that can work with 4 packed 32bit integers (e.g. x86 with SSE2), you can broadcast each element of s1 to its own vector, deduplicate, and then do 4 packed-compares. icc does something like this to autovectorize my match4_ordups function (check it out on godbolt.)
Store the compare results back to integer registers with movemask, to get a bitmap of which elements compared equal. Popcount those bitmaps, and add the results.
This led me to a better idea: Getting all the compares done with only 3 shuffles with element-wise rotation:
{ 1d 1c 1b 1a }
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1a 1d 1c 1b }
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1b 1a 1d 1c } # if dups didn't matter: do this shuffle on s2
== == == == packed-compare with
{ 2d 2c 2b 2a }
{ 1c 1b 1a 1d } # if dups didn't matter: this result from { 1a ... }
== == == == packed-compare with
{ 2d 2c 2b 2a } { 2b ...
That's only 3 shuffles, and still does all 16 comparisons. The trick is combining them with ORs where we need to merge duplicates, and then being able to count them efficiently. A packed-compare outputs a vector with each element = zero or -1 (all bits set), based on the comparison between the two elements in that position. It's designed to make a useful operand to AND or XOR to mask off some vector elements, e.g. to make v1 += v2 & mask conditional on a per-element basis. It also works as just a boolean truth value.
All 16 compares with only 2 shuffles is possible by rotating one vector by two, and the other vector by one, and then comparing between the four shifted and unshifted vectors. This would be great if we didn't need to eliminate dups, but since we do, it matters which results are where. We're not just adding all 16 comparison results.
OR together the packed-compare results down to one vector. Each element will be set based on whether that element of s2 had any matches in s1. int _mm_movemask_ps (__m128 a) to turn the vector into a bitmap, then popcount the bitmap. (Nehalem or newer CPU required for popcnt, otherwise fall back to a version with a 4-bit lookup table.)
The vertical ORs take care of duplicates in s1, but duplicates in s2 is a less obvious extension, and would take more work. I did eventually think of a way that was less than twice as slow (see below).
#include <stdint.h>
#include <immintrin.h>
typedef struct four_points {
int32_t a, b, c, d;
} four_points;
//typedef uint32_t four_points[4];
// small enough to inline, only 62B of x86 instructions (gcc 4.9.2)
static inline int match4_sse_noS2dup(const four_points *s1pointer, const four_points *s2pointer)
{
__m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
__m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
__m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
// no shuffle needed for first compare
__m128i match = _mm_cmpeq_epi32(s1 , s2); //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
__m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
s1b = _mm_cmpeq_epi32(s1b, s2);
match = _mm_or_si128(match, s1b); // merge dups by ORing instead of adding
// note that we shuffle the original vector every time
// multiple short dependency chains are better than one long one.
__m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
s1c = _mm_cmpeq_epi32(s1c, s2);
match = _mm_or_si128(match, s1c);
s1d = _mm_cmpeq_epi32(s1d, s2);
match = _mm_or_si128(match, s1d); // match = { s2.a in s1?, s2.b in s1?, etc. }
// turn the the high bit of each 32bit element into a bitmap of s2 elements that have matches anywhere in s1
// use float movemask because integer movemask does 8bit elements.
int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
return _mm_popcnt_u32(matchmask); // or use a 4b lookup table for CPUs with SSE2 but not popcnt
}
See the version that eliminates duplicates in s2 for the same code with lines in a more readable order. I tried to schedule instructions in case the CPU was only just barely decoding instructions ahead of what was executing, but gcc puts the instructions in the same order regardless of what order you put the intrinsics in.
This is extremely fast, if there isn't a store-forwarding stall in the 128b loads. If you just wrote the struct with four 32bit stores, running this function within the next several clock cycles will produce a stall when it tries to load the whole struct with a 128b load. See Agner Fog's site. If calling code already has many of the 8 values in registers, the scalar version could be a win, even though it'll be slower for a microbenchmark test that only reads the structs from memory.
I got lazy on cycle-counting for this, since dup-handling isn't done yet. IACA says Haswell can run it with a throughput of one iteration per 4.05 clock cycles, and latency of 17 cycles (Not sure if that's including the memory latency of the loads. There's a lot of instruction-level parallelism available, and all the instructions have single-cycle latency, except for movmsk(2) and popcnt(3)). It's slightly slower without AVX, because gcc chooses a worse instruction ordering, and still wastes a movdqa instruction copying a vector register.
With AVX2, this could do two match4 operations in parallel, in 256b vectors. AVX2 usually works as two 128b lanes, rather than full 256b vectors. Setting up your code to be able to take advantage of 2 or 4 (AVX-512) match4 operations in parallel will give you gains when you can compile for those CPUs. It's not essential for both the s1s or s2s to be stored contiguously so a single 32B load can get two structs. AVX2 has a fairly fast load 128b to the upper lane of a register.
Handling duplicates in s2
Maybe compare s2 to a shifted instead of rotated version of itself.
#### comparing S2 with itself to mask off duplicates
{ 0 2d 2c 2b }
{ 2d 2c 2b 2a } == == ==
{ 0 0 2d 2c }
{ 2d 2c 2b 2a } == ==
{ 0 0 0 2d }
{ 2d 2c 2b 2a } ==
Hmm, if zero can occur as a regular element, we may need to byte-shift after the compare as well, to turn potential false positives into zeros. If there was a sentinel value that couldn't appear in s1, you could shift in elements of that, instead of 0. (SSE has PALIGNR, which gives you any contiguous 16B window you want of the contents of two registers appended. Named for the use-case of simulating an unaligned load from two aligned loads. So you'd have a constant vector of that element.)
update: I thought of a nice trick that avoids the need for an identity element. We can actually get all 6 necessary s2 vs. s2 comparisons to happen with just two vector compares, and then combine the results.
Doing the same compare in the same place in two vectors lets you OR two results together without having to mask before the OR. (Works around the lack of a sentinel value).
Shuffling the output of the compares instead of extra shuffle&compare of S2. This means we can get d==a done next to the other compares.
Notice that we aren't limited to shuffling whole elements around. Byte-wise shuffle to get bytes from different compare results into a single vector element, and compare that against zero. (This saves less than I'd hoped, see below).
Checking for duplicates is a big slowdown (esp. in throughput, not so much in latency). So you're still best off arranging for a sentinel value in s2 that will never match any s1 element, which you say is possible. I only present this because I thought it was interesting. (And gives you an option in case you need a version that doesn't require sentinels sometime.)
static inline
int match4_sse(const four_points *s1pointer, const four_points *s2pointer)
{
// IACA_START
__m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
__m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
// s1a = unshuffled = s1.a in the low element
__m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
__m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
__m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
__m128i match = _mm_cmpeq_epi32(s1 , s2); //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
s1b = _mm_cmpeq_epi32(s1b, s2);
match = _mm_or_si128(match, s1b); // merge dups by ORing instead of adding
s1c = _mm_cmpeq_epi32(s1c, s2);
match = _mm_or_si128(match, s1c);
s1d = _mm_cmpeq_epi32(s1d, s2);
match = _mm_or_si128(match, s1d);
// match = { s2.a in s1?, s2.b in s1?, etc. }
// s1 vs s2 all done, now prepare a mask for it based on s2 dups
/*
* d==b c==a b==a d==a #s2b
* d==c c==b b==a d==a #s2c
* OR together -> s2bc
* d==abc c==ba b==a 0 pshufb(s2bc) (packed as zero or non-zero bytes within the each element)
* !(d==abc) !(c==ba) !(b==a) !0 pcmpeq setzero -> AND mask for s1_vs_s2 match
*/
__m128i s2b = _mm_shuffle_epi32(s2, _MM_SHUFFLE(1, 0, 0, 3));
__m128i s2c = _mm_shuffle_epi32(s2, _MM_SHUFFLE(2, 1, 0, 3));
s2b = _mm_cmpeq_epi32(s2b, s2);
s2c = _mm_cmpeq_epi32(s2c, s2);
__m128i s2bc= _mm_or_si128(s2b, s2c);
s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12, -1,-1,-1,8, -1,-1,-1,4, -1,-1,-1,-1));
__m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
// see below for alternate insn sequences that can go here.
match = _mm_and_si128(match, dupmask);
// turn the the high bit of each 32bit element into a bitmap of s2 matches
// use float movemask because integer movemask does 8bit elements.
int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
int ret = _mm_popcnt_u32(matchmask); // or use a 4b lookup table for CPUs with SSE2 but not popcnt
// IACA_END
return ret;
}
This requires SSSE3 for pshufb. It and a pcmpeq (and a pxor to generate a constant) are replacing a shuffle (bslli(s2bc, 12)), an OR, and an AND.
d==bc c==ab b==a a==d = s2b|s2c
d==a 0 0 0 = byte-shift-left(s2b) = s2d0
d==abc c==ab b==a a==d = s2abc
d==abc c==ab b==a 0 = mask(s2abc). Maybe use PBLENDW or MOVSS from s2d0 (which we know has zeros) to save loading a 16B mask.
__m128i s2abcd = _mm_or_si128(s2b, s2c);
//s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12, -1,-1,-1,8, -1,-1,-1,4, -1,-1,-1,-1));
//__m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
__m128i s2d0 = _mm_bslli_si128(s2b, 12); // d==a 0 0 0
s2abcd = _mm_or_si128(s2abcd, s2d0);
__m128i dupmask = _mm_blend_epi16(s2abcd, s2d0, 0 | (2 | 1));
//__m128i dupmask = _mm_and_si128(s2abcd, _mm_set_epi32(-1, -1, -1, 0));
match = _mm_andnot_si128(dupmask, match); // ~dupmask & match; first arg is the one that's inverted
I can't recommend MOVSS; it will incur extra latency on AMD because it runs in the FP domain. PBLENDW is SSE4.1. popcnt is available on AMD K10, but PBLENDW isn't (some Barcelona-core PhenomII CPUs are probably still in use). Actually, K10 doesn't have PSHUFB either, so just require SSE4.1 and POPCNT, and use PBLENDW. (Or use the PSHUFB version, unless it's going to cache-miss a lot.)
Another option to avoid a loading a vector constant from memory is to movemask s2bc, and use integer instead of vector ops. However, it looks like that'll be slower, because the extra movemask isn't free, and integer ANDN isn't usable. BMI1 didn't appear until Haswell, and even Skylake Celerons and Pentiums won't have it. (Very annoying, IMO. It means compilers can't start using BMI for even longer.)
unsigned int dupmask = _mm_movemask_ps(cast(s2bc));
dupmask |= dupmask << 3; // bit3 = d==abc. garbage in bits 4-6, careful if using AVX2 to do two structs at once
// only 2 instructions. compiler can use lea r2, [r1*8] to copy and scale
dupmask &= ~1; // clear the low bit
unsigned int matchmask = _mm_movemask_ps(cast(match));
matchmask &= ~dupmask; // ANDN is in BMI1 (Haswell), so this will take 2 instructions
return _mm_popcnt_u32(matchmask);
AMD XOP's VPPERM (pick bytes from any element of two source registers) would let the byte-shuffle replace the OR that merges s2b and s2c as well.
Hmm, pshufb isn't saving me as much as I thought, because it requires a pcmpeqd, and a pxor to zero a register. It's also loading its shuffle mask from a constant in memory, which can miss in the D-cache. It is the fastest version I've come up with, though.
If inlined into a loop, the same zeroed register could be used, saving one instruction. However, OR and AND can run on port0 (Intel CPUs), which can't run shuffle or compare instructions. The PXOR doesn't use any execution ports, though (on Intel SnB-family microarchitecture).
I haven't run real benchmarks of any of these, only IACA.
The PBLENDW and PSHUFB versions have the same latency (22 cycles, compiled for non-AVX), but the PSHUFB version has better throughput (one per 7.1c, vs. one per 7.4c, because PBLENDW needs the shuffle port, and there's already a lot of contention for it.) IACA says the version using PANDN with a constant instead of PBLENDW is also one-per-7.4c throughput, disappointingly. Port0 isn't saturated, so IDK why it's as slow as PBLENDW.
Old ideas that didn't pan out.
Leaving them in for the benefit of people looking for things to try when using vectors for related things.
Dup-checking s2 with vectors is more work than checking s2 vs. s1, because one compare is as expensive as 4 if done with vectors. The shuffling or masking needed after the compare, to remove false positives if there's no sentinel value, is annoying.
Ideas so far:
Shift s2 over by an element, and compare it to itself. Mask off false positives from shifting in 0. Vertically OR these together, and use it to ANDN the s1 vs s2 vector.
scalar code to do the smaller number of s2 vs. itself comparisons, building a bitmask to use before popcnt.
Broadcast s2.d and check it against s2 (all positions). But that puts the results horizontally in one vector, instead of vertically in 3 vectors. To use that, maybe PTEST / SETCC to make a mask for the bitmap (to apply before popcount). (PTEST with a mask of _mm_setr_epi32(0, -1, -1, -1), to only test the c,b,a, not d==d). Do (c==a | c==b) and b==a with scalar code, and combine that into a mask. Intel Haswell and later have 4 ALU execution ports, but only 3 of them can run vector instructions, so some scalar code in the mix could fill port6. AMD has even more separation between vector and integer execution resources.
shuffle s2 to get all the necessary comparisons done somehow, then shuffle the outputs. Maybe use movemask -> 4-bit lookup table for something?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight