The best way to shift a __m128i?

The best way to shift a __m128i? - c

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m).
What is the best way to do this?!
Note that _mm_slli_epi64 shifts v0 and v1 seperately:
r0 := v0 << count
r1 := v1 << count
so the last bits of v0 missed, but I want to move those bits to r1.
Edit:
I looking for a code, faster than this (m<64):
r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);

For compile-time constant shift counts, you can get fairly good results. Otherwise not really.
This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.
// untested
#include <immintrin.h>
/* some compilers might choke on slli / srli with non-compile-time-constant args
* gcc generates the xmm, imm8 form with constants,
* and generates the xmm, xmm form with otherwise. (With movd to get the count in an xmm)
*/
// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
__m128i carry = _mm_bslli_si128(x, 8); // old compilers only have the confusingly named _mm_slli_si128 synonym
if (count >= 64)
return _mm_slli_epi64(carry, count-64); // the non-carry part is all zero, so return early
// else
carry = _mm_srli_epi64(carry, 64-count); // After bslli shifted left by 64b
x = _mm_slli_epi64(x, count);
return _mm_or_si128(x, carry);
}
__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100); }
I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.
The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):
mm_bitshift_left_3(long long __vector(2)):
vpslldq xmm1, xmm0, 8
vpsrlq xmm1, xmm1, 61
vpsllq xmm0, xmm0, 3
vpor xmm0, xmm0, xmm1
ret
Performance:
This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)
Or for counts >= 64:
mm_bitshift_left_100(long long __vector(2)):
vpslldq xmm0, xmm0, 8
vpsllq xmm0, xmm0, 36
ret
If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.
It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.
The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)
__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
return x << count; // undefined if count >= 128
}

In SSE4.A the instructions insrq and extrq can be used to shift (and rotate) through __mm128i 1-64 bits at a time. Unlike the 8/16/32/64 bit counterparts pextrN/pinsrX, these instructions select or insert m bits (between 1 and 64) at any bit offset from 0 to 127. The caveat is that the sum of lenght and offset must not exceed 128.

Related

AVX512 - How to move all set bits to the right?

How can I move all set bits of mask register to right? (To the bottom, least-significant position).
For example:
__mmask16 mask = _mm512_cmpeq_epi32_mask(vload, vlimit); // mask = 1101110111011101
If we move all set bits to the right, we will get: 1101110111011101 -> 0000111111111111
How can I achieve this efficiently?
Below you can see how I tried to get the same result, but it's inefficient:
__mmask16 mask = 56797;
// mask: 1101110111011101
__m512i vbrdcast = _mm512_maskz_broadcastd_epi32(mask, _mm_set1_epi32(~0));
// vbrdcast: -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1
__m512i vcompress = _mm512_maskz_compress_epi32(mask, vbrdcast);
// vcompress:-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0
__mmask16 right_packed_mask = _mm512_movepi32_mask(vcompress);
// right_packed_mask: 0000111111111111
What is the best way to do this?

BMI2 pext is the scalar bitwise equivalent of v[p]compressd/q/ps/pd.
Use it on your mask value to left-pack them to the bottom of the value.
mask = _pext_u32(-1U, mask); // or _pext_u64(-1ULL, mask64) for __mmask64
// costs 3 asm instructions (kmov + pext + kmov) if you need to use the result as a mask
// not including putting -1 in a register.
Implicit conversion between __mmask16 (aka uint16_t in GCC) and uint32_t works.
Use _cvtu32_mask16 and _cvtu32_mask16 to make the KMOVW explicit if you like.
See How to unset N right-most set bits for more about using pext/pdep in ways like this.
All current CPUs with AVX-512 also have fast BMI2 pext (including Xeon Phi), same performance as popcnt. AMD had slow pext until Zen 3, but if/when AMD ever introduces an AVX-512 CPU it should have fast pext/pdep.
For earlier AMD without AVX512, you might want (1ULL << __builtin_popcount(mask)) - 1, but be careful of overflow if all bits are set. 1ULL << 64 is undefined behaviour, and likely to produce 1 not 0 when compiled for x86-64.
If you were going to use vpcompressd, note that the source vector can simply be all-ones _mm512_set1_epi32(-1); compress doesn't care about elements where the mask was zero, they don't need to already be zero.
(It doesn't matter which -1s you pack; once you're working with boolean values, there's no difference between a true that came from your original bitmask vs. a constant true that was just sitting there which you generated more cheaply, without a dependency on your input mask. Same reasoning applies for pext, why you can use -1U as the source data instead of a pdep. i.e. a -1 or set bit doesn't have an identity; it's the same as any other -1 or set bit).
So let's try both ways and see how good/bad the asm is.
inline
__mmask16 leftpack_k(__mmask16 mask){
return _pdep_u32(-1U, mask);
}
inline
__mmask16 leftpack_comp(__mmask16 mask) {
__m512i v = _mm512_maskz_compress_epi32(mask, _mm512_set1_epi32(-1));
return _mm512_movepi32_mask(v);
}
Looking at stand-alone versions of these isn't useful because __mmask16 is a typedef for unsigned short, and is thus passed/returned in integer registers, not k registers. That makes the pext version look very good, of course, but we want to see how it inlines into a case where we generate and use the mask with AVX-512 intrinsics.
// not a useful function, just something that compiles to asm in an obvious way
void use_leftpack_compress(void *dst, __m512i v){
__mmask16 m = _mm512_test_epi32_mask(v,v);
m = leftpack_comp(m);
_mm512_mask_storeu_epi32(dst, m, v);
}
Commenting out the m = pack(m), this is just a simple 2 instructions that generate and then use a mask.
use_mask_nocompress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
So any extra instructions will be due to left-packing (compressing) the mask. GCC and clang make the same asm as each other, differing only in clang avoiding kmovw in favour of always kmovd. Godbolt
# GCC10.3 -O3 -march=skylake-avx512
use_leftpack_k(void*, long long __vector(8)):
vptestmd k0, zmm0, zmm0
mov eax, -1 # could be hoisted out of a loop
kmovd edx, k0
pdep eax, eax, edx
kmovw k1, eax
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
use_leftpack_compress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vpternlogd zmm2, zmm2, zmm2, 0xFF # set1(-1) could be hoisted out of a loop
vpcompressd zmm1{k1}{z}, zmm2
vpmovd2m k1, zmm1
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
So the non-hoistable parts are
kmov r,k (port 0) / pext (port 1) / kmov k,r (port 5) = 3 uops, one for each execution port. (Including port 1, which has its vector ALUs shut down while 512-bit uops are in flight). The kmov/kmov round trip has 4 cycle latency on SKX, and pext is 3 cycle latency, for a total of 7 cycle latency.
vpcompressd zmm{k}{z}, z (2 p5) / vpmovd2m (port 0) = 3 uops, two for port 5. vpmovd2m has 3 cycle latency on SKX / ICL, and vpcompressd-zeroing-into-zmm has 6 cycle from the k input to the zmm output (SKX and ICL). So a total of 9 cycle latency, and worse port distribution for the uops.
Also, the hoistable part is generally worse (vpternlogd is longer and competes for fewer ports than mov r32, imm32), unless your function already needs an all-ones vector for something but not an all-ones register.
Conclusion: the BMI2 pext way is not worse in any way, and better in several. (Unless surrounding code heavily bottlenecked on port 1 uops, which is very unlikely if using 512-bit vectors because in that case it can only be running scalar integer uops like 3-cycle LEA, IMUL, LZCNT, and of course simple 1-cycle integer stuff like add/sub/and/or).

How can I generate a 256 bit mask

I have an array of uint64_t[4], and I need to generate a mask,
such that the array, if it were a 256-bit integer, equals
(1 << w) - 1, where w goes from 1 to 256.
The best thing I have come up with is branchless, but it takes MANY instructions. It is in Zig because Clang doesn't seem to expose llvm's saturating subtraction. http://localhost:10240/z/g8h1rV
Is there a better way to do this?
var mask: [4]u64 = undefined;
for (mask) |_, i|
mask[i] = 0xffffffffffffffff;
mask[3] ^= ((u64(1) << #intCast(u6, (inner % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[2] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 64) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[1] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 128) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[0] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 192) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));

Are you targeting x86-64 with AVX2 for 256-bit vectors? I thought that was an interesting case to answer for.
If so, you can do this in a few instructions using saturating subtraction and a variable count shift.
x86 SIMD shifts like vpsrlvq saturate the shift count, shifting all the bits out when the count is >= element width. Unlike integer shifts the shift count is masked (and thus wraps around).
For the lowest u64 element, starting with all-ones we need to leave it unmodified for bitpos >= 64. Or for smaller bit positions, right-shift it by 64-bitpos. Unsigned saturating subtraction looks like the way to go here, as you observed, to create a shift count of 0 for larger bitpos. But x86 only has SIMD saturating subtraction, and only for byte or word elements. But if we don't care about bitpos > 256, that's fine we can use 16-bit elements at the bottom of each u64, and let a 0-0 happen in the rest of the u64.
Your code looks pretty overcomplicated, creating (1<<n) - 1 and XORing. I think it's a lot easier to just use a variable-count shift on the 0xFFFF...FF elements directly.
I don't know Zig, so do whatever you have to to get it to emit asm like this. Hopefully this is useful because you tagged this assembly; should be easy to translate to intrinsics for C, or Zig if it has them.
default rel
section .rodata
shift_offsets: dw 64, 128, 192, 256 ; 16-bit elements, to be loaded with zero-extension to 64
section .text
pos_to_mask256:
vpmovzxwq ymm2, [shift_offsets] ; _mm256_set1_epi64x(256, 192, 128, 64)
vpcmpeqd ymm1, ymm1,ymm1 ; ymm1 = all-ones
; set up vector constants, can be hoisted
vmovd xmm0, edi
vpbroadcastq ymm0, xmm0 ; ymm0 = _mm256_set1_epi64(bitpos)
vpsubusw ymm0, ymm2, ymm0 ; ymm0 = {256,192,128,64}-bitpos with unsigned saturation
vpsrlvq ymm0, ymm1, ymm0 ; mask[i] >>= count, where counts >= 64 create 0s.
ret
If the input integer starts in memory, you can of course efficiently broadcast-load it into a ymm register directly.
The shift-offsets vector can of course be hoisted out of a loop, as can the all-ones.
With input = 77, the high 2 elements are zeroed by shifts of 256-77=179, and 192-77=115 bits. Tested with NASM + GDB for EDI=77, and the result is
(gdb) p /x $ymm0.v4_int64
{0xffffffffffffffff, 0x1fff, 0x0, 0x0}
GDB prints low element first, opposite of Intel notation / diagrams. This vector is actually 0, 0, 0x1fff, 0xffffffffffffffff, i.e. 64+13 = 77 one bits, and the rest all zeros. Other test cases
edi=0: mask = all-zero
edi=1: mask = 1
... : mask = edi one bits at the bottom, then zeros
edi=255: mask = all ones except for the top bit of the top element
edi=256: mask = all ones
edi>256: mask = all ones. (unsigned subtraction saturates to 0 everywhere.)
You need AVX2 for the variable-count shifts. psubusb/w is SSE2, so you could consider doing that part with SIMD and then go back to scalar integer for the shifts, or maybe just use SSE2 shifts for one element at a time. Like psrlq xmm1, xmm0 which takes the low 64 bits of xmm0 as the shift count for all elements of xmm1.
Most ISAs don't have saturating scalar subtraction. Some ARM CPUs do for scalar integer, I think, but x86 doesn't. IDK what you're using.
On x86 (and many other ISAs) you have 2 problems:
keep all-ones for low elements (either modify the shift result, or saturate shift count to 0)
produce 0 for high elements above the one containing the top bit of the mask. x86 scalar shifts can't do this at all, so you might feed the shift an input of 0 for that case. Maybe using cmov to create it based on flags set by sub for 192-w or something.
count = 192-w;
shift_input = count<0 ? 0 : ~0ULL;
shift_input >>= count & 63; // mask to avoid UB in C. Optimizes away on x86 where shr does this anyway.
Hmm, this doesn't handle saturating the subtraction to 0 to keep the all-ones, though.
If tuning for ISAs other than x86, maybe look at some other options. Or maybe there's something better on x86 as well. Creating the all-ones or all-zeros with sar reg,63 is an interesting option (broadcast the sign bit), but we actually need all-ones when 192-count has sign bit = 0.

Here's some Zig code that compiles and runs:
const std = #import("std");
noinline fn thing(x: u256) bool {
return x > 0xffffffffffffffff;
}
pub fn main() anyerror!void {
var num: u256 = 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff;
while (thing(num)) {
num /= 2;
std.debug.print(".", .{});
}
std.debug.print("done\n", .{});
}
Zig master generates relatively clean x86 assembler from that.

Creating a mask with N least significant bits set

I would like to create a macro or function1 mask(n) which given a number n returns an unsigned integer with its n least significant bits set. Although this seems like it should be a basic primitive with heavily discussed implementations which compile efficiently - this doesn't seem to be the case.
Of course, various implementations may have different sizes for the primitive integral types like unsigned int, so let's assume for the sake of concreteness that we are talking returning a uint64_t specifically although of course an acceptable solutions would work (with different definitions) for any unsigned integral type. In particular, the solution should be efficient when the type returned is equal to or smaller than the platform's native width.
Critically, this must work for all n in [0, 64]. In particular mask(0) == 0 and mask(64) == (uint64_t)-1. Many "obvious" solutions don't work for one of these two cases.
The most important criteria is correctness: only correct solutions which don't rely on undefined behavior are interesting.
The second most important criteria is performance: the idiom should ideally compile to approximately the most efficient platform-specific way to do this on common platforms.
A solution that sacrifices simplicity in the name of performance, e.g., that uses different implementations on different platforms, is fine.
1 The most general case is a function, but ideally it would also work as a macro, without re-evaluating any of its arguments more than once.

Try
unsigned long long mask(const unsigned n)
{
assert(n <= 64);
return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
(1ULL << n) - 1ULL;
}
There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.
Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline or, in C++, constexpr.
The unsigned long long int type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t is not.
If you need a macro (because you need something that works as a compile-time constant), that might be:
#define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)
As several people correctly reminded me in the comments, 1ULL << 64U is potential undefined behavior! So, insert a check for that special case.
You could replace 64U with CHAR_BITS*sizeof(unsigned long long) if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.
You could similarly generate this from an unsigned right shift, but you would still need to check n == 64 as a special case, since right-shifting by the width of the type is undefined behavior.
ETA:
The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.

Another solution without branching
unsigned long long mask(unsigned n)
{
return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
}
n & 0x3F keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and instruction is needed for this.
The checking condition for 64 can be changed to -(n < 64) to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n) if your CPU supports BMI2.
The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output
The condition can also be simplified to n >> 6, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above
return (1ULL << (n & 0x3F)) - (n == 64) - 1; // or n >= 64
return (1ULL << (n & 0x3F)) - (n >> 6) - 1;
gcc compiles the latter to
mov eax, 1
shlx rax, rax, rdi
shr edi, 6
dec rax
sub rax, rdi
ret
Some more alternatives
return ~((~0ULL << (n & 0x3F)) << (n == 64));
return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);
return (uint64_t)(((__uint128_t)1 << n) - 1); // if a 128-bit type is available
A similar question for 32 bits: Set last `n` bits in unsigned int

Here's one that is portable and conditional-free:
unsigned long long mask(unsigned n)
{
assert (n <= sizeof(unsigned long long) * CHAR_BIT);
return (1ULL << (n/2) << (n-(n/2))) - 1;
}

This is not an answer to the exact question. It only works if 0 isn't a required output, but is more efficient.
2n+1 - 1 computed without overflow. i.e. an integer with the low n bits set, for n = 0 .. all_bits
Possibly using this inside a ternary for cmov could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1, to take care of the difference in counting for this vs. the question for the pow2 calculation.
// defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
unsigned long long setbits_upto(unsigned n) {
unsigned long long pow2 = 1ULL << n;
return pow2*2 - 1; // one more shift, and subtract 1.
}
Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 = all bits set.
unsigned long long setbits_upto2(unsigned n) {
unsigned long long pow2 = 2ULL << n; // bake in the extra shift count
return pow2 - 1;
}
The table of inputs / outputs for a 32-bit version of this function is:
n -> 1<<n -> *2 - 1
0 -> 1 -> 1 = 2 - 1
1 -> 2 -> 3 = 4 - 1
2 -> 4 -> 7 = 8 - 1
3 -> 8 -> 15 = 16 - 1
...
30 -> 0x40000000 -> 0x7FFFFFFF = 0x80000000 - 1
31 -> 0x80000000 -> 0xFFFFFFFF = 0 - 1
You could slap a cmov after it, or other way of handling an input that has to produce zero.
On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).
xor eax, eax
bts rax, rdi ; rax = 1<<(n&63)
lea rax, [rax + rax - 1] ; one more left shift, and subtract
(3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)
In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family
C compilers unfortunately are dumb and miss using bts even when tuning for Intel CPUs without BMI2 (where shl reg,cl is 3 uops).
e.g. gcc and clang both do this (with dec or add -1), on Godbolt
# gcc9.1 -O3 -mtune=haswell
setbits_upto(unsigned int):
mov ecx, edi
mov eax, 2 ; bake in the extra shift by 1.
sal rax, cl
dec rax
ret
MSVC starts with n in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:
# ICC19
setbits_upto(unsigned int):
mov eax, 1 #3.21
mov ecx, edi #2.39
shl rax, cl #2.39
lea rax, QWORD PTR [-1+rax+rax] #3.21
ret #3.21
With BMI2 (-march=haswell), we get optimal-for-AMD code from gcc/clang with -march=haswell
mov eax, 2
shlx rax, rax, rdi
add rax, -1
ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.
On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.
e.g. AArch64. And of course this can hoist the constant 2 for reuse with different n, like x86 can with BMI2 shlx.
setbits_upto(unsigned int):
mov x1, 2
lsl x0, x1, x0
sub x0, x0, #1
ret
Basically the same on PowerPC, RISC-V, etc.

#include <stdint.h>
uint64_t mask_n_bits(const unsigned n){
uint64_t ret = n < 64;
ret <<= n&63; //the &63 is typically optimized away
ret -= 1;
return ret;
}
Results:
mask_n_bits:
xor eax, eax
cmp edi, 63
setbe al
shlx rax, rax, rdi
dec rax
ret
Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .
Explanation:
The &63 gets optimized away, but ensures the shift is <=64.
For values less than 64 it just sets the first n bits using (1<<n)-1. 1<<n sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.
By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).
Caveats:
1s complement systems must die - requires special casing if you have one
some compilers may not optimize the &63 away

When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63).
The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.
When N=0, we can make the constant zero before shifting:
uint64_t mask(unsigned N)
{
return -uint64_t(N != 0) >> (64-N & 63);
}
This compiles to five instructions in x64 clang:
neg sets the carry flag to N != 0.
sbb turns the carry flag into 0 or -1.
shr rax,N already has an implicit N & 63, so 64-N & 63 was optimized to -N.
mov rcx,rdi
neg rcx
sbb rax,rax
shr rax,cl
ret
With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):
neg edi
sbb rax,rax
shrx rax,rax,rdi
ret

SIMD (AVX2) mask store and pack [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

Handling zeroes in _mm256_rsqrt_ps()

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:
_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
eightFloats);
Is the way to go for that extra bit of performance and avoiding a pipeline stall.
Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?
_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
_mm256_set1_ps(0.1))),
eightFloats);
My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:
float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
int thisSqrt = (int) floor(sqrt(vals[i]));
sum += min(thisSqrt, 0x3F);
}
All the values of vals should be integer values. (Why everything isn't just int is a different question...)

tl;dr: See the end for code that compiles and should work.
To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)
It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.
The thing you're missing:
Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.
Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).
Also:
Newton Raphson formula explained
Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.
It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).
63*1.5*2^(-12) == 0.023071...
approx_sqrt(63*63-1) == 62.99206... +/- 0.023068..
Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:
define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140
Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?
You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2-12 requires more instructions. I don't see any tricks a NR iteration.
If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.
; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087 # (x+0.5)/sqrt(x+0.5)
~0.00732879495710064718 ~0.00731963968187500662 # relative error
Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.
Anyway, I think you need one Newton-Raphson iteration.
The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).
FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.
Compiler output: a decent starting point
I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).
sum_fp doesn't auto-vectorize, even with the full -ffast-math.
However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX2, otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.
See the asm output on godbolt from:
// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
int sum = 0;
for (int i = 0; i < MAX; i++) {
int thisSqrt = (int) sqrtf(vals[i]);
sum += std::min(thisSqrt, 0x3F);
}
return sum;
}
To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.
putting it together:
I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).
As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).
I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.
Horizontal sum from this answer.
Complete but untested version:
#include <immintrin.h>
#define MAX 4096
// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
// clang / gcc usually use -3.0 and -0.5. We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
__m256 three = _mm256_set1_ps(3.0f);
//__m256 half = _mm256_set1_ps(0.5f); // we omit the *0.5 step
__m256 nr = _mm256_rsqrt_ps( x ); // initial approximation for Newton-Raphson
// 1/sqrt(x) ~= nr * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
// sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
// 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
__m256 xnr = _mm256_mul_ps( x, nr );
__m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3
return _mm256_mul_ps( xnr, three_minus_muls );
}
// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
__m256 offset = _mm256_set1_ps(0.5f); // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
__m256 xoff = _mm256_add_ps(x, offset); // avoids 0*Inf = NaN, and rounding error before truncation
__m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
__m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
return _mm256_srli_epi32(i2sqrtx, 1); // divide by 2 with truncation
// alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}
__m256i isqrt256_ps_simple_exact(__m256 x) {
__m256 sqrt_x = _mm256_sqrt_ps(x);
__m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
return isqrtx;
}
int hsum_epi32_avx(__m256i x256){
__m128i xhi = _mm256_extracti128_si256(x256, 1);
__m128i xlo = _mm256_castsi256_si128(x256);
__m128i x = _mm_add_epi32(xlo, xhi);
__m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
hl = _mm_add_epi32(hl, x);
x = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
hl = _mm_add_epi32(hl, x);
return _mm_cvtsi128_si32(hl);
}
int sum_int_avx(float vals[]){
__m256i sum = _mm256_setzero_si256();
__m256i upperlimit = _mm256_set1_epi32(0x3F);
for (int i = 0; i < MAX; i+=8) {
__m256 v = _mm256_loadu_ps(vals+i);
__m256i visqrt = isqrt256_ps(v);
// assert visqrt == isqrt256_ps_simple_exact(v) or something
visqrt = _mm256_min_epi32(visqrt, upperlimit);
sum = _mm256_add_epi32(sum, visqrt);
}
return hsum_epi32_avx(sum);
}
Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.
Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated sum_int version. It probably didn't notice the x*initial_guess(x) common-subexpressions that occur in the Newton-Raphson iteration formula when you're multiplying the result by x, or something like that. It also does an extra mulps instead of a psrld because it does the *0.5 before converting to int. So that's where the extra two mulps instructions come from, and there's the cmpps/blendvps.
sum_int_avx(float*):
vpxor ymm3, ymm3, ymm3
xor eax, eax
vbroadcastss ymm0, dword ptr [rip + .LCPI4_0] ; set1(0.5)
vbroadcastss ymm1, dword ptr [rip + .LCPI4_1] ; set1(3.0)
vpbroadcastd ymm2, dword ptr [rip + .LCPI4_2] ; set1(63)
LBB4_1: ; latencies
vaddps ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c
vrsqrtps ymm5, ymm4 ; 7c
vmulps ymm4, ymm4, ymm5 ; x*nr ; 5c
vfnmadd213ps ymm5, ymm4, ymm1 ; 5c
vmulps ymm4, ymm4, ymm5 ; 5c
vcvttps2dq ymm4, ymm4 ; 3c
vpsrld ymm4, ymm4, 1 ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain
vpminsd ymm4, ymm4, ymm2 ; 1c
vpaddd ymm3, ymm4, ymm3 ; 1c
; ... (those 9 insns repeated: loop unrolling)
add rax, 16
cmp rax, 4096
jl .LBB4_1
;... horizontal sum
IACA thinks that with no unroll, Haswell can sustain a throughput of one iteration per 4.15 cycles, bottlenecking on ports 0 and 1. So potentially you could shave a cycle by accumulating sqrt(x)*2 (with truncation to even numbers using _mm256_and_si256), and only divide by two outside the loop.
Also according to IACA, the latency of a single iteration is 38 cycles on Haswell. I only get 31c, so probably it's including L1 load-use latency or something. Anyway, this means that to saturate the execution units, operations from 8 iterations have to be in flight at once. That's 8 * ~14 unfused-domain uops = 112 unfused-uops (or less with clang's unroll) that have to be in flight at once. Haswell's scheduler is actually only 60 entries, but the ROB is 192 entries. The early uops from early iterations will already have executed, so they only need to be tracked in the ROB, not also in the scheduler. Many of the slow uops are at the beginning of each iteration, though. Still, there's reason to hope that this will come close-ish to saturating ports 0 and 1. Unless data is hot in L1 cache, cache/memory bandwidth will probably be the bottleneck.
Interleaving operations from multiple dep chains would also be better. When clang unrolls, it puts all 9 instructions for one iteration ahead of all 9 instructions for another iteration. It uses a surprisingly small number of registers, so it would be possible to have instructions for 2 or 4 iterations mixed together. This is the sort of thing compilers are supposed to be good at, but which is cumbersome for humans. :/
It would also be slightly more efficient if the compiler chose a one-register addressing mode, so the load could micro-fuse with the vaddps. gcc does this.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight