Trick to divide a constant (power of two) by an integer

Trick to divide a constant (power of two) by an integer - c

NOTE This is a theoretical question. I'm happy with the performance of my actual code as it is. I'm just curious about whether there is an alternative.
Is there a trick to do an integer division of a constant value, which is itself an integer power of two, by an integer variable value, without having to use do an actual divide operation?
// The fixed value of the numerator
#define SIGNAL_PULSE_COUNT 0x4000UL
// The division that could use a neat trick.
uint32_t signalToReferenceRatio(uint32_t referenceCount)
{
// Promote the numerator to a 64 bit value, shift it left by 32 so
// the result has an adequate number of bits of precision, and divide
// by the numerator.
return (uint32_t)((((uint64_t)SIGNAL_PULSE_COUNT) << 32) / referenceCount);
}
I've found several (lots) of references for tricks to do division by a constant, both integer and floating point. For example, the question What's the fastest way to divide an integer by 3? has a number of good answers including references to other academic and community materials.
Given that the numerator is constant, and it's an integer power of two, is there a neat trick that could be used in place of doing an actual 64 bit division; some kind of bit-wise operation (shifts, AND, XOR, that kind of stuff) or similar?
I don't want any loss of precision (beyond a possible half bit due to integer rounding) greater than that of doing the actual division, as the precision of the instrument relies on the precision of this measurement.
"Let the compiler decide" is not an answer, because I want to know if there is a trick.
Extra, Contextual Information
I'm developing a driver on a 16 bit data, 24 bit instruction word micro-controller. The driver does some magic with the peripheral modules to obtain a pulse count of a reference frequency for a fixed number of pulses of a signal frequency. The required result is a ratio of the signal pulses to the reference pulse, expressed as an unsigned 32 bit value. The arithmetic for the function is defined by the manufacturer of the device for which I'm developing the driver, and the result is processed further to obtain a floating point real-world value, but that's outside the scope of this question.
The micro-controller I'm using has a Digital Signal Processor that has a number of division operations that I could use, and I'm not afraid to do so if necessary. There would be some minor challenges to overcome with this approach, beyond the putting together the assembly instructions to make it work, such as the DSP being used to do a PID function in a BLDC driver ISR, but nothing I can't manage.

You cannot use clever mathematical tricks to not do a division, but you can of course still use programming tricks if you know the range of your reference count:
Nothing beats a pre-computed lookup table in terms of speed.
There are fast approximate square root algorithms (probably already in your DSP), and you can improve the approximation by one or two Newton-Raphson iterations. If doing the computation with floating-point numbers is accurate enough for you, you can probably beat a 64bit integer division in terms of speed (but not in clarity of code).
You mentioned that the result will be converted to floating-point later, it might be beneficial to not compute the integer division at all, but use your floating point hardware.

I worked out a Matlab version, using fixed point arithmetic.
This method assumes that a integer version of log2(x) can be calculated efficiently, which is true for dsPIC30/33F and TI C6000 that have instruction to detect the most significant 1 of an integer.
For this reason, this code has strong ISA depency and can not be written in portable/standard C and can be improved using instructions like multiply-and-add, multiply-and-shift, so I won't try translating it to C.
nrdiv.m
function [ y ] = nrdiv( q, x, lut)
% assume q>31, lut = 2^31/[1,1,2,...255]
p2 = ceil(log2(x)); % available in TI C6000, instruction LMBD
% available in Microchip dsPIC30F/33F, instruction FF1L
if p2<8
pre_shift=0;
else
pre_shift=p2-8;
end % shr = (p2-8)>0?(p2-8):0;
xn = shr(x, pre_shift); % xn = x>>pre_shift;
y = shr(lut(xn), pre_shift); % y = lut[xn]>pre_shift;
y = shr(y * (2^32 - y*x), 30); % basic iteration
% step up from q31 to q32
y = shr(y * (2^33 - y*x), (64-q)); % step up from q32 to desired q
if q>39
y = shr(y * (2^(1+q) - y*x), (q)); % when q>40, additional
% iteration is required,
end % no step up is performed
end
function y = shr(x, r)
y=floor(x./2^r); % simulate operator >>
end
test.m
test_number = (2^22-12345);
test_q = 48;
lut_q31 = round(2^31 ./ [1,[1:1:255]]);
display(sprintf('tested 2^%d/%d, diff=%f\n',test_q, test_number,...
nrdiv( 39, (2^22-5), lut_q31) - 2^39/(2^22-5)));
sample output
tested 2^48/4181959, diff=-0.156250
reference:
Newton–Raphson division

A little late but here is my solution.
First some assumptions:
Problem:
X=N/D where N is a constant ans a power of 2.
All 32 bit unsigned integers.
X is unknown but we have a good estimate
(previous but no longer accurate solution).
An exact solution is not required.
Note: due to integer truncation this is not an accurate algorithm!
An iterative solution is okay (improves with each loop).
Division is much more expensive than multiplication:
For 32bit unsigned integer for Arduino UNO:
'+/-' ~0.75us
'*' ~3.5us
'/' ~36us 4 We seek to replace the Basically lets start with Newton's method:
Xnew=Xold-f(x)/(f`(x)
where f(x)=0 for the solution we seek.
Solving this I get:
Xnew=XNew*(C-X*D)/N
where C=2*N
First trick:
Now that the Numerator (constant) is now a Divisor (constant) then one solution here (which does not require the N to be a power of 2) is:
Xnew=XNew*(C-X*D)*A>>M
where C=2*N, A and M are constants (look for dividing by a constant tricks).
or (staying with Newtons method):
Xnew=XNew*(C-X*D)>>M
where C=2>>M where M is the power.
So I have 2 '*' (7.0us), a '-' (0.75us) and a '>>' (0.75us?) or 8.5us total (rather than 36us), excluding other overheads.
Limitations:
As the data type is 32 bit unsigned, 'M' should not exceed 15 else there will be problems with overflow (you can probably get around this using a 64bit intermediate data type).
N>D (else the algorithm blows up! at least with unsigned integer)
Obviously the algorithm will work with signed and float data types)
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
int main(void)
{
unsigned long c,d,m,x;
// x=n/d where n=1<<m
m=15;
c=2<<m;
d=10;
x=10;
while (true)
{
x=x*(c-d*x)>>m;
printf("%ld",x);
getchar();
}
return(0);
}

Having tried many alternatives, I ended up doing normal binary long division in assembly language. However, the routine does use a few optimisations that bring the execution time down to an acceptable level.
/*
* Converts the reference frequency count for a specific signal frequency
* to a ratio.
* Xs = Ns * 2^32 / Nr
* Where:
* 2^32 is a constant scaling so that the maximum accuracy can be achieved.
* Ns is the number of signal counts (fixed at 0x4000 by hardware).
* Nr is the number of reference counts, passed in W1:W0.
* #param W1:W0 The number of reference frequency pulses.
* #return W1:W0 The scaled ratio.
*/
.align 2
.global _signalToReferenceRatio
.type _signalToReferenceRatio, #function
; This is the position of the most significant bit of the fixed Ns (0x4000).
.equ LOG2_DIVIDEND, 14
.equ DIVISOR_LIMIT, LOG2_DIVIDEND+1
.equ WORD_SIZE, 16
_signalToReferenceRatio:
; Create a dividend, MSB-aligned with the divisor, in W2:W3 and place the
; number of iterations required for the MSW in [W14] and the LSW in [W14+2].
LNK #4
MUL.UU W2, #0, W2
FF1L W1, W4
; If MSW is zero the argument is out of range.
BRA C, .returnZero
SUBR W4, #WORD_SIZE, W4
; Find the number of quotient MSW loops.
; This is effectively 1 + log2(dividend) - log2(divisor).
SUBR W4, #DIVISOR_LIMIT, [W14]
BRA NC, .returnZero
; Since the SUBR above is always non-negative and the C flag set, use this
; to set bit W3<W5> and the dividend in W2:W3 = 2^(16+W5) = 2^log2(divisor).
BSW.C W3, W4
; Use 16 quotient LSW loops.
MOV #WORD_SIZE, W4
MOV W4, [W14+2]
; Set up W4:W5 to hold the divisor and W0:W1 to hold the result.
MOV.D W0, W4
MUL.UU W0, #0, W0
.checkLoopCount:
; While the bit count is non-negative ...
DEC [W14], [W14]
BRA NC, .nextWord
.alignQuotient:
; Shift the current quotient word up by one bit.
SL W0, W0
; Subtract divisor from the current dividend part.
SUB W2, W4, W6
SUBB W3, W5, W7
; Check if the dividend part was less than the divisor.
BRA NC, .didNotDivide
; It did divide, so set the LSB of the quotient.
BSET W0, #0
; Shift the remainder up by one bit, with the next zero in the LSB.
SL W7, W3
BTSC W6, #15
BSET W3, #0
SL W6, W2
BRA .checkLoopCount
.didNotDivide:
; Shift the next (zero) bit of the dividend into the LSB of the remainder.
SL W3, W3
BTSC W2, #15
BSET W3, #0
SL W2, W2
BRA .checkLoopCount
.nextWord:
; Test if there are any LSW bits left to calculate.
MOV [++W14], W6
SUB W6, #WORD_SIZE, [W14--]
BRA NC, .returnQ
; Decrement the remaining bit counter before writing it back.
DEC W6, [W14]
; Move the working part of the quotient up into the MSW of the result.
MOV W0, W1
BRA .alignQuotient
.returnQ:
; Return the quotient in W0:W1.
ULNK
RETURN
.returnZero:
MUL.UU W0, #0, W0
ULNK
RETURN
.size _signalToReferenceRatio, .-_signalToReferenceRatio

Related

Creating a mask with N least significant bits set

I would like to create a macro or function1 mask(n) which given a number n returns an unsigned integer with its n least significant bits set. Although this seems like it should be a basic primitive with heavily discussed implementations which compile efficiently - this doesn't seem to be the case.
Of course, various implementations may have different sizes for the primitive integral types like unsigned int, so let's assume for the sake of concreteness that we are talking returning a uint64_t specifically although of course an acceptable solutions would work (with different definitions) for any unsigned integral type. In particular, the solution should be efficient when the type returned is equal to or smaller than the platform's native width.
Critically, this must work for all n in [0, 64]. In particular mask(0) == 0 and mask(64) == (uint64_t)-1. Many "obvious" solutions don't work for one of these two cases.
The most important criteria is correctness: only correct solutions which don't rely on undefined behavior are interesting.
The second most important criteria is performance: the idiom should ideally compile to approximately the most efficient platform-specific way to do this on common platforms.
A solution that sacrifices simplicity in the name of performance, e.g., that uses different implementations on different platforms, is fine.
1 The most general case is a function, but ideally it would also work as a macro, without re-evaluating any of its arguments more than once.

Try
unsigned long long mask(const unsigned n)
{
assert(n <= 64);
return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
(1ULL << n) - 1ULL;
}
There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.
Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline or, in C++, constexpr.
The unsigned long long int type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t is not.
If you need a macro (because you need something that works as a compile-time constant), that might be:
#define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)
As several people correctly reminded me in the comments, 1ULL << 64U is potential undefined behavior! So, insert a check for that special case.
You could replace 64U with CHAR_BITS*sizeof(unsigned long long) if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.
You could similarly generate this from an unsigned right shift, but you would still need to check n == 64 as a special case, since right-shifting by the width of the type is undefined behavior.
ETA:
The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.

Another solution without branching
unsigned long long mask(unsigned n)
{
return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
}
n & 0x3F keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and instruction is needed for this.
The checking condition for 64 can be changed to -(n < 64) to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n) if your CPU supports BMI2.
The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output
The condition can also be simplified to n >> 6, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above
return (1ULL << (n & 0x3F)) - (n == 64) - 1; // or n >= 64
return (1ULL << (n & 0x3F)) - (n >> 6) - 1;
gcc compiles the latter to
mov eax, 1
shlx rax, rax, rdi
shr edi, 6
dec rax
sub rax, rdi
ret
Some more alternatives
return ~((~0ULL << (n & 0x3F)) << (n == 64));
return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);
return (uint64_t)(((__uint128_t)1 << n) - 1); // if a 128-bit type is available
A similar question for 32 bits: Set last `n` bits in unsigned int

Here's one that is portable and conditional-free:
unsigned long long mask(unsigned n)
{
assert (n <= sizeof(unsigned long long) * CHAR_BIT);
return (1ULL << (n/2) << (n-(n/2))) - 1;
}

This is not an answer to the exact question. It only works if 0 isn't a required output, but is more efficient.
2n+1 - 1 computed without overflow. i.e. an integer with the low n bits set, for n = 0 .. all_bits
Possibly using this inside a ternary for cmov could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1, to take care of the difference in counting for this vs. the question for the pow2 calculation.
// defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
unsigned long long setbits_upto(unsigned n) {
unsigned long long pow2 = 1ULL << n;
return pow2*2 - 1; // one more shift, and subtract 1.
}
Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 = all bits set.
unsigned long long setbits_upto2(unsigned n) {
unsigned long long pow2 = 2ULL << n; // bake in the extra shift count
return pow2 - 1;
}
The table of inputs / outputs for a 32-bit version of this function is:
n -> 1<<n -> *2 - 1
0 -> 1 -> 1 = 2 - 1
1 -> 2 -> 3 = 4 - 1
2 -> 4 -> 7 = 8 - 1
3 -> 8 -> 15 = 16 - 1
...
30 -> 0x40000000 -> 0x7FFFFFFF = 0x80000000 - 1
31 -> 0x80000000 -> 0xFFFFFFFF = 0 - 1
You could slap a cmov after it, or other way of handling an input that has to produce zero.
On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).
xor eax, eax
bts rax, rdi ; rax = 1<<(n&63)
lea rax, [rax + rax - 1] ; one more left shift, and subtract
(3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)
In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family
C compilers unfortunately are dumb and miss using bts even when tuning for Intel CPUs without BMI2 (where shl reg,cl is 3 uops).
e.g. gcc and clang both do this (with dec or add -1), on Godbolt
# gcc9.1 -O3 -mtune=haswell
setbits_upto(unsigned int):
mov ecx, edi
mov eax, 2 ; bake in the extra shift by 1.
sal rax, cl
dec rax
ret
MSVC starts with n in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:
# ICC19
setbits_upto(unsigned int):
mov eax, 1 #3.21
mov ecx, edi #2.39
shl rax, cl #2.39
lea rax, QWORD PTR [-1+rax+rax] #3.21
ret #3.21
With BMI2 (-march=haswell), we get optimal-for-AMD code from gcc/clang with -march=haswell
mov eax, 2
shlx rax, rax, rdi
add rax, -1
ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.
On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.
e.g. AArch64. And of course this can hoist the constant 2 for reuse with different n, like x86 can with BMI2 shlx.
setbits_upto(unsigned int):
mov x1, 2
lsl x0, x1, x0
sub x0, x0, #1
ret
Basically the same on PowerPC, RISC-V, etc.

#include <stdint.h>
uint64_t mask_n_bits(const unsigned n){
uint64_t ret = n < 64;
ret <<= n&63; //the &63 is typically optimized away
ret -= 1;
return ret;
}
Results:
mask_n_bits:
xor eax, eax
cmp edi, 63
setbe al
shlx rax, rax, rdi
dec rax
ret
Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .
Explanation:
The &63 gets optimized away, but ensures the shift is <=64.
For values less than 64 it just sets the first n bits using (1<<n)-1. 1<<n sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.
By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).
Caveats:
1s complement systems must die - requires special casing if you have one
some compilers may not optimize the &63 away

When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63).
The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.
When N=0, we can make the constant zero before shifting:
uint64_t mask(unsigned N)
{
return -uint64_t(N != 0) >> (64-N & 63);
}
This compiles to five instructions in x64 clang:
neg sets the carry flag to N != 0.
sbb turns the carry flag into 0 or -1.
shr rax,N already has an implicit N & 63, so 64-N & 63 was optimized to -N.
mov rcx,rdi
neg rcx
sbb rax,rax
shr rax,cl
ret
With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):
neg edi
sbb rax,rax
shrx rax,rax,rdi
ret

SIMD (AVX2) mask store and pack [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000

Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

Handling zeroes in _mm256_rsqrt_ps()

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:
_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
eightFloats);
Is the way to go for that extra bit of performance and avoiding a pipeline stall.
Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?
_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
_mm256_set1_ps(0.1))),
eightFloats);
My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:
float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
int thisSqrt = (int) floor(sqrt(vals[i]));
sum += min(thisSqrt, 0x3F);
}
All the values of vals should be integer values. (Why everything isn't just int is a different question...)

tl;dr: See the end for code that compiles and should work.
To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)
It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.
The thing you're missing:
Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.
Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).
Also:
Newton Raphson formula explained
Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.
It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).
63*1.5*2^(-12) == 0.023071...
approx_sqrt(63*63-1) == 62.99206... +/- 0.023068..
Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:
define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140
Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?
You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2-12 requires more instructions. I don't see any tricks a NR iteration.
If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.
; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087 # (x+0.5)/sqrt(x+0.5)
~0.00732879495710064718 ~0.00731963968187500662 # relative error
Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.
Anyway, I think you need one Newton-Raphson iteration.
The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).
FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.
Compiler output: a decent starting point
I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).
sum_fp doesn't auto-vectorize, even with the full -ffast-math.
However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX2, otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.
See the asm output on godbolt from:
// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
int sum = 0;
for (int i = 0; i < MAX; i++) {
int thisSqrt = (int) sqrtf(vals[i]);
sum += std::min(thisSqrt, 0x3F);
}
return sum;
}
To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.
putting it together:
I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).
As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).
I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.
Horizontal sum from this answer.
Complete but untested version:
#include <immintrin.h>
#define MAX 4096
// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
// clang / gcc usually use -3.0 and -0.5. We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
__m256 three = _mm256_set1_ps(3.0f);
//__m256 half = _mm256_set1_ps(0.5f); // we omit the *0.5 step
__m256 nr = _mm256_rsqrt_ps( x ); // initial approximation for Newton-Raphson
// 1/sqrt(x) ~= nr * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
// sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
// 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
__m256 xnr = _mm256_mul_ps( x, nr );
__m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3
return _mm256_mul_ps( xnr, three_minus_muls );
}
// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
__m256 offset = _mm256_set1_ps(0.5f); // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
__m256 xoff = _mm256_add_ps(x, offset); // avoids 0*Inf = NaN, and rounding error before truncation
__m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
__m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
return _mm256_srli_epi32(i2sqrtx, 1); // divide by 2 with truncation
// alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}
__m256i isqrt256_ps_simple_exact(__m256 x) {
__m256 sqrt_x = _mm256_sqrt_ps(x);
__m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
return isqrtx;
}
int hsum_epi32_avx(__m256i x256){
__m128i xhi = _mm256_extracti128_si256(x256, 1);
__m128i xlo = _mm256_castsi256_si128(x256);
__m128i x = _mm_add_epi32(xlo, xhi);
__m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
hl = _mm_add_epi32(hl, x);
x = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
hl = _mm_add_epi32(hl, x);
return _mm_cvtsi128_si32(hl);
}
int sum_int_avx(float vals[]){
__m256i sum = _mm256_setzero_si256();
__m256i upperlimit = _mm256_set1_epi32(0x3F);
for (int i = 0; i < MAX; i+=8) {
__m256 v = _mm256_loadu_ps(vals+i);
__m256i visqrt = isqrt256_ps(v);
// assert visqrt == isqrt256_ps_simple_exact(v) or something
visqrt = _mm256_min_epi32(visqrt, upperlimit);
sum = _mm256_add_epi32(sum, visqrt);
}
return hsum_epi32_avx(sum);
}
Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.
Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated sum_int version. It probably didn't notice the x*initial_guess(x) common-subexpressions that occur in the Newton-Raphson iteration formula when you're multiplying the result by x, or something like that. It also does an extra mulps instead of a psrld because it does the *0.5 before converting to int. So that's where the extra two mulps instructions come from, and there's the cmpps/blendvps.
sum_int_avx(float*):
vpxor ymm3, ymm3, ymm3
xor eax, eax
vbroadcastss ymm0, dword ptr [rip + .LCPI4_0] ; set1(0.5)
vbroadcastss ymm1, dword ptr [rip + .LCPI4_1] ; set1(3.0)
vpbroadcastd ymm2, dword ptr [rip + .LCPI4_2] ; set1(63)
LBB4_1: ; latencies
vaddps ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c
vrsqrtps ymm5, ymm4 ; 7c
vmulps ymm4, ymm4, ymm5 ; x*nr ; 5c
vfnmadd213ps ymm5, ymm4, ymm1 ; 5c
vmulps ymm4, ymm4, ymm5 ; 5c
vcvttps2dq ymm4, ymm4 ; 3c
vpsrld ymm4, ymm4, 1 ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain
vpminsd ymm4, ymm4, ymm2 ; 1c
vpaddd ymm3, ymm4, ymm3 ; 1c
; ... (those 9 insns repeated: loop unrolling)
add rax, 16
cmp rax, 4096
jl .LBB4_1
;... horizontal sum
IACA thinks that with no unroll, Haswell can sustain a throughput of one iteration per 4.15 cycles, bottlenecking on ports 0 and 1. So potentially you could shave a cycle by accumulating sqrt(x)*2 (with truncation to even numbers using _mm256_and_si256), and only divide by two outside the loop.
Also according to IACA, the latency of a single iteration is 38 cycles on Haswell. I only get 31c, so probably it's including L1 load-use latency or something. Anyway, this means that to saturate the execution units, operations from 8 iterations have to be in flight at once. That's 8 * ~14 unfused-domain uops = 112 unfused-uops (or less with clang's unroll) that have to be in flight at once. Haswell's scheduler is actually only 60 entries, but the ROB is 192 entries. The early uops from early iterations will already have executed, so they only need to be tracked in the ROB, not also in the scheduler. Many of the slow uops are at the beginning of each iteration, though. Still, there's reason to hope that this will come close-ish to saturating ports 0 and 1. Unless data is hot in L1 cache, cache/memory bandwidth will probably be the bottleneck.
Interleaving operations from multiple dep chains would also be better. When clang unrolls, it puts all 9 instructions for one iteration ahead of all 9 instructions for another iteration. It uses a surprisingly small number of registers, so it would be possible to have instructions for 2 or 4 iterations mixed together. This is the sort of thing compilers are supposed to be good at, but which is cumbersome for humans. :/
It would also be slightly more efficient if the compiler chose a one-register addressing mode, so the load could micro-fuse with the vaddps. gcc does this.

The best way to shift a __m128i?

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m).
What is the best way to do this?!
Note that _mm_slli_epi64 shifts v0 and v1 seperately:
r0 := v0 << count
r1 := v1 << count
so the last bits of v0 missed, but I want to move those bits to r1.
Edit:
I looking for a code, faster than this (m<64):
r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);

For compile-time constant shift counts, you can get fairly good results. Otherwise not really.
This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.
// untested
#include <immintrin.h>
/* some compilers might choke on slli / srli with non-compile-time-constant args
* gcc generates the xmm, imm8 form with constants,
* and generates the xmm, xmm form with otherwise. (With movd to get the count in an xmm)
*/
// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
__m128i carry = _mm_bslli_si128(x, 8); // old compilers only have the confusingly named _mm_slli_si128 synonym
if (count >= 64)
return _mm_slli_epi64(carry, count-64); // the non-carry part is all zero, so return early
// else
carry = _mm_srli_epi64(carry, 64-count); // After bslli shifted left by 64b
x = _mm_slli_epi64(x, count);
return _mm_or_si128(x, carry);
}
__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100); }
I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.
The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):
mm_bitshift_left_3(long long __vector(2)):
vpslldq xmm1, xmm0, 8
vpsrlq xmm1, xmm1, 61
vpsllq xmm0, xmm0, 3
vpor xmm0, xmm0, xmm1
ret
Performance:
This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)
Or for counts >= 64:
mm_bitshift_left_100(long long __vector(2)):
vpslldq xmm0, xmm0, 8
vpsllq xmm0, xmm0, 36
ret
If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.
It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.
The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)
__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
return x << count; // undefined if count >= 128
}

In SSE4.A the instructions insrq and extrq can be used to shift (and rotate) through __mm128i 1-64 bits at a time. Unlike the 8/16/32/64 bit counterparts pextrN/pinsrX, these instructions select or insert m bits (between 1 and 64) at any bit offset from 0 to 127. The caveat is that the sum of lenght and offset must not exceed 128.

Where can I find soft-multiply and divide algorithms?

I'm working on a micro-controller without hardware multiply and divide. I need to cook up software algorithms for these basic operations that are a nice balance of compact size and efficiency. My C compiler port will employ these algos, not the the C developers themselves.
My google-fu is so far turning up mostly noise on this topic.
Can anyone point me to something informative? I can use add/sub and shift instructions. Table lookup based algos might also work for me, but I'm a bit worried about cramming so much into the compiler's back-end...um, so to speak.

Here's a simple multiplication algorithm:
Start with rightmost bit of multiplier.
If bit in multiplier is 1, add multiplicand
Shift multiplicand by 1
Move to next bit in multiplier and go back to step 2.
And here's a division algorithm:
If divisor is larger than dividend, stop.
While divisor register is less than dividend register, shift left.
Shift divisor register right by 1.
Subtract divisor register from dividend register and change the bit to 1 in the result register at the bit that corresponds with the total number of shifts done to the divisor register.
Start over at step 1 with divisor register in original state.
Of course you'll need to put in a check for dividing by 0, but it should work.
These algorithms, of course, are only for integers.

My favorite reference for things like this, available in book form:
http://www.hackersdelight.org/
Also you can't go wrong with TAoCP: http://www-cs-faculty.stanford.edu/~uno/taocp.html

Here's a division algorithm: http://www.prasannatech.net/2009/01/division-without-division-operator_24.html
I assume we're talking about ints?
If there's no hardware support, you'll have to implement your own divide-by-zero exception.
(I didn't have much luck quickly finding a multiplication algorithm, but I'll keep looking if someone else doesn't find one).

One simple and fairly performant multiplication algorithm for integers is Russian Peasant Multiplication.
For rationals, you could try a binary quote notation, for which division is easier than usual.

It turns out that I still have some old 68000 assembler code for long multiplication and long division. 68000 code is pretty clean and simple, so should be easy to translate for your chip.
The 68000 had multiply and divide instructions IIRC - I think these were written as a learning exercise.
Decided to just put the code here. Added comments and, in the process, fixed a problem.
;
; Purpose : division of longword by longword to give longword
; : all values signed.
; Requires : d0.L == Value to divide
; : d1.L == Value to divide by
; Changes : d0.L == Remainder
; : d2.L == Result
; : corrupts d1, d3, d4
;
section text
ldiv: move #0,d3 ; Convert d0 -ve to +ve - d3 records original sign
tst.l d0
bpl.s lib5a
neg.l d0
not d3
lib5a: tst.l d1 ; Convert d1 -ve to +ve - d3 records result sign
bpl.s lib5b
neg.l d1
not d3
lib5b: tst.l d1 ; Detect division by zero (not really handled well)
bne.s lib3a
rts
lib3a: moveq.l #0,d2 ; Init working result d2
moveq.l #1,d4 ; Init d4
lib3b: cmp.l d0,d1 ; while d0 < d1 {
bhi.s lib3c
asl.l #1,d1 ; double d1 and d4
asl.l #1,d4
bra.s lib3b ; }
lib3c: asr.l #1,d1 ; halve d1 and d4
asr.l #1,d4
bcs.s lib3d ; stop when d4 reaches zero
cmp.l d0,d1 ; do subtraction if appropriate
bhi.s lib3c
or.l d4,d2 ; update result
sub.l d1,d0
bne.s lib3c
lib3d: ; fix the result and remainder signs
; and.l #$7fffffff,d2 ; don't know why this is here
tst d3
beq.s lib3e
neg.l d2
neg.l d0
lib3e: rts
;
; Purpose : Multiply long by long to give long
; Requires : D0.L == Input 1
; : D1.L == Input 2
; Changes : D2.L == Result
; : D3.L is corrupted
;
lmul: move #0,d3 ; d0 -ve to +ve, original sign in d3
tst.l d0
bpl.s lib4c
neg.l d0
not d3
lib4c: tst.l d1 ; d1 -ve to +ve, result sign in d3
bpl.s lib4d
neg.l d1
not d3
lib4d: moveq.l #0,d2 ; init d2 as working result
lib4a: asr.l #1,d0 ; shift d0 right
bcs.s lib4b ; if a bit fell off, update result
asl.l #1,d1 ; either way, shift left d1
tst.l d0
bne.s lib4a ; if d0 non-zero, continue
tst.l d3 ; basically done - apply sign?
beq.s lib4e ; was broken! now fixed
neg.l d2
lib4e: rts
lib4b: add.l d1,d2 ; main loop body - update result
asl.l #1,d1
bra.s lib4a
By the way - I never did figure out whether it was necessary to convert everything to positive at the start. If you're careful with the shift operations, that may be avoidable overhead.

To multiply, add partial products from the shifted multiplicand to an accumulator iff the corresponding bit in the multiplier is set. Shift multiplicand and multiplier at end of loop, testing multiplier & 1 to see if addition should be done.

The Microchip PICmicro 16Fxxx series chips do not have a multiply or divide instruction.
Perhaps some of the soft multiply and soft divide routines for it can be ported to your MCU.
PIC Microcontroller Basic Math Multiplication Methods
PIC Microcontroller Basic Math Division Methods
Also check out "Newton's method" for division.
I think that method gives the smallest executable size of any division algorithm I've ever seen, although the explanation makes it sound more complicated than it really is.
I hear that some early Cray supercomputers used Newton's method for division.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight