Creating a mask with N least significant bits set - c

I would like to create a macro or function1 mask(n) which given a number n returns an unsigned integer with its n least significant bits set. Although this seems like it should be a basic primitive with heavily discussed implementations which compile efficiently - this doesn't seem to be the case.
Of course, various implementations may have different sizes for the primitive integral types like unsigned int, so let's assume for the sake of concreteness that we are talking returning a uint64_t specifically although of course an acceptable solutions would work (with different definitions) for any unsigned integral type. In particular, the solution should be efficient when the type returned is equal to or smaller than the platform's native width.
Critically, this must work for all n in [0, 64]. In particular mask(0) == 0 and mask(64) == (uint64_t)-1. Many "obvious" solutions don't work for one of these two cases.
The most important criteria is correctness: only correct solutions which don't rely on undefined behavior are interesting.
The second most important criteria is performance: the idiom should ideally compile to approximately the most efficient platform-specific way to do this on common platforms.
A solution that sacrifices simplicity in the name of performance, e.g., that uses different implementations on different platforms, is fine.
1 The most general case is a function, but ideally it would also work as a macro, without re-evaluating any of its arguments more than once.

Try
unsigned long long mask(const unsigned n)
{
assert(n <= 64);
return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
(1ULL << n) - 1ULL;
}
There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.
Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline or, in C++, constexpr.
The unsigned long long int type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t is not.
If you need a macro (because you need something that works as a compile-time constant), that might be:
#define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)
As several people correctly reminded me in the comments, 1ULL << 64U is potential undefined behavior! So, insert a check for that special case.
You could replace 64U with CHAR_BITS*sizeof(unsigned long long) if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.
You could similarly generate this from an unsigned right shift, but you would still need to check n == 64 as a special case, since right-shifting by the width of the type is undefined behavior.
ETA:
The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.

Another solution without branching
unsigned long long mask(unsigned n)
{
return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
}
n & 0x3F keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and instruction is needed for this.
The checking condition for 64 can be changed to -(n < 64) to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n) if your CPU supports BMI2.
The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output
The condition can also be simplified to n >> 6, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above
return (1ULL << (n & 0x3F)) - (n == 64) - 1; // or n >= 64
return (1ULL << (n & 0x3F)) - (n >> 6) - 1;
gcc compiles the latter to
mov eax, 1
shlx rax, rax, rdi
shr edi, 6
dec rax
sub rax, rdi
ret
Some more alternatives
return ~((~0ULL << (n & 0x3F)) << (n == 64));
return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);
return (uint64_t)(((__uint128_t)1 << n) - 1); // if a 128-bit type is available
A similar question for 32 bits: Set last `n` bits in unsigned int

Here's one that is portable and conditional-free:
unsigned long long mask(unsigned n)
{
assert (n <= sizeof(unsigned long long) * CHAR_BIT);
return (1ULL << (n/2) << (n-(n/2))) - 1;
}

This is not an answer to the exact question. It only works if 0 isn't a required output, but is more efficient.
2n+1 - 1 computed without overflow. i.e. an integer with the low n bits set, for n = 0 .. all_bits
Possibly using this inside a ternary for cmov could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1, to take care of the difference in counting for this vs. the question for the pow2 calculation.
// defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
unsigned long long setbits_upto(unsigned n) {
unsigned long long pow2 = 1ULL << n;
return pow2*2 - 1; // one more shift, and subtract 1.
}
Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 = all bits set.
unsigned long long setbits_upto2(unsigned n) {
unsigned long long pow2 = 2ULL << n; // bake in the extra shift count
return pow2 - 1;
}
The table of inputs / outputs for a 32-bit version of this function is:
n -> 1<<n -> *2 - 1
0 -> 1 -> 1 = 2 - 1
1 -> 2 -> 3 = 4 - 1
2 -> 4 -> 7 = 8 - 1
3 -> 8 -> 15 = 16 - 1
...
30 -> 0x40000000 -> 0x7FFFFFFF = 0x80000000 - 1
31 -> 0x80000000 -> 0xFFFFFFFF = 0 - 1
You could slap a cmov after it, or other way of handling an input that has to produce zero.
On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).
xor eax, eax
bts rax, rdi ; rax = 1<<(n&63)
lea rax, [rax + rax - 1] ; one more left shift, and subtract
(3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)
In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family
C compilers unfortunately are dumb and miss using bts even when tuning for Intel CPUs without BMI2 (where shl reg,cl is 3 uops).
e.g. gcc and clang both do this (with dec or add -1), on Godbolt
# gcc9.1 -O3 -mtune=haswell
setbits_upto(unsigned int):
mov ecx, edi
mov eax, 2 ; bake in the extra shift by 1.
sal rax, cl
dec rax
ret
MSVC starts with n in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:
# ICC19
setbits_upto(unsigned int):
mov eax, 1 #3.21
mov ecx, edi #2.39
shl rax, cl #2.39
lea rax, QWORD PTR [-1+rax+rax] #3.21
ret #3.21
With BMI2 (-march=haswell), we get optimal-for-AMD code from gcc/clang with -march=haswell
mov eax, 2
shlx rax, rax, rdi
add rax, -1
ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.
On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.
e.g. AArch64. And of course this can hoist the constant 2 for reuse with different n, like x86 can with BMI2 shlx.
setbits_upto(unsigned int):
mov x1, 2
lsl x0, x1, x0
sub x0, x0, #1
ret
Basically the same on PowerPC, RISC-V, etc.

#include <stdint.h>
uint64_t mask_n_bits(const unsigned n){
uint64_t ret = n < 64;
ret <<= n&63; //the &63 is typically optimized away
ret -= 1;
return ret;
}
Results:
mask_n_bits:
xor eax, eax
cmp edi, 63
setbe al
shlx rax, rax, rdi
dec rax
ret
Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .
Explanation:
The &63 gets optimized away, but ensures the shift is <=64.
For values less than 64 it just sets the first n bits using (1<<n)-1. 1<<n sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.
By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).
Caveats:
1s complement systems must die - requires special casing if you have one
some compilers may not optimize the &63 away

When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63).
The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.
When N=0, we can make the constant zero before shifting:
uint64_t mask(unsigned N)
{
return -uint64_t(N != 0) >> (64-N & 63);
}
This compiles to five instructions in x64 clang:
neg sets the carry flag to N != 0.
sbb turns the carry flag into 0 or -1.
shr rax,N already has an implicit N & 63, so 64-N & 63 was optimized to -N.
mov rcx,rdi
neg rcx
sbb rax,rax
shr rax,cl
ret
With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):
neg edi
sbb rax,rax
shrx rax,rax,rdi
ret

Related

AVX512 - How to move all set bits to the right?

How can I move all set bits of mask register to right? (To the bottom, least-significant position).
For example:
__mmask16 mask = _mm512_cmpeq_epi32_mask(vload, vlimit); // mask = 1101110111011101
If we move all set bits to the right, we will get: 1101110111011101 -> 0000111111111111
How can I achieve this efficiently?
Below you can see how I tried to get the same result, but it's inefficient:
__mmask16 mask = 56797;
// mask: 1101110111011101
__m512i vbrdcast = _mm512_maskz_broadcastd_epi32(mask, _mm_set1_epi32(~0));
// vbrdcast: -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1
__m512i vcompress = _mm512_maskz_compress_epi32(mask, vbrdcast);
// vcompress:-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0
__mmask16 right_packed_mask = _mm512_movepi32_mask(vcompress);
// right_packed_mask: 0000111111111111
What is the best way to do this?
BMI2 pext is the scalar bitwise equivalent of v[p]compressd/q/ps/pd.
Use it on your mask value to left-pack them to the bottom of the value.
mask = _pext_u32(-1U, mask); // or _pext_u64(-1ULL, mask64) for __mmask64
// costs 3 asm instructions (kmov + pext + kmov) if you need to use the result as a mask
// not including putting -1 in a register.
Implicit conversion between __mmask16 (aka uint16_t in GCC) and uint32_t works.
Use _cvtu32_mask16 and _cvtu32_mask16 to make the KMOVW explicit if you like.
See How to unset N right-most set bits for more about using pext/pdep in ways like this.
All current CPUs with AVX-512 also have fast BMI2 pext (including Xeon Phi), same performance as popcnt. AMD had slow pext until Zen 3, but if/when AMD ever introduces an AVX-512 CPU it should have fast pext/pdep.
For earlier AMD without AVX512, you might want (1ULL << __builtin_popcount(mask)) - 1, but be careful of overflow if all bits are set. 1ULL << 64 is undefined behaviour, and likely to produce 1 not 0 when compiled for x86-64.
If you were going to use vpcompressd, note that the source vector can simply be all-ones _mm512_set1_epi32(-1); compress doesn't care about elements where the mask was zero, they don't need to already be zero.
(It doesn't matter which -1s you pack; once you're working with boolean values, there's no difference between a true that came from your original bitmask vs. a constant true that was just sitting there which you generated more cheaply, without a dependency on your input mask. Same reasoning applies for pext, why you can use -1U as the source data instead of a pdep. i.e. a -1 or set bit doesn't have an identity; it's the same as any other -1 or set bit).
So let's try both ways and see how good/bad the asm is.
inline
__mmask16 leftpack_k(__mmask16 mask){
return _pdep_u32(-1U, mask);
}
inline
__mmask16 leftpack_comp(__mmask16 mask) {
__m512i v = _mm512_maskz_compress_epi32(mask, _mm512_set1_epi32(-1));
return _mm512_movepi32_mask(v);
}
Looking at stand-alone versions of these isn't useful because __mmask16 is a typedef for unsigned short, and is thus passed/returned in integer registers, not k registers. That makes the pext version look very good, of course, but we want to see how it inlines into a case where we generate and use the mask with AVX-512 intrinsics.
// not a useful function, just something that compiles to asm in an obvious way
void use_leftpack_compress(void *dst, __m512i v){
__mmask16 m = _mm512_test_epi32_mask(v,v);
m = leftpack_comp(m);
_mm512_mask_storeu_epi32(dst, m, v);
}
Commenting out the m = pack(m), this is just a simple 2 instructions that generate and then use a mask.
use_mask_nocompress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
So any extra instructions will be due to left-packing (compressing) the mask. GCC and clang make the same asm as each other, differing only in clang avoiding kmovw in favour of always kmovd. Godbolt
# GCC10.3 -O3 -march=skylake-avx512
use_leftpack_k(void*, long long __vector(8)):
vptestmd k0, zmm0, zmm0
mov eax, -1 # could be hoisted out of a loop
kmovd edx, k0
pdep eax, eax, edx
kmovw k1, eax
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
use_leftpack_compress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vpternlogd zmm2, zmm2, zmm2, 0xFF # set1(-1) could be hoisted out of a loop
vpcompressd zmm1{k1}{z}, zmm2
vpmovd2m k1, zmm1
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
So the non-hoistable parts are
kmov r,k (port 0) / pext (port 1) / kmov k,r (port 5) = 3 uops, one for each execution port. (Including port 1, which has its vector ALUs shut down while 512-bit uops are in flight). The kmov/kmov round trip has 4 cycle latency on SKX, and pext is 3 cycle latency, for a total of 7 cycle latency.
vpcompressd zmm{k}{z}, z (2 p5) / vpmovd2m (port 0) = 3 uops, two for port 5. vpmovd2m has 3 cycle latency on SKX / ICL, and vpcompressd-zeroing-into-zmm has 6 cycle from the k input to the zmm output (SKX and ICL). So a total of 9 cycle latency, and worse port distribution for the uops.
Also, the hoistable part is generally worse (vpternlogd is longer and competes for fewer ports than mov r32, imm32), unless your function already needs an all-ones vector for something but not an all-ones register.
Conclusion: the BMI2 pext way is not worse in any way, and better in several. (Unless surrounding code heavily bottlenecked on port 1 uops, which is very unlikely if using 512-bit vectors because in that case it can only be running scalar integer uops like 3-cycle LEA, IMUL, LZCNT, and of course simple 1-cycle integer stuff like add/sub/and/or).

How can I generate a 256 bit mask

I have an array of uint64_t[4], and I need to generate a mask,
such that the array, if it were a 256-bit integer, equals
(1 << w) - 1, where w goes from 1 to 256.
The best thing I have come up with is branchless, but it takes MANY instructions. It is in Zig because Clang doesn't seem to expose llvm's saturating subtraction. http://localhost:10240/z/g8h1rV
Is there a better way to do this?
var mask: [4]u64 = undefined;
for (mask) |_, i|
mask[i] = 0xffffffffffffffff;
mask[3] ^= ((u64(1) << #intCast(u6, (inner % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[2] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 64) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[1] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 128) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[0] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 192) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
Are you targeting x86-64 with AVX2 for 256-bit vectors? I thought that was an interesting case to answer for.
If so, you can do this in a few instructions using saturating subtraction and a variable count shift.
x86 SIMD shifts like vpsrlvq saturate the shift count, shifting all the bits out when the count is >= element width. Unlike integer shifts the shift count is masked (and thus wraps around).
For the lowest u64 element, starting with all-ones we need to leave it unmodified for bitpos >= 64. Or for smaller bit positions, right-shift it by 64-bitpos. Unsigned saturating subtraction looks like the way to go here, as you observed, to create a shift count of 0 for larger bitpos. But x86 only has SIMD saturating subtraction, and only for byte or word elements. But if we don't care about bitpos > 256, that's fine we can use 16-bit elements at the bottom of each u64, and let a 0-0 happen in the rest of the u64.
Your code looks pretty overcomplicated, creating (1<<n) - 1 and XORing. I think it's a lot easier to just use a variable-count shift on the 0xFFFF...FF elements directly.
I don't know Zig, so do whatever you have to to get it to emit asm like this. Hopefully this is useful because you tagged this assembly; should be easy to translate to intrinsics for C, or Zig if it has them.
default rel
section .rodata
shift_offsets: dw 64, 128, 192, 256 ; 16-bit elements, to be loaded with zero-extension to 64
section .text
pos_to_mask256:
vpmovzxwq ymm2, [shift_offsets] ; _mm256_set1_epi64x(256, 192, 128, 64)
vpcmpeqd ymm1, ymm1,ymm1 ; ymm1 = all-ones
; set up vector constants, can be hoisted
vmovd xmm0, edi
vpbroadcastq ymm0, xmm0 ; ymm0 = _mm256_set1_epi64(bitpos)
vpsubusw ymm0, ymm2, ymm0 ; ymm0 = {256,192,128,64}-bitpos with unsigned saturation
vpsrlvq ymm0, ymm1, ymm0 ; mask[i] >>= count, where counts >= 64 create 0s.
ret
If the input integer starts in memory, you can of course efficiently broadcast-load it into a ymm register directly.
The shift-offsets vector can of course be hoisted out of a loop, as can the all-ones.
With input = 77, the high 2 elements are zeroed by shifts of 256-77=179, and 192-77=115 bits. Tested with NASM + GDB for EDI=77, and the result is
(gdb) p /x $ymm0.v4_int64
{0xffffffffffffffff, 0x1fff, 0x0, 0x0}
GDB prints low element first, opposite of Intel notation / diagrams. This vector is actually 0, 0, 0x1fff, 0xffffffffffffffff, i.e. 64+13 = 77 one bits, and the rest all zeros. Other test cases
edi=0: mask = all-zero
edi=1: mask = 1
... : mask = edi one bits at the bottom, then zeros
edi=255: mask = all ones except for the top bit of the top element
edi=256: mask = all ones
edi>256: mask = all ones. (unsigned subtraction saturates to 0 everywhere.)
You need AVX2 for the variable-count shifts. psubusb/w is SSE2, so you could consider doing that part with SIMD and then go back to scalar integer for the shifts, or maybe just use SSE2 shifts for one element at a time. Like psrlq xmm1, xmm0 which takes the low 64 bits of xmm0 as the shift count for all elements of xmm1.
Most ISAs don't have saturating scalar subtraction. Some ARM CPUs do for scalar integer, I think, but x86 doesn't. IDK what you're using.
On x86 (and many other ISAs) you have 2 problems:
keep all-ones for low elements (either modify the shift result, or saturate shift count to 0)
produce 0 for high elements above the one containing the top bit of the mask. x86 scalar shifts can't do this at all, so you might feed the shift an input of 0 for that case. Maybe using cmov to create it based on flags set by sub for 192-w or something.
count = 192-w;
shift_input = count<0 ? 0 : ~0ULL;
shift_input >>= count & 63; // mask to avoid UB in C. Optimizes away on x86 where shr does this anyway.
Hmm, this doesn't handle saturating the subtraction to 0 to keep the all-ones, though.
If tuning for ISAs other than x86, maybe look at some other options. Or maybe there's something better on x86 as well. Creating the all-ones or all-zeros with sar reg,63 is an interesting option (broadcast the sign bit), but we actually need all-ones when 192-count has sign bit = 0.
Here's some Zig code that compiles and runs:
const std = #import("std");
noinline fn thing(x: u256) bool {
return x > 0xffffffffffffffff;
}
pub fn main() anyerror!void {
var num: u256 = 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff;
while (thing(num)) {
num /= 2;
std.debug.print(".", .{});
}
std.debug.print("done\n", .{});
}
Zig master generates relatively clean x86 assembler from that.

Working inline assembly in C for bit parity?

I'm trying to compute the bit parity of a large number of uint64's. By bit parity I mean a function that accepts a uint64 and outputs 0 if the number of set bits is even, and 1 otherwise.
Currently I'm using the following function (by #Troyseph, found here):
uint parity64(uint64 n){
n ^= n >> 1;
n ^= n >> 2;
n = (n & 0x1111111111111111) * 0x1111111111111111;
return (n >> 60) & 1;
}
The same SO page has the following assembly routine (by #papadp):
.code
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
END
which takes advantage of the machine's parity flag. But I cannot get it to work with my C program (I know next to no assembly).
Question. How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
(I'm using GCC with 64-bit Ubuntu 14 on an Intel Xeon Haswell)
In case it's of any help, the parity64() function is called inside the following routine:
uint bindot(uint64* a, uint64* b, uint64 entries){
uint parity = 0;
for(uint i=0; i<entries; ++i)
parity ^= parity64(a[i] & b[i]); // Running sum!
return parity;
}
(This is supposed to be the "dot product" of two vectors over the field Z/2Z, aka. GF(2).)
This may sound a bit harsh, but I believe it needs to be said. Please don't take it personally; I don't mean it as an insult, especially since you already admitted that you "know next to no assembly." But if you think code like this:
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
will beat what a C compiler generates, then you really have no business using inline assembly. In just those 5 lines of code, I see 2 instructions that are glaringly sub-optimal. It could be optimized by just rewriting it slightly:
xor eax, eax
test ecx, ecx ; logically, should use RCX, but see below for behavior of PF
jnp jmp_over
mov eax, 1 ; or possibly even "inc eax"; would need to verify
jmp_over:
ret
Or, if you have random input values that are likely to foil the branch predictor (i.e., there is no predictable pattern to the parity of the input values), then it would be faster yet to remove the branch, writing it as:
xor eax, eax
test ecx, ecx
setp al
ret
Or perhaps the equivalent (which will be faster on certain processors, but not necessarily all):
xor eax, eax
test ecx, ecx
mov ecx, 1
cmovp eax, ecx
ret
And these are just the improvements I could see off the top of my head, given my existing knowledge of the x86 ISA and previous benchmarks that I have conducted. But lest anyone be fooled, this is undoubtedly not the fastest code, because (borrowing from Michael Abrash), "there ain't no such thing as the fastest code"—someone can virtually always make it faster yet.
There are enough problems with using inline assembly when you're an expert assembly-language programmer and a wizard when it comes to the intricacies of the x86 ISA. Optimizers are pretty darn good nowadays, which means it's hard enough for a true guru to produce better code (though certainly not impossible). It also takes trustworthy benchmarks that will verify your assumptions and confirm that your optimized inline assembly is actually faster. Never commit yourself to using inline assembly to outsmart the compiler's optimizer without running a good benchmark. I see no evidence in your question that you've done anything like this. I'm speculating here, but it looks like you saw that the code was written in assembly and assumed that meant it would be faster. That is rarely the case. C compilers ultimately emit assembly language code, too, and it is often more optimal than what us humans are capable of producing, given a finite amount of time and resources, much less limited expertise.
In this particular case, there is a notion that inline assembly will be faster than the C compiler's output, since the C compiler won't be able to intelligently use the x86 architecture's built-in parity flag (PF) to its benefit. And you might be right, but it's a pretty shaky assumption, far from universalizable. As I've said, optimizing compilers are pretty smart nowadays, and they do optimize to a particular architecture (assuming you specify the right options), so it would not at all surprise me that an optimizer would emit code that used PF. You'd have to look at the disassembly to see for sure.
As an example of what I mean, consider the highly specialized BSWAP instruction that x86 provides. You might naïvely think that inline assembly would be required to take advantage of it, but it isn't. The following C code compiles to a BSWAP instruction on almost all major compilers:
uint32 SwapBytes(uint32 x)
{
return ((x << 24) & 0xff000000 ) |
((x << 8) & 0x00ff0000 ) |
((x >> 8) & 0x0000ff00 ) |
((x >> 24) & 0x000000ff );
}
The performance will be equivalent, if not better, because the optimizer has more knowledge about what the code does. In fact, a major benefit this form has over inline assembly is that the compiler can perform constant folding with this code (i.e., when called with a compile-time constant). Plus, the code is more readable (at least, to a C programmer), much less error-prone, and considerably easier to maintain than if you'd used inline assembly. Oh, and did I mention it's reasonably portable if you ever wanted to target an architecture other than x86?
I know I'm making a big deal of this, and I want you to understand that I say this as someone who enjoys the challenge of writing highly-tuned assembly code that beats the compiler's optimizer in performance. But every time I do it, it's just that: a challenge, which comes with sacrifices. It isn't a panacea, and you need to remember to check your assumptions, including:
Is this code actually a bottleneck in my application, such that optimizing it would even make any perceptible difference?
Is the optimizer actually emitting sub-optimal machine language instructions for the code that I have written?
Am I wrong in what I naïvely think is sub-optimal? Maybe the optimizer knows more than I do about the target architecture, and what looks like slow or sub-optimal code is actually faster. (Remember that less code is not necessarily faster.)
Have I tested it in a meaningful, real-world benchmark, and proven that the compiler-generated code is slow and that my inline assembly is actually faster?
Is there absolutely no way that I can tweak the C code to persuade the optimizer to emit better machine code that is close, equal to, or even superior to the performance of my inline assembly?
In an attempt to answer some of these questions, I set up a little benchmark. (Using MSVC, because that's what I have handy; if you're targeting GCC, it's best to use that compiler, but we can still get a general idea. I use and recommend Google's benchmarking library.) And I immediately ran into problems. See, I first run my benchmarks in "debugging" mode, with assertions compiled in that verify that my "tweaked"/"optimized" code is actually producing the same results for all test cases as the original code (that is presumably known to be working/correct). In this case, an assertion immediately fired. It turns out that the CheckParity routine written in assembly language does not return identical results to the parity64 routine written in C! Uh-oh. Well, that's another bullet we need to add to the above list:
Have I ensured that my "optimized" code is returning the correct results?
This one is especially critical, because it's easy to make something faster if you also make it wrong. :-) I jest, but not entirely, because I've done this many times in the pursuit of faster code.
I believe Michael Petch has already pointed out the reason for the discrepancy: in the x86 implementation, the parity flag (PF) only concerns itself with the bits in the low byte, not the entire value. If that's all you need, then great. But even then, we can go back to the C code and further optimize it to do less work, which will make it faster—perhaps faster than the assembly code, eliminating the one advantage that inline assembly ever had.
For now, let's assume that you need the parity of the full value, since that's the original implementation you had that was working, and you're just trying to make it faster without changing its behavior. Thus, we need to fix the assembly code's logic before we can even proceed with meaningfully benchmarking it. Fortunately, since I am writing this answer late, Ajay Brahmakshatriya (with collaboration from others) has already done that work, saving me the extra effort.
…except, not quite. When I first drafted this answer, my benchmark revealed that draft 9 of his "tweaked" code still did not produce the same result as the original C function, so it's unsuitable according to our test cases. You say in a comment that his code "works" for you, which means either (A) the original C code was doing extra work, making it needlessly slow, meaning that you can probably tweak it to beat the inline assembly at its own game, or worse, (B) you have insufficient test cases and the new "optimized" code is actually a bug lying in wait. Since that time, Ped7g suggested a couple of fixes, which both fixed the bug causing the incorrect result to be returned, and further improved the code. The amount of input required here, and the number of drafts that he has gone through, should serve as testament to the difficulty of writing correct inline assembly to beat the compiler. But we're not even done yet! His inline assembly remains incorrectly written. SETcc instructions require an 8-bit register as their operand, but his code doesn't use a register specifier to request that, meaning that the code either won't compile (because Clang is smart enough to detect this error) or will compile on GCC but won't execute properly because that instruction has an invalid operand.
Have I convinced you about the importance of testing yet? I'll take it on faith, and move on to the benchmarking part. The benchmark results use the final draft of Ajay's code, with Ped7g's improvements, and my additional tweaks. I also compare some of the other solutions from that question you linked, modified for 64-bit integers, plus a couple of my own invention. Here are my benchmark results (mobile Haswell i7-4850HQ):
Benchmark Time CPU Iterations
-------------------------------------------------------------------
Naive 36 ns 36 ns 19478261
OriginalCCode 4 ns 4 ns 194782609
Ajay_Brahmakshatriya_Tweaked 4 ns 4 ns 194782609
Shreyas_Shivalkar 37 ns 37 ns 17920000
TypeIA 5 ns 5 ns 154482759
TypeIA_Tweaked 4 ns 4 ns 160000000
has_even_parity 227 ns 229 ns 3200000
has_even_parity_Tweaked 36 ns 36 ns 19478261
GCC_builtin_parityll 4 ns 4 ns 186666667
PopCount 3 ns 3 ns 248888889
PopCount_Downlevel 5 ns 5 ns 100000000
Now, keep in mind that these are for randomly-generated 64-bit input values, which disrupts branch prediction. If your input values are biased in a predictable way, either towards parity or non-parity, then the branch predictor will work for you, rather than against you, and certain approaches may be faster. This underscores the importance of benchmarking against data that simulates real-world use cases. (That said, when I write general library functions, I tend to optimize for random inputs, balancing size and speed.)
Notice how the original C function compares to the others. I'm going to make the claim that optimizing it any further is probably a big fat waste of time. So hopefully you learned something more general from this answer, rather than just scrolled down to copy-paste the code snippets. :-)
The Naive function is a completely unoptimized sanity check to determine the parity, taken from here. I used it to validate even your original C code, and also to provide a baseline for the benchmarks. Since it loops through each bit, one-by-one, it is relatively slow, as expected:
unsigned int Naive(uint64 n)
{
bool parity = false;
while (n)
{
parity = !parity;
n &= (n - 1);
}
return parity;
}
OriginalCCode is exactly what it sounds like—it's the original C code that you had, as shown in the question. Notice how it posts up at exactly the same time as the tweaked/corrected version of Ajay Brahmakshatriya's inline assembly code! Now, since I ran this benchmark in MSVC, which doesn't support inline assembly for 64-bit builds, I had to use an external assembly module containing the function, and call it from there, which introduced some additional overhead. With GCC's inline assembly, the compiler probably would have been able to inline the code, thus eliding a function call. So on GCC, you might see the inline-assembly version be up to a nanosecond faster (or maybe not). Is that worth it? You be the judge. For reference, this is the code I tested for Ajay_Brahmakshatriya_Tweaked:
Ajay_Brahmakshatriya_Tweaked PROC
mov rax, rcx ; Windows 64-bit calling convention passes parameter in ECX (System V uses EDI)
shr rax, 32
xor rcx, rax
mov rax, rcx
shr rax, 16
xor rcx, rax
mov rax, rcx
shr rax, 8
xor eax, ecx ; Ped7g's TEST is redundant; XOR already sets PF
setnp al
movzx eax, al
ret
Ajay_Brahmakshatriya_Tweaked ENDP
The function named Shreyas_Shivalkar is from his answer here, which is just a variation on the loop-through-each-bit theme, and is, in keeping with expectations, slow:
Shreyas_Shivalkar PROC
; unsigned int parity = 0;
; while (x != 0)
; {
; parity ^= x;
; x >>= 1;
; }
; return (parity & 0x1);
xor eax, eax
test rcx, rcx
je SHORT Finished
Process:
xor eax, ecx
shr rcx, 1
jne SHORT Process
Finished:
and eax, 1
ret
Shreyas_Shivalkar ENDP
TypeIA and TypeIA_Tweaked are the code from this answer, modified to support 64-bit values, and my tweaked version. They parallelize the operation, resulting in a significant speed improvement over the loop-through-each-bit strategy. The "tweaked" version is based on an optimization originally suggested by Mathew Hendry to Sean Eron Anderson's Bit Twiddling Hacks, and does net us a tiny speed-up over the original.
unsigned int TypeIA(uint64 n)
{
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n ^= n >> 2;
n ^= n >> 1;
return !((~n) & 1);
}
unsigned int TypeIA_Tweaked(uint64 n)
{
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n &= 0xf;
return ((0x6996 >> n) & 1);
}
has_even_parity is based on the accepted answer to that question, modified to support 64-bit values. I knew this would be slow, since it's yet another loop-through-each-bit strategy, but obviously someone thought it was a good approach. It's interesting to see just how slow it actually is, even compared to what I termed the "naïve" approach, which does essentially the same thing, but faster, with less-complicated code.
unsigned int has_even_parity(uint64 n)
{
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
{
if (n & (b << i)) { ++count; }
}
return (count % 2);
}
has_even_parity_Tweaked is an alternate version of the above that saves a branch by taking advantage of the fact that Boolean values are implicitly convertible into 0 and 1. It is substantially faster than the original, clocking in at a time comparable to the "naïve" approach:
unsigned int has_even_parity_Tweaked(uint64 n)
{
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
{
count += static_cast<int>(static_cast<bool>(n & (b << i)));
}
return (count % 2);
}
Now we get into the good stuff. The function GCC_builtin_parityll consists of the assembly code that GCC would emit if you used its __builtin_parityll intrinsic. Several others have suggested that you use this intrinsic, and I must echo their endorsement. Its performance is on par with the best we've seen so far, and it has a couple of additional advantages: (1) it keeps the code simple and readable (simpler than the C version); (2) it is portable to different architectures, and can be expected to remain fast there, too; (3) as GCC improves its implementation, your code may get faster with a simple recompile. You get all the benefits of inline assembly, without any of the drawbacks.
GCC_builtin_parityll PROC ; GCC's __builtin_parityll
mov edx, ecx
shr rcx, 32
xor edx, ecx
mov eax, edx
shr edx, 16
xor eax, edx
xor al, ah
setnp al
movzx eax, al
ret
GCC_builtin_parityll ENDP
PopCount is an optimized implementation of my own invention. To come up with this, I went back and considered what we were actually trying to do. The definition of "parity" is an even number of set bits. Therefore, it can be calculated simply by counting the number of set bits and testing to see if that count is even or odd. That's two logical operations. As luck would have it, on recent generations of x86 processors (Intel Nehalem or AMD Barcelona, and newer), there is an instruction that counts the number of set bits—POPCNT (population count, or Hamming weight)—which allows us to write assembly code that does this in two operations.
(Okay, actually three instructions, because there is a bug in the implementation of POPCNT on certain microarchitectures that creates a false dependency on its destination register, and to ensure we get maximum throughput from the code, we need to break this dependency by pre-clearing the destination register. Fortunately, this a very cheap operation, one that can generally be handled for "free" by register renaming.)
PopCount PROC
xor eax, eax ; break false dependency
popcnt rax, rcx
and eax, 1
ret
PopCount ENDP
In fact, as it turns out, GCC knows to emit exactly this code for the __builtin_parityll intrinsic when you target a microarchitecture that supports POPCNT (otherwise, it uses the fallback implementation shown below). As you can see from the benchmarks, this is the fastest code yet. It isn't a major difference, so it's unlikely to matter unless you're doing this repeatedly within a tight loop, but it is a measurable difference and presumably you wouldn't be optimizing this so heavily unless your profiler indicated that this was a hot-spot.
But the POPCNT instruction does have the drawback of not being available on older processors, so I also measured a "fallback" version of the code that does a population count with a sequence of universally-supported instructions. That is the PopCount_Downlevel function, taken from my private library, originally adapted from this answer and other sources.
PopCount_Downlevel PROC
mov rax, rcx
shr rax, 1
mov rdx, 5555555555555555h
and rax, rdx
sub rcx, rax
mov rax, 3333333333333333h
mov rdx, rcx
and rcx, rax
shr rdx, 2
and rdx, rax
add rdx, rcx
mov rcx, 0FF0F0F0F0F0F0F0Fh
mov rax, rdx
shr rax, 4
add rax, rdx
mov rdx, 0FF01010101010101h
and rax, rcx
imul rax, rdx
shr rax, 56
and eax, 1
ret
PopCount_Downlevel ENDP
As you can see from the benchmarks, all of the bit-twiddling instructions that are required here exact a cost in performance. It is slower than POPCNT, but supported on all systems and still reasonably quick. If you needed a bit count anyway, this would be the best solution, especially since it can be written in pure C without the need to resort to inline assembly, potentially yielding even more speed:
unsigned int PopCount_Downlevel(uint64 n)
{
uint64 temp = n - ((n >> 1) & 0x5555555555555555ULL);
temp = (temp & 0x3333333333333333ULL) + ((temp >> 2) & 0x3333333333333333ULL);
temp = (temp + (temp >> 4)) & 0x0F0F0F0F0F0F0F0FULL;
temp = (temp * 0x0101010101010101ULL) >> 56;
return (temp & 1);
}
But run your own benchmarks to see if you wouldn't be better off with one of the other implementations, like OriginalCCode, which simplifies the operation and thus requires fewer total instructions. Fun fact: Intel's compiler (ICC) always uses a population count-based algorithm to implement __builtin_parityll; it emits a POPCNT instruction if the target architecture supports it, or otherwise, it simulates it using essentially the same code as I've shown here.
Or, better yet, just forget the whole complicated mess and let your compiler deal with it. That's what built-ins are for, and there's one for precisely this purpose.
Because C sucks when handling bit operations, I suggest using gcc built in functions, in this case __builtin_parityl(). See:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
You will have to use extended inline assembly (which is a gcc extension) to get the similar effect.
Your parity64 function can be changed as follows -
uint parity64_unsafe_and_broken(uint64 n){
uint result = 0;
__asm__("addq $0, %0" : : "r"(n) :);
// editor's note: compiler-generated instructions here can destroy EFLAGS
// Don't depending on FLAGS / regs surviving between asm statements
// also, jumping out of an asm statement safely requires asm goto
__asm__("jnp 1f");
__asm__("movl $1, %0" : "=r"(result) : : );
__asm__("1:");
return result;
}
But as commented by #MichaelPetch the parity flag is computed only on the lower 8 bits. So this will work for your if your n is less than 255. For bigger numbers you will have to use the code you mentioned in your question.
To get it working for 64 bits you can collapse the parity of the 32 bit integer into single byte by doing
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
This code will have to be just at the start of the function before the assembly.
You will have to check how it affects the performance.
The most optimized I could get it is
uint parity64(uint64 n){
unsigned char result = 0;
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
__asm__("test %1, %1 \n\t"
"setp %0"
: "+r"(result)
: "r"(n)
:
);
return result;
}
How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
This is an XY problem... You think you need to inline that assembly to gain from its benefits, so you asked about how to inline it... but you don't need to inline it.
You shouldn't include assembly into your C source code, because in this case you don't need to, and the better alternative (in terms of portability and maintainability) is to keep the two pieces of source code separate, compile them separately and use the linker to link them.
In parity64.c you should have your portable version (with a wrapper named bool CheckParity(size_t result)), which you can default to in non-x86/64 situations.
You can compile this to an object file like so: gcc -c parity64.c -o parity64.o
... and then link the object code generated from assembly, with the C code: gcc bindot.c parity64.o -o bindot
In parity64_x86.s you might have the following assembly code from your question:
.code
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
END
You can compile this to an alternative parity64.o object file object code using gcc with this command: gcc -c parity64_x86.s -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
Similarly, if you wanted to use __builtin_parityl instead (as suggested by hdantes answer, you could (and should) once again keep that code separate (in the same place you keep other gcc/x86 optimisations) from your portable code. In parity64_x86.c you might have:
bool CheckParity(size_t result) {
return __builtin_parityl(result);
}
To compile this, your command would be: gcc -c parity64_x86.c -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
On a side-note, if you'd like to inspect the assembly gcc would produce from this: gcc -S parity64_x86.c
Comments in your assembly indicate that the equivalent function prototype in C would be bool CheckParity(size_t Result), so with that in mind, here's what bindot.c might look like:
extern bool CheckParity(size_t Result);
uint64_t bindot(uint64_t *a, uint64_t *b, size_t entries){
uint64_t parity = 0;
for(size_t i = 0; i < entries; ++i)
parity ^= a[i] & b[i]; // Running sum!
return CheckParity(parity);
}
You can build this and link it to any of the above parity64.o versions like so: gcc bindot.c parity64.o -o bindot...
I highly recommend reading the manual for your compiler, when you have the time...

GCC compiles leading zero count poorly unless Haswell specified

GCC supports the __builtin_clz(int x) builtin, which counts the number of number of leading zeros (consecutive most-significant zeros) in the argument.
Among other things0, this is great for efficiently implementing the lg(unsigned int x) function, which takes the base-2 logarithm of x, rounding down1:
/** return the base-2 log of x, where x > 0 */
unsigned lg(unsigned x) {
return 31U - (unsigned)__builtin_clz(x);
}
This works in the straightforward way - in particular consider the case x == 1 and clz(x) == 31 - then x == 2^0 so lg(x) == 0 and 31 - 31 == 0 and we get the correct result. Higher values of x work similarly.
Assuming the builtin is efficiently implemented, this ends much better than the alternate pure C solutions.
Now as it happens, the count leading zeros operation is essentially the dual of the bsr instruction in x86. That returns the index of the most-significant 1-bit2 in the argument. So if there are 10 leading zeros, the first 1-bit is in bit 21 of the argument. In general we have 31 - clz(x) == bsr(x) and in so bsr in fact directly implements our desired lg() function, without the superfluous 31U - ... part.
In fact, you can read between the line and see that the __builtin_clz function was implemented with bsr in mind: it is defined as undefined behavior if the argument is zero, when of course the "leading zeros" operation is perfectly well-defined as 32 (or whatever the bit-size of int is) with a zero argument. So __builtin_clz was certainly implemented with the idea of being efficiently mapped to a bsr instruction on x86.
However, looking at what GCC actually does, at -O3 with otherwise default options: it adds a ton of extra junk:
lg(unsigned int):
bsr edi, edi
mov eax, 31
xor edi, 31
sub eax, edi
ret
The xor edi,31 line is effectively a not edi for the bottom 4 bits that actually matter, that's off-by-one3 from neg edi which turns the result of bsr into clz. Then the 31 - clz(x) is carried out.
However with -mtune=haswell, the code simplifies into exactly the expected single bsr instruction:
lg(unsigned int):
bsr eax, edi
ret
Why that is the case is very unclear to me. The bsr instruction has been around for a couple decades before Haswell, and the behavior is, AFAIK, unchanged. It's not just an issue of tuning for a particular arch, since bsr + a bunch of extra instructions isn't going to be faster than a plain bsr and furthermore using -mtune=haswell still results in the slower code.
The situation for 64-bit inputs and outputs is even slightly worse: there is an extra movsx in the critical path which seems to do nothing since the result from clz will never be negative. Again, the -march=haswell variant is optimal with a single bsr instruction.
Finally, let's check the big players in the non-Windows compiler space, icc and clang. icc just does a bad job and adds redundant stuff like neg eax; add eax, 31; neg eax; add eax, 31; - wtf? clang does a good job regardless of -march.
0 Such as scanning a bitmap for the first set bit.
1 The logarithm of 0 is undefinited, and so calling our function with a 0 argument is undefined behavior.
2 Here, the LSB is the 0th bit and the MSB is the 31st.
3 Recall that -x == ~x + 1 in twos-complement.
This looks like a known issue with gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

The best way to shift a __m128i?

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m).
What is the best way to do this?!
Note that _mm_slli_epi64 shifts v0 and v1 seperately:
r0 := v0 << count
r1 := v1 << count
so the last bits of v0 missed, but I want to move those bits to r1.
Edit:
I looking for a code, faster than this (m<64):
r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);
For compile-time constant shift counts, you can get fairly good results. Otherwise not really.
This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.
// untested
#include <immintrin.h>
/* some compilers might choke on slli / srli with non-compile-time-constant args
* gcc generates the xmm, imm8 form with constants,
* and generates the xmm, xmm form with otherwise. (With movd to get the count in an xmm)
*/
// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
__m128i carry = _mm_bslli_si128(x, 8); // old compilers only have the confusingly named _mm_slli_si128 synonym
if (count >= 64)
return _mm_slli_epi64(carry, count-64); // the non-carry part is all zero, so return early
// else
carry = _mm_srli_epi64(carry, 64-count); // After bslli shifted left by 64b
x = _mm_slli_epi64(x, count);
return _mm_or_si128(x, carry);
}
__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100); }
I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.
The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):
mm_bitshift_left_3(long long __vector(2)):
vpslldq xmm1, xmm0, 8
vpsrlq xmm1, xmm1, 61
vpsllq xmm0, xmm0, 3
vpor xmm0, xmm0, xmm1
ret
Performance:
This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)
Or for counts >= 64:
mm_bitshift_left_100(long long __vector(2)):
vpslldq xmm0, xmm0, 8
vpsllq xmm0, xmm0, 36
ret
If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.
It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.
The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)
__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
return x << count; // undefined if count >= 128
}
In SSE4.A the instructions insrq and extrq can be used to shift (and rotate) through __mm128i 1-64 bits at a time. Unlike the 8/16/32/64 bit counterparts pextrN/pinsrX, these instructions select or insert m bits (between 1 and 64) at any bit offset from 0 to 127. The caveat is that the sum of lenght and offset must not exceed 128.

Resources