How can I generate a 256 bit mask - c

I have an array of uint64_t[4], and I need to generate a mask,
such that the array, if it were a 256-bit integer, equals
(1 << w) - 1, where w goes from 1 to 256.
The best thing I have come up with is branchless, but it takes MANY instructions. It is in Zig because Clang doesn't seem to expose llvm's saturating subtraction. http://localhost:10240/z/g8h1rV
Is there a better way to do this?
var mask: [4]u64 = undefined;
for (mask) |_, i|
mask[i] = 0xffffffffffffffff;
mask[3] ^= ((u64(1) << #intCast(u6, (inner % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[2] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 64) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[1] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 128) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));
mask[0] ^= ((u64(1) << #intCast(u6, (#satSub(u32, inner, 192) % 64) + 1)) - 1) << #intCast(u6, 64 - (inner % 64));

Are you targeting x86-64 with AVX2 for 256-bit vectors? I thought that was an interesting case to answer for.
If so, you can do this in a few instructions using saturating subtraction and a variable count shift.
x86 SIMD shifts like vpsrlvq saturate the shift count, shifting all the bits out when the count is >= element width. Unlike integer shifts the shift count is masked (and thus wraps around).
For the lowest u64 element, starting with all-ones we need to leave it unmodified for bitpos >= 64. Or for smaller bit positions, right-shift it by 64-bitpos. Unsigned saturating subtraction looks like the way to go here, as you observed, to create a shift count of 0 for larger bitpos. But x86 only has SIMD saturating subtraction, and only for byte or word elements. But if we don't care about bitpos > 256, that's fine we can use 16-bit elements at the bottom of each u64, and let a 0-0 happen in the rest of the u64.
Your code looks pretty overcomplicated, creating (1<<n) - 1 and XORing. I think it's a lot easier to just use a variable-count shift on the 0xFFFF...FF elements directly.
I don't know Zig, so do whatever you have to to get it to emit asm like this. Hopefully this is useful because you tagged this assembly; should be easy to translate to intrinsics for C, or Zig if it has them.
default rel
section .rodata
shift_offsets: dw 64, 128, 192, 256 ; 16-bit elements, to be loaded with zero-extension to 64
section .text
pos_to_mask256:
vpmovzxwq ymm2, [shift_offsets] ; _mm256_set1_epi64x(256, 192, 128, 64)
vpcmpeqd ymm1, ymm1,ymm1 ; ymm1 = all-ones
; set up vector constants, can be hoisted
vmovd xmm0, edi
vpbroadcastq ymm0, xmm0 ; ymm0 = _mm256_set1_epi64(bitpos)
vpsubusw ymm0, ymm2, ymm0 ; ymm0 = {256,192,128,64}-bitpos with unsigned saturation
vpsrlvq ymm0, ymm1, ymm0 ; mask[i] >>= count, where counts >= 64 create 0s.
ret
If the input integer starts in memory, you can of course efficiently broadcast-load it into a ymm register directly.
The shift-offsets vector can of course be hoisted out of a loop, as can the all-ones.
With input = 77, the high 2 elements are zeroed by shifts of 256-77=179, and 192-77=115 bits. Tested with NASM + GDB for EDI=77, and the result is
(gdb) p /x $ymm0.v4_int64
{0xffffffffffffffff, 0x1fff, 0x0, 0x0}
GDB prints low element first, opposite of Intel notation / diagrams. This vector is actually 0, 0, 0x1fff, 0xffffffffffffffff, i.e. 64+13 = 77 one bits, and the rest all zeros. Other test cases
edi=0: mask = all-zero
edi=1: mask = 1
... : mask = edi one bits at the bottom, then zeros
edi=255: mask = all ones except for the top bit of the top element
edi=256: mask = all ones
edi>256: mask = all ones. (unsigned subtraction saturates to 0 everywhere.)
You need AVX2 for the variable-count shifts. psubusb/w is SSE2, so you could consider doing that part with SIMD and then go back to scalar integer for the shifts, or maybe just use SSE2 shifts for one element at a time. Like psrlq xmm1, xmm0 which takes the low 64 bits of xmm0 as the shift count for all elements of xmm1.
Most ISAs don't have saturating scalar subtraction. Some ARM CPUs do for scalar integer, I think, but x86 doesn't. IDK what you're using.
On x86 (and many other ISAs) you have 2 problems:
keep all-ones for low elements (either modify the shift result, or saturate shift count to 0)
produce 0 for high elements above the one containing the top bit of the mask. x86 scalar shifts can't do this at all, so you might feed the shift an input of 0 for that case. Maybe using cmov to create it based on flags set by sub for 192-w or something.
count = 192-w;
shift_input = count<0 ? 0 : ~0ULL;
shift_input >>= count & 63; // mask to avoid UB in C. Optimizes away on x86 where shr does this anyway.
Hmm, this doesn't handle saturating the subtraction to 0 to keep the all-ones, though.
If tuning for ISAs other than x86, maybe look at some other options. Or maybe there's something better on x86 as well. Creating the all-ones or all-zeros with sar reg,63 is an interesting option (broadcast the sign bit), but we actually need all-ones when 192-count has sign bit = 0.

Here's some Zig code that compiles and runs:
const std = #import("std");
noinline fn thing(x: u256) bool {
return x > 0xffffffffffffffff;
}
pub fn main() anyerror!void {
var num: u256 = 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff;
while (thing(num)) {
num /= 2;
std.debug.print(".", .{});
}
std.debug.print("done\n", .{});
}
Zig master generates relatively clean x86 assembler from that.

Related

AVX512 - How to move all set bits to the right?

How can I move all set bits of mask register to right? (To the bottom, least-significant position).
For example:
__mmask16 mask = _mm512_cmpeq_epi32_mask(vload, vlimit); // mask = 1101110111011101
If we move all set bits to the right, we will get: 1101110111011101 -> 0000111111111111
How can I achieve this efficiently?
Below you can see how I tried to get the same result, but it's inefficient:
__mmask16 mask = 56797;
// mask: 1101110111011101
__m512i vbrdcast = _mm512_maskz_broadcastd_epi32(mask, _mm_set1_epi32(~0));
// vbrdcast: -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1
__m512i vcompress = _mm512_maskz_compress_epi32(mask, vbrdcast);
// vcompress:-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0
__mmask16 right_packed_mask = _mm512_movepi32_mask(vcompress);
// right_packed_mask: 0000111111111111
What is the best way to do this?
BMI2 pext is the scalar bitwise equivalent of v[p]compressd/q/ps/pd.
Use it on your mask value to left-pack them to the bottom of the value.
mask = _pext_u32(-1U, mask); // or _pext_u64(-1ULL, mask64) for __mmask64
// costs 3 asm instructions (kmov + pext + kmov) if you need to use the result as a mask
// not including putting -1 in a register.
Implicit conversion between __mmask16 (aka uint16_t in GCC) and uint32_t works.
Use _cvtu32_mask16 and _cvtu32_mask16 to make the KMOVW explicit if you like.
See How to unset N right-most set bits for more about using pext/pdep in ways like this.
All current CPUs with AVX-512 also have fast BMI2 pext (including Xeon Phi), same performance as popcnt. AMD had slow pext until Zen 3, but if/when AMD ever introduces an AVX-512 CPU it should have fast pext/pdep.
For earlier AMD without AVX512, you might want (1ULL << __builtin_popcount(mask)) - 1, but be careful of overflow if all bits are set. 1ULL << 64 is undefined behaviour, and likely to produce 1 not 0 when compiled for x86-64.
If you were going to use vpcompressd, note that the source vector can simply be all-ones _mm512_set1_epi32(-1); compress doesn't care about elements where the mask was zero, they don't need to already be zero.
(It doesn't matter which -1s you pack; once you're working with boolean values, there's no difference between a true that came from your original bitmask vs. a constant true that was just sitting there which you generated more cheaply, without a dependency on your input mask. Same reasoning applies for pext, why you can use -1U as the source data instead of a pdep. i.e. a -1 or set bit doesn't have an identity; it's the same as any other -1 or set bit).
So let's try both ways and see how good/bad the asm is.
inline
__mmask16 leftpack_k(__mmask16 mask){
return _pdep_u32(-1U, mask);
}
inline
__mmask16 leftpack_comp(__mmask16 mask) {
__m512i v = _mm512_maskz_compress_epi32(mask, _mm512_set1_epi32(-1));
return _mm512_movepi32_mask(v);
}
Looking at stand-alone versions of these isn't useful because __mmask16 is a typedef for unsigned short, and is thus passed/returned in integer registers, not k registers. That makes the pext version look very good, of course, but we want to see how it inlines into a case where we generate and use the mask with AVX-512 intrinsics.
// not a useful function, just something that compiles to asm in an obvious way
void use_leftpack_compress(void *dst, __m512i v){
__mmask16 m = _mm512_test_epi32_mask(v,v);
m = leftpack_comp(m);
_mm512_mask_storeu_epi32(dst, m, v);
}
Commenting out the m = pack(m), this is just a simple 2 instructions that generate and then use a mask.
use_mask_nocompress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
So any extra instructions will be due to left-packing (compressing) the mask. GCC and clang make the same asm as each other, differing only in clang avoiding kmovw in favour of always kmovd. Godbolt
# GCC10.3 -O3 -march=skylake-avx512
use_leftpack_k(void*, long long __vector(8)):
vptestmd k0, zmm0, zmm0
mov eax, -1 # could be hoisted out of a loop
kmovd edx, k0
pdep eax, eax, edx
kmovw k1, eax
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
use_leftpack_compress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vpternlogd zmm2, zmm2, zmm2, 0xFF # set1(-1) could be hoisted out of a loop
vpcompressd zmm1{k1}{z}, zmm2
vpmovd2m k1, zmm1
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
So the non-hoistable parts are
kmov r,k (port 0) / pext (port 1) / kmov k,r (port 5) = 3 uops, one for each execution port. (Including port 1, which has its vector ALUs shut down while 512-bit uops are in flight). The kmov/kmov round trip has 4 cycle latency on SKX, and pext is 3 cycle latency, for a total of 7 cycle latency.
vpcompressd zmm{k}{z}, z (2 p5) / vpmovd2m (port 0) = 3 uops, two for port 5. vpmovd2m has 3 cycle latency on SKX / ICL, and vpcompressd-zeroing-into-zmm has 6 cycle from the k input to the zmm output (SKX and ICL). So a total of 9 cycle latency, and worse port distribution for the uops.
Also, the hoistable part is generally worse (vpternlogd is longer and competes for fewer ports than mov r32, imm32), unless your function already needs an all-ones vector for something but not an all-ones register.
Conclusion: the BMI2 pext way is not worse in any way, and better in several. (Unless surrounding code heavily bottlenecked on port 1 uops, which is very unlikely if using 512-bit vectors because in that case it can only be running scalar integer uops like 3-cycle LEA, IMUL, LZCNT, and of course simple 1-cycle integer stuff like add/sub/and/or).

Extract 10bits words from bitstream

I need to extract all 10-bit words from a raw bitstream whitch is built as ABACABACABAC...
It already works with a naive C implementation like
for(uint8_t *ptr = in_packet; ptr < max; ptr += 5){
const uint64_t val =
(((uint64_t)(*(ptr + 4))) << 32) |
(((uint64_t)(*(ptr + 3))) << 24) |
(((uint64_t)(*(ptr + 2))) << 16) |
(((uint64_t)(*(ptr + 1))) << 8) |
(((uint64_t)(*(ptr + 0))) << 0) ;
*a_ptr++ = (val >> 0);
*b_ptr++ = (val >> 10);
*a_ptr++ = (val >> 20);
*c_ptr++ = (val >> 30);
}
But performance is inadequate for my application so I would like to improve this using some AVX2 optimisations.
I visited the website https://software.intel.com/sites/landingpage/IntrinsicsGuide/# to find any functions that can help but it seems there is nothing to works with 10-bit words, only 8 or 16-bit. That seems logical since 10-bit is not native for a processor, but it make things hard for me.
Is there any way to use AVX2 to solve this problem?
Your scalar loop does not compile efficiently. Compilers do it as 5 separate byte loads. You can express an unaligned 8-byte load in C++ with memcpy:
#include <stdint.h>
#include <string.h>
// do an 8-byte load that spans the 5 bytes we want
// clang auto-vectorizes using an AVX2 gather for 4 qwords. Looks pretty clunky but not terrible
void extract_10bit_fields_v2calar(const uint8_t *__restrict src,
uint16_t *__restrict a_ptr, uint16_t *__restrict b_ptr, uint16_t *__restrict c_ptr,
const uint8_t *max)
{
for(const uint8_t *ptr = src; ptr < max; ptr += 5){
uint64_t val;
memcpy(&val, ptr, sizeof(val));
const unsigned mask = (1U<<10) - 1; // unused in original source!?!
*a_ptr++ = (val >> 0) & mask;
*b_ptr++ = (val >> 10) & mask;
*a_ptr++ = (val >> 20) & mask;
*c_ptr++ = (val >> 30) & mask;
}
}
ICC and clang auto-vectorize your 1-byte version, but do a very bad job (lots of insert/extract of single bytes). Here's your original and this function on Godbolt (with gcc and clang -O3 -march=skylake)
None of those 3 compilers are really close to what we can do manually.
Manual vectorization
My current AVX2 version of this answer forgot a detail: there are only 3 kinds of fields ABAC, not ABCD like 10-bit RGBA pixels. So I have a version of this which unpacks to 4 separate output streams (which I'll leave in because of the packed-RGBA use-case if I ever add a dedicated version for the ABAC interleave).
The existing version can use vpunpcklwd to interleave the two A parts instead of storing with separate vmovq should work for your case. There might be something more efficient, IDK.
BTW, I find it easier to remember and type instruction mnemonics, not intrinsic names. Intel's online intrinsics guide is searchable by instruction mnemonic.
Observations about your layout:
Each field spans one byte boundary, never two, so it's possible to assemble any 4 pairs of bytes in a qword that hold 4 complete fields.
Or with a byte shuffle, to create 2-byte words that each have a whole field at some offset. (e.g. for AVX512BW vpsrlvw, or for AVX2 2x vpsrld + word-blend.) A word shuffle like AVX512 vpermw would not be sufficient: some individual bytes need to be duplicated with the start of one field and end of another. I.e the source positions aren't all aligned words, especially when you have 2x 5 bytes inside the same 16-byte "lane" of a vector.
00-07|08-15|16-23|24-31|32-39 byte boundaries (8-bit)
00...09|10..19|20...29|30..39 field boundaries (10-bit)
Luckily 8 and 10 have a GCD of 2 which is >= 10-8=2. 8*5 = 4*10 so we don't get all possible start positions, e.g. never a field starting at the last bit of 1 byte, spanning another byte, and including the first bit of a 3rd byte.
Possible AVX2 strategy: unaligned 32-byte load that leave 2x 5 bytes at the top of the low lane, and 2x 5 bytes at the bottom of the high lane. Then vpshufb in-lane shuffle to set up for 2x vpsrlvd variable-count shifts, and a blend.
Quick summary of a new idea I haven't expanded yet.
Given an input of xxx a0B0A0C0 a1B1A1C1 | a2B2A2C2 a3B3A3C3 from our unaligned load, we can get a result of
a0 A0 a1 A1 B0 B1 C0 C1 | a2 A2 a3 A3 B2 B3 C2 C3 with the right choice of vpshufb control.
Then a vpermd can put all of those 32-bit groups into the right order, with all the A elements in the high half (ready for a vextracti128 to memory), and the B and C in the low half (ready for vmovq / vmovhps stores).
Use different vpermd shuffles for adjacent pairs so we can vpblendd to merge them for 128-bit B and C stores.
Old version, probably worse than unaligned load + vpshufb.
With AVX2, one option is to broadcast the containing 64-bit element to all positions in a vector and then use variable-count right shifts to get the bits to the bottom of a dword element.
You probably want to do a separate 64-bit broadcast-load for each group (thus partially overlapping with the previous), instead of trying to pick apart a __m256i of contiguous bits. (Broadcast-loads are cheap, shuffling is expensive.)
After _mm256_srlvd_epi64, then AND to isolate the low 10 bits in each qword.
Repeat that 4 times for 4 vectors of input, then use _mm256_packus_epi32 to do in-lane packing down to 32-bit then 16-bit elements.
That's the simple version. Optimizations of the interleaving are possible, e.g. by using left or right shifts to set up for vpblendd instead of a 2-input shuffle like vpackusdw or vshufps. _mm256_blend_epi32 is very efficient on existing CPUs, running on any port.
This also allows delaying the AND until after the first packing step because we don't need to avoid saturation from high garbage.
Design notes:
shown as 32-bit chunks after variable-count shifts
[0 d0 0 c0 | 0 b0 0 a0] # after an AND mask
[0 d1 0 c1 | 0 b1 0 a1]
[0 d1 0 c1 0 d0 0 c0 | 0 b1 0 a1 0 b0 0 a0] # vpackusdw
shown as 16-bit elements but actually the same as what vshufps can do
---------
[X d0 X c0 | X b0 X a0] even the top element is only garbage right shifted by 30, not quite zero
[X d1 X c1 | X b1 X a1]
[d1 c1 d0 c0 | b1 a1 b0 a0 ] vshufps (can't do d1 d0 c1 c0 unfortunately)
---------
[X d0 X c0 | X b0 X a0] variable-count >> qword
[d1 X c1 X | b1 X a1 0] variable-count << qword
[d1 d0 c1 c0 | b1 b0 a1 a0] vpblendd
This last trick extends to vpblendw, allowing us to do everything with interleaving blends, no shuffle instructions at all, resulting in the outputs we want contiguous and in the right order in qwords of a __m256i.
x86 SIMD variable-count shifts can only be left or right for all elements, so we need to make sure that all the data is either left or right of the desired position, not some of each within the same vector. We could use an immediate-count shift to set up for this, but even better is to just adjust the byte-address we load from. For loads after the first, we know it's safe to load some of the bytes before the first bitfield we want (without touching an unmapped page).
# as 16-bit elements
[X X X d0 X X X c0 | ...] variable-count >> qword
[X X d1 X X X c1 X | ...] variable-count >> qword from an offset load that started with the 5 bytes we want all to the left of these positions
[X d2 X X X c2 X X | ...] variable-count << qword
[d3 X X X c3 X X X | ...] variable-count << qword
[X d2 X d0 X c2 X c0 | ...] vpblendd
[d3 X d1 X c3 X c1 X | ...] vpblendd
[d3 d2 d1 d0 c3 c2 c1 c0 | ...] vpblendw (Same behaviour in both high and low lane)
Then mask off the high garbage inside each 16-bit word
Note: this does 4 separate outputs, like ABCD or RGBA->planar, not ABAC.
// potentially unaligned 64-bit broadcast-load, hopefully vpbroadcastq. (clang: yes, gcc: no)
// defeats gcc/clang folding it into an AVX512 broadcast memory source
// but vpsllvq's ymm/mem operand is the shift count, not data
static inline
__m256i bcast_load64(const uint8_t *p) {
// hopefully safe with strict-aliasing since the deref is inside an intrinsic?
__m256i bcast = _mm256_castpd_si256( _mm256_broadcast_sd( (const double*)p ) );
return bcast;
}
// UNTESTED
// unpack 10-bit fields from 4x 40-bit chunks into 16-bit dst arrays
// overreads past the end of the last chunk by 1 byte
// for ABCD repeating, not ABAC, e.g. packed 10-bit RGBA
void extract_10bit_fields_4output(const uint8_t *__restrict src,
uint16_t *__restrict da, uint16_t *__restrict db, uint16_t *__restrict dc, uint16_t *__restrict dd,
const uint8_t *max)
{
// FIXME: cleanup loop for non-whole-vectors at the end
while( src<max ){
__m256i bcast = bcast_load64(src); // data we want is from bits [0 to 39], last starting at 30
__m256i ext0 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place at bottome of each qword
bcast = bcast_load64(src+5-2); // data we want is from bits [16 to 55], last starting at 30+16 = 46
__m256i ext1 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place it at bit 16 in each qword element
bcast = bcast_load64(src+10); // data we want is from bits [0 to 39]
__m256i ext2 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 32 in each qword element
bcast = bcast_load64(src+15-2); // data we want is from bits [16 to 55], last field starting at 46
__m256i ext3 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 48 in each qword element
__m256i blend20 = _mm256_blend_epi32(ext0, ext2, 0b10101010); // X d2 X d0 X c2 X c0 | X b2 ...
__m256i blend31 = _mm256_blend_epi32(ext1, ext3, 0b10101010); // d3 X d1 X c3 X c1 X | b3 X ...
__m256i blend3210 = _mm256_blend_epi16(blend20, blend31, 0b10101010); // d3 d2 d1 d0 c3 c2 c1 c0
__m256i res = _mm256_and_si256(blend3210, _mm256_set1_epi16((1U<<10) - 1) );
__m128i lo = _mm256_castsi256_si128(res);
__m128i hi = _mm256_extracti128_si256(res, 1);
_mm_storel_epi64((__m128i*)da, lo); // movq store of the lowest 64 bits
_mm_storeh_pi((__m64*)db, _mm_castsi128_ps(lo)); // movhps store of the high half of the low 128. Efficient: no shuffle uop needed on Intel CPUs
_mm_storel_epi64((__m128i*)dc, hi);
_mm_storeh_pi((__m64*)dd, _mm_castsi128_ps(hi)); // clang pessmizes this to vpextrq :(
da += 4;
db += 4;
dc += 4;
dd += 4;
src += 4*5;
}
}
This compiles (Godbolt) to about 21 front-end uops (on Skylake) in the loop per 4 groups of 4 fields. (Including has a useless register copy for _mm256_castsi256_si128 instead of just using the low half of ymm0 = xmm0). This will be very good on Skylake. There's a good balance of uops for different ports, and variable-count shift is 1 uop for either p0 or p1 on SKL (vs. more expensive previously). The bottleneck might be just the front-end limit of 4 fused-domain uops per clock.
Replays of cache-line-split loads will happen because the unaligned loads will sometimes cross a 64-byte cache-line boundary. But that's just in the back-end, and we have a few spare cycles on ports 2 and 3 because of the front-end bottleneck (4 loads and 4 stores per set of results, with indexed stores which thus can't use port 7). If dependent ALU uops have to get replayed as well, we might start seeing back-end bottlenecks.
Despite the indexed addressing modes, there won't be unlamination because Haswell and later can keep indexed stores micro-fused, and the broadcast loads are a single pure uop anyway, not micro-fused ALU+load.
On Skylake, it can maybe come close to 4x 40-bit groups per 5 clock cycles, if memory bandwidth isn't a bottleneck. (e.g. with good cache blocking.) Once you factor in overhead and cost of cache-line-split loads causing occasional stalls, maybe 1.5 cycles per 40 bits of input, i.e. 6 cycles per 20 bytes of input on Skylake.
On other CPUs (Haswell and Ryzen), the variable-count shifts will be a bottleneck, but you can't really do anything about that. I don't think there's anything better. On HSW it's 3 uops: p5 + 2p0. On Ryzen it's only 1 uop, but it only has 1 per 2 clock throughput (for the 128-bit version), or per 4 clocks for the 256-bit version which costs 2 uops.
Beware that clang pessmizes the _mm_storeh_pi store to vpextrq [mem], xmm, 1: 2 uops, shuffle + store. (Instead of vmovhps : pure store on Intel, no ALU). GCC compiles it as written.
I used _mm256_broadcast_sd even though I really want vpbroadcastq just because there's an intrinsic that takes a pointer operand instead of __m256i (because with AVX1, only the memory-source version existed. But with AVX2, register-source versions of all the broadcast instructions exist). To use _mm256_set1_epi64, I'd have to write pure C that didn't violate strict aliasing (e.g. with memcpy) to do an unaligned uint64_t load. I don't think it will hurt performance to use an FP broadcast load on current CPUs, though.
I'm hoping _mm256_broadcast_sd allows its source operand to alias anything without C++ strict-aliasing undefined behaviour, the same way _mm256_loadu_ps does. Either way it will work in practice if it doesn't inline into a function that stores into *src, and maybe even then. So maybe a memcpy unaligned load would have made more sense!
I've had bad results in the past with getting compilers to emit pmovzxdw xmm0, [mem] from code like _mm_cvtepu16_epi32( _mm_loadu_si64(ptr) ); you often get an actual movq load + reg-reg pmovzx. That's why I didn't try that _mm256_broadcastq_epi64(__m128i).
Old idea; if we already need a byte shuffle we might as well use plain word shifts instead of vpmultishift.
With AVX512VBMI (IceLake, CannonLake), you might want vpmultishiftqb. Instead of broadcasting / shifting one group at a time, we can do all the work for a whole vector of groups after putting the right bytes in the right places first.
You'd still need/want a version for CPUs with some AVX512 but not AVX512VBMI (e.g. Skylake-avx512). Probably vpermd + vpshufb can get the bytes we need into the 128-bit lanes we want.
I don't think we can get away with using only dword-granularity shifts to allow merge-masking instead of dword blend after qword shift. We might be able to merge-mask a vpblendw though, saving a vpblendd
IceLake has 1/clock vpermw and vpermb, single-uop. (It has a 2nd shuffle unit on another port that handles some shuffle uops). So we can load a full vector that contains 4 or 8 groups of 4 elements and shuffle every byte into place efficiently. I think every CPU that has vpermb has it single-uop. (But that's only Ice Lake and the limited-release Cannon Lake).
vpermt2w (to combine 16-bit element from 2 vectors into any order) is one per 2 clock throughput. (InstLatx64 for IceLake-Y), so unfortunately it's not as efficient as the one-vector shuffles.
Anyway, you might use it like this:
64-byte / 512-bit load (includes some over-read at the end from 8x 8-byte groups instead of 8x 5-byte groups. Optionally use a zero-masked load to make this safe near the end of an array thanks to fault suppression)
vpermb to put the 2 bytes containing each field into desired final destination position.
vpsrlvw + vpandq to extract each 10-bit field into a 16-bit word
That's about 4 uops, not including the stores.
You probably want the high half containing the A elements for a contiguous vextracti64x4 and the low half containing the B and C elements for vmovdqu and vextracti128 stores.
Or for 2x vpblenddd to set up for 256-bit stores. (Use 2 different vpermb vectors to create 2 different layouts.)
You shouldn't need vpermt2w or vpermt2d to combine adjacent vectors for wider stores.
Without AVX512VBMI, probably a vpermd + vpshufb can get all the necessary bytes into each 128-bit chunk instead of vpermb. The rest of it only requires AVX512BW which Skylake-X has.

Creating a mask with N least significant bits set

I would like to create a macro or function1 mask(n) which given a number n returns an unsigned integer with its n least significant bits set. Although this seems like it should be a basic primitive with heavily discussed implementations which compile efficiently - this doesn't seem to be the case.
Of course, various implementations may have different sizes for the primitive integral types like unsigned int, so let's assume for the sake of concreteness that we are talking returning a uint64_t specifically although of course an acceptable solutions would work (with different definitions) for any unsigned integral type. In particular, the solution should be efficient when the type returned is equal to or smaller than the platform's native width.
Critically, this must work for all n in [0, 64]. In particular mask(0) == 0 and mask(64) == (uint64_t)-1. Many "obvious" solutions don't work for one of these two cases.
The most important criteria is correctness: only correct solutions which don't rely on undefined behavior are interesting.
The second most important criteria is performance: the idiom should ideally compile to approximately the most efficient platform-specific way to do this on common platforms.
A solution that sacrifices simplicity in the name of performance, e.g., that uses different implementations on different platforms, is fine.
1 The most general case is a function, but ideally it would also work as a macro, without re-evaluating any of its arguments more than once.
Try
unsigned long long mask(const unsigned n)
{
assert(n <= 64);
return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
(1ULL << n) - 1ULL;
}
There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.
Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline or, in C++, constexpr.
The unsigned long long int type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t is not.
If you need a macro (because you need something that works as a compile-time constant), that might be:
#define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)
As several people correctly reminded me in the comments, 1ULL << 64U is potential undefined behavior! So, insert a check for that special case.
You could replace 64U with CHAR_BITS*sizeof(unsigned long long) if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.
You could similarly generate this from an unsigned right shift, but you would still need to check n == 64 as a special case, since right-shifting by the width of the type is undefined behavior.
ETA:
The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.
Another solution without branching
unsigned long long mask(unsigned n)
{
return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
}
n & 0x3F keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and instruction is needed for this.
The checking condition for 64 can be changed to -(n < 64) to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n) if your CPU supports BMI2.
The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output
The condition can also be simplified to n >> 6, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above
return (1ULL << (n & 0x3F)) - (n == 64) - 1; // or n >= 64
return (1ULL << (n & 0x3F)) - (n >> 6) - 1;
gcc compiles the latter to
mov eax, 1
shlx rax, rax, rdi
shr edi, 6
dec rax
sub rax, rdi
ret
Some more alternatives
return ~((~0ULL << (n & 0x3F)) << (n == 64));
return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);
return (uint64_t)(((__uint128_t)1 << n) - 1); // if a 128-bit type is available
A similar question for 32 bits: Set last `n` bits in unsigned int
Here's one that is portable and conditional-free:
unsigned long long mask(unsigned n)
{
assert (n <= sizeof(unsigned long long) * CHAR_BIT);
return (1ULL << (n/2) << (n-(n/2))) - 1;
}
This is not an answer to the exact question. It only works if 0 isn't a required output, but is more efficient.
2n+1 - 1 computed without overflow. i.e. an integer with the low n bits set, for n = 0 .. all_bits
Possibly using this inside a ternary for cmov could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1, to take care of the difference in counting for this vs. the question for the pow2 calculation.
// defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
unsigned long long setbits_upto(unsigned n) {
unsigned long long pow2 = 1ULL << n;
return pow2*2 - 1; // one more shift, and subtract 1.
}
Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 = all bits set.
unsigned long long setbits_upto2(unsigned n) {
unsigned long long pow2 = 2ULL << n; // bake in the extra shift count
return pow2 - 1;
}
The table of inputs / outputs for a 32-bit version of this function is:
n -> 1<<n -> *2 - 1
0 -> 1 -> 1 = 2 - 1
1 -> 2 -> 3 = 4 - 1
2 -> 4 -> 7 = 8 - 1
3 -> 8 -> 15 = 16 - 1
...
30 -> 0x40000000 -> 0x7FFFFFFF = 0x80000000 - 1
31 -> 0x80000000 -> 0xFFFFFFFF = 0 - 1
You could slap a cmov after it, or other way of handling an input that has to produce zero.
On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).
xor eax, eax
bts rax, rdi ; rax = 1<<(n&63)
lea rax, [rax + rax - 1] ; one more left shift, and subtract
(3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)
In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family
C compilers unfortunately are dumb and miss using bts even when tuning for Intel CPUs without BMI2 (where shl reg,cl is 3 uops).
e.g. gcc and clang both do this (with dec or add -1), on Godbolt
# gcc9.1 -O3 -mtune=haswell
setbits_upto(unsigned int):
mov ecx, edi
mov eax, 2 ; bake in the extra shift by 1.
sal rax, cl
dec rax
ret
MSVC starts with n in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:
# ICC19
setbits_upto(unsigned int):
mov eax, 1 #3.21
mov ecx, edi #2.39
shl rax, cl #2.39
lea rax, QWORD PTR [-1+rax+rax] #3.21
ret #3.21
With BMI2 (-march=haswell), we get optimal-for-AMD code from gcc/clang with -march=haswell
mov eax, 2
shlx rax, rax, rdi
add rax, -1
ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.
On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.
e.g. AArch64. And of course this can hoist the constant 2 for reuse with different n, like x86 can with BMI2 shlx.
setbits_upto(unsigned int):
mov x1, 2
lsl x0, x1, x0
sub x0, x0, #1
ret
Basically the same on PowerPC, RISC-V, etc.
#include <stdint.h>
uint64_t mask_n_bits(const unsigned n){
uint64_t ret = n < 64;
ret <<= n&63; //the &63 is typically optimized away
ret -= 1;
return ret;
}
Results:
mask_n_bits:
xor eax, eax
cmp edi, 63
setbe al
shlx rax, rax, rdi
dec rax
ret
Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .
Explanation:
The &63 gets optimized away, but ensures the shift is <=64.
For values less than 64 it just sets the first n bits using (1<<n)-1. 1<<n sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.
By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).
Caveats:
1s complement systems must die - requires special casing if you have one
some compilers may not optimize the &63 away
When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63).
The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.
When N=0, we can make the constant zero before shifting:
uint64_t mask(unsigned N)
{
return -uint64_t(N != 0) >> (64-N & 63);
}
This compiles to five instructions in x64 clang:
neg sets the carry flag to N != 0.
sbb turns the carry flag into 0 or -1.
shr rax,N already has an implicit N & 63, so 64-N & 63 was optimized to -N.
mov rcx,rdi
neg rcx
sbb rax,rax
shr rax,cl
ret
With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):
neg edi
sbb rax,rax
shrx rax,rax,rdi
ret

The best way to shift a __m128i?

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m).
What is the best way to do this?!
Note that _mm_slli_epi64 shifts v0 and v1 seperately:
r0 := v0 << count
r1 := v1 << count
so the last bits of v0 missed, but I want to move those bits to r1.
Edit:
I looking for a code, faster than this (m<64):
r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);
For compile-time constant shift counts, you can get fairly good results. Otherwise not really.
This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.
// untested
#include <immintrin.h>
/* some compilers might choke on slli / srli with non-compile-time-constant args
* gcc generates the xmm, imm8 form with constants,
* and generates the xmm, xmm form with otherwise. (With movd to get the count in an xmm)
*/
// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
__m128i carry = _mm_bslli_si128(x, 8); // old compilers only have the confusingly named _mm_slli_si128 synonym
if (count >= 64)
return _mm_slli_epi64(carry, count-64); // the non-carry part is all zero, so return early
// else
carry = _mm_srli_epi64(carry, 64-count); // After bslli shifted left by 64b
x = _mm_slli_epi64(x, count);
return _mm_or_si128(x, carry);
}
__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100); }
I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.
The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):
mm_bitshift_left_3(long long __vector(2)):
vpslldq xmm1, xmm0, 8
vpsrlq xmm1, xmm1, 61
vpsllq xmm0, xmm0, 3
vpor xmm0, xmm0, xmm1
ret
Performance:
This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)
Or for counts >= 64:
mm_bitshift_left_100(long long __vector(2)):
vpslldq xmm0, xmm0, 8
vpsllq xmm0, xmm0, 36
ret
If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.
It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.
The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)
__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
return x << count; // undefined if count >= 128
}
In SSE4.A the instructions insrq and extrq can be used to shift (and rotate) through __mm128i 1-64 bits at a time. Unlike the 8/16/32/64 bit counterparts pextrN/pinsrX, these instructions select or insert m bits (between 1 and 64) at any bit offset from 0 to 127. The caveat is that the sum of lenght and offset must not exceed 128.

128-bit rotation using ARM Neon intrinsics

I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t).
Here is my c code:
uint16_t rotated[8];
uint16_t temp[8];
uint16_t j;
for(j = 0; j < 8; j++)
{
//Rotation <<< 24 over 128 bits (x << shift) | (x >> (16 - shift)
rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff);
}
I've checked the gcc documentation about Neon Intrinsics and it doesn't have instruction for vector rotations. Moreover, I've tried to do this using vshlq_n_u16(temp, 8) but all the bits shifted outside a uint16_t word are lost.
How to achieve this using neon intrinsics ? By the way is there a better documentation about GCC Neon Intrinsics ?
After some reading on Arm Community Blogs, I've found this :
VEXT: Extract
VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors. VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.
The following Neon GCC Intrinsic does the same as the assembly provided in the picture :
uint16x8_t vextq_u16 (uint16x8_t, uint16x8_t, const int)
So the the 24bit rotation over a full 128bit vector (not over each element) could be done by the following:
uint16x8_t input;
uint16x8_t t0;
uint16x8_t t1;
uint16x8_t rotated;
t0 = vextq_u16(input, input, 1);
t0 = vshlq_n_u16(t0, 8);
t1 = vextq_u16(input, input, 2);
t1 = vshrq_n_u16(t1, 8);
rotated = vorrq_u16(t0, t1);
Use vext.8 to concat a vector with itself and give you the 16-byte window that you want (in this case offset by 3 bytes).
Doing this with intrinsics requires casting to keep the compiler happy, but it's still a single instruction:
#include <arm_neon.h>
uint16x8_t byterotate3(uint16x8_t input) {
uint8x16_t tmp = vreinterpretq_u8_u16(input);
uint8x16_t rotated = vextq_u8(tmp, tmp, 16-3);
return vreinterpretq_u16_u8(rotated);
}
g++5.4 -O3 -march=armv7-a -mfloat-abi=hard -mfpu=neon (on Godbolt) compiles it to this:
byterotate3(__simd128_uint16_t):
vext.8 q0, q0, q0, #13
bx lr
A count of 16-3 means we left-rotate by 3 bytes. (It means we take 13 bytes from the left vector and 3 bytes from the right vector, so it's also a right-rotate by 13).
Related: x86 also has instruction that takes a sliding window into the concatenation of two registers: palignr (added in SSSE3).
Maybe I'm missing something about NEON, but I don't understand why the OP's self-answer is using vext.16 (vextq_u16), which has 16-bit granularity. It's not even a different instruction, just an alias for vext.8 which makes it impossible to use an odd-numbered count, requiring extra instructions. The manual for vext.8 says:
VEXT pseudo-instruction
You can specify a datatype of 16, 32, or 64 instead of 8. In this
case, #imm refers to halfwords, words, or doublewords instead of
referring to bytes, and the permitted ranges are correspondingly
reduced.
I'm not 100% sure but I don't think NEON has rotate instructions.
You can compose the rotation operation you require with a left shift, a right shit and an or, e.g.:
uint8_t ror(uint8_t in, int rotation)
{
return (in >> rotation) | (in << (8-rotation));
}
Just do the same with the Neon intrinsics for left shift, right shit and or.
uint16x8_t temp;
uint8_t rot;
uint16x8_t rotated = vorrq_u16 ( vshlq_n_u16(temp, rot) , vshrq_n_u16(temp, 16 - rot) );
See http://en.wikipedia.org/wiki/Circular_shift "Implementing circular shifts."
This will rotate the values inside the lanes. If you want to rotate the lanes themselves use VEXT as described in the other answer.

Resources