Extracting edges of AVX2 16x16 bitmatrix

Extracting edges of AVX2 16x16 bitmatrix - c

Is there a relatively cheap way to extract the four edges (rows 0 and 15, and columns 0 and 15) of a 16x16 bitmatrix stored in a __m256i into four 16b lanes of a __m256i? I don't care which lanes the output is to, or if there is garbage in the rest of the register. Mild preference for all of them to be in the low half, but only mild.
Extracting the 'top' and 'bottom' are easy - it's just the first and last 16b elements of the vector, done - but the sides are another matter. You need the first and last bits of each 16b element, which gets complicated.
You can do it with a full bit-transpose, like so:
// Full bit-transpose of input viewed as a 16x16 bitmatrix.
extern __m256i transpose(__m256i m);
__m256i get_edges(__m256i m) {
__m256i t = transpose(m);
// We only care about first and last u16 of each
// m = [abcdefghijklmnop]
// t = [ABCDEFGHIJKLMNOP]
m = _mm256_permutevar8x32_epi32(m, _mm256_set_epi32(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0));
// m = [............a..p]
t = _mm256_permutevar8x32_epi32(t, _mm256_set_epi32(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0));
// m = [............A..P]
__m256i r = _mm256_unpacklo_epi16(t, m);
// r = [........aA....pP]
return r; // output in low and high dwords of low half
}
... but that just reduces one surprisingly annoying problem to another surprisingly annoying problem - I can't see how to cheaply do a full bit-transpose of a __m256i.
Ditto, there might be something _mm256_movemask_epi8-esque that could do the trick - but nothing jumps out at me.
Is there a better approach?

With fast BMI2 pext (Haswell or Zen 3 and later), that's one option if you start with vpmovmskb + shift + vpmovmskb to get the bits of the edges (interleaved with garbage bits, since we want every 16th but we get every 8th).
9 uops for the front-end, 6 of them needing port 5 on Intel Skylake-family. (Not counting the integer constant setup, assuming you'd do this in a loop. If not, that also counts against this.)
__m128i edges_zen3_intel(__m256i v)
{
__m128i vtop_bottom = _mm256_castsi256_si128(
_mm256_permute4x64_epi64(v, _MM_SHUFFLE(0,0, 3, 0)) );
// vpermq: 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
// side bits interleaved with garbage
// without AVX-512 we can only extract a bit per byte, dword, or qword
unsigned left = _mm256_movemask_epi8(v); // high bit of each element
unsigned right = _mm256_movemask_epi8( _mm256_slli_epi16(v, 15) ); // low<<15
// left = _pext_u32(left, 0xAAAAAAAAul); // take every other bit starting with #1
// right = _pext_u32(right, 0xAAAAAAAAul);
// then combine or do whatever
uint64_t lr = ((uint64_t)left << 32) | right;
lr = _pext_u64(lr, 0xAAAAAAAAAAAAAAAAull);
//__m128i vsides = _mm_cvtsi32_si128(lr);
__m128i vtblr = _mm_insert_epi32(vtop_bottom, lr, 1); // into an unused space
// u16 elems: [ top | x | x | x | left | right | x | bottom ]
return vtblr;
}
This compiles to 10 uops for Intel CPUs (and Zen 4), including getting everything back into one SIMD vector. The movabs can be hoisted out of loops. SHL/OR don't compete for SIMD execution-port throughput (able to run on port 6 on Intel), but do compete for the front-end. Godbolt
# Haswell/Sklake uop counts
edges_zen3_intel(long long __vector(4)):
vpsllw ymm2, ymm0, 15 # p0 (or p01 on Skylake)
vpmovmskb eax, ymm0 # p0
vpermq ymm1, ymm0, 12 # p5
vpmovmskb edx, ymm2 # p0
sal rax, 32 # p06
or rax, rdx # p0156
movabs rdx, -6148914691236517206 # p0156 (and can be hoisted out of loops)
pext rax, rax, rdx # p1
vpinsrd xmm0, xmm1, eax, 1 # 2 p5. On Intel, both uops compete with shuffles
ret
As a variation, we could maybe get left and right edges together for one vpmovmskb, if we can left-shift the odd bytes but not the evens? Probably not, _mm256_maddubs_epi16 with _mm256_set1_epi16(0x0180) can't do that, it adds horizontal pairs, and a left-shift of 7 (0x80 = 1<<7) isn't enough, we'd need 8 to get the top bit back to the top.
Or if we vpsllw + vpacksswb, then use the right masks to group bits, like 0x00ff00ff. But that's getting closer to my non-pext idea, maybe it's better even if we do have fast pext
Without fast BMI2 pext - saturating pack the vector to reduce to 8-bit elements
This might be faster even if pext is fast.
Packing with signed saturation always preserves the sign bit, so you can narrow 16 to 8-bit without losing the information you want to keep. We want to do this to the high and low bit of each word (16-bit element), so a 2:1 pack with the original and v<<15 is perfect.
Except for the fact that AVX2 vpacksswb ymm is two separate in-lane pack operations, so we end up with 8-element chunks interleaved. We could fix that up right after packing with vpermq, but it's multiple uops on Zen 1 through Zen 3, and we can instead shuffle bytes after getting the movemask result back into a vector register. (The same vpshufb can move around the high and low elements.)
// avoiding PEXT because it's slow on Zen 2 and Zen 1 (and Excavator)
// This might be good on Intel and Zen 3, maybe comparable to using PEXT
__m128i edges_no_pext(__m256i v)
{
__m128i vhi = _mm256_extract_si128(v, 1); // contains top, as vhi.u16[7]
__m128i vlo = _mm256_castsi256_si128(v); // contains bottom, as vlo.u16[0], contiguous if concatenated the right way
__m128i bottom_top = _mm_alignr_epi8(vhi, vlo, 12); // rotate bottom :top down to the 2nd dword [ x | x | bottom:top | x]
// vpermq ymm, ymm, imm would also work to get them into the low 128
// but that's 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
// and would need a slightly more expensive vpinsrd instead of vmovd+vpblendd
// On Intel CPUs (and Zen4) vpermq is better; we pshufb later so we can get the bytes where we want them.
// A compromise is to use vextracti128+vpblendd here, vpinsrd later
// __m128i bottom_top = _mm_blend_epi32(vhi, vlo, 0b0001);
// [ hi | x | x | x | x | x | x | lo ]
__m256i vright = _mm256_slli_epi16(v, 15);
__m256i vpacked = _mm256_packs_epi16(v, vright); // pack now, shuffle bytes later.
unsigned bits = _mm256_extract_epi8(vpacked); // [ left_hi | right_hi | left_lo | right_lo ]
__m128i vsides = _mm_cvtsi32_si128(bits);
__m128i vtblr = _mm_blend_epi32(top_bottom, vsides, 0b0001); // vpinsrd xmm0, eax, 0 but the merge can run on more ports
__m128i shuffle = _mm_set_epi8(-1,-1,-1,-1, -1,-1,-1,-1,
7,6,5,4, 3,1, 2,0);
// swap middle 2 bytes of the low dword, fixing up the in-lane pack
vtblr = _mm_shuffle_epi8(vtblr, shuffle);
return vtblr; // low 4 u16 elements are (MSB) top | bottom | left | right (LSB)
}
This compiles pretty nicely (see earlier Godbolt link), although GCC4.9 and later (and clang) pessimize my vmovd+vpblendd into vpinsrd, even with -march=haswell or Skylake where it's 2 uops for port 5 (https://uops.info/) when most of the other instructions in the function are also shuffles that only run on port 5. (This is much more shuffle-heavy for Intel CPUs.)
Using vpblendd instead of vpalignr would make it less bad for Intel, like __m128i bottom_top = _mm_blend_epi32(vhi, vlo, 0b0001);, to get to the same situation as in the vpermq version below with 2 uops even on Zen 1. But this is just saving 1 uop on Zen 1 and is equal or worse everywhere else.
# GCC12 -O3 -march=haswell
# uop counts for Skylake
edges_no_pext:
vextracti128 xmm1, ymm0, 0x1 # p5
vpsllw ymm2, ymm0, 15 # p01
vpalignr xmm1, xmm1, xmm0, 12 # p5
vpacksswb ymm0, ymm0, ymm2 # p5
vpmovmskb eax, ymm0 # p0
vpinsrd xmm0, xmm1, eax, 0 # 2 p5
vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip] # p5
ret
So that's 6 uops for port 5 on Intel, a throughput bottleneck of 1 per 6 cycles. vs. the PEXT version being 3 uops that need port 0, 3 that need port 5. But this is only 8 total uops for the front-end, vs. 9 for the pext version. And the vpermq version saves one more on Intel, assuming GCC doesn't waste the vmovdqa after inlining.
If you didn't care about zeroing the upper 8 bytes of the output vector, the shuffle constant could be loaded with vmovq and just be 8 bytes instead of 16 (if you made the upper 0 bytes all zeros). But compilers will probably not spot that optimization.
Since compilers insist on pessimizing to vpinsrd, on CPUs with fast vpermq (Intel and Zen4), we might as well use that:
If you're only going to have one non-GFNI AVX2 version, this is probably a good tradeoff
vpermq being 3 uops on Zen 1 isn't much worse than emulating what we need from it using 2 instruction, and is worse on Intel CPUs. And probably about break-even on Zen 2 and Zen 3, modulo differences in back-end port usage.
// for fast vpermq, especially if compilers are going to pessimize vmovd(p5)+vpblendd (p015) into vpinsrd (2p5).
// good on Intel and Zen 4, maybe also Zen 3 and not bad on Zen 2.
__m128i edges_no_pext_fast_vpermq(__m256i v)
{
__m128i vtop_bottom = _mm256_castsi256_si128(
_mm256_permute4x64_epi64(v, _MM_SHUFFLE(0,0, 3, 0)) );
// 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
__m256i vright = _mm256_slli_epi16(v, 15);
__m256i vpacked = _mm256_packs_epi16(v, vright); // pack now, shuffle bytes later.
unsigned bits = _mm256_movemask_epi8(vpacked); // [ left_hi | right_hi | left_lo | right_lo ]
__m128i vtblr = _mm_insert_epi32(vtop_bottom, bits, 1); // into an unused space
// u16 elems: [ top | x | x | x | lh:rh | ll:rl | x | bottom ]
__m128i shuffle = _mm_set_epi8(-1,-1,-1,-1, -1,-1,-1,-1,
15,14, 1,0, 7,5, 6,4);
vtblr = _mm_shuffle_epi8(vtblr, shuffle);
return vtblr; // low 4 u16 elements are (MSB) top | bottom | left | right (LSB)
}
# GCC12.2 -O3 -march=haswell clang is similar but has vzeroupper despite the caller passing a YMM, but no wasted vmovdqa
edges_no_pext_fast_vpermq(long long __vector(4)):
vmovdqa ymm1, ymm0
vpermq ymm0, ymm0, 12
vpsllw ymm2, ymm1, 15
vpacksswb ymm1, ymm1, ymm2
vpmovmskb eax, ymm1
vpinsrd xmm0, xmm0, eax, 1
vpshufb xmm0, xmm0, XMMWORD PTR .LC1[rip]
ret
On Intel Haswell/Skylake, this is 5 uops for port 5, plus a shift (p01) and vpmovmskb (p0). So 7 total uops. (Not counting the ret or the wasted vmovdqa that should go away with inlining.)
On Ice Lake and later, one of the uops from vpinsrd can run on p15, relieving one uop of pressure on that port if you're doing this in a loop. vpinsrd is single-uop on Alder Lake E-cores.
Ice Lake (and later) can also run vpshufb on p1/p5, further reducing port 5 pressure, down to 3 of the 7 uops. Port 5 can handle any shuffle, port 1 can handle some but not all shuffle uops. It may be hooked up to the upper half of the 512-bit shuffle unit to give extra throughput for some 256-bit and narrower shuffles, like how the p0/p1 FMA units work as a single 512-bit FMA unit on p0. It doesn't handle vpermq or vpacksswb; those are still p5 only on Ice/Alder Lake.
So this version is pretty reasonable on current-generation and future Intel CPUs. Alder Lake E-cores run vpermq ymm as 2 uops with 7 cycle latency. But if they can hide that latency with their more limited out-of-order scheduling (big ROB, but queues for each port aren't as long), running vpinsrd as a single uop helps make up the front-end throughput.
256-bit instructions like vpsllw ymm and vpacksswb ymm are also 2 uops each on Alder Lake E-cores, but vpmovmskb eax,ymm is 1 uop (but maybe high-ish latency). So even if we wanted to make a version optimized for Zen1 / Alder E, we probably can't save total uops on them by using more 128-bit instructions after vextracti128; we still need to do stuff to both halves of the input vector.
I had looked at packing into the right order for vpmovmskb xmm to get each 16-bit group in the right order, but separately. I had considered doing this with vperm2i128, but that's quite slow on Zen 1.
// __m256i vcombined = _mm256_permute2x128_si256(v, vright, 0x10); // or something? Takes two shuffles to get them ordered the right way for pack
Zen 1 has very fast vextracti128 - is single-uop for any port, and 128-bit vector ops are 1 uop vs. 2 for __m256i operations. And where we're already doing that extract to get the top and bottom together.
But it still leads to more scalar work, especially if you want the result combined in a vector. 2x vpinsrw or and extra SHL/OR before vmovd is worse.
#if 0
// Zen 1 has slow vperm2i128, but I didn't end up using it even if it's fast
__m128i hi = _mm256_extract_si128(v, 1); // vextracti128 - very cheap on Zen1
__m128i lo = _mm256_castsi256_si128(v); // no cost
__m128i vleft = _mm_packs_epi16(lo, hi); // vpacksswb signed saturation, high bit of each word becomes high bit of byte
// then shift 2 halves separately and pack again?
#endif
Vector packing to set up for vpmovmskb is probably the best bet; before thinking of that, I was looking at using vpmovmskb on the input directly and using scalar bithacks to take odd or even bits:
How to efficiently de-interleave bits (inverse Morton)
How to de-interleave bits (UnMortonizing?)
But those take more operations, so they're slower unless you're bottlenecked on SIMD ALUs specifically, not overall front-end throughput (or execution-port throughput on Intel where SIMD and scalar ALUs share ports).
AVX-512 and/or GFNI
There are two interesting strategies here:
vpmovw2m and/or vptestmw or mb as a more convenient vpmovmskb. Only requires AVX-512BW (Skylake-avx512)
Pack 8 bits to the bottom of each qword, then shuffle. Probably only good with GFNI + AVX512VBMI, like Ice Lake / Zen4 and later. Maybe just GFNI + AVX2 as in crippled Alder Lake (no AVX-512).
Extracting bits to a mask:
With one vptestmb with set1_epi8(0x8001), we can get all the bits we want into one mask, but then we need to deinterleave, probably with scalar pext (which is fast on all AVX-512 CPUs except maybe Knight's Landing, but it doesn't have AVX-512BW).
So probably better to extract two masks and concatenate. Except wait a minute, I don't see a great way to get a 32-bit mask into a vector register (without expanding it to a vector of 0 / -1 elements). For 8 and 16-bit masks, there's mask-to-vector broadcasts like vpbroadcastmw2d x/y/zmm, k. They don't support masking, so you can't merge-mask into another register. That's single-uop on Zen 4, but on Intel it costs 2 uops, same as kmov eax, k / vpbroadcastd x/y/zmm, eax, which is what you should do instead so you can merge-mask into the vector with the top and bottom edges.
vpmovw2m k1, ymm0 # left = 16 mask bits from high bits of 16 elements
vptestmw k2, ymm0, set1_epi16(0x0001) # right. pseudocode constant
kunpckwd k1, k1, k2 # left:right
# there's no vpbroadcastmd2d only byte/word mask to dword or qword element!
mov ecx, 0b0010
kmovb k7, ecx # hoist this constant setup out of loops. If not looping, maybe do something else, like bcast to another register and vpblendd.
kmovd eax, k1
vpbroadcastd xmm0{k7}, eax # put left:right into the 2nd element of XMM0
# leaving other unchanged (merge-masking)
Where xmm0 could have been set by vpermq to have top:bottom in the low 16 bytes; all CPUs with AVX-512 have efficient vpermq. So that's 1 more uop on top of the 5 from my hand-written asm (which should be straightforward to write with intrinsics, I just didn't feel like taking the extra step of looking up the right intrinsics after finding the available asm instructions.)
Packing bits within qwords then shuffling: GFNI and probably AVX-512VBMI for vpermb
(Requiring AVX512VBMI means Ice Lake or Zen 4, so vpermb will be single-uop. Unless some future Intel CPU with an E-core supports a slower AVX-512, but still vpermb ymm hopefully wouldn't be too bad.)
Probably pack in left:right order (1 nibble each), then byte shuffle. If we can do left:right and right:left in alternating bytes, a byte shuffle (like vpermb or vpermt2b) should be able to set up for a vprolw to rotate within each 16-bit word to group 8 "left" bits in the right order.
Moving bits within a qword: Harold's answer on bitpack ascii string into 7-bit binary blob using SIMD shows _mm256_gf2p8affine_epi64_epi8 putting 1 bit from each byte at the top of each qword. (And packing the remaining 7-bit fields, which was the goal in that answer.)
If this is doable, it'll probably be fewer uops and significantly better latency than going to masks and back.
With Alder Lake (GFNI but AVX-512 disabled unless you manage to avoid Intel's efforts to cripple this amazing CPU), this might still be useful, since it has AVX+GFNI for _mm256_gf2p8affine_epi64_epi8. vpshufb + vpermd can substitute for vpermb. But you won't have word rotates; still, shuffling bytes like ABAB will let you use a plain left shift to get the window you wanted, and then shuffle again.

Related

bitpack ascii string into 7-bit binary blob using SIMD

Related: bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD - same question specialized for AArch64 intrinsics. This question covers portable C and x86-64 intrinsics.
I would like to encode a char string as a 7-bit blob to gain a 12.5% reduction in memory.
I want to do it as fast a possible, i.e. with minimal latency when encoding large strings.
Here is the plain implementation of the algo:
void ascii_pack(const char* ascii, size_t len, uint8_t* bin) {
uint64_t val;
const char* end = ascii + len;
while (ascii + 8 <= end) {
memcpy(&val, ascii, 8);
uint64_t dest = (val & 0xFF);
// Compiler will perform loop unrolling
for (unsigned i = 1; i <= 7; ++i) {
val >>= 1;
dest |= (val & (0x7FUL << 7 * i));
}
memcpy(bin, &dest, 7);
bin += 7;
ascii += 8;
}
// epilog - we do not pack since we have less than 8 bytes.
while (ascii < end) {
*bin++ = *ascii++;
}
}
now, I would like to speed it up with SIMD. I came with SSE2 algo below.
My question:
is it possible to optimize the internal loop that is sequential?
will it improve the throughput when running on large strings?
// The algo - do in parallel what ascii_pack does on two uint64_t integers
void ascii_pack_simd(const char* ascii, size_t len, uint8_t* bin) {
__m128i val;
__m128i mask = _mm_set1_epi64x(0x7FU); // two uint64_t masks
// I leave out 16 bytes in addition to 16 that we load in the loop
// because we store into "bin" full 16 bytes instead of 14. To prevent out of bound
// writes we finish one iteration earlier.
const char* end = ascii + len - 32;
while (ascii <= end) {
val = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ascii));
__m128i dest = _mm_and_si128(val, mask);
// Compiler unrolls it
for (unsigned i = 1; i <= 7; ++i) {
val = _mm_srli_epi64(val, 1); // shift right both integers
__m128i shmask = _mm_slli_epi64(mask, 7 * i); // mask both
dest = _mm_or_si128(dest, _mm_and_si128(val, shmask)); // add another 7bit part.
}
// dest contains two 7 byte blobs. Lets copy them to bin.
_mm_storeu_si128(reinterpret_cast<__m128i*>(bin), dest);
memmove(bin + 7, bin + 8, 7);
bin += 14;
ascii += 16;
}
end += 32; // Bring back end.
DCHECK(ascii < end);
ascii_pack(ascii, end - ascii, bin);
}

The scalar trick (without requiring PEXT) which I referred to in the comments could be implemented like this:
uint64_t compress8x7bit(uint64_t x)
{
x = ((x & 0x7F007F007F007F00) >> 1) | (x & 0x007F007F007F007F);
x = ((x & 0x3FFF00003FFF0000) >> 2) | (x & 0x00003FFF00003FFF);
x = ((x & 0x0FFFFFFF00000000) >> 4) | (x & 0x000000000FFFFFFF);
return x;
}
The idea here is to concatenate together adjacent pairs, first concatenate 7-bit elements into 14-bit elements, then concatenate them into 28-bit elements, and finally concatenate them into one 56-bit chunk (which is the result).
With SSSE3, you could use pshufb to concatenate two of those 56-bit parts (before storing them) too.
SSE2 (and AVX2) can do the same thing as that scalar code with 64-bit elements, but this approach does not take advantage of any techniques that may be possible with special operations (which SSE2+ has plenty of, more with every version), there are probably better things to do than just implementing the scalar trick in SIMD.
For example just to throw something wild out there, gf2p8affineqb(0x8040201008040201, x) would put all the "discarded" bits in one place (namely the top byte of the result) and makes a solid 56-bit chunk out of the bits that we want to keep. But the bits do end up in a strange order (the first byte would contain bits 56, 48, 40, 32, 24, 16, 8, 0, in that order, listing the least significant bit first).
That order, strange as it is, can be easily unpacked using pshufb to reverse the bytes (you can also use this to insert the two zeroes) and then gf2p8affineqb(0x0102040810204080, reversedBytes) shuffles the bits back into the original order.
Here's a sketch of how that could work with actual AVX2+GFNI intrinsics. I'm not bothering to handle the extra parts at the end here, just the "main" loop, so the input text had better be a multiple of 32 bytes. Works on my PC ✔️
void compress8x7bit(const char* ascii, size_t len, uint8_t* bin)
{
const char* end = ascii + len;
while (ascii + 31 < end) {
__m256i text = _mm256_loadu_si256((__m256i*)ascii);
__m256i transposed = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x8040201008040201), text, 0);
__m256i compressed = _mm256_shuffle_epi8(transposed,
_mm256_set_epi8(-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0,
-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0));
_mm_storeu_si128((__m128i*)bin, _mm256_castsi256_si128(compressed));
_mm_storeu_si128((__m128i*)(bin + 14), _mm256_extracti128_si256(compressed, 1));
bin += 28;
ascii += 32;
}
}
void uncompress8x7bit(char* ascii, size_t len, const uint8_t* bin)
{
const char* end = ascii + len;
while (ascii + 31 < end) {
__m256i raw = _mm256_inserti128_si256(_mm256_castsi128_si256(_mm_loadu_si128((__m128i*)bin)), _mm_loadu_si128((__m128i*)(bin + 14)), 1);
__m256i rev_with_zeroes = _mm256_shuffle_epi8(raw,
_mm256_set_epi8(7, 8, 9, 10, 11, 12, 13, -1, 0, 1, 2, 3, 4, 5, 6, -1,
7, 8, 9, 10, 11, 12, 13, -1, 0, 1, 2, 3, 4, 5, 6, -1));
__m256i decompressed = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x0102040810204080), rev_with_zeroes, 0);
_mm256_storeu_si256((__m256i*)ascii, decompressed);
bin += 28;
ascii += 32;
}
}
Perhaps there is a nicer solution than using two 128-bit stores in the compressor and two 128-bit loads in the uncompressor. With AVX512 that would be easy since it has full-register byte-granular permutes, but AVX2 has vpshufb, which is not able to move bytes between the two 128-bit halves that make up a 256-bit vector. The uncompressor could do a funny load that starts 2 bytes before the start of the data it wants, like this: _mm256_loadu_si256((__m256i*)(bin - 2)) (and a slightly different shuffle vector), at the cost of having to avoid a potential out-of-bounds error with either padding or a special first iteration, but the compressor cannot (not cheaply) use a trick like that with a store that start 2 bytes earlier (that would destroy two bytes of the result).
By the way I have some test code here that you can use to verify that your bit-compression functions do the right thing (well sort of - as long as the function is a bit-permutation where some of the bits may be zeroed this works as a check, but this would not detect every possible bug in general):
uint64_t bitindex[7];
bitindex[6] = compress8x7bit(0xFFFFFFFFFFFFFFFF);
bitindex[5] = compress8x7bit(0xFFFFFFFF00000000);
bitindex[4] = compress8x7bit(0xFFFF0000FFFF0000);
bitindex[3] = compress8x7bit(0xFF00FF00FF00FF00);
bitindex[2] = compress8x7bit(0xF0F0F0F0F0F0F0F0);
bitindex[1] = compress8x7bit(0xCCCCCCCCCCCCCCCC);
bitindex[0] = compress8x7bit(0xAAAAAAAAAAAAAAAA);
for (size_t i = 0; i < 64; i++)
{
if (i != 0)
std::cout << ", ";
if (bitindex[6] & (1uLL << i))
{
int index = 0;
for (size_t j = 0; j < 6; j++)
{
if (bitindex[j] & (1uLL << i))
index |= 1 << j;
}
std::cout << index;
}
else
std::cout << "_";
}
std::cout << "\n";

You can improve the solution by #harold, if you replace the first two mask and shift steps by a vpmaddubsw and vpmaddwd (each using 1 instead of 4 uops) and the next step can be replaced by shifting every other 32bit element 4 to the left and afterwords shifting all 64bit elements 4 to the right. Of course, by using AVX2 instead of SSE, you can again double the throughput.
The final step of joining the lower and upper lane is likely most efficiently done by two separate stores which extract each lane directly to memory.
void ascii_pack32(char const* ascii, char* bin)
{
const __m256i control = _mm256_set_epi8(-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0,
-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0);
__m256i input = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(ascii));
// only necessary if high bits of input might be set:
input = _mm256_and_si256(input, _mm256_set1_epi8(0x7f));
__m256i t1 = _mm256_maddubs_epi16(_mm256_set1_epi16(0x8001), input);
__m256i t2 = _mm256_madd_epi16(_mm256_set1_epi32(0x40000001), t1);
__m256i t3 = _mm256_srli_epi64(_mm256_sllv_epi32(t2, _mm256_set1_epi64x(4)), 4);
__m256i val = _mm256_shuffle_epi8(t3, control);
_mm_storeu_si128(reinterpret_cast<__m128i*>(bin), _mm256_castsi256_si128(val));
_mm_storeu_si128(reinterpret_cast<__m128i*>(bin+14), _mm256_extracti128_si256(val, 1));
}
Godbolt link with short testcode:
https://godbolt.org/z/hs7477h5W

SIMD unpack can benefit from blend instructions instead of and/andn/or because we can blend at dword / word / byte boundaries. We only need to AND once at the end to clear the high bit of each byte.
#include <immintrin.h>
static inline
__m128i ascii_unpack7x8_sse4(__m128i v)
{
__m128i separate_7B_halves = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1,
7, 8, 9,10,11,12,13, -1);
v = _mm_shuffle_epi8(v, separate_7B_halves);
// separate each u64 qword into 2 u32 halves, with the good bits at the bottom
__m128i shifted = _mm_slli_epi64(v, 4);
#ifdef __AVX2__
v = _mm_blend_epi32(v, shifted, 0b1010); // vpblendd is very efficient, 1 uop any port
#else
v = _mm_castps_si128(_mm_blend_ps( // blendps has extra bypass latency between integer insns, but is single-uop
_mm_castsi128_ps(v), _mm_castsi128_ps(shifted), 0b1010) );
#endif
// Separate each u32 into u16
shifted = _mm_slli_epi32(v, 2);
v = _mm_blend_epi16(v, shifted, 0b10101010); // high halves of pairs from shifted
// Separate each u16 into bytes, with one of two strategies
#if 0 // this strategy is simpler but worse
// shifted = _mm_add_epi16(v, v); // v<<1
// v = _mm_blendv_epi8(v, shifted, _mm_set1_epi16(0xff00));
// v = _mm_and_si128(v, _mm_set1_epi8(0x7f)); // clear garbage from high bits
#else
__m128i hi = _mm_and_si128(v, _mm_set1_epi16(0x3f80)); // isolate hi half
v = _mm_and_si128(v, _mm_set1_epi16(0x007f)); // clear high garbage
v = _mm_add_epi16(v, hi); // high halves left 1 (x+=x), low halves stay (x+=0)
// both ways need two vector constants and 3 instructions, but pblendvb can be slower and has an awkward requirement of having the control vector in XMM0
#endif
return v;
}
With AVX2 available, clang compiles it to this nice asm. Godbolt
# clang -O3 -march=x86-64-v3 (implies AVX2+BMI2, basically Haswell with generic tuning)
ascii_unpack7x8_sse4(long long __vector(2)):
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0,1,2,3,4,5,6],zero,xmm0[7,8,9,10,11,12,13],zero
vpsllq xmm1, xmm0, 4
vpblendd xmm0, xmm0, xmm1, 10 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
vpslld xmm1, xmm0, 2
vpblendw xmm0, xmm0, xmm1, 170 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7]
vpand xmm1, xmm0, xmmword ptr [rip + .LCPI0_1] # in a loop, these constants would be in registers
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
vpaddw xmm0, xmm0, xmm1
ret
With just SSE4.1, compilers need several movdqa instructions, as in GCC's output. And out-of-order exec will have an extra 1 or 2 cycles of latency to hide because of bypass-forwarding delays for integer shifts forwarding to an FP blendps, on Intel CPUs. (https://agner.org/optimize/). But that's fine, we're doing this in a loop over an array, modern CPUs have deep enough out-of-order exec.
# gcc -O3 -march=x86-64-v2 # SSE4.2, Nehalem. Actually only using SSE4.1
ascii_unpack7x8_sse4(long long __vector(2)):
movdqa xmm1, xmm0 # silly compiler wastes a MOV
pshufb xmm1, XMMWORD PTR .LC0[rip]
movdqa xmm0, xmm1 # save unshifted v
psllq xmm0, 4
blendps xmm1, xmm0, 10 # 0b1010 = 0xA
movdqa xmm0, xmm1
pslld xmm0, 2
pblendw xmm1, xmm0, 170 # 0b10101010 = 0xAA
movdqa xmm0, XMMWORD PTR .LC1[rip] # after inlining, probably a reg-copy
pand xmm0, xmm1 # and two PAND xmm,xmm
pand xmm1, XMMWORD PTR .LC2[rip]
paddw xmm0, xmm1
ret
If AVX2 is available, an __m256i version of this is straightforward and wouldn't need the blendps fallback. That may be better than scalar pdep (BMI2). AVX2 vpsrlvd or q (per-element shift counts) seem like they should help, but we find ourselves needing to move bits across dword boundaries, and it can only be left or right, not alternating directions. (AVX512 has variable-count rotates (32 and 64-bit), and 16-bit variable-count shifts. Rotates let you go right or left with the same instruction.)
The shift element size could be 64 each time; our blends drop bits that would get shifted into the low element of a pair. For the final step, paddw is 1 byte smaller than psllw/d/q because it has no immediate operand. And can run on more ports on most CPUs. Especially Haswell, where shifts can only run on port 0, but paddw can run on port 1 or 5. (This code has no instruction-level parallelism within one iteration, so we rely on out-of-order exec to overlap execution of multiple iterations.)
Skylake through Alder Lake run SIMD shifts on p01, SIMD inter adds on p015, blendps on p015, pblendw on p5 (p15 for Alder Lake), pblendvb as 1 uop for p015. (Only the non-AVX encoding; vpblendvb is 2 uops for p015). Zen 3 for example has plenty of throughput for all of these.
The final step avoiding _mm_blendv_epi8 has several advantages:
Both ways need two vector constants and 3 instructions. (And no difference in the minimum number of movdqa register-copies a compiler has to invent without non-destructive AVX instructions.)
The AND/AND/ADD version has better ILP; two ANDs in parallel.
SSE4.1 pblendvb can be slower (e.g. Haswell runs it as 2 uops for port 5) and has an awkward requirement of having the control vector in XMM0. Some compilers may waste instructions with hard-reg constraints. (Maybe even when inlining into a loop, unlike when we look at how this helper function would compile on its own.)
vpblendvb (the AVX encoding of it) is 2 uops (for any port) on newer Intel, or 3 on Alder Lake, presumably as the price for having 4 operands (3 inputs and a separate output). Also the AVX version is slow on Alder Lake E-cores (4 uops, 3.23 cycle throughput) https://uops.info/.
AMD CPUs don't have this problem; for example Zen 3 runs vpblendvb as 1 uop for either of two ports.
The only possible upside to the blend version is that the constants are easier to construct on the fly. GCC12 has started preferring to construct some constants on the fly when AVX is available, but does a rather bad job of it, using 10-byte mov r64, imm64 / vmovq / vpunpcklqdq instead of 5-byte mov reg, imm32 / ... / vpbroadcastd or pshufd v,v,0. Or instead of starting with an all-ones vector and shifting.
Actually, the constants for the non-blend way can be generated from an all-ones vector with psrlw xmm, 9 to get 0x007f, and then left shifting that 7-bit mask left by 7. So with AVX, 3 total instructions for both masks, without memory access. Unfortunately compilers don't know how to do this optimization so it's a moot point.
AVX-512F / BW, without AVX-512VBMI / AVX-512GFNI
If you have Ice Lake / Zen4 features, you want #Harold's answer; as I commented there, it's slightly better than AVX-512 vpmultishiftqb (parallel bitfield-extract within a qword).
But if not, with Skylake-X / Cascade Lake features (AVX-512BW and F) you have have masking and variable-count rotates. This saves 2 instructions vs. the SSE4 version (built with AVX2); it feels like there should be room to save more, especially at the final step within 16-bit elements. But masking has byte granularity, and there is no vprolvw, and still no byte shift, unlike AArch64 which can shift elements in 2 directions at byte granularity.
Splitting things apart and doing different things, then merging with a merge-masking vmovdqa could work, but I don't think would help.
#ifdef __AVX512BW__
// pre-Ice Lake, without AVX-512VBMI or AVX512-GFNI
__m128i ascii_unpack7x8_avx512bw(__m128i v)
{
// for YMM or ZMM, use VPERMW, or VPERMB if we have AVX512VBMI since unfortunately VPERMW isn't single-uop on Intel CPUs that support both.
__m128i separate_7B_halves = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1,
7, 8, 9,10,11,12,13, -1);
v = _mm_shuffle_epi8(v, separate_7B_halves);
v = _mm_slli_epi64(v, 4); // 00000HGFE | DCBA0000 // dword halves
v = _mm_rolv_epi32(v, _mm_set_epi32(2, 32-2, 2, 32-2));
// 00HG|FE00 | 00DC|BA00 // u16 chunks of a u64
v = _mm_mask_srli_epi16(v, (__mmask8)0b0101'0101, v, 2); // 00HG | 00FE | 00DC | 00BA
// Separate each u16 into bytes
__m128i hi = _mm_and_si128(v, _mm_set1_epi16(0x3f80)); // isolate hi half
v = _mm_add_epi16(v, hi); // high halves left 1 (x+=x), low halves stay (x+=0)
// 0H0G | 0F0E | 0D0C | 0B0A in each qword.
return v;
}
#endif
Clang (Godbolt) optimizes the masked right-shift to a variable-count right shift, which is a good idea for a stand-alone function not in a loop especially when we're loading other constants.
This uses more non-immediate constants, but fewer uops. A wider version of this using vpermw to unpack 14-byte chunks to 16-byte lanes might have to do something to introduce zero bits where they're needed, perhaps using zero-masking on the shuffle. But I think we'd still need vpshufb within lanes, so it can zero those high bits.
Having those known zeros that we move around with shifts and rotates is what lets us only use one and and add at the end, unlike the blending version where elements end up with high garbage so we need to mask both ways.
# clang -O3 -march=x86-64-v4
ascii_unpack7x8_avx512bw(long long __vector(2)):
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = xmm0[0,1,2,3,4,5,6],zero,xmm0[7,8,9,10,11,12,13],zero
vpsllq xmm0, xmm0, 4
vprolvd xmm0, xmm0, xmmword ptr [rip + .LCPI1_1]
vpsrlvw xmm0, xmm0, xmmword ptr [rip + .LCPI1_2]
vpand xmm1, xmm0, xmmword ptr [rip + .LCPI1_3]
vpaddw xmm0, xmm1, xmm0
ret
These constants would of course be loaded into registers.
Just 6 uops; shifts run on port 0 or 1, shuffles on port 5, on Skylake, with VPAND and VPADD able to run on any of the 3 vector ALU ports. So it's a good balance, not running into back-end throughput bottlenecks on a specific port. (vs. 8 uops with clang's AVX build of the SSE4 version)
GCC using masking as requested, again the constant init will get hoisted out of loops, including k1.
# gcc -O3 -march=x86-64-v4
ascii_unpack7x8_avx512bw(long long __vector(2)):
vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip]
mov eax, 85 # 0x55
vpsllq xmm0, xmm0, 4
kmovb k1, eax
movabs rax, 4575727041462157184 # 0x3F803F803F803F80 silly to use a 64-bit immediate
vprolvd xmm0, xmm0, XMMWORD PTR .LC3[rip]
vpbroadcastq xmm1, rax
vpsrlw xmm0{k1}, xmm0, 2
vpand xmm1, xmm0, xmm1
vpaddw xmm0, xmm0, xmm1
ret
Same instructions doing the work, just setting up constants differently. (Except for vpsrlw xmm0{k1}, xmm0, 2 to shift some elements but not others.)

Backporting my arm64 answer to SSE2, we can simulate variadic shifts by mullo_epu16 and mulhi_epu16; first pack adjacent 7+7-bit values as consecutive:
// 0b'0aaaaaaa'0bbbbbbb + 0bbbbbbb = 0b'0aaaaaaa'bbbbbbb0
a0 = _mm_add_epi16(a, _mm_and_epi16(a, _mm_set1_epi16(0x7f)));
a0 = 0aaaaaaabbbbbbb0'0cccccccddddddd0'0eeeeeeefffffff0'0ggggggghhhhhhh0
a1 = 00000000aaaaaaab'000000cccccccddd'0000eeeeeeefffff'00ggggggghhhhhhh
a2 = bbbbbb0000000000'dddd000000000000'ff00000000000000'0000000000000000
a3 = 0000000000000000'bbbbbb0000000000'dddd000000000000'ff00000000000000
a1 = _mm_mulhi_epu16(a0, kShift); // 1 << {9,11,13,15}
a2 = _mm_mullo_epu16(a0, kShift); // 1 << {9,11,13,15}
a3 = _mm_bsrli_si128(a2, 2);
return _mm_or_si128(a1,a3);

Fastest way to find 16bit match in a 4 element short array?

I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way
I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?
First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48) but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0
Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?

For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.
But then you need to detect a contiguous 16 0 bits. Unlike pcmpeqw, you'll have some zero bits in the other elements.
So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.
There is yet a faster method — use hasless(v, 1), which is defined below; it works in 4 operations and requires no subsquent verification. It simplifies to
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.
This bithack was originally by Alan Mycroft in 1987.
So it could look like this (untested):
#include <stdint.h>
#include <string.h>
// returns 0 / non-zero status.
uint64_t hasmatch_16in64(uint16_t needle, const uint16_t haystack[4])
{
uint64_t vneedle = 0x0001000100010001ULL * needle; // broadcast
uint64_t vbuf;
memcpy(&vbuf, haystack, sizeof(vbuf)); // aliasing-safe unaligned load
//static_assert(sizeof(vbuf) == 4*sizeof(haystack[0]));
uint64_t match = vbuf ^ vneedle;
uint64_t any_zeros = (match - 0x0001000100010001ULL) & ~match & 0x8000800080008000ULL;
return any_zeros;
// unsigned matchpos = _tzcnt_u32(any_zeros) >> 4; // I think.
}
Godbolt with GCC and clang, also including a SIMD intrinsics version.
# gcc12.2 -O3 -march=x86-64-v3 -mtune=znver1
# x86-64-v3 is the Haswell/Zen1 baseline: AVX2+FMA+BMI2, but with tune=generic
# without tune=haswell or whatever, GCC uses shl/add /shl/add instead of imul, despite still needing the same constant
hasmatch_16in64:
movabs rax, 281479271743489 # 0x1000100010001
movzx edi, di # zero-extend to 64-bit
imul rdi, rax # vneedle
xor rdi, QWORD PTR [rsi] # match
# then the bithack
mov rdx, rdi
sub rdx, rax
andn rax, rdi, rdx # BMI1
movabs rdx, -9223231297218904064 # 0x8000800080008000
and rax, rdx
ret
Clang unfortunately adds 0xFFFEFFFEFFFEFFFF instead of reusing the multiplier constant, so it has three 64-bit immediate constants.
AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD movemask, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.
Match position
If you need to know where the match is, I think that bithack has a 1 in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving 0x80008000...).
So maybe tzcnt(any_zeros) >> 4 to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31. 31 >> 4 = 1.
If that doesn't work, then yeah AVX2 or AVX-512 vpbroadcastw xmm0, edi / vmovq / vpcmeqw / vpmovmskb / tzcnt will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of which short.)
Actually just SSE2 pshuflw can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-source pcmpeqw mm0, [rsi] since it has no alignment requirement and is only 64-bit, not 128.
If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.
#include <immintrin.h>
// note the unsigned function arg, not uint16_t;
// we only use the low 16, but GCC doesn't realize that and wastes an instruction in the non-AVX2 version
int hasmatch_SIMD(unsigned needle, const uint16_t haystack[4])
{
#ifdef __AVX2__ // or higher
__m128i vneedle = _mm_set1_epi16(needle);
#else
__m128i vneedle = _mm_cvtsi32_si128(needle); // movd
vneedle = _mm_shufflelo_epi16(vneedle, 0); // broadcast to low half
#endif
__m128i vbuf = _mm_loadl_epi64((void*)haystack); // alignment and aliasing safe
unsigned mask = _mm_movemask_epi8(_mm_cmpeq_epi16(vneedle, vbuf));
//return _tzcnt_u32(mask) >> 1;
return mask;
}
# clang expects narrow integer args to already be zero- or sign-extended to 32
hasmatch_SIMD:
movd xmm0, edi
pshuflw xmm0, xmm0, 0 # xmm0 = xmm0[0,0,0,0,4,5,6,7]
movq xmm1, qword ptr [rsi] # xmm1 = mem[0],zero
pcmpeqw xmm1, xmm0
pmovmskb eax, xmm1
ret
AXV-512 gives us vpbroadcastw xmm0, edi, replacing vmovd + vpbroadcastw xmm,xmm or movd + pshuflw, saving a shuffle uop.
With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)
There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a __m128i. So probably we do need to consider the front-end costs of setting up those constants.
I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.
GCC unfortunately fails to optimize away the movzx edi, di from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of _mm_cvtsi32_si128(needle) is discarded by the later shuffle. Maybe better to make the function arg unsigned, not explicitly a narrow 16-bit type.

Clang with -O2 or -O3 and GCC with -O3 compile a simple search loop into branchless instructions:
int indexOf(short target, short* arr) {
int index = -1;
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
index = i;
}
}
return index;
}
Demo
I doubt you can get much better without SIMD. In other words, write simple and understandable code to help the compiler produce efficient code.
Side note: for some reason, neither Clang nor GCC use conditional moves on this very similar code:
int indexOf(short target, short* arr) {
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
return i;
}
}
return -1;
}

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best option I think.
Edit: best/optimal in term of speed/cycle reduction.

Related: if you're looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling.
Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel's reduce_add helper function. reduce_add doesn't necessarily compile optimally anyway with AVX512.
There is a int _mm512_reduce_add_epi32(__m512i) inline function in immintrin.h. You might as well use it. (It compiles to shuffle and add instructions, but more efficient ones than vpermd, like I describe below.) AVX512 didn't introduce any new hardware support for horizontal sums, just this new helper function. It's still something to avoid or sink out of loops whenever possible.
GCC 9.2 -O3 -march=skylake-avx512 compiles a wrapper that calls it as follows:
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm1, ymm1, ymm0
vextracti64x2 xmm0, ymm1, 0x1 # silly compiler, vextracti128 would be shorter
vpaddd xmm1, xmm0, xmm1
vpshufd xmm0, xmm1, 78
vpaddd xmm0, xmm0, xmm1
vmovd edx, xmm0
vpextrd eax, xmm0, 1 # 2x xmm->integer to feed scalar add.
add eax, edx
ret
Extracting twice to feed scalar add is questionable; it needs uops for p0 and p5 so it's equivalent to a regular shuffle + a movd.
Clang doesn't do that; it does one more step of shuffle / SIMD add to reduce down to a single scalar for vmovd. See below for perf analysis of the two.
There is a VPHADDD but you should never use it with both inputs the same. (Unless you're optimizing for code-size over speed). It can be useful to transpose-and-sum multiple vectors, resulting in some vectors of results. You do that by feeding phadd with 2 different inputs. (Except it gets messy with 256 and 512-bit because vphadd is still only in-lane.)
Yes, you need log2(vector_width) shuffles and vpaddd instructions. (So this isn't very efficient; avoid horizontal sums inside inner loops. Accumulate vertically until the end of a loop, for example).
General strategy for all SSE / AVX / AVX512
You want to successively narrow from 512 -> 256, then 256 -> 128, then shuffle within __m128i until you're down to one scalar element. Presumably some future AMD CPU will decode 512-bit instructions to two 256-bit uops, so reducing width is a big win there. And narrower instructions presumably cost slightly less power.
Your shuffles can take immediate control operands, not vectors for vpermd. e.g. VEXTRACTI32x8, vextracti128, and vpshufd. (Or vpunpckhqdq to save code size for the immediate constant.)
See Fastest way to do horizontal SSE vector sum (or other reduction) (my answer also includes some integer versions).
This general strategy is appropriate for all element types: float, double, and any size integer
Special cases:
8-bit integer: start with vpsadbw, more efficient and avoids overflow, but then continue as for 64-bit integers.
16-bit integer: start by widening to 32 with pmaddwd (_mm256_madd_epi16 with set1_epi16(1)) : SIMD: Accumulate Adjacent Pairs - fewer uops even if you don't care about the avoiding-overflow benefit, except on AMD before Zen2 where 256-bit instructions cost at least 2 uops. But then you continue as for 32-bit integer.
32-bit integer can be done manually like this, with an SSE2 function called by the AVX2 function after reducing to __m128i, in turn called by the AVX512 function after reducing to __m256i. The calls will of course inline in practice.
#include <immintrin.h>
#include <stdint.h>
// from my earlier answer, with tuning for non-AVX CPUs removed
// static inline
uint32_t hsum_epi32_avx(__m128i x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x); // 3-operand non-destructive AVX lets us save a byte without needing a movdqa
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1)); // Swap the low two elements
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32); // movd
}
// only needs AVX2
uint32_t hsum_8x32(__m256i v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1)); // silly GCC uses a longer AXV512VL instruction if AVX512 is enabled :/
return hsum_epi32_avx(sum128);
}
// AVX512
uint32_t hsum_16x32(__m512i v)
{
__m256i sum256 = _mm256_add_epi32(
_mm512_castsi512_si256(v), // low half
_mm512_extracti64x4_epi64(v, 1)); // high half. AVX512F. 32x8 version is AVX512DQ
return hsum_8x32(sum256);
}
Notice that this uses __m256i hsum as a building block for __m512i; there's nothing to be gained by doing in-lane operations first.
Well possibly a very tiny advantage: in-lane shuffles have lower latency than lane-crossing, so they could execute 2 cycles earlier and leave the RS earlier, and similarly retire from the ROB slightly earlier. But the higher-latency shuffles are coming just a couple instructions later even if you did that. So you might get a handful of some independent instructions into the back-end 2 cycles earlier if this hsum was on the critical path (blocking retirement).
But reducing to a narrower vector width sooner is generally good, maybe getting 512-bit uops out of the system sooner so the CPU can re-activate the SIMD execution units on port 1, if you aren't doing more 512-bit work right away.
Compiles on Godbolt to these instructions, with GCC9.2 -O3 -march=skylake-avx512
hsum_16x32(long long __vector(8)):
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm0, ymm1, ymm0
vextracti64x2 xmm1, ymm0, 0x1 # silly compiler uses a longer EVEX instruction when its available (AVX512VL)
vpaddd xmm0, xmm0, xmm1
vpunpckhqdq xmm1, xmm0, xmm0
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 177
vpaddd xmm0, xmm1, xmm0
vmovd eax, xmm0
ret
P.S.: perf analysis of GCC's _mm512_reduce_add_epi32 vs. clang's (which is equivalent to my version), using data from https://uops.info/ and/or Agner Fog's instruction tables:
After inlining into a caller that does something with the result, it could allow optimizations like adding a constant as well using lea eax, [rax + rdx + 123] or something.
But other than that it seems almost always worse than the the shuffle / vpadd / vmovd at the end of my implementation, on Skylake-X:
total uops: reduce: 4. Mine: 3
ports: reduce: 2p0, p5 (part of vpextrd), p0156 (scalar add)
ports: mine: p5, p015 (vpadd on SKX), p0 (vmod)
Latency is equal at 4 cycles, assuming no resource conflicts:
shuffle 1 cycle -> SIMD add 1 cycle -> vmovd 2 cycles
vpextrd 3 cycles (in parallel with 2 cycle vmovd) -> add 1 cycle.

Encoding 3 base-6 digits in 8 bits for unpacking performance

I'm looking for an efficient-to-unpack (in terms of small number of basic ALU ops in the generated code) way of encoding 3 base-6 digits (i.e. 3 numbers in the range [0,5]) in 8 bits. Only one is needed at a time, so approaches that need to decode all three in order to access one are probably not good unless the cost of decoding all three is very low.
The obvious method is of course:
x = b%6; // 8 insns
y = b/6%6; // 13 insns
z = b/36; // 5 insns
The instruction counts are measured on x86_64 with gcc>=4.8 which knows how to avoid divs.
Another method (using a different encoding) is:
b *= 6
x = b>>8;
b &= 255;
b *= 6
y = b>>8;
b &= 255;
b *= 6
z = b>>8;
This encoding has more than one representation for many tuples (it uses the whole 8bit range rather than just [0,215]) and appears more efficient if you want all 3 outputs, but wasteful if you only want one.
Are there better approaches?
Target language is C but I've tagged this assembly as well since answering requires some consideration of the instructions that would be generated.

As discussed in comments, a LUT would be excellent if it stays hot in cache. uint8_t LUT[3][256] would need the selector scaled by 256, which takes an extra instruction if it's not a compile-time constant. Scaling by 216 to pack the LUT better is only 1 or 2 instructions more expensive. struct3 LUT[216] is nice, where the struct has a 3-byte array member. On x86, this compiles extremely well in position-dependent code where the LUT base can be a 32-bit absolute as part of the addressing mode (if the table is static):
struct { uint8_t vals[3]; } LUT[216];
unsigned decode_LUT(uint8_t b, unsigned selector) {
return LUT[b].vals[selector];
}
gcc7 -O3 on Godbolt for x86-64 and AArch64
movzx edi, dil
mov esi, esi # zero-extension to 64-bit: goes away when inlining.
lea rax, LUT[rdi+rdi*2] # multiply by 3 and add the base
movzx eax, BYTE PTR [rax+rsi] # then index by selector
ret
Silly gcc used a 3-component LEA (3 cycle latency and runs on fewer ports) instead of using LUT as a disp32 for the actual load (no extra latency for an indexed addressing mode, I think).
This layout has the added advantage of locality if you ever need to decode multiple components of the same byte.
In PIC / PIE code, this costs 2 extra instructions, unfortunately:
movzx edi, dil
lea rax, LUT[rip] # RIP-relative LEA instead of absolute as part of another addressing mode
mov esi, esi
lea rdx, [rdi+rdi*2]
add rax, rdx
movzx eax, BYTE PTR [rax+rsi]
ret
But that's still cheap, and all the ALU instructions are single-cycle latency.
Your 2nd ALU unpacking strategy is promising. I thought at first we could use a single 64-bit multiply to get b*6, b*6*6, and b*6*6*6 in different positions of the same 64-bit integer. (b * ((6ULL*6*6<<32) + (36<<16) + 6)
But the upper byte of each multiply result does depend on masking back to 8-bit after each multiply by 6. (If you can think of a way to not require that, one multiple and shift would be very cheap, especially on 64-bit ISAs where the entire 64-bit multiply result is in one register).
Still, x86 and ARM can multiply by 6 and mask in 3 cycles of latency, the same or better latency than a multiply, or less on Intel CPUs with zero-latency movzx r32, r8, if the compiler avoids using parts of the same register for movzx.
add eax, eax ; *2
lea eax, [rax + rax*2] ; *3
movzx ecx, al ; 0 cycle latency on Intel
.. repeat for next steps
ARM / AArch64 is similarly good, with add r0, r0, r0 lsl #1 for multiply by 3.
As a branchless way to select one of the three, you could consider storing (from ah / ch / ... to get the shift for free) to an array, then loading with the selector as the index. This costs store/reload latency (~5 cycles), but is cheap for throughput and avoids branch misses. (Possibly a 16-bit store and then a byte reload would be good, scaling the selector in the load address and adding 1 to get the high byte, saving an extract instruction before each store on ARM).
This is in fact what gcc emits if you write it this way:
unsigned decode_ALU(uint8_t b, unsigned selector) {
uint8_t decoded[3];
uint32_t tmp = b * 6;
decoded[0] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[1] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[2] = tmp >> 8;
return decoded[selector];
}
movzx edi, dil
mov esi, esi
lea eax, [rdi+rdi*2]
add eax, eax
mov BYTE PTR -3[rsp], ah # store high half of mul-by-6
movzx eax, al # costs 1 cycle: gcc doesn't know about zero-latency movzx?
lea eax, [rax+rax*2]
add eax, eax
mov BYTE PTR -2[rsp], ah
movzx eax, al
lea eax, [rax+rax*2]
shr eax, 7
mov BYTE PTR -1[rsp], al
movzx eax, BYTE PTR -3[rsp+rsi]
ret
The first store's data is ready 4 cycles after the input to the first movzx, or 5 if you include the extra 1c of latency for reading ah when it's not renamed separately on Intel HSW/SKL. The next 2 stores are 3 cycles apart.
So the total latency is ~10 cycles from b input to result output, if selector=0. Otherwise 13 or 16 cycles.

Measuring a number of different approaches in-place in the function that needs to do this, the practical answer is really boring: it doesn't matter. They're all running at about 50ns per call, and other work is dominating. So for my purposes, the approach that pollutes the cache and branch predictors the least is probably the best. That seems to be:
(b * (int[]){2048,342,57}[i] >> 11) % 6;
where b is the byte containing the packed values and i is the index of the value wanted. The magic constants 342 and 57 are just the multiplicative constants GCC generates for division by 6 and 36, respectively, scaled to a common shift of 11. The final %6 is spurious in the /36 case (i==2) but branching to avoid it does not seem worthwhile.
On the other hand, if doing this same work in a context where there wasn't an interface constraint to have the surrounding function call overhead per lookup, I think an approach like Peter's would be preferable.

Demultiplex an AVX register into four registers each containing identical values [duplicate]

In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:
# a is a 64-byte aligned array of double
__m256d b0 = _mm256_broadcast_sd(&b[4*k+0]);
__m256d b1 = _mm256_broadcast_sd(&b[4*k+1]);
__m256d b2 = _mm256_broadcast_sd(&b[4*k+2]);
__m256d b3 = _mm256_broadcast_sd(&b[4*k+3]);
I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU actually issue 4 separate load from the L1 cache. Although at this point I'm not limited by L1 latency or bandwidth, I'm very interested to know if there is a way to load the 4 doubles with one 256-bit load (or two 128-bit loads) and shuffle them into 4 YMM registers. I looked through the Intel Intrinsics Guide but couldn't find a way to accomplish the shuffling required. Is that possible?
(If the premise that the CPU doesn't combine the 4 consecutive loads is actually wrong, please let me know.)

TL;DR: It's almost always best to just do four broadcast-loads using _mm256_set1_pd(). This very good on Haswell and later, where vbroadcastsd ymm,[mem] doesn't require an ALU shuffle operation, and usually also the best option for Sandybridge/Ivybridge (where it's a 2-uop load + shuffle instruction).
It also means you don't need to care about alignment at all, beyond natural alignment for a double.
The first vector is ready sooner than if you did a two-step load + shuffle, so out-of-order execution can potentially get started on the code using these vectors while the first one is still loading. AVX512 can even fold broadcast-loads into memory operands for ALU instructions, so doing it this way will allow a recompile to take slight advantage of AVX512 with 256b vectors.
(It's usually best to use set1(x), not _mm256_broadcast_sd(&x); If the AVX2-only register-source form of vbroadcastsd isn't available, the compiler can choose to store -> broadcast-load or to do two shuffles. You never know when inlining will mean your code will run on inputs that are already in registers.)
If you're really bottlenecked on load-port resource-conflicts or throughput, not total uops or ALU / shuffle resources, it might help to replace a pair of 64->256b broadcasts with a 16B->32B broadcast-load (vbroadcastf128/_mm256_broadcast_pd) and two in-lane shuffles (vpermilpd or vunpckl/hpd (_mm256_shuffle_pd)).
Or with AVX2: load 32B and use 4 _mm256_permute4x64_pd shuffles to broadcast each element into a separate vector.
Source Agner Fog's insn tables (and microarch pdf):
Intel Haswell and later:
vbroadcastsd ymm,[mem] and other broadcast-load insns are 1uop instructions that are handled entirely by a load port (the broadcast happens "for free").
The total cost of doing four broadcast-loads this way is 4 instructions. fused-domain: 4uops. unfused-domain: 4 uops for p2/p3. Throughput: two vectors per cycle.
Haswell only has one shuffle unit, on port5. Doing all your broadcast-loads with load+shuffle will bottleneck on p5.
Maximum broadcast throughput is probably with a mix of vbroadcastsd ymm,m64 and shuffles:
## Haswell maximum broadcast throughput with AVX1
vbroadcastsd ymm0, [rsi]
vbroadcastsd ymm1, [rsi+8]
vbroadcastf128 ymm2, [rsi+16] # p23 only on Haswell, also p5 on SnB/IvB
vunpckhpd ymm3, ymm2,ymm2
vunpcklpd ymm2, ymm2,ymm2
vbroadcastsd ymm4, [rsi+32] # or vaddpd ymm0, [rdx+something]
#add rsi, 40
Any of these addressing modes can be two-register indexed addressing modes, because they don't need to micro-fuse to be a single uop.
AVX1: 5 vectors per 2 cycles, saturating p2/p3 and p5. (Ignoring cache-line splits on the 16B load). 6 fused-domain uops, leaving only 2 uops per 2 cycles to use the 5 vectors... Real code would probably use some of the load throughput to load something else (e.g. a non-broadcast 32B load from another array, maybe as a memory operand to an ALU instruction), or to leave room for stores to steal p23 instead of using p7.
## Haswell maximum broadcast throughput with AVX2
vmovups ymm3, [rsi]
vbroadcastsd ymm0, xmm3 # special-case for the low element; compilers should generate this from _mm256_permute4x64_pd(v, 0)
vpermpd ymm1, ymm3, 0b01_01_01_01 # NASM syntax for 0x99
vpermpd ymm2, ymm3, 0b10_10_10_10
vpermpd ymm3, ymm3, 0b11_11_11_11
vbroadcastsd ymm4, [rsi+32]
vbroadcastsd ymm5, [rsi+40]
vbroadcastsd ymm6, [rsi+48]
vbroadcastsd ymm7, [rsi+56]
vbroadcastsd ymm8, [rsi+64]
vbroadcastsd ymm9, [rsi+72]
vbroadcastsd ymm10,[rsi+80] # or vaddpd ymm0, [rdx + whatever]
#add rsi, 88
AVX2: 11 vectors per 4 cycles, saturating p23 and p5. (Ignoring cache-line splits for the 32B load...). Fused-domain: 12 uops, leaving 2 uops per 4 cycles beyond this.
I think 32B unaligned loads are a bit more fragile in terms of performance than unaligned 16B loads like vbroadcastf128.
Intel SnB/IvB:
vbroadcastsd ymm, m64 is 2 fused-domain uops: p5 (shuffle) and p23 (load).
vbroadcastss xmm, m32 and movddup xmm, m64 are single-uop load-port-only. Interestingly, vmovddup ymm, m256 is also a single-uop load-port-only instruction, but like all 256b loads, it occupies a load port for 2 cycles. It can still generate a store-address in the 2nd cycle. This uarch doesn't deal well with cache-line splits for unaligned 32B-loads, though. gcc defaults to using movups / vinsertf128 for unaligned 32B loads with -mtune=sandybridge / -mtune=ivybridge.
4x broadcast-load: 8 fused-domain uops: 4 p5 and 4 p23. Throughput: 4 vectors per 4 cycles, bottlenecking on port 5. Multiple loads from the same cache line in the same cycle don't cause a cache-bank conflict, so this is nowhere near saturating the load ports (also needed for store-address generation). That only happens on the same bank of two different cache lines in the same cycle.
Multiple 2-uop instructions with no other instructions between is the worst case for the decoders if the uop-cache is cold, but a good compiler would mix in single-uop instructions between them.
SnB has 2 shuffle units, but only the one on p5 can handle shuffles that have a 256b version in AVX. Using a p1 integer-shuffle uop to broadcast a double to both elements of an xmm register doesn't get us anywhere, since vinsertf128 ymm,ymm,xmm,i takes a p5 shuffle uop.
## Sandybridge maximum broadcast throughput: AVX1
vbroadcastsd ymm0, [rsi]
add rsi, 8
one per clock, saturating p5 but only using half the capacity of p23.
We can save one load uop at the cost of 2 more shuffle uops, throughput = two results per 3 clocks:
vbroadcastf128 ymm2, [rsi+16] # 2 uops: p23 + p5 on SnB/IvB
vunpckhpd ymm3, ymm2,ymm2 # 1 uop: p5
vunpcklpd ymm2, ymm2,ymm2 # 1 uop: p5
Doing a 32B load and unpacking it with 2x vperm2f128 -> 4x vunpckh/lpd might help if stores are part of what's competing for p23.

In my matrix multiplication code I only have to use the broadcast once per kernel code but if you really want to load four doubles in one instruction and then broadcast them to four registers you can do it like this
#include <stdio.h>
#include <immintrin.h>
int main() {
double in[] = {1,2,3,4};
double out[4];
__m256d x4 = _mm256_loadu_pd(in);
__m256d t1 = _mm256_permute2f128_pd(x4, x4, 0x0);
__m256d t2 = _mm256_permute2f128_pd(x4, x4, 0x11);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);
_mm256_storeu_pd(out,broad1);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
_mm256_storeu_pd(out,broad2);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
_mm256_storeu_pd(out,broad3);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
_mm256_storeu_pd(out,broad4);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
}
Edit: Here is another solution based on Paul R's suggestion.
__m256 t1 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256 t2 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);

Here is a variant built upon Z Boson's original answer (before edit), using two 128-bit loads instead of one 256-bit load.
__m256d b01 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+0]));
__m256d b23 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+2]));
__m256d b0101 = _mm256_permute2f128_pd(b01, b01, 0);
__m256d b2323 = _mm256_permute2f128_pd(b23, b23, 0);
__m256d b0000 = _mm256_permute_pd(b0101, 0);
__m256d b1111 = _mm256_permute_pd(b0101, 0xf);
__m256d b2222 = _mm256_permute_pd(b2323, 0);
__m256d b3333 = _mm256_permute_pd(b2323, 0xf);
In my case this is slightly faster than using one 256-bit load, possibly because the first permute can start before the second 128-bit load completes.
Edit: gcc compiles the two loads and the first 2 permutes into
vmovapd (%rdi),%xmm8
vmovapd 0x10(%rdi),%xmm4
vperm2f128 $0x0,%ymm8,%ymm8,%ymm1
vperm2f128 $0x0,%ymm4,%ymm4,%ymm2
Paul R's suggestion of using _mm256_broadcast_pd() can be written as:
__m256d b0101 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256d b2323 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
which compiles into
vbroadcastf128 (%rdi),%ymm6
vbroadcastf128 0x10(%rdi),%ymm11
and is faster than doing two vmovapd+vperm2f128 (tested).
In my code, which is bound by vector execution ports instead of L1 cache accesses, this is still slightly slower than 4 _mm256_broadcast_sd(), but I imagine that L1 bandwidth-constrained code can benefit greatly from this.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight