bitpack ascii string into 7-bit binary blob using SIMD

bitpack ascii string into 7-bit binary blob using SIMD - c

Related: bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD - same question specialized for AArch64 intrinsics. This question covers portable C and x86-64 intrinsics.
I would like to encode a char string as a 7-bit blob to gain a 12.5% reduction in memory.
I want to do it as fast a possible, i.e. with minimal latency when encoding large strings.
Here is the plain implementation of the algo:
void ascii_pack(const char* ascii, size_t len, uint8_t* bin) {
uint64_t val;
const char* end = ascii + len;
while (ascii + 8 <= end) {
memcpy(&val, ascii, 8);
uint64_t dest = (val & 0xFF);
// Compiler will perform loop unrolling
for (unsigned i = 1; i <= 7; ++i) {
val >>= 1;
dest |= (val & (0x7FUL << 7 * i));
}
memcpy(bin, &dest, 7);
bin += 7;
ascii += 8;
}
// epilog - we do not pack since we have less than 8 bytes.
while (ascii < end) {
*bin++ = *ascii++;
}
}
now, I would like to speed it up with SIMD. I came with SSE2 algo below.
My question:
is it possible to optimize the internal loop that is sequential?
will it improve the throughput when running on large strings?
// The algo - do in parallel what ascii_pack does on two uint64_t integers
void ascii_pack_simd(const char* ascii, size_t len, uint8_t* bin) {
__m128i val;
__m128i mask = _mm_set1_epi64x(0x7FU); // two uint64_t masks
// I leave out 16 bytes in addition to 16 that we load in the loop
// because we store into "bin" full 16 bytes instead of 14. To prevent out of bound
// writes we finish one iteration earlier.
const char* end = ascii + len - 32;
while (ascii <= end) {
val = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ascii));
__m128i dest = _mm_and_si128(val, mask);
// Compiler unrolls it
for (unsigned i = 1; i <= 7; ++i) {
val = _mm_srli_epi64(val, 1); // shift right both integers
__m128i shmask = _mm_slli_epi64(mask, 7 * i); // mask both
dest = _mm_or_si128(dest, _mm_and_si128(val, shmask)); // add another 7bit part.
}
// dest contains two 7 byte blobs. Lets copy them to bin.
_mm_storeu_si128(reinterpret_cast<__m128i*>(bin), dest);
memmove(bin + 7, bin + 8, 7);
bin += 14;
ascii += 16;
}
end += 32; // Bring back end.
DCHECK(ascii < end);
ascii_pack(ascii, end - ascii, bin);
}

The scalar trick (without requiring PEXT) which I referred to in the comments could be implemented like this:
uint64_t compress8x7bit(uint64_t x)
{
x = ((x & 0x7F007F007F007F00) >> 1) | (x & 0x007F007F007F007F);
x = ((x & 0x3FFF00003FFF0000) >> 2) | (x & 0x00003FFF00003FFF);
x = ((x & 0x0FFFFFFF00000000) >> 4) | (x & 0x000000000FFFFFFF);
return x;
}
The idea here is to concatenate together adjacent pairs, first concatenate 7-bit elements into 14-bit elements, then concatenate them into 28-bit elements, and finally concatenate them into one 56-bit chunk (which is the result).
With SSSE3, you could use pshufb to concatenate two of those 56-bit parts (before storing them) too.
SSE2 (and AVX2) can do the same thing as that scalar code with 64-bit elements, but this approach does not take advantage of any techniques that may be possible with special operations (which SSE2+ has plenty of, more with every version), there are probably better things to do than just implementing the scalar trick in SIMD.
For example just to throw something wild out there, gf2p8affineqb(0x8040201008040201, x) would put all the "discarded" bits in one place (namely the top byte of the result) and makes a solid 56-bit chunk out of the bits that we want to keep. But the bits do end up in a strange order (the first byte would contain bits 56, 48, 40, 32, 24, 16, 8, 0, in that order, listing the least significant bit first).
That order, strange as it is, can be easily unpacked using pshufb to reverse the bytes (you can also use this to insert the two zeroes) and then gf2p8affineqb(0x0102040810204080, reversedBytes) shuffles the bits back into the original order.
Here's a sketch of how that could work with actual AVX2+GFNI intrinsics. I'm not bothering to handle the extra parts at the end here, just the "main" loop, so the input text had better be a multiple of 32 bytes. Works on my PC ✔️
void compress8x7bit(const char* ascii, size_t len, uint8_t* bin)
{
const char* end = ascii + len;
while (ascii + 31 < end) {
__m256i text = _mm256_loadu_si256((__m256i*)ascii);
__m256i transposed = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x8040201008040201), text, 0);
__m256i compressed = _mm256_shuffle_epi8(transposed,
_mm256_set_epi8(-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0,
-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0));
_mm_storeu_si128((__m128i*)bin, _mm256_castsi256_si128(compressed));
_mm_storeu_si128((__m128i*)(bin + 14), _mm256_extracti128_si256(compressed, 1));
bin += 28;
ascii += 32;
}
}
void uncompress8x7bit(char* ascii, size_t len, const uint8_t* bin)
{
const char* end = ascii + len;
while (ascii + 31 < end) {
__m256i raw = _mm256_inserti128_si256(_mm256_castsi128_si256(_mm_loadu_si128((__m128i*)bin)), _mm_loadu_si128((__m128i*)(bin + 14)), 1);
__m256i rev_with_zeroes = _mm256_shuffle_epi8(raw,
_mm256_set_epi8(7, 8, 9, 10, 11, 12, 13, -1, 0, 1, 2, 3, 4, 5, 6, -1,
7, 8, 9, 10, 11, 12, 13, -1, 0, 1, 2, 3, 4, 5, 6, -1));
__m256i decompressed = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x0102040810204080), rev_with_zeroes, 0);
_mm256_storeu_si256((__m256i*)ascii, decompressed);
bin += 28;
ascii += 32;
}
}
Perhaps there is a nicer solution than using two 128-bit stores in the compressor and two 128-bit loads in the uncompressor. With AVX512 that would be easy since it has full-register byte-granular permutes, but AVX2 has vpshufb, which is not able to move bytes between the two 128-bit halves that make up a 256-bit vector. The uncompressor could do a funny load that starts 2 bytes before the start of the data it wants, like this: _mm256_loadu_si256((__m256i*)(bin - 2)) (and a slightly different shuffle vector), at the cost of having to avoid a potential out-of-bounds error with either padding or a special first iteration, but the compressor cannot (not cheaply) use a trick like that with a store that start 2 bytes earlier (that would destroy two bytes of the result).
By the way I have some test code here that you can use to verify that your bit-compression functions do the right thing (well sort of - as long as the function is a bit-permutation where some of the bits may be zeroed this works as a check, but this would not detect every possible bug in general):
uint64_t bitindex[7];
bitindex[6] = compress8x7bit(0xFFFFFFFFFFFFFFFF);
bitindex[5] = compress8x7bit(0xFFFFFFFF00000000);
bitindex[4] = compress8x7bit(0xFFFF0000FFFF0000);
bitindex[3] = compress8x7bit(0xFF00FF00FF00FF00);
bitindex[2] = compress8x7bit(0xF0F0F0F0F0F0F0F0);
bitindex[1] = compress8x7bit(0xCCCCCCCCCCCCCCCC);
bitindex[0] = compress8x7bit(0xAAAAAAAAAAAAAAAA);
for (size_t i = 0; i < 64; i++)
{
if (i != 0)
std::cout << ", ";
if (bitindex[6] & (1uLL << i))
{
int index = 0;
for (size_t j = 0; j < 6; j++)
{
if (bitindex[j] & (1uLL << i))
index |= 1 << j;
}
std::cout << index;
}
else
std::cout << "_";
}
std::cout << "\n";

You can improve the solution by #harold, if you replace the first two mask and shift steps by a vpmaddubsw and vpmaddwd (each using 1 instead of 4 uops) and the next step can be replaced by shifting every other 32bit element 4 to the left and afterwords shifting all 64bit elements 4 to the right. Of course, by using AVX2 instead of SSE, you can again double the throughput.
The final step of joining the lower and upper lane is likely most efficiently done by two separate stores which extract each lane directly to memory.
void ascii_pack32(char const* ascii, char* bin)
{
const __m256i control = _mm256_set_epi8(-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0,
-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0);
__m256i input = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(ascii));
// only necessary if high bits of input might be set:
input = _mm256_and_si256(input, _mm256_set1_epi8(0x7f));
__m256i t1 = _mm256_maddubs_epi16(_mm256_set1_epi16(0x8001), input);
__m256i t2 = _mm256_madd_epi16(_mm256_set1_epi32(0x40000001), t1);
__m256i t3 = _mm256_srli_epi64(_mm256_sllv_epi32(t2, _mm256_set1_epi64x(4)), 4);
__m256i val = _mm256_shuffle_epi8(t3, control);
_mm_storeu_si128(reinterpret_cast<__m128i*>(bin), _mm256_castsi256_si128(val));
_mm_storeu_si128(reinterpret_cast<__m128i*>(bin+14), _mm256_extracti128_si256(val, 1));
}
Godbolt link with short testcode:
https://godbolt.org/z/hs7477h5W

SIMD unpack can benefit from blend instructions instead of and/andn/or because we can blend at dword / word / byte boundaries. We only need to AND once at the end to clear the high bit of each byte.
#include <immintrin.h>
static inline
__m128i ascii_unpack7x8_sse4(__m128i v)
{
__m128i separate_7B_halves = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1,
7, 8, 9,10,11,12,13, -1);
v = _mm_shuffle_epi8(v, separate_7B_halves);
// separate each u64 qword into 2 u32 halves, with the good bits at the bottom
__m128i shifted = _mm_slli_epi64(v, 4);
#ifdef __AVX2__
v = _mm_blend_epi32(v, shifted, 0b1010); // vpblendd is very efficient, 1 uop any port
#else
v = _mm_castps_si128(_mm_blend_ps( // blendps has extra bypass latency between integer insns, but is single-uop
_mm_castsi128_ps(v), _mm_castsi128_ps(shifted), 0b1010) );
#endif
// Separate each u32 into u16
shifted = _mm_slli_epi32(v, 2);
v = _mm_blend_epi16(v, shifted, 0b10101010); // high halves of pairs from shifted
// Separate each u16 into bytes, with one of two strategies
#if 0 // this strategy is simpler but worse
// shifted = _mm_add_epi16(v, v); // v<<1
// v = _mm_blendv_epi8(v, shifted, _mm_set1_epi16(0xff00));
// v = _mm_and_si128(v, _mm_set1_epi8(0x7f)); // clear garbage from high bits
#else
__m128i hi = _mm_and_si128(v, _mm_set1_epi16(0x3f80)); // isolate hi half
v = _mm_and_si128(v, _mm_set1_epi16(0x007f)); // clear high garbage
v = _mm_add_epi16(v, hi); // high halves left 1 (x+=x), low halves stay (x+=0)
// both ways need two vector constants and 3 instructions, but pblendvb can be slower and has an awkward requirement of having the control vector in XMM0
#endif
return v;
}
With AVX2 available, clang compiles it to this nice asm. Godbolt
# clang -O3 -march=x86-64-v3 (implies AVX2+BMI2, basically Haswell with generic tuning)
ascii_unpack7x8_sse4(long long __vector(2)):
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0,1,2,3,4,5,6],zero,xmm0[7,8,9,10,11,12,13],zero
vpsllq xmm1, xmm0, 4
vpblendd xmm0, xmm0, xmm1, 10 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
vpslld xmm1, xmm0, 2
vpblendw xmm0, xmm0, xmm1, 170 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7]
vpand xmm1, xmm0, xmmword ptr [rip + .LCPI0_1] # in a loop, these constants would be in registers
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
vpaddw xmm0, xmm0, xmm1
ret
With just SSE4.1, compilers need several movdqa instructions, as in GCC's output. And out-of-order exec will have an extra 1 or 2 cycles of latency to hide because of bypass-forwarding delays for integer shifts forwarding to an FP blendps, on Intel CPUs. (https://agner.org/optimize/). But that's fine, we're doing this in a loop over an array, modern CPUs have deep enough out-of-order exec.
# gcc -O3 -march=x86-64-v2 # SSE4.2, Nehalem. Actually only using SSE4.1
ascii_unpack7x8_sse4(long long __vector(2)):
movdqa xmm1, xmm0 # silly compiler wastes a MOV
pshufb xmm1, XMMWORD PTR .LC0[rip]
movdqa xmm0, xmm1 # save unshifted v
psllq xmm0, 4
blendps xmm1, xmm0, 10 # 0b1010 = 0xA
movdqa xmm0, xmm1
pslld xmm0, 2
pblendw xmm1, xmm0, 170 # 0b10101010 = 0xAA
movdqa xmm0, XMMWORD PTR .LC1[rip] # after inlining, probably a reg-copy
pand xmm0, xmm1 # and two PAND xmm,xmm
pand xmm1, XMMWORD PTR .LC2[rip]
paddw xmm0, xmm1
ret
If AVX2 is available, an __m256i version of this is straightforward and wouldn't need the blendps fallback. That may be better than scalar pdep (BMI2). AVX2 vpsrlvd or q (per-element shift counts) seem like they should help, but we find ourselves needing to move bits across dword boundaries, and it can only be left or right, not alternating directions. (AVX512 has variable-count rotates (32 and 64-bit), and 16-bit variable-count shifts. Rotates let you go right or left with the same instruction.)
The shift element size could be 64 each time; our blends drop bits that would get shifted into the low element of a pair. For the final step, paddw is 1 byte smaller than psllw/d/q because it has no immediate operand. And can run on more ports on most CPUs. Especially Haswell, where shifts can only run on port 0, but paddw can run on port 1 or 5. (This code has no instruction-level parallelism within one iteration, so we rely on out-of-order exec to overlap execution of multiple iterations.)
Skylake through Alder Lake run SIMD shifts on p01, SIMD inter adds on p015, blendps on p015, pblendw on p5 (p15 for Alder Lake), pblendvb as 1 uop for p015. (Only the non-AVX encoding; vpblendvb is 2 uops for p015). Zen 3 for example has plenty of throughput for all of these.
The final step avoiding _mm_blendv_epi8 has several advantages:
Both ways need two vector constants and 3 instructions. (And no difference in the minimum number of movdqa register-copies a compiler has to invent without non-destructive AVX instructions.)
The AND/AND/ADD version has better ILP; two ANDs in parallel.
SSE4.1 pblendvb can be slower (e.g. Haswell runs it as 2 uops for port 5) and has an awkward requirement of having the control vector in XMM0. Some compilers may waste instructions with hard-reg constraints. (Maybe even when inlining into a loop, unlike when we look at how this helper function would compile on its own.)
vpblendvb (the AVX encoding of it) is 2 uops (for any port) on newer Intel, or 3 on Alder Lake, presumably as the price for having 4 operands (3 inputs and a separate output). Also the AVX version is slow on Alder Lake E-cores (4 uops, 3.23 cycle throughput) https://uops.info/.
AMD CPUs don't have this problem; for example Zen 3 runs vpblendvb as 1 uop for either of two ports.
The only possible upside to the blend version is that the constants are easier to construct on the fly. GCC12 has started preferring to construct some constants on the fly when AVX is available, but does a rather bad job of it, using 10-byte mov r64, imm64 / vmovq / vpunpcklqdq instead of 5-byte mov reg, imm32 / ... / vpbroadcastd or pshufd v,v,0. Or instead of starting with an all-ones vector and shifting.
Actually, the constants for the non-blend way can be generated from an all-ones vector with psrlw xmm, 9 to get 0x007f, and then left shifting that 7-bit mask left by 7. So with AVX, 3 total instructions for both masks, without memory access. Unfortunately compilers don't know how to do this optimization so it's a moot point.
AVX-512F / BW, without AVX-512VBMI / AVX-512GFNI
If you have Ice Lake / Zen4 features, you want #Harold's answer; as I commented there, it's slightly better than AVX-512 vpmultishiftqb (parallel bitfield-extract within a qword).
But if not, with Skylake-X / Cascade Lake features (AVX-512BW and F) you have have masking and variable-count rotates. This saves 2 instructions vs. the SSE4 version (built with AVX2); it feels like there should be room to save more, especially at the final step within 16-bit elements. But masking has byte granularity, and there is no vprolvw, and still no byte shift, unlike AArch64 which can shift elements in 2 directions at byte granularity.
Splitting things apart and doing different things, then merging with a merge-masking vmovdqa could work, but I don't think would help.
#ifdef __AVX512BW__
// pre-Ice Lake, without AVX-512VBMI or AVX512-GFNI
__m128i ascii_unpack7x8_avx512bw(__m128i v)
{
// for YMM or ZMM, use VPERMW, or VPERMB if we have AVX512VBMI since unfortunately VPERMW isn't single-uop on Intel CPUs that support both.
__m128i separate_7B_halves = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1,
7, 8, 9,10,11,12,13, -1);
v = _mm_shuffle_epi8(v, separate_7B_halves);
v = _mm_slli_epi64(v, 4); // 00000HGFE | DCBA0000 // dword halves
v = _mm_rolv_epi32(v, _mm_set_epi32(2, 32-2, 2, 32-2));
// 00HG|FE00 | 00DC|BA00 // u16 chunks of a u64
v = _mm_mask_srli_epi16(v, (__mmask8)0b0101'0101, v, 2); // 00HG | 00FE | 00DC | 00BA
// Separate each u16 into bytes
__m128i hi = _mm_and_si128(v, _mm_set1_epi16(0x3f80)); // isolate hi half
v = _mm_add_epi16(v, hi); // high halves left 1 (x+=x), low halves stay (x+=0)
// 0H0G | 0F0E | 0D0C | 0B0A in each qword.
return v;
}
#endif
Clang (Godbolt) optimizes the masked right-shift to a variable-count right shift, which is a good idea for a stand-alone function not in a loop especially when we're loading other constants.
This uses more non-immediate constants, but fewer uops. A wider version of this using vpermw to unpack 14-byte chunks to 16-byte lanes might have to do something to introduce zero bits where they're needed, perhaps using zero-masking on the shuffle. But I think we'd still need vpshufb within lanes, so it can zero those high bits.
Having those known zeros that we move around with shifts and rotates is what lets us only use one and and add at the end, unlike the blending version where elements end up with high garbage so we need to mask both ways.
# clang -O3 -march=x86-64-v4
ascii_unpack7x8_avx512bw(long long __vector(2)):
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = xmm0[0,1,2,3,4,5,6],zero,xmm0[7,8,9,10,11,12,13],zero
vpsllq xmm0, xmm0, 4
vprolvd xmm0, xmm0, xmmword ptr [rip + .LCPI1_1]
vpsrlvw xmm0, xmm0, xmmword ptr [rip + .LCPI1_2]
vpand xmm1, xmm0, xmmword ptr [rip + .LCPI1_3]
vpaddw xmm0, xmm1, xmm0
ret
These constants would of course be loaded into registers.
Just 6 uops; shifts run on port 0 or 1, shuffles on port 5, on Skylake, with VPAND and VPADD able to run on any of the 3 vector ALU ports. So it's a good balance, not running into back-end throughput bottlenecks on a specific port. (vs. 8 uops with clang's AVX build of the SSE4 version)
GCC using masking as requested, again the constant init will get hoisted out of loops, including k1.
# gcc -O3 -march=x86-64-v4
ascii_unpack7x8_avx512bw(long long __vector(2)):
vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip]
mov eax, 85 # 0x55
vpsllq xmm0, xmm0, 4
kmovb k1, eax
movabs rax, 4575727041462157184 # 0x3F803F803F803F80 silly to use a 64-bit immediate
vprolvd xmm0, xmm0, XMMWORD PTR .LC3[rip]
vpbroadcastq xmm1, rax
vpsrlw xmm0{k1}, xmm0, 2
vpand xmm1, xmm0, xmm1
vpaddw xmm0, xmm0, xmm1
ret
Same instructions doing the work, just setting up constants differently. (Except for vpsrlw xmm0{k1}, xmm0, 2 to shift some elements but not others.)

Backporting my arm64 answer to SSE2, we can simulate variadic shifts by mullo_epu16 and mulhi_epu16; first pack adjacent 7+7-bit values as consecutive:
// 0b'0aaaaaaa'0bbbbbbb + 0bbbbbbb = 0b'0aaaaaaa'bbbbbbb0
a0 = _mm_add_epi16(a, _mm_and_epi16(a, _mm_set1_epi16(0x7f)));
a0 = 0aaaaaaabbbbbbb0'0cccccccddddddd0'0eeeeeeefffffff0'0ggggggghhhhhhh0
a1 = 00000000aaaaaaab'000000cccccccddd'0000eeeeeeefffff'00ggggggghhhhhhh
a2 = bbbbbb0000000000'dddd000000000000'ff00000000000000'0000000000000000
a3 = 0000000000000000'bbbbbb0000000000'dddd000000000000'ff00000000000000
a1 = _mm_mulhi_epu16(a0, kShift); // 1 << {9,11,13,15}
a2 = _mm_mullo_epu16(a0, kShift); // 1 << {9,11,13,15}
a3 = _mm_bsrli_si128(a2, 2);
return _mm_or_si128(a1,a3);

Related

Extracting edges of AVX2 16x16 bitmatrix

Is there a relatively cheap way to extract the four edges (rows 0 and 15, and columns 0 and 15) of a 16x16 bitmatrix stored in a __m256i into four 16b lanes of a __m256i? I don't care which lanes the output is to, or if there is garbage in the rest of the register. Mild preference for all of them to be in the low half, but only mild.
Extracting the 'top' and 'bottom' are easy - it's just the first and last 16b elements of the vector, done - but the sides are another matter. You need the first and last bits of each 16b element, which gets complicated.
You can do it with a full bit-transpose, like so:
// Full bit-transpose of input viewed as a 16x16 bitmatrix.
extern __m256i transpose(__m256i m);
__m256i get_edges(__m256i m) {
__m256i t = transpose(m);
// We only care about first and last u16 of each
// m = [abcdefghijklmnop]
// t = [ABCDEFGHIJKLMNOP]
m = _mm256_permutevar8x32_epi32(m, _mm256_set_epi32(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0));
// m = [............a..p]
t = _mm256_permutevar8x32_epi32(t, _mm256_set_epi32(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0));
// m = [............A..P]
__m256i r = _mm256_unpacklo_epi16(t, m);
// r = [........aA....pP]
return r; // output in low and high dwords of low half
}
... but that just reduces one surprisingly annoying problem to another surprisingly annoying problem - I can't see how to cheaply do a full bit-transpose of a __m256i.
Ditto, there might be something _mm256_movemask_epi8-esque that could do the trick - but nothing jumps out at me.
Is there a better approach?

With fast BMI2 pext (Haswell or Zen 3 and later), that's one option if you start with vpmovmskb + shift + vpmovmskb to get the bits of the edges (interleaved with garbage bits, since we want every 16th but we get every 8th).
9 uops for the front-end, 6 of them needing port 5 on Intel Skylake-family. (Not counting the integer constant setup, assuming you'd do this in a loop. If not, that also counts against this.)
__m128i edges_zen3_intel(__m256i v)
{
__m128i vtop_bottom = _mm256_castsi256_si128(
_mm256_permute4x64_epi64(v, _MM_SHUFFLE(0,0, 3, 0)) );
// vpermq: 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
// side bits interleaved with garbage
// without AVX-512 we can only extract a bit per byte, dword, or qword
unsigned left = _mm256_movemask_epi8(v); // high bit of each element
unsigned right = _mm256_movemask_epi8( _mm256_slli_epi16(v, 15) ); // low<<15
// left = _pext_u32(left, 0xAAAAAAAAul); // take every other bit starting with #1
// right = _pext_u32(right, 0xAAAAAAAAul);
// then combine or do whatever
uint64_t lr = ((uint64_t)left << 32) | right;
lr = _pext_u64(lr, 0xAAAAAAAAAAAAAAAAull);
//__m128i vsides = _mm_cvtsi32_si128(lr);
__m128i vtblr = _mm_insert_epi32(vtop_bottom, lr, 1); // into an unused space
// u16 elems: [ top | x | x | x | left | right | x | bottom ]
return vtblr;
}
This compiles to 10 uops for Intel CPUs (and Zen 4), including getting everything back into one SIMD vector. The movabs can be hoisted out of loops. SHL/OR don't compete for SIMD execution-port throughput (able to run on port 6 on Intel), but do compete for the front-end. Godbolt
# Haswell/Sklake uop counts
edges_zen3_intel(long long __vector(4)):
vpsllw ymm2, ymm0, 15 # p0 (or p01 on Skylake)
vpmovmskb eax, ymm0 # p0
vpermq ymm1, ymm0, 12 # p5
vpmovmskb edx, ymm2 # p0
sal rax, 32 # p06
or rax, rdx # p0156
movabs rdx, -6148914691236517206 # p0156 (and can be hoisted out of loops)
pext rax, rax, rdx # p1
vpinsrd xmm0, xmm1, eax, 1 # 2 p5. On Intel, both uops compete with shuffles
ret
As a variation, we could maybe get left and right edges together for one vpmovmskb, if we can left-shift the odd bytes but not the evens? Probably not, _mm256_maddubs_epi16 with _mm256_set1_epi16(0x0180) can't do that, it adds horizontal pairs, and a left-shift of 7 (0x80 = 1<<7) isn't enough, we'd need 8 to get the top bit back to the top.
Or if we vpsllw + vpacksswb, then use the right masks to group bits, like 0x00ff00ff. But that's getting closer to my non-pext idea, maybe it's better even if we do have fast pext
Without fast BMI2 pext - saturating pack the vector to reduce to 8-bit elements
This might be faster even if pext is fast.
Packing with signed saturation always preserves the sign bit, so you can narrow 16 to 8-bit without losing the information you want to keep. We want to do this to the high and low bit of each word (16-bit element), so a 2:1 pack with the original and v<<15 is perfect.
Except for the fact that AVX2 vpacksswb ymm is two separate in-lane pack operations, so we end up with 8-element chunks interleaved. We could fix that up right after packing with vpermq, but it's multiple uops on Zen 1 through Zen 3, and we can instead shuffle bytes after getting the movemask result back into a vector register. (The same vpshufb can move around the high and low elements.)
// avoiding PEXT because it's slow on Zen 2 and Zen 1 (and Excavator)
// This might be good on Intel and Zen 3, maybe comparable to using PEXT
__m128i edges_no_pext(__m256i v)
{
__m128i vhi = _mm256_extract_si128(v, 1); // contains top, as vhi.u16[7]
__m128i vlo = _mm256_castsi256_si128(v); // contains bottom, as vlo.u16[0], contiguous if concatenated the right way
__m128i bottom_top = _mm_alignr_epi8(vhi, vlo, 12); // rotate bottom :top down to the 2nd dword [ x | x | bottom:top | x]
// vpermq ymm, ymm, imm would also work to get them into the low 128
// but that's 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
// and would need a slightly more expensive vpinsrd instead of vmovd+vpblendd
// On Intel CPUs (and Zen4) vpermq is better; we pshufb later so we can get the bytes where we want them.
// A compromise is to use vextracti128+vpblendd here, vpinsrd later
// __m128i bottom_top = _mm_blend_epi32(vhi, vlo, 0b0001);
// [ hi | x | x | x | x | x | x | lo ]
__m256i vright = _mm256_slli_epi16(v, 15);
__m256i vpacked = _mm256_packs_epi16(v, vright); // pack now, shuffle bytes later.
unsigned bits = _mm256_extract_epi8(vpacked); // [ left_hi | right_hi | left_lo | right_lo ]
__m128i vsides = _mm_cvtsi32_si128(bits);
__m128i vtblr = _mm_blend_epi32(top_bottom, vsides, 0b0001); // vpinsrd xmm0, eax, 0 but the merge can run on more ports
__m128i shuffle = _mm_set_epi8(-1,-1,-1,-1, -1,-1,-1,-1,
7,6,5,4, 3,1, 2,0);
// swap middle 2 bytes of the low dword, fixing up the in-lane pack
vtblr = _mm_shuffle_epi8(vtblr, shuffle);
return vtblr; // low 4 u16 elements are (MSB) top | bottom | left | right (LSB)
}
This compiles pretty nicely (see earlier Godbolt link), although GCC4.9 and later (and clang) pessimize my vmovd+vpblendd into vpinsrd, even with -march=haswell or Skylake where it's 2 uops for port 5 (https://uops.info/) when most of the other instructions in the function are also shuffles that only run on port 5. (This is much more shuffle-heavy for Intel CPUs.)
Using vpblendd instead of vpalignr would make it less bad for Intel, like __m128i bottom_top = _mm_blend_epi32(vhi, vlo, 0b0001);, to get to the same situation as in the vpermq version below with 2 uops even on Zen 1. But this is just saving 1 uop on Zen 1 and is equal or worse everywhere else.
# GCC12 -O3 -march=haswell
# uop counts for Skylake
edges_no_pext:
vextracti128 xmm1, ymm0, 0x1 # p5
vpsllw ymm2, ymm0, 15 # p01
vpalignr xmm1, xmm1, xmm0, 12 # p5
vpacksswb ymm0, ymm0, ymm2 # p5
vpmovmskb eax, ymm0 # p0
vpinsrd xmm0, xmm1, eax, 0 # 2 p5
vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip] # p5
ret
So that's 6 uops for port 5 on Intel, a throughput bottleneck of 1 per 6 cycles. vs. the PEXT version being 3 uops that need port 0, 3 that need port 5. But this is only 8 total uops for the front-end, vs. 9 for the pext version. And the vpermq version saves one more on Intel, assuming GCC doesn't waste the vmovdqa after inlining.
If you didn't care about zeroing the upper 8 bytes of the output vector, the shuffle constant could be loaded with vmovq and just be 8 bytes instead of 16 (if you made the upper 0 bytes all zeros). But compilers will probably not spot that optimization.
Since compilers insist on pessimizing to vpinsrd, on CPUs with fast vpermq (Intel and Zen4), we might as well use that:
If you're only going to have one non-GFNI AVX2 version, this is probably a good tradeoff
vpermq being 3 uops on Zen 1 isn't much worse than emulating what we need from it using 2 instruction, and is worse on Intel CPUs. And probably about break-even on Zen 2 and Zen 3, modulo differences in back-end port usage.
// for fast vpermq, especially if compilers are going to pessimize vmovd(p5)+vpblendd (p015) into vpinsrd (2p5).
// good on Intel and Zen 4, maybe also Zen 3 and not bad on Zen 2.
__m128i edges_no_pext_fast_vpermq(__m256i v)
{
__m128i vtop_bottom = _mm256_castsi256_si128(
_mm256_permute4x64_epi64(v, _MM_SHUFFLE(0,0, 3, 0)) );
// 3 uops on Zen1, 2 on Zen2&3, 1 on Zen4 and Intel.
__m256i vright = _mm256_slli_epi16(v, 15);
__m256i vpacked = _mm256_packs_epi16(v, vright); // pack now, shuffle bytes later.
unsigned bits = _mm256_movemask_epi8(vpacked); // [ left_hi | right_hi | left_lo | right_lo ]
__m128i vtblr = _mm_insert_epi32(vtop_bottom, bits, 1); // into an unused space
// u16 elems: [ top | x | x | x | lh:rh | ll:rl | x | bottom ]
__m128i shuffle = _mm_set_epi8(-1,-1,-1,-1, -1,-1,-1,-1,
15,14, 1,0, 7,5, 6,4);
vtblr = _mm_shuffle_epi8(vtblr, shuffle);
return vtblr; // low 4 u16 elements are (MSB) top | bottom | left | right (LSB)
}
# GCC12.2 -O3 -march=haswell clang is similar but has vzeroupper despite the caller passing a YMM, but no wasted vmovdqa
edges_no_pext_fast_vpermq(long long __vector(4)):
vmovdqa ymm1, ymm0
vpermq ymm0, ymm0, 12
vpsllw ymm2, ymm1, 15
vpacksswb ymm1, ymm1, ymm2
vpmovmskb eax, ymm1
vpinsrd xmm0, xmm0, eax, 1
vpshufb xmm0, xmm0, XMMWORD PTR .LC1[rip]
ret
On Intel Haswell/Skylake, this is 5 uops for port 5, plus a shift (p01) and vpmovmskb (p0). So 7 total uops. (Not counting the ret or the wasted vmovdqa that should go away with inlining.)
On Ice Lake and later, one of the uops from vpinsrd can run on p15, relieving one uop of pressure on that port if you're doing this in a loop. vpinsrd is single-uop on Alder Lake E-cores.
Ice Lake (and later) can also run vpshufb on p1/p5, further reducing port 5 pressure, down to 3 of the 7 uops. Port 5 can handle any shuffle, port 1 can handle some but not all shuffle uops. It may be hooked up to the upper half of the 512-bit shuffle unit to give extra throughput for some 256-bit and narrower shuffles, like how the p0/p1 FMA units work as a single 512-bit FMA unit on p0. It doesn't handle vpermq or vpacksswb; those are still p5 only on Ice/Alder Lake.
So this version is pretty reasonable on current-generation and future Intel CPUs. Alder Lake E-cores run vpermq ymm as 2 uops with 7 cycle latency. But if they can hide that latency with their more limited out-of-order scheduling (big ROB, but queues for each port aren't as long), running vpinsrd as a single uop helps make up the front-end throughput.
256-bit instructions like vpsllw ymm and vpacksswb ymm are also 2 uops each on Alder Lake E-cores, but vpmovmskb eax,ymm is 1 uop (but maybe high-ish latency). So even if we wanted to make a version optimized for Zen1 / Alder E, we probably can't save total uops on them by using more 128-bit instructions after vextracti128; we still need to do stuff to both halves of the input vector.
I had looked at packing into the right order for vpmovmskb xmm to get each 16-bit group in the right order, but separately. I had considered doing this with vperm2i128, but that's quite slow on Zen 1.
// __m256i vcombined = _mm256_permute2x128_si256(v, vright, 0x10); // or something? Takes two shuffles to get them ordered the right way for pack
Zen 1 has very fast vextracti128 - is single-uop for any port, and 128-bit vector ops are 1 uop vs. 2 for __m256i operations. And where we're already doing that extract to get the top and bottom together.
But it still leads to more scalar work, especially if you want the result combined in a vector. 2x vpinsrw or and extra SHL/OR before vmovd is worse.
#if 0
// Zen 1 has slow vperm2i128, but I didn't end up using it even if it's fast
__m128i hi = _mm256_extract_si128(v, 1); // vextracti128 - very cheap on Zen1
__m128i lo = _mm256_castsi256_si128(v); // no cost
__m128i vleft = _mm_packs_epi16(lo, hi); // vpacksswb signed saturation, high bit of each word becomes high bit of byte
// then shift 2 halves separately and pack again?
#endif
Vector packing to set up for vpmovmskb is probably the best bet; before thinking of that, I was looking at using vpmovmskb on the input directly and using scalar bithacks to take odd or even bits:
How to efficiently de-interleave bits (inverse Morton)
How to de-interleave bits (UnMortonizing?)
But those take more operations, so they're slower unless you're bottlenecked on SIMD ALUs specifically, not overall front-end throughput (or execution-port throughput on Intel where SIMD and scalar ALUs share ports).
AVX-512 and/or GFNI
There are two interesting strategies here:
vpmovw2m and/or vptestmw or mb as a more convenient vpmovmskb. Only requires AVX-512BW (Skylake-avx512)
Pack 8 bits to the bottom of each qword, then shuffle. Probably only good with GFNI + AVX512VBMI, like Ice Lake / Zen4 and later. Maybe just GFNI + AVX2 as in crippled Alder Lake (no AVX-512).
Extracting bits to a mask:
With one vptestmb with set1_epi8(0x8001), we can get all the bits we want into one mask, but then we need to deinterleave, probably with scalar pext (which is fast on all AVX-512 CPUs except maybe Knight's Landing, but it doesn't have AVX-512BW).
So probably better to extract two masks and concatenate. Except wait a minute, I don't see a great way to get a 32-bit mask into a vector register (without expanding it to a vector of 0 / -1 elements). For 8 and 16-bit masks, there's mask-to-vector broadcasts like vpbroadcastmw2d x/y/zmm, k. They don't support masking, so you can't merge-mask into another register. That's single-uop on Zen 4, but on Intel it costs 2 uops, same as kmov eax, k / vpbroadcastd x/y/zmm, eax, which is what you should do instead so you can merge-mask into the vector with the top and bottom edges.
vpmovw2m k1, ymm0 # left = 16 mask bits from high bits of 16 elements
vptestmw k2, ymm0, set1_epi16(0x0001) # right. pseudocode constant
kunpckwd k1, k1, k2 # left:right
# there's no vpbroadcastmd2d only byte/word mask to dword or qword element!
mov ecx, 0b0010
kmovb k7, ecx # hoist this constant setup out of loops. If not looping, maybe do something else, like bcast to another register and vpblendd.
kmovd eax, k1
vpbroadcastd xmm0{k7}, eax # put left:right into the 2nd element of XMM0
# leaving other unchanged (merge-masking)
Where xmm0 could have been set by vpermq to have top:bottom in the low 16 bytes; all CPUs with AVX-512 have efficient vpermq. So that's 1 more uop on top of the 5 from my hand-written asm (which should be straightforward to write with intrinsics, I just didn't feel like taking the extra step of looking up the right intrinsics after finding the available asm instructions.)
Packing bits within qwords then shuffling: GFNI and probably AVX-512VBMI for vpermb
(Requiring AVX512VBMI means Ice Lake or Zen 4, so vpermb will be single-uop. Unless some future Intel CPU with an E-core supports a slower AVX-512, but still vpermb ymm hopefully wouldn't be too bad.)
Probably pack in left:right order (1 nibble each), then byte shuffle. If we can do left:right and right:left in alternating bytes, a byte shuffle (like vpermb or vpermt2b) should be able to set up for a vprolw to rotate within each 16-bit word to group 8 "left" bits in the right order.
Moving bits within a qword: Harold's answer on bitpack ascii string into 7-bit binary blob using SIMD shows _mm256_gf2p8affine_epi64_epi8 putting 1 bit from each byte at the top of each qword. (And packing the remaining 7-bit fields, which was the goal in that answer.)
If this is doable, it'll probably be fewer uops and significantly better latency than going to masks and back.
With Alder Lake (GFNI but AVX-512 disabled unless you manage to avoid Intel's efforts to cripple this amazing CPU), this might still be useful, since it has AVX+GFNI for _mm256_gf2p8affine_epi64_epi8. vpshufb + vpermd can substitute for vpermb. But you won't have word rotates; still, shuffling bytes like ABAB will let you use a plain left shift to get the window you wanted, and then shuffle again.

Fastest way to find 16bit match in a 4 element short array?

I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way
I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?
First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48) but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0
Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?

For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.
But then you need to detect a contiguous 16 0 bits. Unlike pcmpeqw, you'll have some zero bits in the other elements.
So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.
There is yet a faster method — use hasless(v, 1), which is defined below; it works in 4 operations and requires no subsquent verification. It simplifies to
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.
This bithack was originally by Alan Mycroft in 1987.
So it could look like this (untested):
#include <stdint.h>
#include <string.h>
// returns 0 / non-zero status.
uint64_t hasmatch_16in64(uint16_t needle, const uint16_t haystack[4])
{
uint64_t vneedle = 0x0001000100010001ULL * needle; // broadcast
uint64_t vbuf;
memcpy(&vbuf, haystack, sizeof(vbuf)); // aliasing-safe unaligned load
//static_assert(sizeof(vbuf) == 4*sizeof(haystack[0]));
uint64_t match = vbuf ^ vneedle;
uint64_t any_zeros = (match - 0x0001000100010001ULL) & ~match & 0x8000800080008000ULL;
return any_zeros;
// unsigned matchpos = _tzcnt_u32(any_zeros) >> 4; // I think.
}
Godbolt with GCC and clang, also including a SIMD intrinsics version.
# gcc12.2 -O3 -march=x86-64-v3 -mtune=znver1
# x86-64-v3 is the Haswell/Zen1 baseline: AVX2+FMA+BMI2, but with tune=generic
# without tune=haswell or whatever, GCC uses shl/add /shl/add instead of imul, despite still needing the same constant
hasmatch_16in64:
movabs rax, 281479271743489 # 0x1000100010001
movzx edi, di # zero-extend to 64-bit
imul rdi, rax # vneedle
xor rdi, QWORD PTR [rsi] # match
# then the bithack
mov rdx, rdi
sub rdx, rax
andn rax, rdi, rdx # BMI1
movabs rdx, -9223231297218904064 # 0x8000800080008000
and rax, rdx
ret
Clang unfortunately adds 0xFFFEFFFEFFFEFFFF instead of reusing the multiplier constant, so it has three 64-bit immediate constants.
AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD movemask, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.
Match position
If you need to know where the match is, I think that bithack has a 1 in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving 0x80008000...).
So maybe tzcnt(any_zeros) >> 4 to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31. 31 >> 4 = 1.
If that doesn't work, then yeah AVX2 or AVX-512 vpbroadcastw xmm0, edi / vmovq / vpcmeqw / vpmovmskb / tzcnt will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of which short.)
Actually just SSE2 pshuflw can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-source pcmpeqw mm0, [rsi] since it has no alignment requirement and is only 64-bit, not 128.
If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.
#include <immintrin.h>
// note the unsigned function arg, not uint16_t;
// we only use the low 16, but GCC doesn't realize that and wastes an instruction in the non-AVX2 version
int hasmatch_SIMD(unsigned needle, const uint16_t haystack[4])
{
#ifdef __AVX2__ // or higher
__m128i vneedle = _mm_set1_epi16(needle);
#else
__m128i vneedle = _mm_cvtsi32_si128(needle); // movd
vneedle = _mm_shufflelo_epi16(vneedle, 0); // broadcast to low half
#endif
__m128i vbuf = _mm_loadl_epi64((void*)haystack); // alignment and aliasing safe
unsigned mask = _mm_movemask_epi8(_mm_cmpeq_epi16(vneedle, vbuf));
//return _tzcnt_u32(mask) >> 1;
return mask;
}
# clang expects narrow integer args to already be zero- or sign-extended to 32
hasmatch_SIMD:
movd xmm0, edi
pshuflw xmm0, xmm0, 0 # xmm0 = xmm0[0,0,0,0,4,5,6,7]
movq xmm1, qword ptr [rsi] # xmm1 = mem[0],zero
pcmpeqw xmm1, xmm0
pmovmskb eax, xmm1
ret
AXV-512 gives us vpbroadcastw xmm0, edi, replacing vmovd + vpbroadcastw xmm,xmm or movd + pshuflw, saving a shuffle uop.
With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)
There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a __m128i. So probably we do need to consider the front-end costs of setting up those constants.
I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.
GCC unfortunately fails to optimize away the movzx edi, di from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of _mm_cvtsi32_si128(needle) is discarded by the later shuffle. Maybe better to make the function arg unsigned, not explicitly a narrow 16-bit type.

Clang with -O2 or -O3 and GCC with -O3 compile a simple search loop into branchless instructions:
int indexOf(short target, short* arr) {
int index = -1;
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
index = i;
}
}
return index;
}
Demo
I doubt you can get much better without SIMD. In other words, write simple and understandable code to help the compiler produce efficient code.
Side note: for some reason, neither Clang nor GCC use conditional moves on this very similar code:
int indexOf(short target, short* arr) {
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
return i;
}
}
return -1;
}

Why vector length SIMD code is slower than plain C

Why is my SIMD vector4 length function 3x slower than a naive vector length method?
SIMD vector4 length function:
__extern_always_inline float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
Naive implementation:
sqrtf(V[0] * V[0] + V[1] * V[1] + V[2] * V[2] + V[3] * V[3])
The SIMD version took 16110ms to iterate 1000000000 times. The naive version was ~3 times faster, it takes only 4746ms.
#include <math.h>
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>
static float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
int main() {
float A[4] __attribute__((aligned(16))) = {3, 4, 0, 0};
struct timespec t0 = {};
clock_gettime(CLOCK_MONOTONIC, &t0);
double sum_len = 0;
for (uint64_t k = 0; k < 1000000000; ++k) {
A[3] = k;
sum_len += vec4_len(A);
// sum_len += sqrtf(A[0] * A[0] + A[1] * A[1] + A[2] * A[2] + A[3] * A[3]);
}
struct timespec t1 = {};
clock_gettime(CLOCK_MONOTONIC, &t1);
fprintf(stdout, "%f\n", sum_len);
fprintf(stdout, "%ldms\n", (((t1.tv_sec - t0.tv_sec) * 1000000000) + (t1.tv_nsec - t0.tv_nsec)) / 1000000);
return 0;
}
I run with the following command on an Intel(R) Core(TM) i7-8550U CPU. First with the vec4_len version then with the plain C.
I compile with GCC (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0:
gcc -Wall -Wextra -O3 -msse -msse3 sse.c -lm && ./a.out
SSE version output:
499999999500000128.000000
13458ms
Plain C version output:
499999999500000128.000000
4441ms

The most obvious problem is using an inefficient dot-product (with haddps which costs 2x shuffle uops + 1x add uop) instead of shuffle + add. See Fastest way to do horizontal float vector sum on x86 for what to do after _mm_mul_ps that doesn't suck as much. But still this is just not something x86 can do very efficiently.
But anyway, the real problem is your benchmark loop.
A[3] = k; and then using _mm_load_ps(A) creates a store-forwarding stall, if it compiles naively instead of to a vector shuffle. A store + reload can be efficiently forwarded with ~5 cycles of latency if the load only loads data from a single store instruction, and no data outside that. Otherwise it has to do a slower scan of the whole store buffer to assemble bytes. This adds about 10 cycles of latency to the store-forwarding.
I'm not sure how much impact this has on throughput, but could be enough to stop out-of-order exec from overlapping enough loop iterations to hide the latency and only bottleneck on sqrtss shuffle throughput.
(Your Coffee Lake CPU has 1 per 3 cycle sqrtss throughput, so surprisingly SQRT throughput is not your bottleneck.1 Instead it will be shuffle throughput or something else.)
See Agner Fog's microarch guide and/or optimization manual.
What does "store-buffer forwarding" mean in the Intel developer's manual?
How does store to load forwarding happens in case of unaligned memory access?
Can modern x86 implementations store-forward from more than one prior store?
Why would a compiler generate this assembly? quotes Intel's optimization manual re: store forwarding. (In that question, and old gcc version stored the 2 dword halves of an 8-byte struct separately, then copied the struct with a qword load/store. Super braindead.)
Plus you're biasing this even more against SSE by letting the compiler hoist the computation of V[0] * V[0] + V[1] * V[1] + V[2] * V[2] out of the loop.
That part of the expression is loop-invariant, so the compiler only has to do (float)k squared, add, and a scalar sqrt every loop iteration. (And convert that to double to add to your accumulator).
(#StaceyGirl's deleted answer pointed this out; looking over the code of the inner loops in it was a great start on writing this answer.)
Extra inefficiency in A[3] = k in the vector version
GCC9.1's inner loop from Kamil's Godbolt link looks terrible, and seems to include a loop-carried store/reload to merge a new A[3] into the 8-byte A[2..3] pair, further limiting the CPU's ability to overlap multiple iterations.
I'm not sure why gcc thought this was a good idea. It would maybe help on CPUs that split vector loads into 8-byte halves (like Pentium M or Bobcat) to avoid store-forwarding stalls. But that's not a sane tuning for "generic" modern x86-64 CPUs.
.L18:
pxor xmm4, xmm4
mov rdx, QWORD PTR [rsp+8] ; reload A[2..3]
cvtsi2ss xmm4, rbx
mov edx, edx ; truncate RDX to 32-bit
movd eax, xmm4 ; float bit-pattern of (float)k
sal rax, 32
or rdx, rax ; merge the float bit-pattern into A[3]
mov QWORD PTR [rsp+8], rdx ; store A[2..3] again
movaps xmm0, XMMWORD PTR [rsp] ; vector load: store-forwarding stall
mulps xmm0, xmm0
haddps xmm0, xmm0
haddps xmm0, xmm0
ucomiss xmm3, xmm0
movaps xmm1, xmm0
sqrtss xmm1, xmm1
ja .L21 ; call sqrtf to set errno if needed; flags set by ucomiss.
.L17:
add rbx, 1
cvtss2sd xmm1, xmm1
addsd xmm2, xmm1 ; total += (double)sqrtf
cmp rbx, 1000000000
jne .L18 ; }while(k<1000000000);
This insanity isn't present in the scalar version.
Either way, gcc did manage to avoid the inefficiency of a full uint64_t -> float conversion (which x86 doesn't have in hardware until AVX512). It was presumably able to prove that using a signed 64-bit -> float conversion would always work because the high bit can't be set.
Footnote 1: But sqrtps has the same 1 per 3 cycle throughput as scalar, so you're only getting 1/4 of your CPU's sqrt throughput capability by doing 1 vector at a time horizontally, instead of doing 4 lengths for 4 vectors in parallel.

A better 8x8 bytes matrix transpose with SSE?

I found this post that explains how to transpose an 8x8 bytes matrix with 24 operations, and a few scrolls later there's the code that implements the transpose. However, this method does not exploit the fact that we can block the 8x8 transpose into four 4x4 transposes, and each one can be done in one shuffle instruction only (this post is the reference). So I came out with this solution:
__m128i transpose4x4mask = _mm_set_epi8(15, 11, 7, 3, 14, 10, 6, 2, 13, 9, 5, 1, 12, 8, 4, 0);
__m128i shuffle8x8Mask = _mm_setr_epi8(0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15);
void TransposeBlock8x8(uint8_t *src, uint8_t *dst, int srcStride, int dstStride) {
__m128i load0 = _mm_set_epi64x(*(uint64_t*)(src + 1 * srcStride), *(uint64_t*)(src + 0 * srcStride));
__m128i load1 = _mm_set_epi64x(*(uint64_t*)(src + 3 * srcStride), *(uint64_t*)(src + 2 * srcStride));
__m128i load2 = _mm_set_epi64x(*(uint64_t*)(src + 5 * srcStride), *(uint64_t*)(src + 4 * srcStride));
__m128i load3 = _mm_set_epi64x(*(uint64_t*)(src + 7 * srcStride), *(uint64_t*)(src + 6 * srcStride));
__m128i shuffle0 = _mm_shuffle_epi8(load0, shuffle8x8Mask);
__m128i shuffle1 = _mm_shuffle_epi8(load1, shuffle8x8Mask);
__m128i shuffle2 = _mm_shuffle_epi8(load2, shuffle8x8Mask);
__m128i shuffle3 = _mm_shuffle_epi8(load3, shuffle8x8Mask);
__m128i block0 = _mm_unpacklo_epi64(shuffle0, shuffle1);
__m128i block1 = _mm_unpackhi_epi64(shuffle0, shuffle1);
__m128i block2 = _mm_unpacklo_epi64(shuffle2, shuffle3);
__m128i block3 = _mm_unpackhi_epi64(shuffle2, shuffle3);
__m128i transposed0 = _mm_shuffle_epi8(block0, transpose4x4mask);
__m128i transposed1 = _mm_shuffle_epi8(block1, transpose4x4mask);
__m128i transposed2 = _mm_shuffle_epi8(block2, transpose4x4mask);
__m128i transposed3 = _mm_shuffle_epi8(block3, transpose4x4mask);
__m128i store0 = _mm_unpacklo_epi32(transposed0, transposed2);
__m128i store1 = _mm_unpackhi_epi32(transposed0, transposed2);
__m128i store2 = _mm_unpacklo_epi32(transposed1, transposed3);
__m128i store3 = _mm_unpackhi_epi32(transposed1, transposed3);
*((uint64_t*)(dst + 0 * dstStride)) = _mm_extract_epi64(store0, 0);
*((uint64_t*)(dst + 1 * dstStride)) = _mm_extract_epi64(store0, 1);
*((uint64_t*)(dst + 2 * dstStride)) = _mm_extract_epi64(store1, 0);
*((uint64_t*)(dst + 3 * dstStride)) = _mm_extract_epi64(store1, 1);
*((uint64_t*)(dst + 4 * dstStride)) = _mm_extract_epi64(store2, 0);
*((uint64_t*)(dst + 5 * dstStride)) = _mm_extract_epi64(store2, 1);
*((uint64_t*)(dst + 6 * dstStride)) = _mm_extract_epi64(store3, 0);
*((uint64_t*)(dst + 7 * dstStride)) = _mm_extract_epi64(store3, 1);
}
Excluding load/store operations this procedure consists of only 16 instructions instead of 24.
What am I missing?

Apart from the loads, stores and pinsrq-s to read from and write to memory, with possibly a stride not equal to 8 bytes,
you can do the transpose with only 12 instructions (this code can easily be used in combination with Z boson's test code):
void tran8x8b_SSE_v2(char *A, char *B) {
__m128i pshufbcnst = _mm_set_epi8(15,11,7,3, 14,10,6,2, 13,9,5,1, 12,8,4,0);
__m128i B0, B1, B2, B3, T0, T1, T2, T3;
B0 = _mm_loadu_si128((__m128i*)&A[ 0]);
B1 = _mm_loadu_si128((__m128i*)&A[16]);
B2 = _mm_loadu_si128((__m128i*)&A[32]);
B3 = _mm_loadu_si128((__m128i*)&A[48]);
T0 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(B0),_mm_castsi128_ps(B1),0b10001000));
T1 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(B2),_mm_castsi128_ps(B3),0b10001000));
T2 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(B0),_mm_castsi128_ps(B1),0b11011101));
T3 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(B2),_mm_castsi128_ps(B3),0b11011101));
B0 = _mm_shuffle_epi8(T0,pshufbcnst);
B1 = _mm_shuffle_epi8(T1,pshufbcnst);
B2 = _mm_shuffle_epi8(T2,pshufbcnst);
B3 = _mm_shuffle_epi8(T3,pshufbcnst);
T0 = _mm_unpacklo_epi32(B0,B1);
T1 = _mm_unpackhi_epi32(B0,B1);
T2 = _mm_unpacklo_epi32(B2,B3);
T3 = _mm_unpackhi_epi32(B2,B3);
_mm_storeu_si128((__m128i*)&B[ 0], T0);
_mm_storeu_si128((__m128i*)&B[16], T1);
_mm_storeu_si128((__m128i*)&B[32], T2);
_mm_storeu_si128((__m128i*)&B[48], T3);
}
Here we use the 32 bit floating point shuffle which is more flexible than the epi32 shuffle.
The casts do not generate extra instructions (code generated with gcc 5.4):
tran8x8b_SSE_v2:
.LFB4885:
.cfi_startproc
vmovdqu 48(%rdi), %xmm5
vmovdqu 32(%rdi), %xmm2
vmovdqu 16(%rdi), %xmm0
vmovdqu (%rdi), %xmm1
vshufps $136, %xmm5, %xmm2, %xmm4
vshufps $221, %xmm5, %xmm2, %xmm2
vmovdqa .LC6(%rip), %xmm5
vshufps $136, %xmm0, %xmm1, %xmm3
vshufps $221, %xmm0, %xmm1, %xmm1
vpshufb %xmm5, %xmm3, %xmm3
vpshufb %xmm5, %xmm1, %xmm0
vpshufb %xmm5, %xmm4, %xmm4
vpshufb %xmm5, %xmm2, %xmm1
vpunpckldq %xmm4, %xmm3, %xmm5
vpunpckldq %xmm1, %xmm0, %xmm2
vpunpckhdq %xmm4, %xmm3, %xmm3
vpunpckhdq %xmm1, %xmm0, %xmm0
vmovups %xmm5, (%rsi)
vmovups %xmm3, 16(%rsi)
vmovups %xmm2, 32(%rsi)
vmovups %xmm0, 48(%rsi)
ret
.cfi_endproc
On some, but not all, older cpus there might be a small bypass delay (between 0 and 2 cycles) for moving data between the
integer and the floating point units. This increases the latency of the function, but it does not necessarily affect the
throughput of the code.
A simple latency test with 1e9 tranpositions:
for (int i=0;i<500000000;i++){
tran8x8b_SSE(A,C);
tran8x8b_SSE(C,A);
}
print8x8b(A);
This takes about 5.5 seconds (19.7e9 cycles) with tran8x8b_SSE and 4.5 seconds (16.0e9 cycles) with tran8x8b_SSE_v2 (Intel core i5-6500). Note that
the load and stores were not eliminated by the compiler, although the functions were inlined in the for loop.
Update: AVX2-128 / SSE 4.1 solution with blends.
The 'shuffles' (unpack, shuffle) are handled by port 5, with 1 instruction per cpu cycle on modern cpus.
Sometimes it pays off to replace one 'shuffle' with two blends. On Skylake the 32 bit blend instructions can run on either port 0, 1 or 5.
Unfortunately, _mm_blend_epi32 is only AVX2-128. An efficient SSE 4.1 alternative is _mm_blend_ps in combination
with a few casts (which are usually free). The 12 'shuffles' are replaced by
8 shuffles in combination with 8 blends.
The simple latency test now runs in about 3.6 seconds (13e9 cpu cycles), which is 18 % faster than the results with tran8x8b_SSE_v2.
Code:
/* AVX2-128 version, sse 4.1 version see ----------------> SSE 4.1 version of tran8x8b_AVX2_128() */
void tran8x8b_AVX2_128(char *A, char *B) { /* void tran8x8b_SSE4_1(char *A, char *B) { */
__m128i pshufbcnst_0 = _mm_set_epi8(15, 7,11, 3,
13, 5, 9, 1, 14, 6,10, 2, 12, 4, 8, 0); /* __m128i pshufbcnst_0 = _mm_set_epi8(15, 7,11, 3, 13, 5, 9, 1, 14, 6,10, 2, 12, 4, 8, 0); */
__m128i pshufbcnst_1 = _mm_set_epi8(13, 5, 9, 1,
15, 7,11, 3, 12, 4, 8, 0, 14, 6,10, 2); /* __m128i pshufbcnst_1 = _mm_set_epi8(13, 5, 9, 1, 15, 7,11, 3, 12, 4, 8, 0, 14, 6,10, 2); */
__m128i pshufbcnst_2 = _mm_set_epi8(11, 3,15, 7,
9, 1,13, 5, 10, 2,14, 6, 8, 0,12, 4); /* __m128i pshufbcnst_2 = _mm_set_epi8(11, 3,15, 7, 9, 1,13, 5, 10, 2,14, 6, 8, 0,12, 4); */
__m128i pshufbcnst_3 = _mm_set_epi8( 9, 1,13, 5,
11, 3,15, 7, 8, 0,12, 4, 10, 2,14, 6); /* __m128i pshufbcnst_3 = _mm_set_epi8( 9, 1,13, 5, 11, 3,15, 7, 8, 0,12, 4, 10, 2,14, 6); */
__m128i B0, B1, B2, B3, T0, T1, T2, T3; /* __m128 B0, B1, B2, B3, T0, T1, T2, T3; */
/* */
B0 = _mm_loadu_si128((__m128i*)&A[ 0]); /* B0 = _mm_loadu_ps((float*)&A[ 0]); */
B1 = _mm_loadu_si128((__m128i*)&A[16]); /* B1 = _mm_loadu_ps((float*)&A[16]); */
B2 = _mm_loadu_si128((__m128i*)&A[32]); /* B2 = _mm_loadu_ps((float*)&A[32]); */
B3 = _mm_loadu_si128((__m128i*)&A[48]); /* B3 = _mm_loadu_ps((float*)&A[48]); */
/* */
B1 = _mm_shuffle_epi32(B1,0b10110001); /* B1 = _mm_shuffle_ps(B1,B1,0b10110001); */
B3 = _mm_shuffle_epi32(B3,0b10110001); /* B3 = _mm_shuffle_ps(B3,B3,0b10110001); */
T0 = _mm_blend_epi32(B0,B1,0b1010); /* T0 = _mm_blend_ps(B0,B1,0b1010); */
T1 = _mm_blend_epi32(B2,B3,0b1010); /* T1 = _mm_blend_ps(B2,B3,0b1010); */
T2 = _mm_blend_epi32(B0,B1,0b0101); /* T2 = _mm_blend_ps(B0,B1,0b0101); */
T3 = _mm_blend_epi32(B2,B3,0b0101); /* T3 = _mm_blend_ps(B2,B3,0b0101); */
/* */
B0 = _mm_shuffle_epi8(T0,pshufbcnst_0); /* B0 = _mm_castsi128_ps(_mm_shuffle_epi8(_mm_castps_si128(T0),pshufbcnst_0)); */
B1 = _mm_shuffle_epi8(T1,pshufbcnst_1); /* B1 = _mm_castsi128_ps(_mm_shuffle_epi8(_mm_castps_si128(T1),pshufbcnst_1)); */
B2 = _mm_shuffle_epi8(T2,pshufbcnst_2); /* B2 = _mm_castsi128_ps(_mm_shuffle_epi8(_mm_castps_si128(T2),pshufbcnst_2)); */
B3 = _mm_shuffle_epi8(T3,pshufbcnst_3); /* B3 = _mm_castsi128_ps(_mm_shuffle_epi8(_mm_castps_si128(T3),pshufbcnst_3)); */
/* */
T0 = _mm_blend_epi32(B0,B1,0b1010); /* T0 = _mm_blend_ps(B0,B1,0b1010); */
T1 = _mm_blend_epi32(B0,B1,0b0101); /* T1 = _mm_blend_ps(B0,B1,0b0101); */
T2 = _mm_blend_epi32(B2,B3,0b1010); /* T2 = _mm_blend_ps(B2,B3,0b1010); */
T3 = _mm_blend_epi32(B2,B3,0b0101); /* T3 = _mm_blend_ps(B2,B3,0b0101); */
T1 = _mm_shuffle_epi32(T1,0b10110001); /* T1 = _mm_shuffle_ps(T1,T1,0b10110001); */
T3 = _mm_shuffle_epi32(T3,0b10110001); /* T3 = _mm_shuffle_ps(T3,T3,0b10110001); */
/* */
_mm_storeu_si128((__m128i*)&B[ 0], T0); /* _mm_storeu_ps((float*)&B[ 0], T0); */
_mm_storeu_si128((__m128i*)&B[16], T1); /* _mm_storeu_ps((float*)&B[16], T1); */
_mm_storeu_si128((__m128i*)&B[32], T2); /* _mm_storeu_ps((float*)&B[32], T2); */
_mm_storeu_si128((__m128i*)&B[48], T3); /* _mm_storeu_ps((float*)&B[48], T3); */
} /* } */

Posting this as an answer. I'm also going to change the title of the question from "... with SSE" to "... with SIMD" due to some answers and comments received so far.
I succeeded in transposing the matrix with AVX2 in 8 instructions only, 10 including load/store (excluding masks loads). EDIT: I found a shorter version. See below. This is the case where the matrices are all contiguous in memory, so direct load/store can be used.
Here's the C code:
void tran8x8b_AVX2(char *src, char *dst) {
__m256i perm = _mm256_set_epi8(
0, 0, 0, 7,
0, 0, 0, 5,
0, 0, 0, 3,
0, 0, 0, 1,
0, 0, 0, 6,
0, 0, 0, 4,
0, 0, 0, 2,
0, 0, 0, 0
);
__m256i tm = _mm256_set_epi8(
15, 11, 7, 3,
14, 10, 6, 2,
13, 9, 5, 1,
12, 8, 4, 0,
15, 11, 7, 3,
14, 10, 6, 2,
13, 9, 5, 1,
12, 8, 4, 0
);
__m256i load0 = _mm256_loadu_si256((__m256i*)&src[ 0]);
__m256i load1 = _mm256_loadu_si256((__m256i*)&src[32]);
__m256i perm0 = _mm256_permutevar8x32_epi32(load0, perm);
__m256i perm1 = _mm256_permutevar8x32_epi32(load1, perm);
__m256i transpose0 = _mm256_shuffle_epi8(perm0, tm);
__m256i transpose1 = _mm256_shuffle_epi8(perm1, tm);
__m256i unpack0 = _mm256_unpacklo_epi32(transpose0, transpose1);
__m256i unpack1 = _mm256_unpackhi_epi32(transpose0, transpose1);
perm0 = _mm256_castps_si256(_mm256_permute2f128_ps(_mm256_castsi256_ps(unpack0), _mm256_castsi256_ps(unpack1), 32));
perm1 = _mm256_castps_si256(_mm256_permute2f128_ps(_mm256_castsi256_ps(unpack0), _mm256_castsi256_ps(unpack1), 49));
_mm256_storeu_si256((__m256i*)&dst[ 0], perm0);
_mm256_storeu_si256((__m256i*)&dst[32], perm1);
}
GCC was smart enough to perform a permutation during AVX load, saving two instructions. Here's the compiler output:
tran8x8b_AVX2(char*, char*):
vmovdqa ymm1, YMMWORD PTR .LC0[rip]
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vpermd ymm0, ymm1, YMMWORD PTR [rdi]
vpermd ymm1, ymm1, YMMWORD PTR [rdi+32]
vpshufb ymm0, ymm0, ymm2
vpshufb ymm1, ymm1, ymm2
vpunpckldq ymm2, ymm0, ymm1
vpunpckhdq ymm0, ymm0, ymm1
vinsertf128 ymm1, ymm2, xmm0, 1
vperm2f128 ymm0, ymm2, ymm0, 49
vmovdqu YMMWORD PTR [rsi], ymm1
vmovdqu YMMWORD PTR [rsi+32], ymm0
vzeroupper
ret
It emitted the vzerupper instruction with -O3, but going down to -O1 removes this.
In case of my original problem (a large matrix and I'm zooming in to an 8x8 part of it), handling strides destroys the output in a pretty bad way:
void tran8x8b_AVX2(char *src, char *dst, int srcStride, int dstStride) {
__m256i load0 = _mm256_set_epi64x(*(uint64_t*)(src + 3 * srcStride), *(uint64_t*)(src + 2 * srcStride), *(uint64_t*)(src + 1 * srcStride), *(uint64_t*)(src + 0 * srcStride));
__m256i load1 = _mm256_set_epi64x(*(uint64_t*)(src + 7 * srcStride), *(uint64_t*)(src + 6 * srcStride), *(uint64_t*)(src + 5 * srcStride), *(uint64_t*)(src + 4 * srcStride));
// ... the same as before, however we can skip the final permutations because we need to handle the destination stride...
*((uint64_t*)(dst + 0 * dstStride)) = _mm256_extract_epi64(unpack0, 0);
*((uint64_t*)(dst + 1 * dstStride)) = _mm256_extract_epi64(unpack0, 1);
*((uint64_t*)(dst + 2 * dstStride)) = _mm256_extract_epi64(unpack1, 0);
*((uint64_t*)(dst + 3 * dstStride)) = _mm256_extract_epi64(unpack1, 1);
*((uint64_t*)(dst + 4 * dstStride)) = _mm256_extract_epi64(unpack0, 2);
*((uint64_t*)(dst + 5 * dstStride)) = _mm256_extract_epi64(unpack0, 3);
*((uint64_t*)(dst + 6 * dstStride)) = _mm256_extract_epi64(unpack1, 2);
*((uint64_t*)(dst + 7 * dstStride)) = _mm256_extract_epi64(unpack1, 3);
}
Here's the compiler output:
tran8x8b_AVX2(char*, char*, int, int):
movsx rdx, edx
vmovq xmm5, QWORD PTR [rdi]
lea r9, [rdi+rdx]
vmovdqa ymm3, YMMWORD PTR .LC0[rip]
movsx rcx, ecx
lea r11, [r9+rdx]
vpinsrq xmm0, xmm5, QWORD PTR [r9], 1
lea r10, [r11+rdx]
vmovq xmm4, QWORD PTR [r11]
vpinsrq xmm1, xmm4, QWORD PTR [r10], 1
lea r8, [r10+rdx]
lea rax, [r8+rdx]
vmovq xmm7, QWORD PTR [r8]
vmovq xmm6, QWORD PTR [rax+rdx]
vpinsrq xmm2, xmm7, QWORD PTR [rax], 1
vinserti128 ymm1, ymm0, xmm1, 0x1
vpinsrq xmm0, xmm6, QWORD PTR [rax+rdx*2], 1
lea rax, [rsi+rcx]
vpermd ymm1, ymm3, ymm1
vinserti128 ymm0, ymm2, xmm0, 0x1
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vpshufb ymm1, ymm1, ymm2
vpermd ymm0, ymm3, ymm0
vpshufb ymm0, ymm0, ymm2
vpunpckldq ymm2, ymm1, ymm0
vpunpckhdq ymm0, ymm1, ymm0
vmovdqa xmm1, xmm2
vmovq QWORD PTR [rsi], xmm1
vpextrq QWORD PTR [rax], xmm1, 1
vmovdqa xmm1, xmm0
add rax, rcx
vextracti128 xmm0, ymm0, 0x1
vmovq QWORD PTR [rax], xmm1
add rax, rcx
vpextrq QWORD PTR [rax], xmm1, 1
add rax, rcx
vextracti128 xmm1, ymm2, 0x1
vmovq QWORD PTR [rax], xmm1
add rax, rcx
vpextrq QWORD PTR [rax], xmm1, 1
vmovq QWORD PTR [rax+rcx], xmm0
vpextrq QWORD PTR [rax+rcx*2], xmm0, 1
vzeroupper
ret
However, this seems not a big deal if compared against the output my original code.
EDIT: I found a shorter version. 4 instructions in total, 8 counting both load/stores. This is possible because I read the matrix in a different way, hiding some "shuffles" in the "gather" instruction during load. Also, note that the final permutation is needed to perform the store because AVX2 doesn't have a "scatter" instruction. Having a scatter instruction would bring down everything to 2 instructions only. Also, note that I can handle without hassles the src stride by changing the content of the vindex vector.
Unfortunately this AVX_v2 seems to be slower than the previous one. Here's the code:
void tran8x8b_AVX2_v2(char *src1, char *dst1) {
__m256i tm = _mm256_set_epi8(
15, 11, 7, 3,
14, 10, 6, 2,
13, 9, 5, 1,
12, 8, 4, 0,
15, 11, 7, 3,
14, 10, 6, 2,
13, 9, 5, 1,
12, 8, 4, 0
);
__m256i vindex = _mm256_setr_epi32(0, 8, 16, 24, 32, 40, 48, 56);
__m256i perm = _mm256_setr_epi32(0, 4, 1, 5, 2, 6, 3, 7);
__m256i load0 = _mm256_i32gather_epi32((int*)src1, vindex, 1);
__m256i load1 = _mm256_i32gather_epi32((int*)(src1 + 4), vindex, 1);
__m256i transpose0 = _mm256_shuffle_epi8(load0, tm);
__m256i transpose1 = _mm256_shuffle_epi8(load1, tm);
__m256i final0 = _mm256_permutevar8x32_epi32(transpose0, perm);
__m256i final1 = _mm256_permutevar8x32_epi32(transpose1, perm);
_mm256_storeu_si256((__m256i*)&dst1[ 0], final0);
_mm256_storeu_si256((__m256i*)&dst1[32], final1);
}
And here's the output of the compiler:
tran8x8b_AVX2_v2(char*, char*):
vpcmpeqd ymm3, ymm3, ymm3
vmovdqa ymm2, YMMWORD PTR .LC0[rip]
vmovdqa ymm4, ymm3
vpgatherdd ymm0, DWORD PTR [rdi+4+ymm2*8], ymm3
vpgatherdd ymm1, DWORD PTR [rdi+ymm2*8], ymm4
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vpshufb ymm1, ymm1, ymm2
vpshufb ymm0, ymm0, ymm2
vmovdqa ymm2, YMMWORD PTR .LC2[rip]
vpermd ymm1, ymm2, ymm1
vpermd ymm0, ymm2, ymm0
vmovdqu YMMWORD PTR [rsi], ymm1
vmovdqu YMMWORD PTR [rsi+32], ymm0
vzeroupper
ret

A simplified one
void tp128_8x8(char *A, char *B) {
__m128i sv = _mm_set_epi8(15, 7, 14, 6, 13, 5, 12, 4, 11, 3, 10, 2, 9, 1, 8, 0);
__m128i iv[4], ov[4];
ov[0] = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)A), sv);
ov[1] = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)(A+16)), sv);
ov[2] = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)(A+32)), sv);
ov[3] = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)(A+48)), sv);
iv[0] = _mm_unpacklo_epi16(ov[0], ov[1]);
iv[1] = _mm_unpackhi_epi16(ov[0], ov[1]);
iv[2] = _mm_unpacklo_epi16(ov[2], ov[3]);
iv[3] = _mm_unpackhi_epi16(ov[2], ov[3]);
_mm_storeu_si128((__m128i*)B, _mm_unpacklo_epi32(iv[0], iv[2]));
_mm_storeu_si128((__m128i*)(B+16), _mm_unpackhi_epi32(iv[0], iv[2]));
_mm_storeu_si128((__m128i*)(B+32), _mm_unpacklo_epi32(iv[1], iv[3]));
_mm_storeu_si128((__m128i*)(B+48), _mm_unpackhi_epi32(iv[1], iv[3]));
}
Benchmark:i5-5300U 2.3GHz (cycles per byte)
tran8x8b : 2.140
tran8x8b_SSE : 1.602
tran8x8b_SSE_v2 : 1.551
tp128_8x8 : 1.535
tran8x8b_AVX2 : 1.563
tran8x8b_AVX2_v2 : 1.731

Normally when load and store instructions are not counted it's because the code is working with a matrix in register e.g. doing multiple operations in addition to the transpose in a loop. The loads and stores in this case are not counted because they are not part of the main loop.
But in your code the loads and stores (or rather sets and extracts) are doing part of the transpose.
GCC implements _mm_set_epi64x for SSE4.1 in your code with _mm_insert_epi64 and _mm_loadl_epi64. The insert instruction is doing part of the transpose i.e. the transpose starts at load0,1,2,3 not at shuffle0,1,2,3. And then your final store0,1,2,3 values don't contain the transpose either. You have to use eight _mm_extract_epi64 instructions to finish the transpose in memory. So it does not really make sense to not count the set and extract intrinsics.
In any case, it turns out you can do the transpose from register with only 16 instructions using only SSSE3 like this:
//__m128i B0, __m128i B1, __m128i B2, __m128i B3
__m128i mask = _mm_setr_epi8(0x0,0x04,0x01,0x05, 0x02,0x06,0x03,0x07, 0x08,0x0c,0x09,0x0d, 0x0a,0x0e,0x0b,0x0f);
__m128i T0, T1, T2, T3;
T0 = _mm_unpacklo_epi8(B0,B1);
T1 = _mm_unpackhi_epi8(B0,B1);
T2 = _mm_unpacklo_epi8(B2,B3);
T3 = _mm_unpackhi_epi8(B2,B3);
B0 = _mm_unpacklo_epi16(T0,T2);
B1 = _mm_unpackhi_epi16(T0,T2);
B2 = _mm_unpacklo_epi16(T1,T3);
B3 = _mm_unpackhi_epi16(T1,T3);
T0 = _mm_unpacklo_epi32(B0,B2);
T1 = _mm_unpackhi_epi32(B0,B2);
T2 = _mm_unpacklo_epi32(B1,B3);
T3 = _mm_unpackhi_epi32(B1,B3);
B0 = _mm_shuffle_epi8(T0,mask);
B1 = _mm_shuffle_epi8(T1,mask);
B2 = _mm_shuffle_epi8(T2,mask);
B3 = _mm_shuffle_epi8(T3,mask);
I'm not sure if it makes sense to exclude the loads and store here either because I'm not sure how convenient it is to work with a 8x8 byte matrix in four 128-bit registers.
Here is code testing this:
#include <stdio.h>
#include <x86intrin.h>
void print8x8b(char *A) {
for(int i=0; i<8; i++) {
for(int j=0; j<8; j++) {
printf("%2d ", A[i*8+j]);
} puts("");
} puts("");
}
void tran8x8b(char *A, char *B) {
for(int i=0; i<8; i++) {
for(int j=0; j<8; j++) {
B[j*8+i] = A[i*8+j];
}
}
}
void tran8x8b_SSE(char *A, char *B) {
__m128i mask = _mm_setr_epi8(0x0,0x04,0x01,0x05, 0x02,0x06,0x03,0x07, 0x08,0x0c,0x09,0x0d, 0x0a,0x0e,0x0b,0x0f);
__m128i B0, B1, B2, B3, T0, T1, T2, T3;
B0 = _mm_loadu_si128((__m128i*)&A[ 0]);
B1 = _mm_loadu_si128((__m128i*)&A[16]);
B2 = _mm_loadu_si128((__m128i*)&A[32]);
B3 = _mm_loadu_si128((__m128i*)&A[48]);
T0 = _mm_unpacklo_epi8(B0,B1);
T1 = _mm_unpackhi_epi8(B0,B1);
T2 = _mm_unpacklo_epi8(B2,B3);
T3 = _mm_unpackhi_epi8(B2,B3);
B0 = _mm_unpacklo_epi16(T0,T2);
B1 = _mm_unpackhi_epi16(T0,T2);
B2 = _mm_unpacklo_epi16(T1,T3);
B3 = _mm_unpackhi_epi16(T1,T3);
T0 = _mm_unpacklo_epi32(B0,B2);
T1 = _mm_unpackhi_epi32(B0,B2);
T2 = _mm_unpacklo_epi32(B1,B3);
T3 = _mm_unpackhi_epi32(B1,B3);
B0 = _mm_shuffle_epi8(T0,mask);
B1 = _mm_shuffle_epi8(T1,mask);
B2 = _mm_shuffle_epi8(T2,mask);
B3 = _mm_shuffle_epi8(T3,mask);
_mm_storeu_si128((__m128i*)&B[ 0], B0);
_mm_storeu_si128((__m128i*)&B[16], B1);
_mm_storeu_si128((__m128i*)&B[32], B2);
_mm_storeu_si128((__m128i*)&B[48], B3);
}
int main(void) {
char A[64], B[64], C[64];
for(int i=0; i<64; i++) A[i] = i;
print8x8b(A);
tran8x8b(A,B);
print8x8b(B);
tran8x8b_SSE(A,C);
print8x8b(C);
}

This was really interesting to me, and I was looking to do exactly this, but for various reasons, I ended up needing to do it in Go, instead of C, and I didn't have vector intrinsics, so I thought "well, I'll just write something and see how it does".
My reported times, on a ~3.6GHz CPU, are about 28ns per 64-byte block transposed for a naive implementation, and about 19ns each for one done using bit shifts. I used perf to confirm the numbers, which seemed a bit unlikely to me, and they seem to add up. The fancy bit shift implementation is a bit over 250 instructions, and gets about 3.6 instructions per cycle, so it comes out to about 69-70 cycles per operation.
This is Go, but honestly it should be trivial to implement; it's just treating the input array of 64 bytes as 8 uint64_t.
You can get another nanosecond or so with declaring some of these things as new variables to hint to the register allocator.
import (
"unsafe"
)
const (
hi16 = uint64(0xFFFF0000FFFF0000)
lo16 = uint64(0x0000FFFF0000FFFF)
hi8 = uint64(0xFF00FF00FF00FF00)
lo8 = uint64(0x00FF00FF00FF00FF)
)
// Okay, this might take some explaining. We are working on a logical
// 8x8 matrix of bytes, which we get as a 64-byte array. We want to transpose
// it (row/column).
//
// start:
// [[00 08 16 24 32 40 48 56]
// [01 09 17 25 33 41 49 57]
// [02 10 18 26 34 42 50 58]
// [03 11 19 27 35 43 51 59]
// [04 12 20 28 36 44 52 60]
// [05 13 21 29 37 45 53 61]
// [06 14 22 30 38 46 54 62]
// [07 15 23 31 39 47 55 63]]
//
// First, let's make sure everything under 32 is in the top four rows,
// and everything over 32 is in the bottom four rows. We do this by
// swapping pairs of 32-bit words.
// swap32:
// [[00 08 16 24 04 12 20 28]
// [01 09 17 25 05 13 21 29]
// [02 10 18 26 06 14 22 30]
// [03 11 19 27 07 15 23 31]
// [32 40 48 56 36 44 52 60]
// [33 41 49 57 37 45 53 61]
// [34 42 50 58 38 46 54 62]
// [35 43 51 59 39 47 55 63]]
//
// Next, let's make sure everything over 16 or 48 is in the bottom two
// rows of the two four-row sections, and everything under 16 or 48 is
// in the top two rows of the section. We do this by swapping masked
// pairs in much the same way:
// swap16:
// [[00 08 02 10 04 12 06 14]
// [01 09 03 11 05 13 07 15]
// [16 24 18 26 20 28 22 30]
// [17 25 19 27 21 29 23 31]
// [32 40 34 42 36 44 38 46]
// [33 41 35 43 37 45 39 47]
// [48 56 50 58 52 60 54 62]
// [49 57 51 59 53 61 55 63]]
//
// Now, we will do the same thing to each pair -- but because of
// clever choices in the specific arrange ment leading up to this, that's
// just one more byte swap, where each 2x2 block has its upper right
// and lower left corners swapped, and that turns out to be an easy
// shift and mask.
func UnswizzleLazy(m *[64]uint8) {
// m32 treats the 8x8 array as a 2x8 array, because
// it turns out we only need to swap a handful of the
// bits...
m32 := (*[16]uint32)(unsafe.Pointer(&m[0]))
m32[1], m32[8] = m32[8], m32[1]
m32[3], m32[10] = m32[10], m32[3]
m32[5], m32[12] = m32[12], m32[5]
m32[7], m32[14] = m32[14], m32[7]
m64 := (*[8]uint64)(unsafe.Pointer(&m[0]))
// we're now at the state described above as "swap32"
tmp0, tmp1, tmp2, tmp3 :=
(m64[0]&lo16)|(m64[2]&lo16)<<16,
(m64[1]&lo16)|(m64[3]&lo16)<<16,
(m64[0]&hi16)>>16|(m64[2]&hi16),
(m64[1]&hi16)>>16|(m64[3]&hi16)
tmp4, tmp5, tmp6, tmp7 :=
(m64[4]&lo16)|(m64[6]&lo16)<<16,
(m64[5]&lo16)|(m64[7]&lo16)<<16,
(m64[4]&hi16)>>16|(m64[6]&hi16),
(m64[5]&hi16)>>16|(m64[7]&hi16)
// now we're at "swap16".
lo8 := lo8
hi8 := hi8
m64[0], m64[1] = (tmp0&lo8)|(tmp1&lo8)<<8, (tmp0&hi8)>>8|tmp1&hi8
m64[2], m64[3] = (tmp2&lo8)|(tmp3&lo8)<<8, (tmp2&hi8)>>8|tmp3&hi8
m64[4], m64[5] = (tmp4&lo8)|(tmp5&lo8)<<8, (tmp4&hi8)>>8|tmp5&hi8
m64[6], m64[7] = (tmp6&lo8)|(tmp7&lo8)<<8, (tmp6&hi8)>>8|tmp7&hi8
}
What this is doing is, I hope, reasonably obvious: shuffle the half-words around, so the first four words have all the values that belong in them, and the last four have all the values that belong in them. Then do a similar thing to each set of four words, so you end up with the things that belong in the top two words in the top two, etcetera.
I wasn't going to comment until I realized that, if the cycles/byte numbers above are right, this actually outperforms the shuffle/unpack solution.
(Note that this is an in-place transpose, but it's easy to use temps for the intermediate steps and do the final store somewhere else. It's actually probably faster.)
UPDATE: I originally described my algorithm slightly incorrectly, then I realized that I could actually do what I'd described. This one's running about 65.7 cycles per 64 bits.
EDIT #2: Tried one of the above AVX versions on this machine. On my hardware (Xeon E3-1505M, nominally 3GHz), I get a little over 10 cycles per 64-byte block, so, about 6 bytes per cycle. That seems a lot more reasonable to me than 1.5 cycles per byte did.
EDIT #3: Got down a bit further, about 45 cycles per 64 bits, by just writing the first part as shifts and masks on uint64 instead of trying to be "smart" and just move the 32 bits I cared about.

AVX512VBMI (Cascade Lake / Ice Lake)
AVX512VBMI introduces vpermb, a 64-byte lane-crossing shuffle with byte granularity.
_mm512_permutexvar_epi8( __m512i idx, __m512i a);
Existing CPUs that support it run it as a single uop, with 1/clock throughput. (https://www.uops.info/html-tp/CNL/VPERMB_ZMM_ZMM_M512-Measurements.html)
That trivializes the problem, making it possible with 1 instruction (at least for the stride=8 case where the whole 8x8 block is contiguous). Otherwise you should look at vpermt2b to shuffle together bytes from 2 sources. But that's 3 uops on CannonLake.
// TODO: strided loads / stores somehow for stride != 8
// AVX512VBMI
void TransposeBlock8x8_contiguous(uint8_t *src, uint8_t *dst)
{
const __m512i trans8x8shuf = _mm512_set_epi8(
63, 63-8*1, 63-8*2, 63-8*3, 63-8*4, ...
...
57, 49, 41, 33, 25, 17, 9, 1,
56, 48, 40, 32, 24, 16, 8, 0
);
__m512i vsrc = _mm512_loadu_si512(src);
__m512i shuffled = _mm512_permutexvar_epi8(trans8x8shuf, vsrc);
_mm512_storeu_si512(dst, shuffled);
}
https://godbolt.org/z/wrfyy3
Apparently _mm512_setr_epi8 doesn't exist for gcc/clang (only the 256 and 128 versions) so you have to define the constant in last-to-first order, opposite of C array initializer order.
vpermb even works with the data as a memory source operand, so it can load+shuffle in a single instruction. But according to https://uops.info/, it doesn't micro-fuse on CannonLake: unlike vpermd zmm, zmm, [r14] which decodes to 1 fused-domain uop (note "retire_slots: 1.0")
vpermd zmm, zmm, [r14] decodes to 2 separate uops for the front-end / fused-domain: "retire_slots: 2.0"). This from experimental testing with perf counters on a real CannonLake CPU. uops.info doesn't have a Cascade Lake or Ice Lake available yet, so it's possible it will be even more efficient there.
The uops.info tables uselessly count total number of unfused-domain uops, so you have to click on an instruction to see if it micro-fuses or not.
Source or dst strided, not contiguous, data
I guess you'd want to do qword (8-byte) loads into XMM registers and shuffle together pairs of inputs, or concatenate them with movhps or pinsrq. It's possibly worth it to use a qword-gather load with strided indices, but that's often not worth it.
I'm not sure if it's worth combining as far as YMM registers, let alone ZMM, or if it's best to only get as wide as XMM registers so we can efficiently scatter qwords back to memory manually with vmovq and vmovhps (which don't need a shuffle uop, just a store, on Intel CPUs). If the dst is contiguous, merging a non-contiguous strided src makes a lot more sense.
AVX512VBMI vpermt2b ymm looks like it would be useful for shuffle+merge like a lane-crossing punpcklbw, selecting any 32 bytes from the concatenation of two other 32-byte YMM registers. (Or 64 from 2x 64-byte regs for the ZMM version). But unfortunately on CannonLake it costs 3 uops, like vpermt2w on Skylake-X and Cannon Lake.
If we can worry about bytes later, vpermt2d is efficient on CPUs that support it (single uop)! Skylake-X and later.
Ice Lake has one per 2-cycle throughput for vpermt2b (instlat), perhaps because it has an extra shuffle unit that can run some (but not all) shuffle uops. Notice for example that vpshufb xmm and ymm is 0.5c throughput, but vpshufb zmm is 1c throughput. But vpermb is always just 1c throughput.
I wonder if we can take advantage of merge-masking? Like maybe vpmovzxbq to zero-extend input bytes to qwords. (one 8-byte row -> 64-byte ZMM register). Then maybe dword left-shift with merge-masking into another register? No, that doesn't help, the useful data is in the same dword elements for both inputs unless you do something to one register first, defeating the purpose.
Overlapped byte-masked stores (vmovdqu8 [rdi + 0..7]{k1}, zmm0..7) of vpmovzxbq load results are also possible, but probably not efficient. All but one of them would be misaligned, at best. The store buffer and/or cache hardware might be able to efficiently commit 8x masked stores, though.
A hybrid strategy doing some moving around in registers and some masked-stores might be interesting to balance shuffle/blend vs. store work for a contiguous dst. Especially if all the stores can be aligned, which would require moving data around in each vector so it's in the right place.
Ice Lake has 2 store execution units. (IDK if L1d cache commit can keep up with that, or if merging in the store buffer usually helps, or if that just helps with bursts of work.)

Most answers here use a combination of different sized shuffles and permutations using _mm_shuffle_epi8, which is available only at SSSE3 and above.
A pure SSE2 implementation with 12* instruction kernel can be formed from interleaving the first 32 elements with the last 32 elements three times in a row:
void systolic_kernel(__m128i a[4]) {
__m128i a0 = _mm_unpacklo_epi8(a[0], a[2]);
__m128i a1 = _mm_unpackhi_epi8(a[0], a[2]);
__m128i a2 = _mm_unpacklo_epi8(a[1], a[3]);
__m128i a3 = _mm_unpackhi_epi8(a[1], a[3]);
a[0] = a0;
a[1] = a1;
a[2] = a2;
a[3] = a3;
}
void transpose(__m128i a[4]) {
systolic_kernel(a);
systolic_kernel(a);
systolic_kernel(a);
}
*without VEX encoding (for three operand instructions), there will be 6 potentially zero cost movdqa instructions added.
The same strategy can be more easily applied to 4x4, 16x16 transposes and more, as the calculation of the indices to be permuted and the block sizes is factored out from the equation.

Extract non-zero values from _m128i register with SSE

I have to extract non-zero values of an __m128i register.
For example I have a vector with eight unsigned shorts.
__m128i vector {40, 0, 22, 0, 0, 0, 0, 8}
I want to extract the 40, 22 and 8 with a minimal amount of SSE instructions.
The non-zero values will then be stored in an array of non zero values.
{40, 22, 8, more values from different vectors ... }
Is it possible to shuffle them or is there a good intrinsic to extract and store?

If you look at this paper, the authors describe how to use _mm_cmpestrm instruction to do basically want you want. The core of their algorithm is this (which I've modified slightly to do what you want, instead of what they want):
__m128i res_v = _mm_cmpestrm(
vector,
8,
mm_setzero_si128(),
8,
_SIDD_UWORD_OPS|_SIDD_CMP_EQUAL_ANY|_SIDD_BIT_MASK|_SIDD_NEGATIVE_POLARITY);
int r = _mm_extract_epi32(res_v, 0);
__m128i p = _mm_shuffle_epi8(vector, sh_mask[r]);
If you build the look-up table sh_mask as described in the paper, then p should have the non-zero elements (without any re-ordering) followed by the zero elements. The number of bits set in r will tell you the number of non-zero elements.
_mm_cmpestrm is in SSE4, unfortunately.

Based on anjruu's answer, here's an SSSE3 version that has not been tested in any way:
; xmm0 = input
pxor xmm1, xmm1
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
shl eax, 4
pshufb xmm0, [table + eax]
The table is different of course, but not that hard to work out, just keep in mind that the index is "inverted" - eg index 0 corresponds to having no zeros and 0xFFFF corresponds to all zeros.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight