SIMD unpack 12-bit fields to 16-bit - c

I need to unpack two 16-bit values from each 24 bits of input. (3 bytes -> 4 bytes). I already did it the naïve way but I'm not happy with the performance.
For example, InBuffer is __m128i:
value1 = (uint16_t)InBuffer[0:11] // bit-ranges
value2 = (uint16_t)InBuffer[12:24]
value3 = (uint16_t)InBuffer[25:36]
value4 = (uint16_t)InBuffer[37:48]
... for all the 128 bits.
After the unpacking, The values should be stored in __m256i variable.
How can I solve this with AVX2? Probably using unpack / shuffle / permute intrinsics?

I'm assuming you're doing this in a loop over a large array. If you only used __m128i loads, you'd have 15 useful bytes, which would only produce 20 output bytes in your __m256i output. (Well, I guess the 21st byte of output would be present, as the 16th byte of the input vector, the first 8 bytes of a new bitfield. But then your next vector would need to shuffle differently.)
Much better to use 24 bytes of input, producing 32 bytes of output. Ideally with a load that splits down the middle, so the low 12 bytes are in the low 128-bit "lane", avoiding the need for a lane-crossing shuffle like _mm256_permutexvar_epi32. Instead you can just _mm256_shuffle_epi8 to put bytes where you want them, setting up for some shift/and.
// uses 24 bytes starting at p by doing a 32-byte load from p-4.
// Don't use this for the first vector of a page-aligned array, or the last
inline
__m256i unpack12to16(const char *p)
{
__m256i v = _mm256_loadu_si256( (const __m256i*)(p-4) );
// v= [ x H G F E | D C B A x ] where each letter is a 3-byte pair of two 12-bit fields, and x is 4 bytes of garbage we load but ignore
const __m256i bytegrouping =
_mm256_setr_epi8(4,5, 5,6, 7,8, 8,9, 10,11, 11,12, 13,14, 14,15, // low half uses last 12B
0,1, 1,2, 3,4, 4,5, 6, 7, 7, 8, 9,10, 10,11); // high half uses first 12B
v = _mm256_shuffle_epi8(v, bytegrouping);
// each 16-bit chunk has the bits it needs, but not in the right position
// in each chunk of 8 nibbles (4 bytes): [ f e d c | d c b a ]
__m256i hi = _mm256_srli_epi16(v, 4); // [ 0 f e d | xxxx ]
__m256i lo = _mm256_and_si256(v, _mm256_set1_epi32(0x00000FFF)); // [ 0000 | 0 c b a ]
return _mm256_blend_epi16(lo, hi, 0b10101010);
// nibbles in each pair of epi16: [ 0 f e d | 0 c b a ]
}
// Untested: I *think* I got my shuffle and blend controls right, but didn't check.
It compiles like this (Godbolt) with clang -O3 -march=znver2. Of course an inline version would load the vector constants once, outside a loop.
unpack12to16(char const*): # #unpack12to16(char const*)
vmovdqu ymm0, ymmword ptr [rdi - 4]
vpshufb ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] # ymm0 = ymm0[4,5,5,6,7,8,8,9,10,11,11,12,13,14,14,15,16,17,17,18,19,20,20,21,22,23,23,24,25,26,26,27]
vpsrlw ymm1, ymm0, 4
vpand ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
vpblendw ymm0, ymm0, ymm1, 170 # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
ret
On Intel CPUs (before Ice Lake) vpblendw only runs on port 5 (https://uops.info/), competing with vpshufb (...shuffle_epi8). But it's a single uop (unlike vpblendvb variable-blend) with an immediate control. Still, that means a back-end ALU bottleneck of at best one vector per 2 cycles on Intel. If your src and dst are hot in L2 cache (or maybe only L1d), that might be the bottleneck, but this is already 5 uops for the front end, so with loop overhead and a store you're already close to a front-end bottleneck.
Blending with another vpand / vpor would cost more front-end uops but would mitigate the back-end bottleneck on Intel (before Ice Lake). It would be worse on AMD, where vpblendw can run on any of the 4 FP execution ports, and worse on Ice Lake where vpblendw can run on p1 or p5. And like I said, cache load/store throughput might be a bigger bottleneck than port 5 anyway, so fewer front-end uops are definitely better to let out-of-order exec see farther.
This may not be optimal; perhaps there's some way to set up for vpunpcklwd by getting the even (low) and odd (high) bit fields into the bottom 8 bytes of two separate input vectors even more cheaply? Or set up so we can blend with OR instead of needing to clear garbage in one input with vpblendw which only runs on port 5 on Skylake?
Or something we can do with vpsrlvd? (But not vpsrlvw - that would require AVX-512).
If you have AVX512VBMI, vpmultishiftqb is a parallel bitfield-extract. You'd just need to shuffle the right 3-byte pairs into the right 64-bit SIMD elements, then one _mm256_multishift_epi64_epi8 to put the good bits where you want them, and a _mm256_and_si256 to zero the high 4 bits of each 16-bit field will do the trick. (Can't quite take care of everything with 0-masking, or shuffling some zeros into the input for multishift, because there won't be any contiguous with the low 12-bit field.) Or you could set up for just an srli_epi16 that works for both low and high, instead of needing an AND constant, by having the multishift bitfield-extract line up both output fields with the bits you want at the top of the 16-bit element.
This may also allow a shuffle with larger granularity than bytes, although vpermb is actually fast on CPUs with AVX512VBMI, and unfortunately Ice Lake's vpermw is slower than vpermb.
With AVX-512 but not AVX512VBMI, working in 256-bit chunks lets us do the same thing as AVX2 but avoiding the blend. Instead, use merge-masking for the right shift, or vpsrlvw with a control vector to only shift the odd elements. For 256-bit vectors, this is probably as good as vpmultishiftqb.

Related

Extract 10bits words from bitstream

I need to extract all 10-bit words from a raw bitstream whitch is built as ABACABACABAC...
It already works with a naive C implementation like
for(uint8_t *ptr = in_packet; ptr < max; ptr += 5){
const uint64_t val =
(((uint64_t)(*(ptr + 4))) << 32) |
(((uint64_t)(*(ptr + 3))) << 24) |
(((uint64_t)(*(ptr + 2))) << 16) |
(((uint64_t)(*(ptr + 1))) << 8) |
(((uint64_t)(*(ptr + 0))) << 0) ;
*a_ptr++ = (val >> 0);
*b_ptr++ = (val >> 10);
*a_ptr++ = (val >> 20);
*c_ptr++ = (val >> 30);
}
But performance is inadequate for my application so I would like to improve this using some AVX2 optimisations.
I visited the website https://software.intel.com/sites/landingpage/IntrinsicsGuide/# to find any functions that can help but it seems there is nothing to works with 10-bit words, only 8 or 16-bit. That seems logical since 10-bit is not native for a processor, but it make things hard for me.
Is there any way to use AVX2 to solve this problem?
Your scalar loop does not compile efficiently. Compilers do it as 5 separate byte loads. You can express an unaligned 8-byte load in C++ with memcpy:
#include <stdint.h>
#include <string.h>
// do an 8-byte load that spans the 5 bytes we want
// clang auto-vectorizes using an AVX2 gather for 4 qwords. Looks pretty clunky but not terrible
void extract_10bit_fields_v2calar(const uint8_t *__restrict src,
uint16_t *__restrict a_ptr, uint16_t *__restrict b_ptr, uint16_t *__restrict c_ptr,
const uint8_t *max)
{
for(const uint8_t *ptr = src; ptr < max; ptr += 5){
uint64_t val;
memcpy(&val, ptr, sizeof(val));
const unsigned mask = (1U<<10) - 1; // unused in original source!?!
*a_ptr++ = (val >> 0) & mask;
*b_ptr++ = (val >> 10) & mask;
*a_ptr++ = (val >> 20) & mask;
*c_ptr++ = (val >> 30) & mask;
}
}
ICC and clang auto-vectorize your 1-byte version, but do a very bad job (lots of insert/extract of single bytes). Here's your original and this function on Godbolt (with gcc and clang -O3 -march=skylake)
None of those 3 compilers are really close to what we can do manually.
Manual vectorization
My current AVX2 version of this answer forgot a detail: there are only 3 kinds of fields ABAC, not ABCD like 10-bit RGBA pixels. So I have a version of this which unpacks to 4 separate output streams (which I'll leave in because of the packed-RGBA use-case if I ever add a dedicated version for the ABAC interleave).
The existing version can use vpunpcklwd to interleave the two A parts instead of storing with separate vmovq should work for your case. There might be something more efficient, IDK.
BTW, I find it easier to remember and type instruction mnemonics, not intrinsic names. Intel's online intrinsics guide is searchable by instruction mnemonic.
Observations about your layout:
Each field spans one byte boundary, never two, so it's possible to assemble any 4 pairs of bytes in a qword that hold 4 complete fields.
Or with a byte shuffle, to create 2-byte words that each have a whole field at some offset. (e.g. for AVX512BW vpsrlvw, or for AVX2 2x vpsrld + word-blend.) A word shuffle like AVX512 vpermw would not be sufficient: some individual bytes need to be duplicated with the start of one field and end of another. I.e the source positions aren't all aligned words, especially when you have 2x 5 bytes inside the same 16-byte "lane" of a vector.
00-07|08-15|16-23|24-31|32-39 byte boundaries (8-bit)
00...09|10..19|20...29|30..39 field boundaries (10-bit)
Luckily 8 and 10 have a GCD of 2 which is >= 10-8=2. 8*5 = 4*10 so we don't get all possible start positions, e.g. never a field starting at the last bit of 1 byte, spanning another byte, and including the first bit of a 3rd byte.
Possible AVX2 strategy: unaligned 32-byte load that leave 2x 5 bytes at the top of the low lane, and 2x 5 bytes at the bottom of the high lane. Then vpshufb in-lane shuffle to set up for 2x vpsrlvd variable-count shifts, and a blend.
Quick summary of a new idea I haven't expanded yet.
Given an input of xxx a0B0A0C0 a1B1A1C1 | a2B2A2C2 a3B3A3C3 from our unaligned load, we can get a result of
a0 A0 a1 A1 B0 B1 C0 C1 | a2 A2 a3 A3 B2 B3 C2 C3 with the right choice of vpshufb control.
Then a vpermd can put all of those 32-bit groups into the right order, with all the A elements in the high half (ready for a vextracti128 to memory), and the B and C in the low half (ready for vmovq / vmovhps stores).
Use different vpermd shuffles for adjacent pairs so we can vpblendd to merge them for 128-bit B and C stores.
Old version, probably worse than unaligned load + vpshufb.
With AVX2, one option is to broadcast the containing 64-bit element to all positions in a vector and then use variable-count right shifts to get the bits to the bottom of a dword element.
You probably want to do a separate 64-bit broadcast-load for each group (thus partially overlapping with the previous), instead of trying to pick apart a __m256i of contiguous bits. (Broadcast-loads are cheap, shuffling is expensive.)
After _mm256_srlvd_epi64, then AND to isolate the low 10 bits in each qword.
Repeat that 4 times for 4 vectors of input, then use _mm256_packus_epi32 to do in-lane packing down to 32-bit then 16-bit elements.
That's the simple version. Optimizations of the interleaving are possible, e.g. by using left or right shifts to set up for vpblendd instead of a 2-input shuffle like vpackusdw or vshufps. _mm256_blend_epi32 is very efficient on existing CPUs, running on any port.
This also allows delaying the AND until after the first packing step because we don't need to avoid saturation from high garbage.
Design notes:
shown as 32-bit chunks after variable-count shifts
[0 d0 0 c0 | 0 b0 0 a0] # after an AND mask
[0 d1 0 c1 | 0 b1 0 a1]
[0 d1 0 c1 0 d0 0 c0 | 0 b1 0 a1 0 b0 0 a0] # vpackusdw
shown as 16-bit elements but actually the same as what vshufps can do
---------
[X d0 X c0 | X b0 X a0] even the top element is only garbage right shifted by 30, not quite zero
[X d1 X c1 | X b1 X a1]
[d1 c1 d0 c0 | b1 a1 b0 a0 ] vshufps (can't do d1 d0 c1 c0 unfortunately)
---------
[X d0 X c0 | X b0 X a0] variable-count >> qword
[d1 X c1 X | b1 X a1 0] variable-count << qword
[d1 d0 c1 c0 | b1 b0 a1 a0] vpblendd
This last trick extends to vpblendw, allowing us to do everything with interleaving blends, no shuffle instructions at all, resulting in the outputs we want contiguous and in the right order in qwords of a __m256i.
x86 SIMD variable-count shifts can only be left or right for all elements, so we need to make sure that all the data is either left or right of the desired position, not some of each within the same vector. We could use an immediate-count shift to set up for this, but even better is to just adjust the byte-address we load from. For loads after the first, we know it's safe to load some of the bytes before the first bitfield we want (without touching an unmapped page).
# as 16-bit elements
[X X X d0 X X X c0 | ...] variable-count >> qword
[X X d1 X X X c1 X | ...] variable-count >> qword from an offset load that started with the 5 bytes we want all to the left of these positions
[X d2 X X X c2 X X | ...] variable-count << qword
[d3 X X X c3 X X X | ...] variable-count << qword
[X d2 X d0 X c2 X c0 | ...] vpblendd
[d3 X d1 X c3 X c1 X | ...] vpblendd
[d3 d2 d1 d0 c3 c2 c1 c0 | ...] vpblendw (Same behaviour in both high and low lane)
Then mask off the high garbage inside each 16-bit word
Note: this does 4 separate outputs, like ABCD or RGBA->planar, not ABAC.
// potentially unaligned 64-bit broadcast-load, hopefully vpbroadcastq. (clang: yes, gcc: no)
// defeats gcc/clang folding it into an AVX512 broadcast memory source
// but vpsllvq's ymm/mem operand is the shift count, not data
static inline
__m256i bcast_load64(const uint8_t *p) {
// hopefully safe with strict-aliasing since the deref is inside an intrinsic?
__m256i bcast = _mm256_castpd_si256( _mm256_broadcast_sd( (const double*)p ) );
return bcast;
}
// UNTESTED
// unpack 10-bit fields from 4x 40-bit chunks into 16-bit dst arrays
// overreads past the end of the last chunk by 1 byte
// for ABCD repeating, not ABAC, e.g. packed 10-bit RGBA
void extract_10bit_fields_4output(const uint8_t *__restrict src,
uint16_t *__restrict da, uint16_t *__restrict db, uint16_t *__restrict dc, uint16_t *__restrict dd,
const uint8_t *max)
{
// FIXME: cleanup loop for non-whole-vectors at the end
while( src<max ){
__m256i bcast = bcast_load64(src); // data we want is from bits [0 to 39], last starting at 30
__m256i ext0 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place at bottome of each qword
bcast = bcast_load64(src+5-2); // data we want is from bits [16 to 55], last starting at 30+16 = 46
__m256i ext1 = _mm256_srlv_epi64(bcast, _mm256_set_epi64x(30, 20, 10, 0)); // place it at bit 16 in each qword element
bcast = bcast_load64(src+10); // data we want is from bits [0 to 39]
__m256i ext2 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 32 in each qword element
bcast = bcast_load64(src+15-2); // data we want is from bits [16 to 55], last field starting at 46
__m256i ext3 = _mm256_sllv_epi64(bcast, _mm256_set_epi64x(2, 12, 22, 32)); // place it at bit 48 in each qword element
__m256i blend20 = _mm256_blend_epi32(ext0, ext2, 0b10101010); // X d2 X d0 X c2 X c0 | X b2 ...
__m256i blend31 = _mm256_blend_epi32(ext1, ext3, 0b10101010); // d3 X d1 X c3 X c1 X | b3 X ...
__m256i blend3210 = _mm256_blend_epi16(blend20, blend31, 0b10101010); // d3 d2 d1 d0 c3 c2 c1 c0
__m256i res = _mm256_and_si256(blend3210, _mm256_set1_epi16((1U<<10) - 1) );
__m128i lo = _mm256_castsi256_si128(res);
__m128i hi = _mm256_extracti128_si256(res, 1);
_mm_storel_epi64((__m128i*)da, lo); // movq store of the lowest 64 bits
_mm_storeh_pi((__m64*)db, _mm_castsi128_ps(lo)); // movhps store of the high half of the low 128. Efficient: no shuffle uop needed on Intel CPUs
_mm_storel_epi64((__m128i*)dc, hi);
_mm_storeh_pi((__m64*)dd, _mm_castsi128_ps(hi)); // clang pessmizes this to vpextrq :(
da += 4;
db += 4;
dc += 4;
dd += 4;
src += 4*5;
}
}
This compiles (Godbolt) to about 21 front-end uops (on Skylake) in the loop per 4 groups of 4 fields. (Including has a useless register copy for _mm256_castsi256_si128 instead of just using the low half of ymm0 = xmm0). This will be very good on Skylake. There's a good balance of uops for different ports, and variable-count shift is 1 uop for either p0 or p1 on SKL (vs. more expensive previously). The bottleneck might be just the front-end limit of 4 fused-domain uops per clock.
Replays of cache-line-split loads will happen because the unaligned loads will sometimes cross a 64-byte cache-line boundary. But that's just in the back-end, and we have a few spare cycles on ports 2 and 3 because of the front-end bottleneck (4 loads and 4 stores per set of results, with indexed stores which thus can't use port 7). If dependent ALU uops have to get replayed as well, we might start seeing back-end bottlenecks.
Despite the indexed addressing modes, there won't be unlamination because Haswell and later can keep indexed stores micro-fused, and the broadcast loads are a single pure uop anyway, not micro-fused ALU+load.
On Skylake, it can maybe come close to 4x 40-bit groups per 5 clock cycles, if memory bandwidth isn't a bottleneck. (e.g. with good cache blocking.) Once you factor in overhead and cost of cache-line-split loads causing occasional stalls, maybe 1.5 cycles per 40 bits of input, i.e. 6 cycles per 20 bytes of input on Skylake.
On other CPUs (Haswell and Ryzen), the variable-count shifts will be a bottleneck, but you can't really do anything about that. I don't think there's anything better. On HSW it's 3 uops: p5 + 2p0. On Ryzen it's only 1 uop, but it only has 1 per 2 clock throughput (for the 128-bit version), or per 4 clocks for the 256-bit version which costs 2 uops.
Beware that clang pessmizes the _mm_storeh_pi store to vpextrq [mem], xmm, 1: 2 uops, shuffle + store. (Instead of vmovhps : pure store on Intel, no ALU). GCC compiles it as written.
I used _mm256_broadcast_sd even though I really want vpbroadcastq just because there's an intrinsic that takes a pointer operand instead of __m256i (because with AVX1, only the memory-source version existed. But with AVX2, register-source versions of all the broadcast instructions exist). To use _mm256_set1_epi64, I'd have to write pure C that didn't violate strict aliasing (e.g. with memcpy) to do an unaligned uint64_t load. I don't think it will hurt performance to use an FP broadcast load on current CPUs, though.
I'm hoping _mm256_broadcast_sd allows its source operand to alias anything without C++ strict-aliasing undefined behaviour, the same way _mm256_loadu_ps does. Either way it will work in practice if it doesn't inline into a function that stores into *src, and maybe even then. So maybe a memcpy unaligned load would have made more sense!
I've had bad results in the past with getting compilers to emit pmovzxdw xmm0, [mem] from code like _mm_cvtepu16_epi32( _mm_loadu_si64(ptr) ); you often get an actual movq load + reg-reg pmovzx. That's why I didn't try that _mm256_broadcastq_epi64(__m128i).
Old idea; if we already need a byte shuffle we might as well use plain word shifts instead of vpmultishift.
With AVX512VBMI (IceLake, CannonLake), you might want vpmultishiftqb. Instead of broadcasting / shifting one group at a time, we can do all the work for a whole vector of groups after putting the right bytes in the right places first.
You'd still need/want a version for CPUs with some AVX512 but not AVX512VBMI (e.g. Skylake-avx512). Probably vpermd + vpshufb can get the bytes we need into the 128-bit lanes we want.
I don't think we can get away with using only dword-granularity shifts to allow merge-masking instead of dword blend after qword shift. We might be able to merge-mask a vpblendw though, saving a vpblendd
IceLake has 1/clock vpermw and vpermb, single-uop. (It has a 2nd shuffle unit on another port that handles some shuffle uops). So we can load a full vector that contains 4 or 8 groups of 4 elements and shuffle every byte into place efficiently. I think every CPU that has vpermb has it single-uop. (But that's only Ice Lake and the limited-release Cannon Lake).
vpermt2w (to combine 16-bit element from 2 vectors into any order) is one per 2 clock throughput. (InstLatx64 for IceLake-Y), so unfortunately it's not as efficient as the one-vector shuffles.
Anyway, you might use it like this:
64-byte / 512-bit load (includes some over-read at the end from 8x 8-byte groups instead of 8x 5-byte groups. Optionally use a zero-masked load to make this safe near the end of an array thanks to fault suppression)
vpermb to put the 2 bytes containing each field into desired final destination position.
vpsrlvw + vpandq to extract each 10-bit field into a 16-bit word
That's about 4 uops, not including the stores.
You probably want the high half containing the A elements for a contiguous vextracti64x4 and the low half containing the B and C elements for vmovdqu and vextracti128 stores.
Or for 2x vpblenddd to set up for 256-bit stores. (Use 2 different vpermb vectors to create 2 different layouts.)
You shouldn't need vpermt2w or vpermt2d to combine adjacent vectors for wider stores.
Without AVX512VBMI, probably a vpermd + vpshufb can get all the necessary bytes into each 128-bit chunk instead of vpermb. The rest of it only requires AVX512BW which Skylake-X has.

How to square two complex doubles with 256-bit AVX vectors?

Matt Scarpino gives a good explanation (although he admits he's not sure it's the optimal algorithm, I offer him my gratitude) for how to multiply two complex doubles with Intel's AVX intrinsics. Here's his method, which I've verified:
__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);
__m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);
__m256d neg = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);
/* Step 1: Multiply vec1 and vec2 */
__m256d vec3 = _mm256_mul_pd(vec1, vec2);
/* Step 2: Switch the real and imaginary elements of vec2 */
vec2 = _mm256_permute_pd(vec2, 0x5);
/* Step 3: Negate the imaginary elements of vec2 */
vec2 = _mm256_mul_pd(vec2, neg);
/* Step 4: Multiply vec1 and the modified vec2 */
__m256d vec4 = _mm256_mul_pd(vec1, vec2);
/* Horizontally subtract the elements in vec3 and vec4 */
vec1 = _mm256_hsub_pd(vec3, vec4);
/* Display the elements of the result vector */
double* res = (double*)&vec1;
printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);
My problem is that I want to square two complex doubles. I tried to use Matt's technique like so:
struct cmplx a;
struct cmplx b;
a.r = 2.5341;
a.i = 1.843;
b.r = 1.3941;
b.i = 0.93;
__m256d zzs = squareZ(a, b);
double* res = (double*) &zzs;
printf("\nA: %f + %f, B: %f + %f\n", res[0], res[1], res[2], res[3]);
Using Haskell's complex arithmetic, I have verified the results are correct except, as you can see, the real part of B:
A: 3.025014 + 9.340693, B: 0.000000 + 2.593026
So I have two questions really: is there a better (simpler and/or faster) way to square two complex doubles with AVX intrinsics? If not, how can I modify Matt's code to do it?
This answer covers the general case of multiplying two arrays of complex numbers
Ideally, store your data in separate real and imaginary arrays, so you can just load contiguous vectors of real and imaginary parts. That makes it free to do the cross-multiplying (just use different registers / variables) instead of having to shuffle things around within a vector.
You can convert between interleaved double complex style and SIMD-friendly separate-vectors style on the fly fairly cheaply, subject to the vagaries of AVX in-lane shuffles. e.g. very cheaply with unpacklo / unpackhi shuffles to de-interleave or to re-interleave within a lane, if you don't care about the actual order of the data within the temporary vector.
It's actually so cheap to do this shuffle that doing it on the fly for a single complex multiply comes out somewhat ahead of (even a tweaked version of) Matt's code, especially on CPUs that support FMA. This requires producing results in groups of 4 complex doubles (2 result vectors).
If you need to produce only one result vector at a time, I also came up with an alternative to Matt's algorithm that can use FMA (actually FMADDSUB) and avoid the separate sign-change insn.
gcc auto-vectorizes simple complex multiply scalar loop to pretty good code, as long as you use -ffast-math. It deinterleaves like I suggested.
#include <complex.h>
// even with -ffast-math -ffp-contract=fast, clang doesn't manage to use vfmaddsubpd, instead using vmulpd and vaddsubpd :(
// gcc does use FMA though.
// auto-vectorizes with a lot of extra shuffles
void cmul(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{ // clang and gcc change strategy slightly for i<1 or i<2, vs. i<4
for (int i=0; i<4 ; i++) {
dst[i] = A[i] * B[i];
}
}
See the asm on the Godbolt compiler explorer. I'm not sure how good clang's asm is; it uses a lot of 64b->128b VMODDDUP broadcast-loads. This form is handled purely in the load ports on Intel CPUs (see Agner Fog's insn tables), but it's still a lot of operations. As mentioned earlier, gcc uses 4 VPERMPD shuffles to reorder within lanes before multiplying / FMA, then another 4 VPERMPD to reorder the results before combining them with VSHUFPD. This is 8 extra shuffles for 4 complex multiplies.
Converting gcc's version back to intrinsics and removing the redundant shuffles gives optimal code. (gcc apparently wants its temporaries to be in A B C D order instead of the A C B D order resulting from the in-lane behaviour of VUNPCKLPD (_mm256_unpacklo_pd)).
I put the code on Godbolt, along with a tweaked version of Matt's code. So you can play around with different compiler options, and also different compiler versions.
// multiplies 4 complex doubles each from A and B, storing the result in dst[0..3]
void cmul_manualvec(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{
// low element first, little-endian style
__m256d A0 = _mm256_loadu_pd((double*)A); // [A0r A0i A1r A1i ] // [a b c d ]
__m256d A2 = _mm256_loadu_pd((double*)(A+2)); // [e f g h ]
__m256d realA = _mm256_unpacklo_pd(A0, A2); // [A0r A2r A1r A3r ] // [a e c g ]
__m256d imagA = _mm256_unpackhi_pd(A0, A2); // [A0i A2i A1i A3i ] // [b f d h ]
// the in-lane behaviour of this interleaving is matched by the same in-lane behaviour when we recombine.
__m256d B0 = _mm256_loadu_pd((double*)B); // [m n o p]
__m256d B2 = _mm256_loadu_pd((double*)(B+2)); // [q r s t]
__m256d realB = _mm256_unpacklo_pd(B0, B2); // [m q o s]
__m256d imagB = _mm256_unpackhi_pd(B0, B2); // [n r p t]
// desired: real=rArB - iAiB, imag=rAiB + rBiA
__m256d realprod = _mm256_mul_pd(realA, realB);
__m256d imagprod = _mm256_mul_pd(imagA, imagB);
__m256d rAiB = _mm256_mul_pd(realA, imagB);
__m256d rBiA = _mm256_mul_pd(realB, imagA);
// gcc and clang will contract these nto FMA. (clang needs -ffp-contract=fast)
// Doing it manually would remove the option to compile for non-FMA targets
__m256d real = _mm256_sub_pd(realprod, imagprod); // [D0r D2r | D1r D3r]
__m256d imag = _mm256_add_pd(rAiB, rBiA); // [D0i D2i | D1i D3i]
// interleave the separate real and imaginary vectors back into packed format
__m256d dst0 = _mm256_shuffle_pd(real, imag, 0b0000); // [D0r D0i | D1r D1i]
__m256d dst2 = _mm256_shuffle_pd(real, imag, 0b1111); // [D2r D2i | D3r D3i]
_mm256_storeu_pd((double*)dst, dst0);
_mm256_storeu_pd((double*)(dst+2), dst2);
}
Godbolt asm output: gcc6.2 -O3 -ffast-math -ffp-contract=fast -march=haswell
vmovupd ymm0, YMMWORD PTR [rsi+32]
vmovupd ymm3, YMMWORD PTR [rsi]
vmovupd ymm1, YMMWORD PTR [rdx]
vunpcklpd ymm5, ymm3, ymm0
vunpckhpd ymm3, ymm3, ymm0
vmovupd ymm0, YMMWORD PTR [rdx+32]
vunpcklpd ymm4, ymm1, ymm0
vunpckhpd ymm1, ymm1, ymm0
vmulpd ymm2, ymm1, ymm3
vmulpd ymm0, ymm4, ymm3
vfmsub231pd ymm2, ymm4, ymm5 # separate mul/sub contracted into FMA
vfmadd231pd ymm0, ymm1, ymm5
vunpcklpd ymm1, ymm2, ymm0
vunpckhpd ymm0, ymm2, ymm0
vmovupd YMMWORD PTR [rdi], ymm1
vmovupd YMMWORD PTR [rdi+32], ymm0
vzeroupper
ret
For 4 complex multiplies (of 2 pairs of input vectors), my code uses:
4 loads (32B each)
2 stores (32B each)
6 in-lane shuffles (one for each input vector, one for each output)
2 VMULPD
2 VFMA...something
(only 4 shuffles if we can use the results in separated real and imag vectors, or 0 shuffles if the inputs are already in this format, too)
latency on Intel Skylake (not counting loads/stores): 14 cycles = 4c for 4 shuffles until the second VMULPD can start + 4 cycles (second VMULPD) + 4c (second vfmadd231pd) + 1c (shuffle first result vector ready 1c earlier) + 1c (shuffle second result vector)
So for throughput, this completely bottlenecks on the shuffle port. (1 shuffle per clock throughput, vs. 2 total MUL/FMA/ADD per clock on Intel Haswell and later). This is why packed storage is horrible: shuffles have limited throughput, and spending more instructions shuffling than on doing math is not good.
Matt Scarpino's code with my minor tweaks (repeated to do 4 complex multiplies). (See below for my rewrite of producing one vector at a time more efficiently).
the same 6 loads/stores
6 in-lane shuffles (HSUBPD is 2 shuffles and a subtract on current Intel and AMD CPUs)
4 multiplies
2 subtracts (which can't combine with the muls into FMAs)
An extra instruction (+ a constant) to flip the sign of the imaginary elements. Matt chose to multiply by 1.0 or -1.0, but the efficient choice is to XOR the sign bit (i.e. XORPD with -0.0).
latency on Intel Skylake for the first result vector: 11 cycles. 1c(vpermilpd and vxorpd in the same cycle) + 4c(second vmulpd) + 6c(vhsubpd). The first vmulpd overlaps with other ops, starting in the same cycle as the shuffle and vxorpd. Computation of a second result vector should interleave pretty nicely.
The major advantage of Matt's code is that it works with just one vector-width of complex multiplies at once, instead of requiring you to have 4 input vectors of data. It has somewhat lower latency. But note that my version doesn't need the 2 pairs of input vectors to be from contiguous memory, or related to each other at all. They get mixed together while processing, but the result is 2 separate 32B vectors.
My tweaked version of Matt's code is nearly as good (as the 4-at-a-time version) on CPUs without FMA (just costing an extra VXORPD), but significantly worse when it stops us from taking advantage of FMA. Also, it never has the results available in non-packed form, so you can't use the separated form as input to another multiply and skip the shuffling.
One vector result at a time, with FMA:
Don't use this if you're sometimes squaring, instead of multiplying two different complex numbers. This is like Matt's algorithm in that common subexpression elimination doesn't simplify.
I haven't typed in the C intrinsics for this, just worked out the data movement. Since all the shuffles are in-lane, I'll only show the low lane. Use the 256b versions of the relevant instructions to do the same shuffle in both lanes. They stay separate.
// MULTIPLY: for each AVX lane: am-bn, an+bm
r i r i
a b c d // input vectors: a + b*i, etc.
m n o p
Algorithm:
create bm bn with movshdup(a b) + mulpd
create bn bm with shufpd on the previous result. (or create n m with a shuffle before the mul)
create a a with movsldup(a b)
use fmaddsubpd to produce the final result: [a|a]*[m|n] -/+ [bn|bm].
Yes, SSE/AVX has ADDSUBPD to do alternating subtract/add in even/odd elements (in that order, presumably because of this use-case). FMA includes FMADDSUB132PD which subtracts and adds, (and the reverse, FMSUBADD which adds and subtracts).
Per 4 results: 6x shuffle, 2x mul, 2xfmaddsub. So unless I got something wrong, it's as efficient as the deinterleave method (when not squaring the same number). Skylake latency = 10c = 1+4+1 to create bn bm (overlapping with 1 cycle to create a a), + 4 (FMA). So it's one cycle lower latency than Matt's.
On Bulldozer-family, it would be a win to shuffle both inputs to the first mul, so the mul->fmaddsub critical path stays inside the FMA domain (1 cycle lower latency). Doing it the other way helps stop silly compilers from making resource conflicts by doing the movsldup(a b) too early, and delaying the mulpd. (In a loop, though, many iterations will be in flight and bottleneck on the shuffle port.)
This is still better than Matt's for squaring (still save the XOR, and can use FMA), but we don't save any shuffles:
// SQUARING: for each AVX lane: aa-bb, 2*ab
// ab bb // movshdup + mul
// bb ab // ^ -> shufpd
// a a // movsldup
// aa-bb ab+ab // fmaddsubpd : [a|a]*[a|b] -/+ [bb|ab]
// per 4 results: 6x shuffle, 2x mul, 2xfmaddsub
I also played around with some possibilities like (a+b) * (a+b) = aa+2ab+bb, or (r-i)*(r+i) = rr - ii but didn't get anywhere. Rounding between steps means that FP math doesn't cancel perfectly, so doing something like this wouldn't even produce exactly identical results.
See my other answer for the general case of multiplying different complex numbers, not squaring.
TL:DR: just use the code in my other answer with both inputs the same. Compilers do a good job with the redundancy.
Squaring simplifies the math slightly: instead of needing two different cross products, rAiB and rBiA are the same. But it still needs to get doubled, so basically we end up with 2 mul + 1 FMA + 1 add, instead of 2 mul + 2 FMA.
With the SIMD-unfriendly interleaved storage format, it gives a big boost to the deinterleave method, since there's only one input to shuffle. Matt's method doesn't benefit at all, since it calculates both cross products with the same vector multiply.
Using the cmul_manualvec() from my other answer:
// squares 4 complex doubles from A[0..3], storing the result in dst[0..3]
void csquare_manual(double complex *restrict dst,
const double complex *restrict A) {
cmul_manualvec(dst, A, A);
}
gcc and clang are smart enough to optimize away the redundancy of using the same input twice, so there's no need to make a custom version with intrinsics. clang does a bad job on the scalar auto-vectorizing version, so don't use that. I don't see anything to be gained over this asm output (from Godbolt):
clang3.9 -O3 -ffast-math -ffp-contract=fast -march=haswell
vmovupd ymm0, ymmword ptr [rsi]
vmovupd ymm1, ymmword ptr [rsi + 32]
vunpcklpd ymm2, ymm0, ymm1
vunpckhpd ymm0, ymm0, ymm1 # doing this shuffle first would let the first multiply start a cycle earlier. Silly compiler.
vmulpd ymm1, ymm0, ymm0 # imag*imag
vfmsub231pd ymm1, ymm2, ymm2 # real*real - imag*imag
vaddpd ymm0, ymm0, ymm0 # imag+imag = 2*imag
vmulpd ymm0, ymm2, ymm0 # 2*imag * real
vunpcklpd ymm2, ymm1, ymm0
vunpckhpd ymm0, ymm1, ymm0
vmovupd ymmword ptr [rdi], ymm2
vmovupd ymmword ptr [rdi + 32], ymm0
vzeroupper
ret
Possibly a different instruction ordering would have been better, to maybe reduce resource conflicts. e.g. double the real vector, since it's unpacked first, so the VADDPD could start a cycle sooner, before the imag*imag VMULPD. But reordering lines in the C source doesn't usually translate directly to asm reordering, because modern compilers are complex beasts. (IIRC, gcc doesn't particularly try to schedule instructions for x86, because out-of-order execution mostly hides those effects.)
Anyway, per 4 complex squares:
2 loads (down from 4) + 2 stores, for obvious reasons
4 shuffles (down from 6), again obvious
2 VMULPD (same)
1 FMA + 1 VADDPD (down from 2 FMA. VADDPD is lower latency than FMA on Haswell/Broadwell, same on Skylake).
Matt's version would still be 6 shuffles, and same everything else.

SIMD (AVX2) mask store and pack [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
}
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
Thanks
AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
}
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
ret
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).
See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
}
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
ret
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.
If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
}
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
c++;
}
}
return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
}
}
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000
Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
}
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
}
Benchmarks
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
PS:
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
Chars:
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
Shorts:
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Ints:
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.
In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
}
}
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
}
This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
}
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
}
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

Handling zeroes in _mm256_rsqrt_ps()

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:
_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
eightFloats);
Is the way to go for that extra bit of performance and avoiding a pipeline stall.
Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?
_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
_mm256_set1_ps(0.1))),
eightFloats);
My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:
float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
int thisSqrt = (int) floor(sqrt(vals[i]));
sum += min(thisSqrt, 0x3F);
}
All the values of vals should be integer values. (Why everything isn't just int is a different question...)
tl;dr: See the end for code that compiles and should work.
To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)
It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.
The thing you're missing:
Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.
Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).
Also:
Newton Raphson formula explained
Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.
It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).
63*1.5*2^(-12) == 0.023071...
approx_sqrt(63*63-1) == 62.99206... +/- 0.023068..
Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:
define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140
Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?
You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2-12 requires more instructions. I don't see any tricks a NR iteration.
If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.
; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087 # (x+0.5)/sqrt(x+0.5)
~0.00732879495710064718 ~0.00731963968187500662 # relative error
Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.
Anyway, I think you need one Newton-Raphson iteration.
The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).
FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.
Compiler output: a decent starting point
I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).
sum_fp doesn't auto-vectorize, even with the full -ffast-math.
However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX2, otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.
See the asm output on godbolt from:
// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
int sum = 0;
for (int i = 0; i < MAX; i++) {
int thisSqrt = (int) sqrtf(vals[i]);
sum += std::min(thisSqrt, 0x3F);
}
return sum;
}
To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.
putting it together:
I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).
As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).
I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.
Horizontal sum from this answer.
Complete but untested version:
#include <immintrin.h>
#define MAX 4096
// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
// clang / gcc usually use -3.0 and -0.5. We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
__m256 three = _mm256_set1_ps(3.0f);
//__m256 half = _mm256_set1_ps(0.5f); // we omit the *0.5 step
__m256 nr = _mm256_rsqrt_ps( x ); // initial approximation for Newton-Raphson
// 1/sqrt(x) ~= nr * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
// sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
// 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
__m256 xnr = _mm256_mul_ps( x, nr );
__m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3
return _mm256_mul_ps( xnr, three_minus_muls );
}
// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
__m256 offset = _mm256_set1_ps(0.5f); // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
__m256 xoff = _mm256_add_ps(x, offset); // avoids 0*Inf = NaN, and rounding error before truncation
__m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
__m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
return _mm256_srli_epi32(i2sqrtx, 1); // divide by 2 with truncation
// alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}
__m256i isqrt256_ps_simple_exact(__m256 x) {
__m256 sqrt_x = _mm256_sqrt_ps(x);
__m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
return isqrtx;
}
int hsum_epi32_avx(__m256i x256){
__m128i xhi = _mm256_extracti128_si256(x256, 1);
__m128i xlo = _mm256_castsi256_si128(x256);
__m128i x = _mm_add_epi32(xlo, xhi);
__m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
hl = _mm_add_epi32(hl, x);
x = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
hl = _mm_add_epi32(hl, x);
return _mm_cvtsi128_si32(hl);
}
int sum_int_avx(float vals[]){
__m256i sum = _mm256_setzero_si256();
__m256i upperlimit = _mm256_set1_epi32(0x3F);
for (int i = 0; i < MAX; i+=8) {
__m256 v = _mm256_loadu_ps(vals+i);
__m256i visqrt = isqrt256_ps(v);
// assert visqrt == isqrt256_ps_simple_exact(v) or something
visqrt = _mm256_min_epi32(visqrt, upperlimit);
sum = _mm256_add_epi32(sum, visqrt);
}
return hsum_epi32_avx(sum);
}
Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.
Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated sum_int version. It probably didn't notice the x*initial_guess(x) common-subexpressions that occur in the Newton-Raphson iteration formula when you're multiplying the result by x, or something like that. It also does an extra mulps instead of a psrld because it does the *0.5 before converting to int. So that's where the extra two mulps instructions come from, and there's the cmpps/blendvps.
sum_int_avx(float*):
vpxor ymm3, ymm3, ymm3
xor eax, eax
vbroadcastss ymm0, dword ptr [rip + .LCPI4_0] ; set1(0.5)
vbroadcastss ymm1, dword ptr [rip + .LCPI4_1] ; set1(3.0)
vpbroadcastd ymm2, dword ptr [rip + .LCPI4_2] ; set1(63)
LBB4_1: ; latencies
vaddps ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c
vrsqrtps ymm5, ymm4 ; 7c
vmulps ymm4, ymm4, ymm5 ; x*nr ; 5c
vfnmadd213ps ymm5, ymm4, ymm1 ; 5c
vmulps ymm4, ymm4, ymm5 ; 5c
vcvttps2dq ymm4, ymm4 ; 3c
vpsrld ymm4, ymm4, 1 ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain
vpminsd ymm4, ymm4, ymm2 ; 1c
vpaddd ymm3, ymm4, ymm3 ; 1c
; ... (those 9 insns repeated: loop unrolling)
add rax, 16
cmp rax, 4096
jl .LBB4_1
;... horizontal sum
IACA thinks that with no unroll, Haswell can sustain a throughput of one iteration per 4.15 cycles, bottlenecking on ports 0 and 1. So potentially you could shave a cycle by accumulating sqrt(x)*2 (with truncation to even numbers using _mm256_and_si256), and only divide by two outside the loop.
Also according to IACA, the latency of a single iteration is 38 cycles on Haswell. I only get 31c, so probably it's including L1 load-use latency or something. Anyway, this means that to saturate the execution units, operations from 8 iterations have to be in flight at once. That's 8 * ~14 unfused-domain uops = 112 unfused-uops (or less with clang's unroll) that have to be in flight at once. Haswell's scheduler is actually only 60 entries, but the ROB is 192 entries. The early uops from early iterations will already have executed, so they only need to be tracked in the ROB, not also in the scheduler. Many of the slow uops are at the beginning of each iteration, though. Still, there's reason to hope that this will come close-ish to saturating ports 0 and 1. Unless data is hot in L1 cache, cache/memory bandwidth will probably be the bottleneck.
Interleaving operations from multiple dep chains would also be better. When clang unrolls, it puts all 9 instructions for one iteration ahead of all 9 instructions for another iteration. It uses a surprisingly small number of registers, so it would be possible to have instructions for 2 or 4 iterations mixed together. This is the sort of thing compilers are supposed to be good at, but which is cumbersome for humans. :/
It would also be slightly more efficient if the compiler chose a one-register addressing mode, so the load could micro-fuse with the vaddps. gcc does this.

When the compiler reorders AVX instructions on Sandy, does it affect performance?

Please do not say this is premature microoptimization. I want to understand, as much as it is possible given my limited knowledge, how the described SB feature and assembly works, and make sure that my code makes use of this architectural feature. Thank you for understanding.
I've started to learn intrinsics a few days ago so the answer may seem obvious to some, but I don't have a reliable source of information to figure this out.
I need to optimize some code for a Sandy Bridge CPU (this is a requirement). Now I know that it can do one AVX multiply and one AVX add per cycle, and read this paper:
http://research.colfaxinternational.com/file.axd?file=2012%2F7%2FColfax_CPI.pdf
which shows how it can be done in C++. So, the problem is that my code won't get auto-vectorized using Intel's compiler (which is another requirement for the task), so I decided to implement it manually using intrinsics like this:
__sum1 = _mm256_setzero_pd();
__sum2 = _mm256_setzero_pd();
__sum3 = _mm256_setzero_pd();
sum = 0;
for(kk = k; kk < k + BS && kk < aW; kk+=12)
{
const double *a_addr = &A[i * aW + kk];
const double *b_addr = &newB[jj * aW + kk];
__aa1 = _mm256_load_pd((a_addr));
__bb1 = _mm256_load_pd((b_addr));
__sum1 = _mm256_add_pd(__sum1, _mm256_mul_pd(__aa1, __bb1));
__aa2 = _mm256_load_pd((a_addr + 4));
__bb2 = _mm256_load_pd((b_addr + 4));
__sum2 = _mm256_add_pd(__sum2, _mm256_mul_pd(__aa2, __bb2));
__aa3 = _mm256_load_pd((a_addr + 8));
__bb3 = _mm256_load_pd((b_addr + 8));
__sum3 = _mm256_add_pd(__sum3, _mm256_mul_pd(__aa3, __bb3));
}
__sum1 = _mm256_add_pd(__sum1, _mm256_add_pd(__sum2, __sum3));
_mm256_store_pd(&vsum[0], __sum1);
The reason I manually unroll the loop like this is explained here:
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell
They say you need to unroll by a factor of 3 to achieve the best performance on Sandy. My naive testing confirms that this indeed runs better than without unrolling or 4-fold unrolling.
OK, so here is the problem. The icl compiler from Intel Parallel Studio 15 generates this:
$LN149:
movsxd r14, r14d ;78.49
$LN150:
vmovupd ymm3, YMMWORD PTR [r11+r14*8] ;80.48
$LN151:
vmovupd ymm5, YMMWORD PTR [32+r11+r14*8] ;84.49
$LN152:
vmulpd ymm4, ymm3, YMMWORD PTR [r8+r14*8] ;82.56
$LN153:
vmovupd ymm3, YMMWORD PTR [64+r11+r14*8] ;88.49
$LN154:
vmulpd ymm15, ymm5, YMMWORD PTR [32+r8+r14*8] ;86.56
$LN155:
vaddpd ymm2, ymm2, ymm4 ;82.34
$LN156:
vmulpd ymm4, ymm3, YMMWORD PTR [64+r8+r14*8] ;90.56
$LN157:
vaddpd ymm0, ymm0, ymm15 ;86.34
$LN158:
vaddpd ymm1, ymm1, ymm4 ;90.34
$LN159:
add r14d, 12 ;76.57
$LN160:
cmp r14d, ebx ;76.42
$LN161:
jb .B1.19 ; Prob 82% ;76.42
To me, this looks like a mess, where the correct order (add next to multiply required to use the handy SB feature) is broken.
Question:
Will this assembly code leverage the Sandy Bridge feature I am referring to?
If not, what do I need to do in order to utilize the feature and prevent the code from becoming "tangled" like this?
Also, when there is only one loop iteration, the order is nice and clean, i.e. load, multiply, add, as it should be.
With x86 CPUs many people expect to get the maximum FLOPS from the dot product
for(int i=0; i<n; i++) sum += a[i]*b[i];
but this turns out not to be the case.
What can give the maximum FLOPS is this
for(int i=0; i<n; i++) sum += k*a[i];
where k is a constant. Why is the CPU not optimized for the dot product? I can speculate. One of the things CPUs are optimized for is BLAS. BLAS is considering a building block of many other routines.
The Level-1 and Level-2 BLAS routines become memory bandwidth bound as n increases. It's only the Level-3 routines (e.g. Matrix Multiplication) which are capable of being compute bound. This is because the Level-3 computations go as n^3 and the reads as n^2. So the CPU is optimized for the Level-3 routines. The Level-3 routines don't need to optimize for a single dot product. They only need to read from one matrix per iteration (sum += k*a[i]).
From this we can conclude that the number of bits needed to be read each cycle to get the maximum FLOPS for the Level-3 routines is
read_size = SIMD_WIDTH * num_MAC
where num_MAC is the number of multiply–accumulate operations that can be done each cycle.
SIMD_WIDTH (bits) num_MAC read_size (bits) ports used
Nehalem 128 1 128 128-bits on port 2
Sandy Bridge 256 1 256 128-bits port 2 and 3
Haswell 256 2 512 256-bits port 2 and 3
Skylake 512 2 1024 ?
For Nehalem-Haswell this agrees with what the hardware is capable of. I don't actually know that Skylake will be able to read 1024-bits per clock cycle but if it can't AVX512 won't be very interesting so I'm confident in my guess. A nice plot for Nahalem, Sandy Bridge, and Haswell for each port can be found at http://www.anandtech.com/show/6355/intels-haswell-architecture/8
So far I have ignored latency and dependency chains. To really get the maximum FLOPS you need to unroll the loop at least three times on Sandy Bridge (I use four because I find it inconvenient to work with multiples of three)
The best way to answer your question about performance is to find the theoretic best performance you expect for your operation and then compare how close your code get to this. I call this the efficiency. Doing this you will find that despite the reordering of the instructions you see in the assembly the performance is still good. But there are many other subtle issues you may need to consider. Here are three issues I encountered:
l1-memory-bandwidth-50-drop-in-efficiency-using-addresses-which-differ-by-4096.
obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62%
difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp.
I also suggest you consider using IACA to study the performance.

Resources