How can I clear the upper 128 bits of m2:
__m256i m2 = _mm256_set1_epi32(2);
__m128i m1 = _mm_set1_epi32(1);
m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);
don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”.
At the same time I can easily do it in assembly:
VMOVDQA xmm2, xmm2 //zeros upper ymm2
VMOVDQA xmm2, xmm1
Of course I'd not like to use "and" or _mm256_insertf128_si256() and such.
A new intrinsic function has been added for solving this problem:
m2 = _mm256_zextsi128_si256(m1);
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_zextsi128_si256&expand=6177,6177
This function doesn't produce any code if the upper half is already known to be zero, it just makes sure the upper half is not treated as undefined.
Update: there's now a __m128i _mm256_zextsi128_si256(__m128i) intrinsic; see Agner Fog's answer. The rest of the answer below is only relevant for old compilers that don't support this intrinsic, and where there's no efficient, portable solution.
Unfortunately, the ideal solution will depend on which compiler you are using, and on some of them, there is no ideal solution.
There are several basic ways that we could write this:
Version A:
ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));
Version B:
ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
ymm,
_MM_SHUFFLE(0, 0, 3, 3));
Version C:
ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
_mm256_castsi256_si128(ymm),
0);
Each of these do precisely what we want, clearing the upper 128 bits of a 256-bit YMM register, so any of them could safely be used. But which is the most optimal? Well, that depends on which compiler you are using...
GCC:
Version A: Not supported at all because GCC lacks the _mm256_set_m128i intrinsic. (Could be simulated, of course, but that would be done using one of the forms in "B" or "C".)
Version B: Compiled to inefficient code. Idiom is not recognized and intrinsics are translated very literally to machine-code instructions. A temporary YMM register is zeroed using VPXOR, and then that is blended with the input YMM register using VPBLENDD.
Version C: Ideal. Although the code looks kind of scary and inefficient, all versions of GCC that support AVX2 code generation recognize this idiom. You get the expected VMOVDQA xmm?, xmm? instruction, which implicitly clears the upper bits.
Prefer Version C!
Clang:
Version A: Compiled to inefficient code. A temporary YMM register is zeroed using VPXOR, and then that is inserted into the temporary YMM register using VINSERTI128 (or the floating-point forms, depending on version and options).
Version B & C: Also compiled to inefficient code. A temporary YMM register is again zeroed, but here, it is blended with the input YMM register using VPBLENDD.
Nothing ideal!
ICC:
Version A: Ideal. Produces the expected VMOVDQA xmm?, xmm? instruction.
Version B: Compiled to inefficient code. Zeros a temporary YMM register, and then blends zeros with the input YMM register (VPBLENDD).
Version C: Also compiled to inefficient code. Zeros a temporary YMM register, and then uses VINSERTI128 to insert zeros into the temporary YMM register.
Prefer Version A!
MSVC:
Version A and C: Compiled to inefficient code. Zeroes a temporary YMM register, and then uses VINSERTI128 (A) or VINSERTF128 (C) to insert zeros into the temporary YMM register.
Version B: Also compiled to inefficient code. Zeros a temporary YMM register, and then blends this with the input YMM register using VPBLENDD.
Nothing ideal!
In conclusion, then, it is possible to get GCC and ICC to emit the ideal VMOVDQA instruction, if you use the right code sequence. But, I can't see any way to get either Clang or MSVC to safely emit a VMOVDQA instruction. These compilers are missing the optimization opportunity.
So, on Clang and MSVC, we have the choice between XOR+blend and XOR+insert. Which is better? We turn to Agner Fog's instruction tables (spreadsheet version also available):
On AMD's Ryzen architecture: (Bulldozer-family is similar for the AVX __m256 equivalents of these, and for AVX2 on Excavator):
Instruction | Ops | Latency | Reciprocal Throughput | Execution Ports
---------------|-----|---------|-----------------------|---------------------
VMOVDQA | 1 | 0 | 0.25 | 0 (renamed)
VPBLENDD | 2 | 1 | 0.67 | 3
VINSERTI128 | 2 | 1 | 0.67 | 3
Agner Fog seems to have missed some AVX2 instructions in the Ryzen section of his tables. See this AIDA64 InstLatX64 result for confirmation that VPBLENDD ymm performs the same as VPBLENDW ymm on Ryzen, rather than being the same as VBLENDPS ymm (1c throughput from 2 uops that can run on 2 ports).
See also an Excavator / Carrizo InstLatX64 showing that VPBLENDD and VINSERTI128 have equal performance there (2 cycle latency, 1 per cycle throughput). Same for VBLENDPS/VINSERTF128.
On Intel architectures (Haswell, Broadwell, and Skylake):
Instruction | Ops | Latency | Reciprocal Throughput | Execution Ports
---------------|-----|---------|-----------------------|---------------------
VMOVDQA | 1 | 0-1 | 0.33 | 3 (may be renamed)
VPBLENDD | 1 | 1 | 0.33 | 3
VINSERTI128 | 1 | 3 | 1.00 | 1
Obviously, VMOVDQA is optimal on both AMD and Intel, but we already knew that, and it doesn't seem to be an option on either Clang or MSVC until their code generators are improved to recognize one of the above idioms or an additional intrinsic is added for this precise purpose.
Luckily, VPBLENDD is at least as good as VINSERTI128 on both AMD and Intel CPUs. On Intel processors, VPBLENDD is a significant improvement over VINSERTI128. (In fact, it's nearly as good as VMOVDQA in the rare case where the latter cannot be renamed, except for needing an all-zero vector constant.) Prefer the sequence of intrinsics that results in a VPBLENDD instruction if you can't coax your compiler to use VMOVDQA.
If you need a floating-point __m256 or __m256d version of this, the choice is more difficult. On Ryzen, VBLENDPS has 1c throughput, but VINSERTF128 has 0.67c. On all other CPUs (including AMD Bulldozer-family), VBLENDPS is equal or better. It's much better on Intel (same as for integer). If you're optimizing specifically for AMD, you may need to do more tests to see which variant is fastest in your particular sequence of code, otherwise blend. It's only a tiny bit worse on Ryzen.
In summary, then, targeting generic x86 and supporting as many different compilers as possible, we can do:
#if (defined _MSC_VER)
ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
ymm,
_MM_SHUFFLE(0, 0, 3, 3));
#elif (defined __INTEL_COMPILER)
ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));
#elif (defined __GNUC__)
// Intended to cover GCC and Clang.
ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
_mm256_castsi256_si128(ymm),
0);
#else
#error "Unsupported compiler: need to figure out optimal sequence for this compiler."
#endif
See this and versions A,B, and C separately on the Godbolt compiler explorer.
Perhaps you could build on this to define your own macro-based intrinsic until something better comes down the pike.
See what your compiler generates for this:
__m128i m1 = _mm_set1_epi32(1);
__m256i m2 = _mm256_set_m128i(_mm_setzero_si128(), m1);
or alternatively this:
__m128i m1 = _mm_set1_epi32(1);
__m256i m2 = _mm256_setzero_si256();
m2 = _mm256_inserti128_si256 (m2, m1, 0);
The version of clang I have here seems to generate the same code for either (vxorps + vinsertf128), but YMMV.
Related
AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org.
#include <immintrin.h>
__m512i add(__m512i a, __m512i b) {
__m512i sum;
asm(
"mov ebx, 0xAAAAAAAA; \n\t"
"kmovw k1, ebx; \n\t"
"vpaddd %[SUM] %{k1%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b)
: "ebx", "k1" // clobbers
);
return sum;
}
-march=skylake-avx512 -masm=intel -O3
mov ebx,0xaaaaaaaa
kmovw k1,ebx
vpaddd zmm0{k1}{z},zmm0,zmm1
The problem is that k1 has to be specified.
Is there an input constraint like "r" for integers except that it picks a k register instead of a general-purpose register?
__mmask16 is literally a typedef for unsigned short (and other mask types for other plain integer types), so we just need a constraint for passing it in a k register.
We have to go digging in the gcc sources config/i386/constraints.md to find it:
The constraint for any mask register is "k". Or use "Yk" for k1..k7 (which can be used as a predicate, unlike k0). You'd use an "=k" operand as the destination for a compare-into-mask, for example.
Obviously you can use "=Yk"(tmp) with a __mmask16 tmp to get the compiler to do register allocation for you, instead of just declaring clobbers on whichever "k" registers you decide to use.
Prefer intrinsics like _mm512_maskz_add_epi32
First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Understanding asm is great, but use that to read compiler output and/or figure out what would be optimal, then write intrinsics that can compile the way you want. Performance tuning info like https://agner.org/optimize/ and https://uops.info/ list things by asm mnemonic, and they're shorter / easier to remember than intrinsics, but you can search by mnemonic to find intrinsics on https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Intrinsics will also let the compiler fold loads into memory source operands for other instructions; with AVX512 those can even be broadcast loads! Your inline asm forces the compiler to use a separate load instruction. Even a "vm" input won't let the compiler pick a broadcast-load as the memory source, because it wouldn't know the broadcast element width of the instruction(s) you were using it with.
Use _mm512_mask_add_epi32 or _mm512_maskz_add_epi32 especially if you're already using __m512i types from <immintrin.h>.
Also, your asm has a bug: you're using {k1} merge-masking not {k1}{z} zero-masking, but you used uninitialized __m512i sum; with an output-only "=v" constraint as the merge destination! As a stand-alone function, it happens to merge into a because the calling convention has ZMM0 = first input = return value register. But when inlining into other functions, you definitely can't assume that sum will pick the same register as a. Your best bet is to use a read/write operand for "+v"(a) and use is as the destination and first source.
Merge-masking only makes sense with a "+v" read/write operand. (Or in an asm statement with multiple instructions where you've already written an output once, and want to merge another result into it.)
Intrinsics would stop you from making this mistake; the merge-masking version has an extra input for the merge-target. (The asm destination operand).
Example using "Yk"
// works with -march=skylake-avx512 or -march=knl
// or just -mavx512f but don't do that.
// also needed: -masm=intel
#include <immintrin.h>
__m512i add_zmask(__m512i a, __m512i b) {
__m512i sum;
asm(
"vpaddd %[SUM] %{%[mask]%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b),
[mask] "Yk" ((__mmask16)0xAAAA)
// no clobbers needed, unlike your question which I fixed with an edit
);
return sum;
}
Note that all the { and } are escaped with % (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Special-format-strings), so they're not parsed as dialect-alternatives {AT&T | Intel-syntax}.
This compiles with gcc as early as 4.9, but don't actually do that because it doesn't understand -march=skylake-avx512, or even have tuning settings for Skylake or KNL. Use a more recent GCC that knows about your CPU for best results.
Godbolt compiler explorer:
# gcc8.3 -O3 -march=skylake-avx512 or -march=knl (and -masm=intel)
add(long long __vector, long long __vector):
mov eax, -21846
kmovw k1, eax # compiler-generated
# inline asm starts
vpaddd zmm0 {k1}{z}, zmm0, zmm1; # conditional add
# inline asm ends
ret
-mavx512bw (implied by -march=skylake-avx512 but not knl) is required for "Yk" to work on an int. If you're compiling with -march=knl, integer literals need a cast to __mmask16 or __mask8, because unsigned int = __mask32 isn't available for masks.
[mask] "Yk" (0xAAAA) requires AVX512BW even though the constant does fit in 16 bits, just because bare integer literals always have type int. (vpaddd zmm has 16 elements per vector, so I shortened your constant to 16-bit.) With AVX512BW, you can pass wider constants or leave out the cast for narrow ones.
gcc6 and later support -march=skylake-avx512. Use that to set tuning as well as enabling everything. Preferably gcc8 or at least gcc7. Newer compilers generate less clunky code with new ISA extensions like AVX512 if you're ever using it outside of inline asm.
gcc5 supports -mavx512f -mavx512bw but doesn't know about Skylake.
gcc4.9 doesn't support -mavx512bw.
"Yk" is unfortunately not yet documented in https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.
I knew where to look in the GCC source thanks to Ross's answer on In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?
While it is undocumented, looking here we see:
(define_register_constraint "Yk" "TARGET_AVX512F ? MASK_REGS :
NO_REGS" "#internal Any mask register that can be used as predicate,
i.e. k1-k7.")
Editing your godbolt to this:
asm(
"vpaddd %[SUM] %{%[k]}, %[A], %[B]"
: [SUM] "=v"(sum)
: [A] "v" (a), [B] "v" (b), [k] "Yk" (0xaaaaaaaa) );
seems to produce the correct output.
That said, I usually try to discourage people from using inline asm (and undocumented features). Can you use _mm512_mask_add_epi32?
I don't have a particular use-case in mind; I'm asking if this is really a design flaw / limitation in Intel's intrinsics or if I'm just missing something.
If you want to combine a scalar float with an existing vector, there doesn't seem to be a way to do it without high-element-zeroing or broadcasting the scalar into a vector, using Intel intrinsics. I haven't investigated GNU C native vector extensions and the associated builtins.
This wouldn't be too bad if the extra intrinsic optimized away, but it doesn't with gcc (5.4 or 6.2). There's also no nice way to use pmovzx or insertps as loads, for the related reason that their intrinsics only take vector args. (And gcc doesn't fold a scalar->vector load into the asm instruction.)
__m128 replace_lower_two_elements(__m128 v, float x) {
__m128 xv = _mm_set_ss(x); // WANTED: something else for this step, some compilers actually compile this to a separate insn
return _mm_shuffle_ps(v, xv, 0); // lower 2 elements are both x, and the garbage is gone
}
gcc 5.3 -march=nehalem -O3 output, to enable SSE4.1 and tune for that Intel CPU: (It's even worse without SSE4.1; multiple instructions to zero the upper elements).
insertps xmm1, xmm1, 0xe # pointless zeroing of upper elements. shufps only reads the low element of xmm1
shufps xmm0, xmm1, 0 # The function *should* just compile to this.
ret
TL:DR: the rest of this question is just asking if you can actually do this efficiently, and if not why not.
clang's shuffle-optimizer gets this right, and doesn't waste instructions on zeroing high elements (_mm_set_ss(x)), or duplicating the scalar into them (_mm_set1_ps(x)). Instead of writing something the compiler has to optimize away, shouldn't there be a way to write it "efficiently" in C in the first place? Even very recent gcc doesn't optimize it away, so this is a real (but minor) problem.
This would be possible if there was a scalar->128b equivalent of __m256 _mm256_castps128_ps256 (__m128 a). i.e. produce a __m128 with undefined garbage in upper elements, and the float in the low element, compiling to zero asm instructions if the scalar float/double was already in an xmm register.
None of the following intrinsics exist, but they should.
a scalar->__m128 equivalent of _mm256_castps128_ps256 as described above. The most general solution for the scalar-already-in-register case.
__m128 _mm_move_ss_scalar (__m128 a, float s): replace low element of vector a with scalar s. This isn't actually necessary if there's a general-purpose scalar->__m128 (previous bullet point). (The reg-reg form of movss merges, unlike the load form which zeros, and unlike movd which zeros upper elements in both cases. To copy a register holding a scalar float without false dependencies, use movaps).
__m128i _mm_loadzxbd (const uint8_t *four_bytes) and other sizes of PMOVZX / PMOVSX: AFAICT, there's no good safe way to use the PMOVZX intrinsics as a load, because the inconvenient safe way doesn't optimize away with gcc.
__m128 _mm_insertload_ps (__m128 a, float *s, const int imm8). INSERTPS behaves differently as a load: the upper 2 bits of the imm8 are ignored, and it always takes the scalar at the effective address (instead of an element from a vector in memory). This lets it work with addresses that aren't 16B-aligned, and work even without faulting if the float right before an unmapped page.
Like with PMOVZX, gcc fails to fold an upper-element-zeroing _mm_load_ss() into a memory operand for INSERTPS. (Note that if the upper 2 bits of the imm8 aren't both zero, then _mm_insert_ps(xmm0, _mm_load_ss(), imm8) can compile to insertps xmm0,xmm0,foo, with a different imm8 that zeros elements in vec as-if the src element was actually a zero produced by MOVSS from memory. Clang actually uses XORPS/BLENDPS in that case)
Are there any viable workarounds to emulate any of those that are both safe (don't break at -O0 by e.g. loading 16B that might touch the next page and segfault), and efficient (no wasted instructions at -O3 with current gcc and clang at least, preferably also other major compilers)? Preferably also in a readable way, but if necessary it could be put behind an inline wrapper function like __m128 float_to_vec(float a){ something(a); }.
Is there any good reason for Intel not to introduce intrinsics like that? They could have added a float->__m128 with undefined upper elements at the same time as adding _mm256_castps128_ps256. Is this a matter of compiler internals making it hard to implement? Perhaps specifically ICC internals?
The major calling conventions on x86-64 (SysV or MS __vectorcall) take the first FP arg in xmm0 and return scalar FP args in xmm0, with upper elements undefined. (See the x86 tag wiki for ABI docs). This means it's not uncommon for the compiler to have a scalar float/double in a register with unknown upper elements. This will be rare in a vectorized inner loop, so I think avoiding these useless instructions will mostly just save a bit of code size.
The pmovzx case is more serious: that is something you might use in an inner loop (e.g. for a LUT of VPERMD shuffle masks, saving a factor of 4 in cache footprint vs. storing each index padded to 32 bits in memory).
The pmovzx-as-a-load issue has been bothering me for a while now, and the original version of this question got me thinking about the related issue of using a scalar float in an xmm register. There are probably more use-cases for pmovzx as a load than for scalar->__m128.
It's doable with GNU C inline asm, but this is ugly and defeats many optimizations, including constant-propagation (https://gcc.gnu.org/wiki/DontUseInlineAsm). This will not be the accepted answer. I'm adding this as an answer instead of part of the question so the question stays short isn't huge.
// don't use this: defeating optimizations is probably worse than an extra instruction
#ifdef __GNUC__
__m128 float_to_vec_inlineasm(float x) {
__m128 retval;
asm ("" : "=x"(retval) : "0"(x)); // matching constraint: provide x in the same xmm reg as retval
return retval;
}
#endif
This does compile to a single ret, as desired, and will inline to let you shufps a scalar into a vector:
gcc5.3
float_to_vec_and_shuffle_asm(float __vector(4), float):
shufps xmm0, xmm1, 0 # tmp93, xv,
ret
See this code on the Godbolt compiler explorer.
This is obviously trivial in pure assembly language, where you don't have to fight with a compiler to get it not to emit instructions you don't want or need.
I haven't found any real way to write a __m128 float_to_vec(float a){ something(a); } that compiles to just a ret instruction. An attempt for double using _mm_undefined_pd() and _mm_move_sd() actually makes worse code with gcc (see the Godbolt link above). None of the existing float->__m128 intrinsics help.
Off-topic: actual _mm_set_ss() code-gen strategies: When you do write code that has to zero upper elements, compilers pick from an interesting range of strategies. Some good, some weird. The strategies also differ between double and float on the same compiler (gcc or clang), as you can see on the Godbolt link above.
One example: __m128 float_to_vec(float x){ return _mm_set_ss(x); } compiles to:
# gcc5.3 -march=core2
movd eax, xmm0 # movd xmm0,xmm0 would work; IDK why gcc doesn't do that
movd xmm0, eax
ret
# gcc5.3 -march=nehalem
insertps xmm0, xmm0, 0xe
ret
# clang3.8 -march=nehalem
xorps xmm1, xmm1
blendps xmm0, xmm1, 14 # xmm0 = xmm0[0],xmm1[1,2,3]
ret
I'm looking for some C code for signed saturated 64-bit addition that compiles to efficient x86-64 code with the gcc optimizer. Portable code would be ideal, although an asm solution could be used if necessary.
static const int64 kint64max = 0x7fffffffffffffffll;
static const int64 kint64min = 0x8000000000000000ll;
int64 signed_saturated_add(int64 x, int64 y) {
bool x_is_negative = (x & kint64min) != 0;
bool y_is_negative = (y & kint64min) != 0;
int64 sum = x+y;
bool sum_is_negative = (sum & kint64min) != 0;
if (x_is_negative != y_is_negative) return sum; // can't overflow
if (x_is_negative && !sum_is_negative) return kint64min;
if (!x_is_negative && sum_is_negative) return kint64max;
return sum;
}
The function as written produces a fairly lengthy assembly output with several branches. Any tips on optimization? Seems like it ought to be be implementable with just an ADD with a few CMOV instructions but I'm a little bit rusty with this stuff.
This may be optimized further but here is a portable solution. It does not invoked undefined behavior and it checks for integer overflow before it could occur.
#include <stdint.h>
int64_t sadd64(int64_t a, int64_t b)
{
if (a > 0) {
if (b > INT64_MAX - a) {
return INT64_MAX;
}
} else if (b < INT64_MIN - a) {
return INT64_MIN;
}
return a + b;
}
This is a solution that continues in the vein that had been given in one of the comments, and has been used in ouah's solution, too. here the generated code should be without conditional jumps
int64_t signed_saturated_add(int64_t x, int64_t y) {
// determine the lower or upper bound of the result
int64_t ret = (x < 0) ? INT64_MIN : INT64_MAX;
// this is always well defined:
// if x < 0 this adds a positive value to INT64_MIN
// if x > 0 this subtracts a positive value from INT64_MAX
int64_t comp = ret - x;
// the condition is equivalent to
// ((x < 0) && (y > comp)) || ((x >=0) && (y <= comp))
if ((x < 0) == (y > comp)) ret = x + y;
return ret;
}
The first looks as if there would be a conditional move to do, but because of the special values my compiler gets off with an addition: in 2's complement INT64_MIN is INT64_MAX+1.
There is then only one conditional move for the assignment of the sum, in case anything is fine.
All of this has no UB, because in the abstract state machine the sum is only done if there is no overflow.
Related: unsigned saturation is much easier, and efficiently possible in pure ISO C: How to do unsigned saturating addition in C?
Compilers are terrible at all of the pure C options proposed so far.
They don't see that they can use the signed-overflow flag result from an add instruction to detect that saturation to INT64_MIN/MAX is necessary. AFAIK there's no pure C pattern that compilers recognize as reading the OF flag result of an add.
Inline asm is not a bad idea here, but we can avoid that with GCC's builtins that expose UB-safe wrapping signed addition with a boolean overflow result. https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
(If you were going to use GNU C inline asm, that would limit you just as much as these GNU C builtins. And these builtins aren't arch-specific. They do require gcc5 or newer, but gcc4.9 and older are basically obsolete. https://gcc.gnu.org/wiki/DontUseInlineAsm - it defeats constant propagation and is hard to maintain.)
This version uses the fact that INT64_MIN = INT64_MAX + 1ULL (for 2's complement) to select INT64_MIN/MAX based on the sign of b. Signed-overflow UB is avoided by using uint64_t for that add, and GNU C defines the behaviour of converting an unsigned integer to a signed type that can't represent its value (bit-pattern used unchanged). Current gcc/clang benefit from this hand-holding because they don't figure out this trick from a ternary like (b<0) ? INT64_MIN : INT64_MAX. (See below for the alternate version using that). I haven't checked the asm on 32-bit architectures.
GCC only supports 2's complement integer types, so a function using __builtin_add_overflow doesn't have to care about portability to C implementations that use 1's complement (where the same identity holds) or sign/magnitude (where it doesn't), even if you made a version for long or int instead of int64_t.
#include <stdint.h>
#ifndef __cplusplus
#include <stdbool.h>
#endif
// static inline
int64_t signed_sat_add64_gnuc_v2(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction depending on the sign bit
return ((uint64_t)b >> 63) + INT64_MAX;
// INT64_MIN = INT64_MAX + 1 wraparound, done with unsigned
}
return res;
}
Another option is (b>>63) ^ INT64_MAX which might be useful if manually vectorizing where SIMD XOR can run on more ports than SIMD ADD, like on Intel before Skylake. (But x86 doesn't have SIMD 64-bit arithmetic right shift, only logical, so this would only help for an int32_t version, and you'd need to efficiently detect overflow in the first place. Or you might use a variable blend on the sign bit, like blendvpd) See Add saturate 32-bit signed ints intrinsics? with x86 SIMD intrinsics (SSE2/SSE4)
On Godbolt with gcc9 and clang8 (along with the other implementations from other answers):
# gcc9.1 -O3 (clang chooses branchless with cmov)
signed_sat_add64_gnuc_v2:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rax, 9223372036854775807 # INT64_MAX
shr rsi, 63 # b is still available after the ADD
add rax, rsi
ret
When inlining into a loop, the mov imm64 can be hoisted. If b is needed afterwards then we might need an extra mov, otherwise shr/add can destroy b, leaving the INT64_MAX constant in a register undamaged. Or if the compiler wants to use cmov (like clang does), it has to mov/shr because it has to get the saturation constant ready before the add, preserving both operands.
Notice that the critical path for the non-overflowing case only includes an add and a not-taken jo. These can't macro-fuse into a single uop even on Sandybridge-family, but the jo only costs throughput not latency thanks to branch prediction + speculative execution. (When inlining, the mov will go away.)
If saturation is actually not rare and branch prediction is a problem, compile with profile-guided optimization and gcc will hopefully do if-conversion and use a cmovno instead of jo, like clang does. This puts the MIN/MAX selection on the critical path, as well as the CMOV itself. The MIN/MAX selection can run in parallel with the add.
You could use a<0 instead. I used b because I think most people would write x = sadd(x, 123) instead of x = sadd(123, x), and having a compile-time-constant allows the b<0 to optimize away. For maximal optimization opportunity, you could use if (__builtin_constant_p(a)) to ask the compiler if a was a compile-time constant. That works for gcc, but clang evaluates the const-ness way too early, before inlining, so it's useless except in macros with clang. (Related: ICC19 doesn't do constant propagation through __builtin_saddll_overflow: it puts both inputs in registers and still does the add. GCC and Clang just return a constant.)
This optimization is especially valuable inside a loop with the MIN/MAX selection hoisted, leaving only add + cmovo. (Or add + jo to a mov.)
cmov is a 2 uop instruction on Intel P6-family and SnB-family before Broadwell because it has 3 inputs. On other x86 CPUs (Broadwell / Skylake, and AMD), it's a single-uop instruction. On most such CPUs it has 1 cycle latency. It's a simple ALU select operation; the only complication is 3 inputs (2 regs + FLAGS). But on KNL it's still 2-cycle latency.
Unfortunately gcc for AArch64 fails to use adds to set flags and check the V (overflow) flag result, so it spends several instructions deciding whether to branch.
Clang does a great job, and AArch64's immediate encodings can represent INT64_MAX as an operand to eor or add.
// clang8.0 -O3 -target aarch64
signed_sat_add64_gnuc:
orr x9, xzr, #0x7fffffffffffffff // mov constant = OR with zero reg
adds x8, x0, x1 // add and set flags
add x9, x9, x1, lsr #63 // sat = (b shr 63) + MAX
csel x0, x9, x8, vs // conditional-select, condition = VS = oVerflow flag Set
ret
Optimizing MIN vs. MAX selection
As noted above, return (b<0) ? INT64_MIN : INT64_MAX; doesn't compile optimally with most versions of gcc/clang; they generate both constant in registers and cmov to select, or something similar on other ISAs.
We can assume 2's complement because GCC only supports 2's complement integer types, and because the ISO C optional int64_t type is guaranteed to be 2's complement if it exists. (Signed overflow of int64_t is still UB, this allows it to be a simple typedef of long or long long).
(On a sign/magnitude C implementation that supported some equivalent of __builtin_add_overflow, a version of this function for long long or int couldn't use the SHR / ADD trick. For extreme portability you'd probably just use the simple ternary, or for sign/magnitude specifically you could return (b&0x800...) | 0x7FFF... to OR the sign bit of b into a max-magnitude number.)
For two's complement, the bit-patterns for MIN and MAX are 0x8000... (just the high bit set) and 0x7FFF... (all other bits set). They have a couple interesting properties: MIN = MAX + 1 (if computed with unsigned on the bit-pattern), and MIN = ~MAX: their bit-patterns are bitwise inverses, aka one's complement of each other.
The MIN = ~MAX property follows from ~x = -x - 1 (a re-arrangement of the standard -x = ~x + 1 2's complement identity) and the fact that MIN = -MAX - 1. The +1 property is unrelated, and follows from simple rollover from most-positive to most-negative and applies to the one's complement encoding of signed integer as well. (But not sign/magnitude; you'd get -0 where the unsigned magnitude ).
The above function uses the MIN = MAX + 1 trick. The MIN = ~MAX trick is also usable by broadcasting the sign bit to all positions with an arithmetic right shift (creating 0 or 0xFF...), and XORing with that.
GNU C guarantees that signed right shifts are arithmetic (sign-extension), so (b>>63) ^ INT64_MAX is equivalent to (b<0) ? INT64_MIN : INT64_MAX in GNU C.
ISO C leaves signed right shifts implementation-defined, but we could use a ternary of b<0 ? ~0ULL : 0ULL. Compilers will optimize the following to sar / xor, or equivalent instruction(s), but it has no implementation-defined behaviour. AArch64 can use a shifted input operand for eor just as well as it can for add.
// an earlier version of this answer used this
int64_t mask = (b<0) ? ~0ULL : 0; // compiles to sar with good compilers, but is not implementation-defined.
return mask ^ INT64_MAX;
Fun fact: AArch64 has a csinv instruction: conditional-select inverse. And it can put INT64_MIN into a register with a single 32-bit mov instruction, thanks to its powerful immediate encodings for simple bit-patterns. AArch64 GCC was already using csinv for the MIN = ~MAX trick for the original return (b<0) ? INT64_MIN : INT64_MAX; version.
clang 6.0 and earlier on Godbolt were using shr/add for the plain (b<0) ? INT64_MIN : INT64_MAX; version. It looks more efficient than what clang7/8 do, so that's a regression / missed-optimization bug I think. (And it's the whole point of this section and why I wrote a 2nd version.)
I chose the MIN = MAX + 1 version because it could possible auto-vectorize better: x86 has 64-bit SIMD logical right shifts but only 16 and 32-bit SIMD arithmetic right shifts until AVX512F. Of course, signed-overflow detection with SIMD probably makes it not worth it until AVX512 for 64-bit integers. Well maybe AVX2. And if it's part of some larger calculation that can otherwise vectorize efficiently, then unpacking to scalar and back sucks.
For scalar it's truly a wash; neither way compiles any better, and sar/shr perform identically, and so do add/xor, on all CPUs that Agner Fog has tested. (https://agner.org/optimize/).
But + can sometimes optimize into other things, though, so you could imagine gcc folding a later + or - of a constant into the overflow branch. Or possibly using LEA for that add instead of ADD to copy-and-add. The difference in power from a simpler ALU execution unit for XOR vs. ADD is going to be lost in the noise from the cost of all the power it takes to do out-of-order execution and other stuff; all x86 CPUs have single-cycle scalar ADD and XOR, even for 64-bit integers, even on P4 Prescott/Nocona with its exotic adders.
Also #chqrlie suggested a compact readable way to write it in C without UB that looks nicer than the super-portable int mask = ternary thing.
The earlier "simpler" version of this function
Doesn't depend on any special property of MIN/MAX, so maybe useful for saturating to other boundaries with other overflow-detection conditions. Or in case a compiler does something better with this version.
int64_t signed_sat_add64_gnuc(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction for a given `b`
return (b<0) ? INT64_MIN : INT64_MAX;
}
return res;
}
which compiles as follows
# gcc9.1 -O3 (clang chooses branchless)
signed_sat_add64_gnuc:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rdx, 9223372036854775807
test rsi, rsi # one of the addends is still available after
movabs rax, -9223372036854775808 # missed optimization: lea rdx, [rax+1]
cmovns rax, rdx # branchless selection of which saturation limit
ret
This is basically what #drwowe's inline asm does, but with a test replacing one cmov. (And of course different conditions on the cmov.)
Another downside to this vs. the _v2 with shr/add is that this needs 2 constants. In a loop, this would tie up an extra register. (Again unless b is a compile-time constant.)
clang uses cmov instead of a branch, and does spot the lea rax, [rcx + 1] trick to avoid a 2nd 10-byte mov r64, imm64 instruction. (Or clang6.0 and earlier use the shr 63/add trick instead of that cmov.)
The first version of this answer put int64_t sat = (b<0) ? MIN : MAX outside the if(), but gcc missed the optimization of moving that inside the branch so it's not run at all for the non-overflow case. That's even better than running it off the critical path. (And doesn't matter if the compiler decides to go branchless).
But when I put it outside the if and after the __builtin_saddll_overflow, gcc was really dumb and saved the bool result in an integer, then did the test/cmov, then used test on the saddll_overflow result again to put it back in FLAGS. Reordering the source fixed that.
I'm still looking for a decent portable solution, but this is as good as I've come up with so far:
Suggestions for improvements?
int64 saturated_add(int64 x, int64 y) {
#if __GNUC__ && __X86_64__
asm("add %1, %0\n\t"
"jno 1f\n\t"
"cmovge %3, %0\n\t"
"cmovl %2, %0\n"
"1:" : "+r"(x) : "r"(y), "r"(kint64min), "r"(kint64max));
return x;
#else
return portable_saturated_add(x, y);
#endif
}
I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.
I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE:
//sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication)
sum = _mm_set1_ps(0.0f);
a1 = _mm_set1_ps(a[0]);
b1 = _mm_load_ps(&b[0]);
sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1));
a2 = _mm_set1_ps(a[1]);
b2 = _mm_load_ps(&b[4]);
sum = _mm_add_ps(sum, _mm_mul_ps(a2, b2));
a3 = _mm_set1_ps(a[2]);
b3 = _mm_load_ps(&b[8]);
sum = _mm_add_ps(sum, _mm_mul_ps(a3, b3));
...
My question is how does this get converted to simultaneous multiply and add? Can the data be dependent? I mean can the CPU do _mm_add_ps(sum, _mm_mul_ps(a1, b1)) simultaneously or do the registers used in the multiplication and add have to be independent?
Lastly how does this apply to FMA (with Haswell)? Is _mm_add_ps(sum, _mm_mul_ps(a1, b1)) automatically converted to a single FMA instruction or micro-operation?
The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).
An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.
The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ON is in effect, and compilers are allowed to have it ON by default (but not all do). Gcc contracts into FMA by default (with the default -std=gnu*, but not -std=c*, e.g. -std=c++14). For Clang, it's only enabled with -ffp-contract=fast. (With just the #pragma enabled, only within a single expression like a+b*c, not across separate C++ statements.).
This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-math vs. -fno-fast-math) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn't be allowed in strict FP.
Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.
So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:
FMA3 Intrinsics: (AVX2 - Intel Haswell)
_mm_fmadd_pd(), _mm256_fmadd_pd()
_mm_fmadd_ps(), _mm256_fmadd_ps()
and about a gazillion other variations...
FMA4 Intrinsics: (XOP - AMD Bulldozer)
_mm_macc_pd(), _mm256_macc_pd()
_mm_macc_ps(), _mm256_macc_ps()
and about a gazillion other variations...
I tested the following code in GCC 5.3, Clang 3.7, ICC 13.0.1 and MSVC 2015 (compiler version 19.00).
float mul_add(float a, float b, float c) {
return a*b + c;
}
__m256 mul_addv(__m256 a, __m256 b, __m256 c) {
return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}
With the right compiler options (see below) every compiler will generate a vfmadd instruction (e.g. vfmadd213ss) from mul_add. However, only MSVC fails to contract mul_addv to a single vfmadd instruction (e.g. vfmadd213ps).
The following compiler options are sufficient to generate vfmadd instructions (except with mul_addv with MSVC).
GCC: -O2 -mavx2 -mfma
Clang: -O1 -mavx2 -mfma -ffp-contract=fast
ICC: -O1 -march=core-avx2
MSVC: /O1 /arch:AVX2 /fp:fast
GCC 4.9 will not contract mul_addv to a single fma instruction but since at least GCC 5.1 it does. I don't know when the other compilers started doing this.
I would like to combine two __m128 values to one __m256.
Something like this:
__m128 a = _mm_set_ps(1, 2, 3, 4);
__m128 b = _mm_set_ps(5, 6, 7, 8);
to something like:
__m256 c = { 1, 2, 3, 4, 5, 6, 7, 8 };
are there any intrinsics that I can use to do this?
This should do what you want:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_castps128_ps256(a);
c = _mm256_insertf128_ps(c,b,1);
If the order is reversed from what you want, then just switch a and b.
The intrinsic of interest is _mm256_insertf128_ps which will let you insert a 128-bit register into either lower or upper half of a 256-bit AVX register:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_avx_insertf128_ps.htm
The complete family of them is here:
_mm256_insertf128_pd()
_mm256_insertf128_ps()
_mm256_insertf128_si256()
Intel documents __m256 _mm256_set_m128(__m128 hi, __m128 lo) and _mm256_setr_m128(lo, hi) as intrinsics for the vinsertf128 instruction, which is what you want1. (Of course there are also __m256d and __m256i versions, which use the same instruction. The __m256i version may use vinserti128 if AVX2 is available, otherwise it'll use f128 as well.)
These days, those intrinsics are supported by current versions of all 4 major x86 compilers (gcc, clang, MSVC, and ICC). But not by older versions; like some other helper intrinsics that Intel documents, widespread implementation has been slow. (Often GCC or clang are the last hold-out to not have something you wish you could use portably.)
Use it if you don't need portability to old GCC versions: it's the most readable way to express what you want, following the well known _mm_set and _mm_setr patterns.
Performance-wise, it's of course just as efficient as manual cast + vinsertf128 intrinsics (#Mysticial's answer), and for gcc at least that's literally how the internal .h actually implements _mm256_set_m128.
Compiler version support for _mm256_set_m128 / _mm256_setr_m128:
clang: 3.6 and newer. (Mainline, IDK about Apple)
GCC: 8.x and newer, not present as recently as GCC7!
ICC: since at least ICC13, the earliest on Godbolt.
MSVC: since at least 19.14 and 19.10 (WINE) VS2015, the earliest on Godbolt.
https://godbolt.org/z/1na1qr has test cases for all 4 compilers.
__m256 combine_testcase(__m128 hi, __m128 lo) {
return _mm256_set_m128(hi, lo);
}
They all compile this function to one vinsertf128, except MSVC where even the latest version wastes a vmovups xmm2, xmm1 copying a register. (I used -O2 -Gv -arch:AVX to use the vectorcall convention so args would be in registers to make an efficient non-inlined function definition possible for MSVC.) Presumably MSVC would be ok inlining into a larger function if it could write the result to a 3rd register, instead of the calling convention forcing it to read xmm0 and write ymm0.
Footnote 1:
vinsertf128 is very efficient on Zen1, and as efficient as vperm2f128 on other CPUs with 256-bit-wide shuffle units. It can also take the high half from memory in case the compiler spilled it or is folding a _mm_loadu_ps into it, instead of needing to separately do a 128-bit load into a register; vperm2f128's memory operand would be a 256-bit load which you don't want.
https://uops.info/ / https://agner.org/optimize/
Even this one will work:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_insertf128_ps(c,a,0);
c = _mm256_insertf128_ps(c,b,1);
You will get a warning as c is not initialized but you can ignore it and if you're looking for performances this solution will use less clock cycle then the other one.
Can also use permute intrinsic:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_permute2f128_ps(_mm256_castps128_ps256(a), _mm256_castps128_ps256(b), 0x20);
I don't know which way is faster.
I believe this is the simplest:
#define _mm256_set_m128(/* __m128 */ hi, /* __m128 */ lo) \ _mm256_insertf128_ps(_mm256_castps128_ps256(lo), (hi), 0x1)
__m256 c = _mm256_set_m128(a, b);
Do note __mm256_set_m128 is already defined in msvc 2019 if you #include "immintrin.h"