Intrinsics for 128 multiplication and division - c

In x86_64 I know that the mul and div opp codes support 128 integers by putting the lower 64 bits in the rax and the upper in the rdx registers. I was looking for some sort of intrinsic to do this in the intel intrinsics guide and I could not find one. I am writing a big number library where the word size is 64 bits. Right now I am doing division by a single word like this.
int ubi_div_i64(ubigint_t* a, ubi_i64_t b, ubi_i64_t* rem)
{
if(b == 0)
return UBI_MATH_ERR;
ubi_i64_t r = 0;
for(size_t i = a->used; i-- > 0;)
{
ubi_i64_t out;
__asm__("\t"
"div %[d] \n\t"
: "=a"(out), "=d"(r)
: "a"(a->data[i]), "d"(r), [d]"r"(b)
: "cc");
a->data[i] = out;
//ubi_i128_t top = (r << 64) + a->data[i];
//r = top % b;
//a->data[i] = top / b;
}
if(rem)
*rem = r;
return ubi_strip_leading_zeros(a);
}
It would be nice if I could use something in the x86intrinsics.h header instead of inline asm.

gcc has __int128 and __uint128 types.
Arithmetic with them should be using the right assembly instructions when they exist; I've used them in the past to get the upper 64 bits of a product, although I've never used it for division. If it's not using the right ones, submit a bug report / feature request as appropriate.

Last I looked into it the intrinsic were in a state of flux. The main reason for the intrinsics in this case appears to be due to the fact that MSVC in 64-bit mode does not allow inline assembly.
With MSVC (and I think ICC) you can use _umul128 for mul and _mulx_u64 for mulx. These don't work in GCC , at least not GCC 4.9 (_umul128 is much older than GCC 4.9). I don't know if GCC plans to support these since you can get mul and mulx indirectly through __int128 (depending on your compile options) or directly through inline assembly.
__int128 works fine until you need a larger type and a 128-bit carry. Then you need adc, adcx, or adox and these are even more of a problem with intrinsics. Intel's documentation disagree's with MSVC and the compilers don't seem to produce adox yet with these intrinsics. See this question: _addcarry_u64 and _addcarryx_u64 with MSVC and ICC.
Inline assembly is probably the best solution with GCC (and probably even ICC).

Related

GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org.
#include <immintrin.h>
__m512i add(__m512i a, __m512i b) {
__m512i sum;
asm(
"mov ebx, 0xAAAAAAAA; \n\t"
"kmovw k1, ebx; \n\t"
"vpaddd %[SUM] %{k1%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b)
: "ebx", "k1" // clobbers
);
return sum;
}
-march=skylake-avx512 -masm=intel -O3
mov ebx,0xaaaaaaaa
kmovw k1,ebx
vpaddd zmm0{k1}{z},zmm0,zmm1
The problem is that k1 has to be specified.
Is there an input constraint like "r" for integers except that it picks a k register instead of a general-purpose register?
__mmask16 is literally a typedef for unsigned short (and other mask types for other plain integer types), so we just need a constraint for passing it in a k register.
We have to go digging in the gcc sources config/i386/constraints.md to find it:
The constraint for any mask register is "k". Or use "Yk" for k1..k7 (which can be used as a predicate, unlike k0). You'd use an "=k" operand as the destination for a compare-into-mask, for example.
Obviously you can use "=Yk"(tmp) with a __mmask16 tmp to get the compiler to do register allocation for you, instead of just declaring clobbers on whichever "k" registers you decide to use.
Prefer intrinsics like _mm512_maskz_add_epi32
First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Understanding asm is great, but use that to read compiler output and/or figure out what would be optimal, then write intrinsics that can compile the way you want. Performance tuning info like https://agner.org/optimize/ and https://uops.info/ list things by asm mnemonic, and they're shorter / easier to remember than intrinsics, but you can search by mnemonic to find intrinsics on https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Intrinsics will also let the compiler fold loads into memory source operands for other instructions; with AVX512 those can even be broadcast loads! Your inline asm forces the compiler to use a separate load instruction. Even a "vm" input won't let the compiler pick a broadcast-load as the memory source, because it wouldn't know the broadcast element width of the instruction(s) you were using it with.
Use _mm512_mask_add_epi32 or _mm512_maskz_add_epi32 especially if you're already using __m512i types from <immintrin.h>.
Also, your asm has a bug: you're using {k1} merge-masking not {k1}{z} zero-masking, but you used uninitialized __m512i sum; with an output-only "=v" constraint as the merge destination! As a stand-alone function, it happens to merge into a because the calling convention has ZMM0 = first input = return value register. But when inlining into other functions, you definitely can't assume that sum will pick the same register as a. Your best bet is to use a read/write operand for "+v"(a) and use is as the destination and first source.
Merge-masking only makes sense with a "+v" read/write operand. (Or in an asm statement with multiple instructions where you've already written an output once, and want to merge another result into it.)
Intrinsics would stop you from making this mistake; the merge-masking version has an extra input for the merge-target. (The asm destination operand).
Example using "Yk"
// works with -march=skylake-avx512 or -march=knl
// or just -mavx512f but don't do that.
// also needed: -masm=intel
#include <immintrin.h>
__m512i add_zmask(__m512i a, __m512i b) {
__m512i sum;
asm(
"vpaddd %[SUM] %{%[mask]%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b),
[mask] "Yk" ((__mmask16)0xAAAA)
// no clobbers needed, unlike your question which I fixed with an edit
);
return sum;
}
Note that all the { and } are escaped with % (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Special-format-strings), so they're not parsed as dialect-alternatives {AT&T | Intel-syntax}.
This compiles with gcc as early as 4.9, but don't actually do that because it doesn't understand -march=skylake-avx512, or even have tuning settings for Skylake or KNL. Use a more recent GCC that knows about your CPU for best results.
Godbolt compiler explorer:
# gcc8.3 -O3 -march=skylake-avx512 or -march=knl (and -masm=intel)
add(long long __vector, long long __vector):
mov eax, -21846
kmovw k1, eax # compiler-generated
# inline asm starts
vpaddd zmm0 {k1}{z}, zmm0, zmm1; # conditional add
# inline asm ends
ret
-mavx512bw (implied by -march=skylake-avx512 but not knl) is required for "Yk" to work on an int. If you're compiling with -march=knl, integer literals need a cast to __mmask16 or __mask8, because unsigned int = __mask32 isn't available for masks.
[mask] "Yk" (0xAAAA) requires AVX512BW even though the constant does fit in 16 bits, just because bare integer literals always have type int. (vpaddd zmm has 16 elements per vector, so I shortened your constant to 16-bit.) With AVX512BW, you can pass wider constants or leave out the cast for narrow ones.
gcc6 and later support -march=skylake-avx512. Use that to set tuning as well as enabling everything. Preferably gcc8 or at least gcc7. Newer compilers generate less clunky code with new ISA extensions like AVX512 if you're ever using it outside of inline asm.
gcc5 supports -mavx512f -mavx512bw but doesn't know about Skylake.
gcc4.9 doesn't support -mavx512bw.
"Yk" is unfortunately not yet documented in https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.
I knew where to look in the GCC source thanks to Ross's answer on In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?
While it is undocumented, looking here we see:
(define_register_constraint "Yk" "TARGET_AVX512F ? MASK_REGS :
NO_REGS" "#internal Any mask register that can be used as predicate,
i.e. k1-k7.")
Editing your godbolt to this:
asm(
"vpaddd %[SUM] %{%[k]}, %[A], %[B]"
: [SUM] "=v"(sum)
: [A] "v" (a), [B] "v" (b), [k] "Yk" (0xaaaaaaaa) );
seems to produce the correct output.
That said, I usually try to discourage people from using inline asm (and undocumented features). Can you use _mm512_mask_add_epi32?

Inline assembly size mismatch for 8-bit rotate

I am trying to write the rotate left operation in C using inline assembly, like so:
byte rotate_left(byte a) {
__asm__("rol %0, $1": "=a" (a) : "a" (a));
return a;
}
(Where byte is typedefed as unsigned char).
This raises the error
/tmp/ccKYcEHR.s:363: Error: operand size mismatch for `rol'.
What is the problem here?
AT&T syntax uses the opposite order from Intel syntax. The rotate count has to be first, not last: rol $1, %0.
Also, you don't need and shouldn't use inline asm for this: https://gcc.gnu.org/wiki/DontUseInlineAsm
As described in Best practices for circular shift (rotate) operations in C++, GNU C has intrinsics for narrow rotates, because the rotate-idiom recognition code fails to optimize away an and of the rotate count. x86 shifts/rotates mask the count with count & 31 even for 8-bit and 16-bit, but rotates still wrap around. It does matter for shifts though.
Anyway, gcc has a builtin function for narrow rotates to avoid any overhead. There's a __rolb wrapper for it in x86intrin.h, but MSVC uses its own __rotr8 and so on from its intrin.h. Anyway, clang doesn't support either the __builtin or the x86intrin.h wrappers for rotates, but gcc and ICC do.
#include <stdint.h>
uint8_t rotate_left_byte_by1(uint8_t a) {
return __builtin_ia32_rolqi(a, 1); // qi = quarter-integer
}
I used uint8_t from stdint.h like a normal person instead of defining a byte type.
This doesn't compile at all with clang, but it compiles as you'd hope with gcc7.2:
rotate_left_byte_by1:
movl %edi, %eax
rolb %al
ret
This gives you a function that compiles as efficiently as your inline asm ever could, but which can optimize away completely for compile-time constants, and the compiler knows how it works / what it does and can optimize accordingly.

Signed saturated add of 64-bit ints?

I'm looking for some C code for signed saturated 64-bit addition that compiles to efficient x86-64 code with the gcc optimizer. Portable code would be ideal, although an asm solution could be used if necessary.
static const int64 kint64max = 0x7fffffffffffffffll;
static const int64 kint64min = 0x8000000000000000ll;
int64 signed_saturated_add(int64 x, int64 y) {
bool x_is_negative = (x & kint64min) != 0;
bool y_is_negative = (y & kint64min) != 0;
int64 sum = x+y;
bool sum_is_negative = (sum & kint64min) != 0;
if (x_is_negative != y_is_negative) return sum; // can't overflow
if (x_is_negative && !sum_is_negative) return kint64min;
if (!x_is_negative && sum_is_negative) return kint64max;
return sum;
}
The function as written produces a fairly lengthy assembly output with several branches. Any tips on optimization? Seems like it ought to be be implementable with just an ADD with a few CMOV instructions but I'm a little bit rusty with this stuff.
This may be optimized further but here is a portable solution. It does not invoked undefined behavior and it checks for integer overflow before it could occur.
#include <stdint.h>
int64_t sadd64(int64_t a, int64_t b)
{
if (a > 0) {
if (b > INT64_MAX - a) {
return INT64_MAX;
}
} else if (b < INT64_MIN - a) {
return INT64_MIN;
}
return a + b;
}
This is a solution that continues in the vein that had been given in one of the comments, and has been used in ouah's solution, too. here the generated code should be without conditional jumps
int64_t signed_saturated_add(int64_t x, int64_t y) {
// determine the lower or upper bound of the result
int64_t ret = (x < 0) ? INT64_MIN : INT64_MAX;
// this is always well defined:
// if x < 0 this adds a positive value to INT64_MIN
// if x > 0 this subtracts a positive value from INT64_MAX
int64_t comp = ret - x;
// the condition is equivalent to
// ((x < 0) && (y > comp)) || ((x >=0) && (y <= comp))
if ((x < 0) == (y > comp)) ret = x + y;
return ret;
}
The first looks as if there would be a conditional move to do, but because of the special values my compiler gets off with an addition: in 2's complement INT64_MIN is INT64_MAX+1.
There is then only one conditional move for the assignment of the sum, in case anything is fine.
All of this has no UB, because in the abstract state machine the sum is only done if there is no overflow.
Related: unsigned saturation is much easier, and efficiently possible in pure ISO C: How to do unsigned saturating addition in C?
Compilers are terrible at all of the pure C options proposed so far.
They don't see that they can use the signed-overflow flag result from an add instruction to detect that saturation to INT64_MIN/MAX is necessary. AFAIK there's no pure C pattern that compilers recognize as reading the OF flag result of an add.
Inline asm is not a bad idea here, but we can avoid that with GCC's builtins that expose UB-safe wrapping signed addition with a boolean overflow result. https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
(If you were going to use GNU C inline asm, that would limit you just as much as these GNU C builtins. And these builtins aren't arch-specific. They do require gcc5 or newer, but gcc4.9 and older are basically obsolete. https://gcc.gnu.org/wiki/DontUseInlineAsm - it defeats constant propagation and is hard to maintain.)
This version uses the fact that INT64_MIN = INT64_MAX + 1ULL (for 2's complement) to select INT64_MIN/MAX based on the sign of b. Signed-overflow UB is avoided by using uint64_t for that add, and GNU C defines the behaviour of converting an unsigned integer to a signed type that can't represent its value (bit-pattern used unchanged). Current gcc/clang benefit from this hand-holding because they don't figure out this trick from a ternary like (b<0) ? INT64_MIN : INT64_MAX. (See below for the alternate version using that). I haven't checked the asm on 32-bit architectures.
GCC only supports 2's complement integer types, so a function using __builtin_add_overflow doesn't have to care about portability to C implementations that use 1's complement (where the same identity holds) or sign/magnitude (where it doesn't), even if you made a version for long or int instead of int64_t.
#include <stdint.h>
#ifndef __cplusplus
#include <stdbool.h>
#endif
// static inline
int64_t signed_sat_add64_gnuc_v2(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction depending on the sign bit
return ((uint64_t)b >> 63) + INT64_MAX;
// INT64_MIN = INT64_MAX + 1 wraparound, done with unsigned
}
return res;
}
Another option is (b>>63) ^ INT64_MAX which might be useful if manually vectorizing where SIMD XOR can run on more ports than SIMD ADD, like on Intel before Skylake. (But x86 doesn't have SIMD 64-bit arithmetic right shift, only logical, so this would only help for an int32_t version, and you'd need to efficiently detect overflow in the first place. Or you might use a variable blend on the sign bit, like blendvpd) See Add saturate 32-bit signed ints intrinsics? with x86 SIMD intrinsics (SSE2/SSE4)
On Godbolt with gcc9 and clang8 (along with the other implementations from other answers):
# gcc9.1 -O3 (clang chooses branchless with cmov)
signed_sat_add64_gnuc_v2:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rax, 9223372036854775807 # INT64_MAX
shr rsi, 63 # b is still available after the ADD
add rax, rsi
ret
When inlining into a loop, the mov imm64 can be hoisted. If b is needed afterwards then we might need an extra mov, otherwise shr/add can destroy b, leaving the INT64_MAX constant in a register undamaged. Or if the compiler wants to use cmov (like clang does), it has to mov/shr because it has to get the saturation constant ready before the add, preserving both operands.
Notice that the critical path for the non-overflowing case only includes an add and a not-taken jo. These can't macro-fuse into a single uop even on Sandybridge-family, but the jo only costs throughput not latency thanks to branch prediction + speculative execution. (When inlining, the mov will go away.)
If saturation is actually not rare and branch prediction is a problem, compile with profile-guided optimization and gcc will hopefully do if-conversion and use a cmovno instead of jo, like clang does. This puts the MIN/MAX selection on the critical path, as well as the CMOV itself. The MIN/MAX selection can run in parallel with the add.
You could use a<0 instead. I used b because I think most people would write x = sadd(x, 123) instead of x = sadd(123, x), and having a compile-time-constant allows the b<0 to optimize away. For maximal optimization opportunity, you could use if (__builtin_constant_p(a)) to ask the compiler if a was a compile-time constant. That works for gcc, but clang evaluates the const-ness way too early, before inlining, so it's useless except in macros with clang. (Related: ICC19 doesn't do constant propagation through __builtin_saddll_overflow: it puts both inputs in registers and still does the add. GCC and Clang just return a constant.)
This optimization is especially valuable inside a loop with the MIN/MAX selection hoisted, leaving only add + cmovo. (Or add + jo to a mov.)
cmov is a 2 uop instruction on Intel P6-family and SnB-family before Broadwell because it has 3 inputs. On other x86 CPUs (Broadwell / Skylake, and AMD), it's a single-uop instruction. On most such CPUs it has 1 cycle latency. It's a simple ALU select operation; the only complication is 3 inputs (2 regs + FLAGS). But on KNL it's still 2-cycle latency.
Unfortunately gcc for AArch64 fails to use adds to set flags and check the V (overflow) flag result, so it spends several instructions deciding whether to branch.
Clang does a great job, and AArch64's immediate encodings can represent INT64_MAX as an operand to eor or add.
// clang8.0 -O3 -target aarch64
signed_sat_add64_gnuc:
orr x9, xzr, #0x7fffffffffffffff // mov constant = OR with zero reg
adds x8, x0, x1 // add and set flags
add x9, x9, x1, lsr #63 // sat = (b shr 63) + MAX
csel x0, x9, x8, vs // conditional-select, condition = VS = oVerflow flag Set
ret
Optimizing MIN vs. MAX selection
As noted above, return (b<0) ? INT64_MIN : INT64_MAX; doesn't compile optimally with most versions of gcc/clang; they generate both constant in registers and cmov to select, or something similar on other ISAs.
We can assume 2's complement because GCC only supports 2's complement integer types, and because the ISO C optional int64_t type is guaranteed to be 2's complement if it exists. (Signed overflow of int64_t is still UB, this allows it to be a simple typedef of long or long long).
(On a sign/magnitude C implementation that supported some equivalent of __builtin_add_overflow, a version of this function for long long or int couldn't use the SHR / ADD trick. For extreme portability you'd probably just use the simple ternary, or for sign/magnitude specifically you could return (b&0x800...) | 0x7FFF... to OR the sign bit of b into a max-magnitude number.)
For two's complement, the bit-patterns for MIN and MAX are 0x8000... (just the high bit set) and 0x7FFF... (all other bits set). They have a couple interesting properties: MIN = MAX + 1 (if computed with unsigned on the bit-pattern), and MIN = ~MAX: their bit-patterns are bitwise inverses, aka one's complement of each other.
The MIN = ~MAX property follows from ~x = -x - 1 (a re-arrangement of the standard -x = ~x + 1 2's complement identity) and the fact that MIN = -MAX - 1. The +1 property is unrelated, and follows from simple rollover from most-positive to most-negative and applies to the one's complement encoding of signed integer as well. (But not sign/magnitude; you'd get -0 where the unsigned magnitude ).
The above function uses the MIN = MAX + 1 trick. The MIN = ~MAX trick is also usable by broadcasting the sign bit to all positions with an arithmetic right shift (creating 0 or 0xFF...), and XORing with that.
GNU C guarantees that signed right shifts are arithmetic (sign-extension), so (b>>63) ^ INT64_MAX is equivalent to (b<0) ? INT64_MIN : INT64_MAX in GNU C.
ISO C leaves signed right shifts implementation-defined, but we could use a ternary of b<0 ? ~0ULL : 0ULL. Compilers will optimize the following to sar / xor, or equivalent instruction(s), but it has no implementation-defined behaviour. AArch64 can use a shifted input operand for eor just as well as it can for add.
// an earlier version of this answer used this
int64_t mask = (b<0) ? ~0ULL : 0; // compiles to sar with good compilers, but is not implementation-defined.
return mask ^ INT64_MAX;
Fun fact: AArch64 has a csinv instruction: conditional-select inverse. And it can put INT64_MIN into a register with a single 32-bit mov instruction, thanks to its powerful immediate encodings for simple bit-patterns. AArch64 GCC was already using csinv for the MIN = ~MAX trick for the original return (b<0) ? INT64_MIN : INT64_MAX; version.
clang 6.0 and earlier on Godbolt were using shr/add for the plain (b<0) ? INT64_MIN : INT64_MAX; version. It looks more efficient than what clang7/8 do, so that's a regression / missed-optimization bug I think. (And it's the whole point of this section and why I wrote a 2nd version.)
I chose the MIN = MAX + 1 version because it could possible auto-vectorize better: x86 has 64-bit SIMD logical right shifts but only 16 and 32-bit SIMD arithmetic right shifts until AVX512F. Of course, signed-overflow detection with SIMD probably makes it not worth it until AVX512 for 64-bit integers. Well maybe AVX2. And if it's part of some larger calculation that can otherwise vectorize efficiently, then unpacking to scalar and back sucks.
For scalar it's truly a wash; neither way compiles any better, and sar/shr perform identically, and so do add/xor, on all CPUs that Agner Fog has tested. (https://agner.org/optimize/).
But + can sometimes optimize into other things, though, so you could imagine gcc folding a later + or - of a constant into the overflow branch. Or possibly using LEA for that add instead of ADD to copy-and-add. The difference in power from a simpler ALU execution unit for XOR vs. ADD is going to be lost in the noise from the cost of all the power it takes to do out-of-order execution and other stuff; all x86 CPUs have single-cycle scalar ADD and XOR, even for 64-bit integers, even on P4 Prescott/Nocona with its exotic adders.
Also #chqrlie suggested a compact readable way to write it in C without UB that looks nicer than the super-portable int mask = ternary thing.
The earlier "simpler" version of this function
Doesn't depend on any special property of MIN/MAX, so maybe useful for saturating to other boundaries with other overflow-detection conditions. Or in case a compiler does something better with this version.
int64_t signed_sat_add64_gnuc(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction for a given `b`
return (b<0) ? INT64_MIN : INT64_MAX;
}
return res;
}
which compiles as follows
# gcc9.1 -O3 (clang chooses branchless)
signed_sat_add64_gnuc:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rdx, 9223372036854775807
test rsi, rsi # one of the addends is still available after
movabs rax, -9223372036854775808 # missed optimization: lea rdx, [rax+1]
cmovns rax, rdx # branchless selection of which saturation limit
ret
This is basically what #drwowe's inline asm does, but with a test replacing one cmov. (And of course different conditions on the cmov.)
Another downside to this vs. the _v2 with shr/add is that this needs 2 constants. In a loop, this would tie up an extra register. (Again unless b is a compile-time constant.)
clang uses cmov instead of a branch, and does spot the lea rax, [rcx + 1] trick to avoid a 2nd 10-byte mov r64, imm64 instruction. (Or clang6.0 and earlier use the shr 63/add trick instead of that cmov.)
The first version of this answer put int64_t sat = (b<0) ? MIN : MAX outside the if(), but gcc missed the optimization of moving that inside the branch so it's not run at all for the non-overflow case. That's even better than running it off the critical path. (And doesn't matter if the compiler decides to go branchless).
But when I put it outside the if and after the __builtin_saddll_overflow, gcc was really dumb and saved the bool result in an integer, then did the test/cmov, then used test on the saddll_overflow result again to put it back in FLAGS. Reordering the source fixed that.
I'm still looking for a decent portable solution, but this is as good as I've come up with so far:
Suggestions for improvements?
int64 saturated_add(int64 x, int64 y) {
#if __GNUC__ && __X86_64__
asm("add %1, %0\n\t"
"jno 1f\n\t"
"cmovge %3, %0\n\t"
"cmovl %2, %0\n"
"1:" : "+r"(x) : "r"(y), "r"(kint64min), "r"(kint64max));
return x;
#else
return portable_saturated_add(x, y);
#endif
}

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.
I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE:
//sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication)
sum = _mm_set1_ps(0.0f);
a1 = _mm_set1_ps(a[0]);
b1 = _mm_load_ps(&b[0]);
sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1));
a2 = _mm_set1_ps(a[1]);
b2 = _mm_load_ps(&b[4]);
sum = _mm_add_ps(sum, _mm_mul_ps(a2, b2));
a3 = _mm_set1_ps(a[2]);
b3 = _mm_load_ps(&b[8]);
sum = _mm_add_ps(sum, _mm_mul_ps(a3, b3));
...
My question is how does this get converted to simultaneous multiply and add? Can the data be dependent? I mean can the CPU do _mm_add_ps(sum, _mm_mul_ps(a1, b1)) simultaneously or do the registers used in the multiplication and add have to be independent?
Lastly how does this apply to FMA (with Haswell)? Is _mm_add_ps(sum, _mm_mul_ps(a1, b1)) automatically converted to a single FMA instruction or micro-operation?
The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).
An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.
The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ON is in effect, and compilers are allowed to have it ON by default (but not all do). Gcc contracts into FMA by default (with the default -std=gnu*, but not -std=c*, e.g. -std=c++14). For Clang, it's only enabled with -ffp-contract=fast. (With just the #pragma enabled, only within a single expression like a+b*c, not across separate C++ statements.).
This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-math vs. -fno-fast-math) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn't be allowed in strict FP.
Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.
So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:
FMA3 Intrinsics: (AVX2 - Intel Haswell)
_mm_fmadd_pd(), _mm256_fmadd_pd()
_mm_fmadd_ps(), _mm256_fmadd_ps()
and about a gazillion other variations...
FMA4 Intrinsics: (XOP - AMD Bulldozer)
_mm_macc_pd(), _mm256_macc_pd()
_mm_macc_ps(), _mm256_macc_ps()
and about a gazillion other variations...
I tested the following code in GCC 5.3, Clang 3.7, ICC 13.0.1 and MSVC 2015 (compiler version 19.00).
float mul_add(float a, float b, float c) {
return a*b + c;
}
__m256 mul_addv(__m256 a, __m256 b, __m256 c) {
return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}
With the right compiler options (see below) every compiler will generate a vfmadd instruction (e.g. vfmadd213ss) from mul_add. However, only MSVC fails to contract mul_addv to a single vfmadd instruction (e.g. vfmadd213ps).
The following compiler options are sufficient to generate vfmadd instructions (except with mul_addv with MSVC).
GCC: -O2 -mavx2 -mfma
Clang: -O1 -mavx2 -mfma -ffp-contract=fast
ICC: -O1 -march=core-avx2
MSVC: /O1 /arch:AVX2 /fp:fast
GCC 4.9 will not contract mul_addv to a single fma instruction but since at least GCC 5.1 it does. I don't know when the other compilers started doing this.

How to combine two __m128 values to __m256?

I would like to combine two __m128 values to one __m256.
Something like this:
__m128 a = _mm_set_ps(1, 2, 3, 4);
__m128 b = _mm_set_ps(5, 6, 7, 8);
to something like:
__m256 c = { 1, 2, 3, 4, 5, 6, 7, 8 };
are there any intrinsics that I can use to do this?
This should do what you want:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_castps128_ps256(a);
c = _mm256_insertf128_ps(c,b,1);
If the order is reversed from what you want, then just switch a and b.
The intrinsic of interest is _mm256_insertf128_ps which will let you insert a 128-bit register into either lower or upper half of a 256-bit AVX register:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_avx_insertf128_ps.htm
The complete family of them is here:
_mm256_insertf128_pd()
_mm256_insertf128_ps()
_mm256_insertf128_si256()
Intel documents __m256 _mm256_set_m128(__m128 hi, __m128 lo) and _mm256_setr_m128(lo, hi) as intrinsics for the vinsertf128 instruction, which is what you want1. (Of course there are also __m256d and __m256i versions, which use the same instruction. The __m256i version may use vinserti128 if AVX2 is available, otherwise it'll use f128 as well.)
These days, those intrinsics are supported by current versions of all 4 major x86 compilers (gcc, clang, MSVC, and ICC). But not by older versions; like some other helper intrinsics that Intel documents, widespread implementation has been slow. (Often GCC or clang are the last hold-out to not have something you wish you could use portably.)
Use it if you don't need portability to old GCC versions: it's the most readable way to express what you want, following the well known _mm_set and _mm_setr patterns.
Performance-wise, it's of course just as efficient as manual cast + vinsertf128 intrinsics (#Mysticial's answer), and for gcc at least that's literally how the internal .h actually implements _mm256_set_m128.
Compiler version support for _mm256_set_m128 / _mm256_setr_m128:
clang: 3.6 and newer. (Mainline, IDK about Apple)
GCC: 8.x and newer, not present as recently as GCC7!
ICC: since at least ICC13, the earliest on Godbolt.
MSVC: since at least 19.14 and 19.10 (WINE) VS2015, the earliest on Godbolt.
https://godbolt.org/z/1na1qr has test cases for all 4 compilers.
__m256 combine_testcase(__m128 hi, __m128 lo) {
return _mm256_set_m128(hi, lo);
}
They all compile this function to one vinsertf128, except MSVC where even the latest version wastes a vmovups xmm2, xmm1 copying a register. (I used -O2 -Gv -arch:AVX to use the vectorcall convention so args would be in registers to make an efficient non-inlined function definition possible for MSVC.) Presumably MSVC would be ok inlining into a larger function if it could write the result to a 3rd register, instead of the calling convention forcing it to read xmm0 and write ymm0.
Footnote 1:
vinsertf128 is very efficient on Zen1, and as efficient as vperm2f128 on other CPUs with 256-bit-wide shuffle units. It can also take the high half from memory in case the compiler spilled it or is folding a _mm_loadu_ps into it, instead of needing to separately do a 128-bit load into a register; vperm2f128's memory operand would be a 256-bit load which you don't want.
https://uops.info/ / https://agner.org/optimize/
Even this one will work:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_insertf128_ps(c,a,0);
c = _mm256_insertf128_ps(c,b,1);
You will get a warning as c is not initialized but you can ignore it and if you're looking for performances this solution will use less clock cycle then the other one.
Can also use permute intrinsic:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_permute2f128_ps(_mm256_castps128_ps256(a), _mm256_castps128_ps256(b), 0x20);
I don't know which way is faster.
I believe this is the simplest:
#define _mm256_set_m128(/* __m128 */ hi, /* __m128 */ lo) \ _mm256_insertf128_ps(_mm256_castps128_ps256(lo), (hi), 0x1)
__m256 c = _mm256_set_m128(a, b);
Do note __mm256_set_m128 is already defined in msvc 2019 if you #include "immintrin.h"

Resources