gcc intrinsic for extended division/multiplication

gcc intrinsic for extended division/multiplication - c

Modern CPU's can perform extended multiplication between two native-size words and store the low and high result in separate registers. Similarly, when performing division, they store the quotient and the remainder in two different registers instead of discarding the unwanted part.
Is there some sort of portable gcc intrinsic which would take the following signature:
void extmul(size_t a, size_t b, size_t *lo, size_t *hi);
Or something like that, and for division:
void extdiv(size_t a, size_t b, size_t *q, size_t *r);
I know I could do it myself with inline assembly and shoehorn portability into it by throwing #ifdef's in the code, or I could emulate the multiplication part using partial sums (which would be significantly slower) but I would like to avoid that for readability. Surely there exists some built-in function to do this?

For gcc since version 4.6 you can use __int128. This works on most 64 bit hardware. For instance
To get the 128 bit result of a 64x64 bit multiplication just use
void extmul(size_t a, size_t b, size_t *lo, size_t *hi) {
__int128 result = (__int128)a * (__int128)b;
*lo = (size_t)result;
*hi = result >> 64;
}
On x86_64 gcc is smart enough to compile this to
0: 48 89 f8 mov %rdi,%rax
3: 49 89 d0 mov %rdx,%r8
6: 48 f7 e6 mul %rsi
9: 49 89 00 mov %rax,(%r8)
c: 48 89 11 mov %rdx,(%rcx)
f: c3 retq
No native 128 bit support or similar required, and after inlining only the mul instruction remains.
Edit: On a 32 bit arch this works in a similar way, you need to replace __int128_t by uint64_t and the shift width by 32. The optimization will work on even older gccs.

For those wondering about the other half of the question (division), gcc does not provide an intrinsic for that because the processor division instructions don't conform to the standard.
This is true both with 128-bit dividends on 64-bit x86 targets and 64-bit dividends on 32-bit x86 targets. The problem is that DIV will cause divide overflow exceptions in cases where the standard says the result should be truncated. For example (unsigned long long) (((unsigned _int128) 1 << 64) / 1) should evaluate to 0, but would cause divide overflow exception if evaluated with DIV.
(Thanks to #ross-ridge for this info)

Related

Why Interrupts not generates by C code but easy generates by assembly instructions?

I am programming a little kernel, and implement idt and interrupts.
This C code in my little kernel not generate any interrupt:
int x = 5/0;
int f[4];
f[5] = 8;
But this Assembly code can generate any interrupt:
asm("int $0");
(and handlers work right).
Help me to understand why this situation can happens.
I also tried this:
int a = 3;
int b = 3;
int c = a-b;
int x = a/c;
Nothing I try in c code can generate exception for me.
Even this not worked:
int div_by_0(int a, int b){return a/b;}
int x = div_by_0(5, 0);

void fun ( void )
{
int a = 3;
int b = 3;
int c = a-b;
int x = a/c;
}
Disassembly of section .text:
0000000000000000 <fun>:
0: f3 c3 repz retq
there is no divide to trigger a divide by zero. It is all dead code.
And none of this has anything to do with the int instruction, these are completely separate topics.
As mentioned in the comments test it without using dead code.
int fun0 ( int x )
{
return(5/x);
}
int fun1 ( void )
{
return(fun0(0));
}
but understand that it still may not have the desired effect:
Disassembly of section .text:
0000000000000000 <fun0>:
0: b8 05 00 00 00 mov $0x5,%eax
5: 99 cltd
6: f7 ff idiv %edi
8: c3 retq
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
0000000000000010 <fun1>:
10: 0f 0b ud2
because the optimizer for fun1 could see the fun0 function. You want to have the code under test in a separate optimization domain. In this case above then the idiv would generate the divide by zero. And then it is becomes an operating system issue as to how that is handled and if it is visible to you.

The problem you are seeing is because division by 0 is undefined behaviour in C/C++. The compiler has managed to do enough optimization at compile time to realize you are dividing by zero. The compiler is free to do anything from things like halting and catching fire to making the result 0. Some compilers will emit a ud2 instruction to raise a CPU exception. The result is undefined.
You have a couple of options. Write your division in assembly and call that function from C/C++. Since you are using GCC (works for CLANG as well) You can also use inline assembly to generate a division by zero with something like:
#include <stdint.h> /* or replace uint16_t with unsigned short int */
void div_by_0 (void)
{
asm ("div %b0" :: "a"((uint16_t)0));
return;
}
This sets AX to 0 then divides AX by AL with the DIV instruction. 0/0 is undefined and will raise a Division Exception (#DE). This inline assembly should work with 16, 32, and 64-bit code.
In protected mode or long mode using int $# (Where # is the vector number) to trigger an exception is not always the same as getting a CPU generated exception. Some exceptions generated by the CPU push an error code on the stack after the return address that needs to be cleaned up by an interrupt handler. If you were to use int $0x0d from ring 0 to cause a #GP exception the interrupt handler would likely fault as it returns from the interrupt because using int to generate an exception never places an error code on the stack. This isn't a problem with int $0 because #DE doesn't have an error code placed on the stack by the CPU.

It turned out to be due to optimization flags. Due to a bit of confusion at Makefiles, the -O2 flag worked. If you enable the -O0 flag, exceptions work directly from C. And even this simple code throws an exceptions:
int x = 5/0;

MSVC Inline ASM to GCC

I'm trying to handle both MSVC and GCC compilers while updating this code base to work on GCC. But I'm unsure exactly how GCCs inline ASM works. Now I'm not great at translating ASM to C else I would just use C instead of ASM.
SLONG Div16(signed long a, signed long b)
{
signed long v;
#ifdef __GNUC__ // GCC doesnt work.
__asm() {
#else // MSVC
__asm {
#endif
mov edx, a
mov ebx, b
mov eax, edx
shl eax, 16
sar edx, 16
idiv ebx
mov v, eax
}
return v;
}
signed long ROR13(signed long val)
{
_asm{
ror val, 13
}
}
I assume ROR13 works something like (val << 13) | (val >> (32 - 13)) but the code doesn't produce the same output.
What is the proper way to translate this inline ASM to GCC and/or whats the C translation of this code?

GCC uses a completely different syntax for inline assembly than MSVC does, so it's quite a bit of work to maintain both forms. It's not an especially good idea, either. There are many problems with inline assembly. People often use it because they think it'll make their code run faster, but it usually has quite the opposite effect. Unless you're an expert in both assembly language and the compiler's code-generation strategies, you are far better off letting the compiler's optimizer generate the code.
When you try to do that, you will have to be a bit careful here, though: signed right shifts are implementation-defined in C, so if you care about portability, you need to cast the value to an equivalent unsigned type:
#include <limits.h> // for CHAR_BIT
signed long ROR13(signed long val)
{
return ((unsigned long)val >> 13) |
((unsigned long)val << ((sizeof(val) * CHAR_BIT) - 13));
}
(See also Best practices for circular shift (rotate) operations in C++).
This will have the same semantics as your original code: ROR val, 13. In fact, MSVC will generate precisely that object code, as will GCC. (Clang, interestingly, will do ROL val, 19, which produces the same result, given the way that rotations work. ICC 17 generates an extended shift instead: SHLD val, val, 19. I'm not sure why; maybe that's faster than rotation on certain Intel processors, or maybe it's the same on Intel but slower on AMD.)
To implement Div16 in pure C, you want:
signed long Div16(signed long a, signed long b)
{
return ((long long)a << 16) / b;
}
On a 64-bit architecture that can do native 64-bit division, (assuming long is still a 32-bit type like on Windows) this will be transformed into:
movsxd rax, a # sign-extend from 32 to 64, if long wasn't already 64-bit
shl rax, 16
cqo # sign-extend rax into rdx:rax
movsxd rcx, b
idiv rcx # or idiv b if the inputs were already 64-bit
ret
Unfortunately, on 32-bit x86, the code isn't nearly as good. Compilers emit a call into their internal library function that provides extended 64-bit division, because they can't prove that using a single 64b/32b => 32b idiv instruction won't fault. (It will raise a #DE exception if the quotient doesn't fit in eax, rather than just truncating)
In other words, transforming:
int32_t Divide(int64_t a, int32_t b)
{
return (a / b);
}
into:
mov eax, a_low
mov edx, a_high
idiv b # will fault if a/b is outside [-2^32, 2^32-1]
ret
is not a legal optimization—the compiler is unable to emit this code. The language standard says that a 64/32 division is promoted to a 64/64 division, which always produces a 64-bit result. That you later cast or coerce that 64-bit result to a 32-bit value is irrelevant to the semantics of the division operation itself. Faulting for some combinations of a and b would violate the as-if rule, unless the compiler can prove that those combinations of a and b are impossible. (For example, if b was known to be greater than 1<<16, this could be a legal optimization for a = (int32_t)input; a <<= 16; But even though this would produce the same behaviour as the C abstract machine for all inputs, gcc and clang
currently don't do that optimization.)
There simply isn't a good way to override the rules imposed by the language standard and force the compiler to emit the desired object code. MSVC doesn't offer an intrinsic for it (although there is a Windows API function, MulDiv, it's not fast, and just uses inline assembly for its own implementation—and with a bug in a certain case, now cemented thanks to the need for backwards compatibility). You essentially have no choice but to resort to assembly, either inline or linked in from an external module.
So, you get into ugliness. It looks like this:
signed long Div16(signed long a, signed long b)
{
#ifdef __GNUC__ // A GNU-style compiler (e.g., GCC, Clang, etc.)
signed long quotient;
signed long remainder; // (unused, but necessary to signal clobbering)
__asm__("idivl %[divisor]"
: "=a" (quotient),
"=d" (remainder)
: "0" ((unsigned long)a << 16),
"1" (a >> 16),
[divisor] "rm" (b)
:
);
return quotient;
#elif _MSC_VER // A Microsoft-style compiler (i.e., MSVC)
__asm
{
mov eax, DWORD PTR [a]
mov edx, eax
shl eax, 16
sar edx, 16
idiv DWORD PTR [b]
// leave result in EAX, where it will be returned
}
#else
#error "Unsupported compiler"
#endif
}
This results in the desired output on both Microsoft and GNU-style compilers.
Well, mostly. For some reason, when you use the rm constraint, which gives the compiler to freedom to choose whether to treat the divisor as either a memory operand or load it into a register, Clang generates worse object code than if you just use r (which forces it to load it into a register). This doesn't affect GCC or ICC. If you care about the quality of output on Clang, you'll probably just want to use r, since this will give equally good object code on all compilers.
Live Demo on Godbolt Compiler Explorer
(Note: GCC uses the SAL mnemonic in its output, instead of the SHL mnemonic. These are identical instructions—the difference only matters for right shifts—and all sane assembly programmers use SHL. I have no idea why GCC emits SAL, but you can just convert it mentally into SHL.)

Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

Given this code:
#include <string.h>
int equal4(const char* a, const char* b)
{
return memcmp(a, b, 4) == 0;
}
int less4(const char* a, const char* b)
{
return memcmp(a, b, 4) < 0;
}
GCC 7 on x86_64 introduced an optimization for the first case (Clang has done it for a long time):
mov eax, DWORD PTR [rsi]
cmp DWORD PTR [rdi], eax
sete al
movzx eax, al
But the second case still calls memcmp():
sub rsp, 8
mov edx, 4
call memcmp
add rsp, 8
shr eax, 31
Could a similar optimization be applied to the second case? What's the best assembly for this, and is there any clear reason why it isn't being done (by GCC or Clang)?
See it on Godbolt's Compiler Explorer: https://godbolt.org/g/jv8fcf

If you generate code for a little-endian platform, optimizing four-byte memcmp for inequality to a single DWORD comparison is invalid.
When memcmp compares individual bytes it goes from low-addressed bytes to high-addressed bytes, regardless of the platform.
In order for memcmp to return zero all four bytes must be identical. Hence, the order of comparison does not matter. Therefore, DWORD optimization is valid, because you ignore the sign of the result.
However, when memcmp returns a positive number, byte ordering matters. Hence, implementing the same comparison using 32-bit DWORD comparison requires a specific endianness: the platform must be big-endian, otherwise the result of comparison would be incorrect.

Endianness is the problem here. Consider this input:
a = 01 00 00 03
b = 02 00 00 02
If you compare these two arrays by treating them as 32-bit integers, then you'll find that a is larger (because 0x03000001 > 0x02000002). On a big-endian machine, this test would probably work as expected.

As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers. It couldn't inline as efficiently as == 0 on little-endian x86.
More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips is big-endian MIPS, while mipsel is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t* instead of const char*.)
If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).
If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.
Pointer-casting to const uint32_t* would be Undefined Behaviour, if the bytes were written as anything but aligned uint32_t or through char*. If you're not sure about strict-aliasing and/or alignment, memcpy into abytes or use GNU C attributes: see another Q&A about alignment and strict-aliasing for workarounds. Most compilers are good at optimizing away small fixed-size memcpy.
// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.
#include <endian.h>
#include <stdint.h>
static inline
uint32_t load32_native_endian(const void *vp){
typedef uint32_t unaligned_aliasing_u32 __attribute__((aligned(1),may_alias));
const unaligned_aliasing_u32 *up = vp;
return *up; // #ifndef __GNUC__ then use memcpy
}
int equal4_optim(const char* a, const char* b) {
uint32_t abytes = load32_native_endian(a);
uint32_t bbytes = load32_native_endian(b);
return abytes == bbytes;
}
int less4_optim(const char* a, const char* b) {
uint32_t a_native = be32toh(load32_native_endian(a));
uint32_t b_native = be32toh(load32_native_endian(b));
return a_native < b_native;
}
I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.
I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).
less4:
mov edi, [rdi]
mov esi, [rsi]
bswap edi
bswap esi
# data loaded and byte-swapped to native unsigned integers
xor eax,eax # solves the same problem as gcc's movzx, see below
cmp edi, esi
setb al # eax=1 if *a was Below(unsigned) *b, else 0
ret
Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).
Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap instructions.
On CPUs that support movbe, it can save code size: movbe ecx, [rsi] is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.
See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).
In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.

Endianness is one problem, but signed char is another. For example, consider that the four bytes you compare are 0x207f2020 and 0x20802020. The 80 as signed char is -128, the 7f as signed char is +127. But if you compare the four bytes, no comparison will give you the right order.
Of course you can do an xor with 0x80808080 and then you can just use an unsigned compare.

Signed saturated add of 64-bit ints?

I'm looking for some C code for signed saturated 64-bit addition that compiles to efficient x86-64 code with the gcc optimizer. Portable code would be ideal, although an asm solution could be used if necessary.
static const int64 kint64max = 0x7fffffffffffffffll;
static const int64 kint64min = 0x8000000000000000ll;
int64 signed_saturated_add(int64 x, int64 y) {
bool x_is_negative = (x & kint64min) != 0;
bool y_is_negative = (y & kint64min) != 0;
int64 sum = x+y;
bool sum_is_negative = (sum & kint64min) != 0;
if (x_is_negative != y_is_negative) return sum; // can't overflow
if (x_is_negative && !sum_is_negative) return kint64min;
if (!x_is_negative && sum_is_negative) return kint64max;
return sum;
}
The function as written produces a fairly lengthy assembly output with several branches. Any tips on optimization? Seems like it ought to be be implementable with just an ADD with a few CMOV instructions but I'm a little bit rusty with this stuff.

This may be optimized further but here is a portable solution. It does not invoked undefined behavior and it checks for integer overflow before it could occur.
#include <stdint.h>
int64_t sadd64(int64_t a, int64_t b)
{
if (a > 0) {
if (b > INT64_MAX - a) {
return INT64_MAX;
}
} else if (b < INT64_MIN - a) {
return INT64_MIN;
}
return a + b;
}

This is a solution that continues in the vein that had been given in one of the comments, and has been used in ouah's solution, too. here the generated code should be without conditional jumps
int64_t signed_saturated_add(int64_t x, int64_t y) {
// determine the lower or upper bound of the result
int64_t ret = (x < 0) ? INT64_MIN : INT64_MAX;
// this is always well defined:
// if x < 0 this adds a positive value to INT64_MIN
// if x > 0 this subtracts a positive value from INT64_MAX
int64_t comp = ret - x;
// the condition is equivalent to
// ((x < 0) && (y > comp)) || ((x >=0) && (y <= comp))
if ((x < 0) == (y > comp)) ret = x + y;
return ret;
}
The first looks as if there would be a conditional move to do, but because of the special values my compiler gets off with an addition: in 2's complement INT64_MIN is INT64_MAX+1.
There is then only one conditional move for the assignment of the sum, in case anything is fine.
All of this has no UB, because in the abstract state machine the sum is only done if there is no overflow.

Related: unsigned saturation is much easier, and efficiently possible in pure ISO C: How to do unsigned saturating addition in C?
Compilers are terrible at all of the pure C options proposed so far.
They don't see that they can use the signed-overflow flag result from an add instruction to detect that saturation to INT64_MIN/MAX is necessary. AFAIK there's no pure C pattern that compilers recognize as reading the OF flag result of an add.
Inline asm is not a bad idea here, but we can avoid that with GCC's builtins that expose UB-safe wrapping signed addition with a boolean overflow result. https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
(If you were going to use GNU C inline asm, that would limit you just as much as these GNU C builtins. And these builtins aren't arch-specific. They do require gcc5 or newer, but gcc4.9 and older are basically obsolete. https://gcc.gnu.org/wiki/DontUseInlineAsm - it defeats constant propagation and is hard to maintain.)
This version uses the fact that INT64_MIN = INT64_MAX + 1ULL (for 2's complement) to select INT64_MIN/MAX based on the sign of b. Signed-overflow UB is avoided by using uint64_t for that add, and GNU C defines the behaviour of converting an unsigned integer to a signed type that can't represent its value (bit-pattern used unchanged). Current gcc/clang benefit from this hand-holding because they don't figure out this trick from a ternary like (b<0) ? INT64_MIN : INT64_MAX. (See below for the alternate version using that). I haven't checked the asm on 32-bit architectures.
GCC only supports 2's complement integer types, so a function using __builtin_add_overflow doesn't have to care about portability to C implementations that use 1's complement (where the same identity holds) or sign/magnitude (where it doesn't), even if you made a version for long or int instead of int64_t.
#include <stdint.h>
#ifndef __cplusplus
#include <stdbool.h>
#endif
// static inline
int64_t signed_sat_add64_gnuc_v2(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction depending on the sign bit
return ((uint64_t)b >> 63) + INT64_MAX;
// INT64_MIN = INT64_MAX + 1 wraparound, done with unsigned
}
return res;
}
Another option is (b>>63) ^ INT64_MAX which might be useful if manually vectorizing where SIMD XOR can run on more ports than SIMD ADD, like on Intel before Skylake. (But x86 doesn't have SIMD 64-bit arithmetic right shift, only logical, so this would only help for an int32_t version, and you'd need to efficiently detect overflow in the first place. Or you might use a variable blend on the sign bit, like blendvpd) See Add saturate 32-bit signed ints intrinsics? with x86 SIMD intrinsics (SSE2/SSE4)
On Godbolt with gcc9 and clang8 (along with the other implementations from other answers):
# gcc9.1 -O3 (clang chooses branchless with cmov)
signed_sat_add64_gnuc_v2:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rax, 9223372036854775807 # INT64_MAX
shr rsi, 63 # b is still available after the ADD
add rax, rsi
ret
When inlining into a loop, the mov imm64 can be hoisted. If b is needed afterwards then we might need an extra mov, otherwise shr/add can destroy b, leaving the INT64_MAX constant in a register undamaged. Or if the compiler wants to use cmov (like clang does), it has to mov/shr because it has to get the saturation constant ready before the add, preserving both operands.
Notice that the critical path for the non-overflowing case only includes an add and a not-taken jo. These can't macro-fuse into a single uop even on Sandybridge-family, but the jo only costs throughput not latency thanks to branch prediction + speculative execution. (When inlining, the mov will go away.)
If saturation is actually not rare and branch prediction is a problem, compile with profile-guided optimization and gcc will hopefully do if-conversion and use a cmovno instead of jo, like clang does. This puts the MIN/MAX selection on the critical path, as well as the CMOV itself. The MIN/MAX selection can run in parallel with the add.
You could use a<0 instead. I used b because I think most people would write x = sadd(x, 123) instead of x = sadd(123, x), and having a compile-time-constant allows the b<0 to optimize away. For maximal optimization opportunity, you could use if (__builtin_constant_p(a)) to ask the compiler if a was a compile-time constant. That works for gcc, but clang evaluates the const-ness way too early, before inlining, so it's useless except in macros with clang. (Related: ICC19 doesn't do constant propagation through __builtin_saddll_overflow: it puts both inputs in registers and still does the add. GCC and Clang just return a constant.)
This optimization is especially valuable inside a loop with the MIN/MAX selection hoisted, leaving only add + cmovo. (Or add + jo to a mov.)
cmov is a 2 uop instruction on Intel P6-family and SnB-family before Broadwell because it has 3 inputs. On other x86 CPUs (Broadwell / Skylake, and AMD), it's a single-uop instruction. On most such CPUs it has 1 cycle latency. It's a simple ALU select operation; the only complication is 3 inputs (2 regs + FLAGS). But on KNL it's still 2-cycle latency.
Unfortunately gcc for AArch64 fails to use adds to set flags and check the V (overflow) flag result, so it spends several instructions deciding whether to branch.
Clang does a great job, and AArch64's immediate encodings can represent INT64_MAX as an operand to eor or add.
// clang8.0 -O3 -target aarch64
signed_sat_add64_gnuc:
orr x9, xzr, #0x7fffffffffffffff // mov constant = OR with zero reg
adds x8, x0, x1 // add and set flags
add x9, x9, x1, lsr #63 // sat = (b shr 63) + MAX
csel x0, x9, x8, vs // conditional-select, condition = VS = oVerflow flag Set
ret
Optimizing MIN vs. MAX selection
As noted above, return (b<0) ? INT64_MIN : INT64_MAX; doesn't compile optimally with most versions of gcc/clang; they generate both constant in registers and cmov to select, or something similar on other ISAs.
We can assume 2's complement because GCC only supports 2's complement integer types, and because the ISO C optional int64_t type is guaranteed to be 2's complement if it exists. (Signed overflow of int64_t is still UB, this allows it to be a simple typedef of long or long long).
(On a sign/magnitude C implementation that supported some equivalent of __builtin_add_overflow, a version of this function for long long or int couldn't use the SHR / ADD trick. For extreme portability you'd probably just use the simple ternary, or for sign/magnitude specifically you could return (b&0x800...) | 0x7FFF... to OR the sign bit of b into a max-magnitude number.)
For two's complement, the bit-patterns for MIN and MAX are 0x8000... (just the high bit set) and 0x7FFF... (all other bits set). They have a couple interesting properties: MIN = MAX + 1 (if computed with unsigned on the bit-pattern), and MIN = ~MAX: their bit-patterns are bitwise inverses, aka one's complement of each other.
The MIN = ~MAX property follows from ~x = -x - 1 (a re-arrangement of the standard -x = ~x + 1 2's complement identity) and the fact that MIN = -MAX - 1. The +1 property is unrelated, and follows from simple rollover from most-positive to most-negative and applies to the one's complement encoding of signed integer as well. (But not sign/magnitude; you'd get -0 where the unsigned magnitude ).
The above function uses the MIN = MAX + 1 trick. The MIN = ~MAX trick is also usable by broadcasting the sign bit to all positions with an arithmetic right shift (creating 0 or 0xFF...), and XORing with that.
GNU C guarantees that signed right shifts are arithmetic (sign-extension), so (b>>63) ^ INT64_MAX is equivalent to (b<0) ? INT64_MIN : INT64_MAX in GNU C.
ISO C leaves signed right shifts implementation-defined, but we could use a ternary of b<0 ? ~0ULL : 0ULL. Compilers will optimize the following to sar / xor, or equivalent instruction(s), but it has no implementation-defined behaviour. AArch64 can use a shifted input operand for eor just as well as it can for add.
// an earlier version of this answer used this
int64_t mask = (b<0) ? ~0ULL : 0; // compiles to sar with good compilers, but is not implementation-defined.
return mask ^ INT64_MAX;
Fun fact: AArch64 has a csinv instruction: conditional-select inverse. And it can put INT64_MIN into a register with a single 32-bit mov instruction, thanks to its powerful immediate encodings for simple bit-patterns. AArch64 GCC was already using csinv for the MIN = ~MAX trick for the original return (b<0) ? INT64_MIN : INT64_MAX; version.
clang 6.0 and earlier on Godbolt were using shr/add for the plain (b<0) ? INT64_MIN : INT64_MAX; version. It looks more efficient than what clang7/8 do, so that's a regression / missed-optimization bug I think. (And it's the whole point of this section and why I wrote a 2nd version.)
I chose the MIN = MAX + 1 version because it could possible auto-vectorize better: x86 has 64-bit SIMD logical right shifts but only 16 and 32-bit SIMD arithmetic right shifts until AVX512F. Of course, signed-overflow detection with SIMD probably makes it not worth it until AVX512 for 64-bit integers. Well maybe AVX2. And if it's part of some larger calculation that can otherwise vectorize efficiently, then unpacking to scalar and back sucks.
For scalar it's truly a wash; neither way compiles any better, and sar/shr perform identically, and so do add/xor, on all CPUs that Agner Fog has tested. (https://agner.org/optimize/).
But + can sometimes optimize into other things, though, so you could imagine gcc folding a later + or - of a constant into the overflow branch. Or possibly using LEA for that add instead of ADD to copy-and-add. The difference in power from a simpler ALU execution unit for XOR vs. ADD is going to be lost in the noise from the cost of all the power it takes to do out-of-order execution and other stuff; all x86 CPUs have single-cycle scalar ADD and XOR, even for 64-bit integers, even on P4 Prescott/Nocona with its exotic adders.
Also #chqrlie suggested a compact readable way to write it in C without UB that looks nicer than the super-portable int mask = ternary thing.
The earlier "simpler" version of this function
Doesn't depend on any special property of MIN/MAX, so maybe useful for saturating to other boundaries with other overflow-detection conditions. Or in case a compiler does something better with this version.
int64_t signed_sat_add64_gnuc(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction for a given `b`
return (b<0) ? INT64_MIN : INT64_MAX;
}
return res;
}
which compiles as follows
# gcc9.1 -O3 (clang chooses branchless)
signed_sat_add64_gnuc:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rdx, 9223372036854775807
test rsi, rsi # one of the addends is still available after
movabs rax, -9223372036854775808 # missed optimization: lea rdx, [rax+1]
cmovns rax, rdx # branchless selection of which saturation limit
ret
This is basically what #drwowe's inline asm does, but with a test replacing one cmov. (And of course different conditions on the cmov.)
Another downside to this vs. the _v2 with shr/add is that this needs 2 constants. In a loop, this would tie up an extra register. (Again unless b is a compile-time constant.)
clang uses cmov instead of a branch, and does spot the lea rax, [rcx + 1] trick to avoid a 2nd 10-byte mov r64, imm64 instruction. (Or clang6.0 and earlier use the shr 63/add trick instead of that cmov.)
The first version of this answer put int64_t sat = (b<0) ? MIN : MAX outside the if(), but gcc missed the optimization of moving that inside the branch so it's not run at all for the non-overflow case. That's even better than running it off the critical path. (And doesn't matter if the compiler decides to go branchless).
But when I put it outside the if and after the __builtin_saddll_overflow, gcc was really dumb and saved the bool result in an integer, then did the test/cmov, then used test on the saddll_overflow result again to put it back in FLAGS. Reordering the source fixed that.

I'm still looking for a decent portable solution, but this is as good as I've come up with so far:
Suggestions for improvements?
int64 saturated_add(int64 x, int64 y) {
#if __GNUC__ && __X86_64__
asm("add %1, %0\n\t"
"jno 1f\n\t"
"cmovge %3, %0\n\t"
"cmovl %2, %0\n"
"1:" : "+r"(x) : "r"(y), "r"(kint64min), "r"(kint64max));
return x;
#else
return portable_saturated_add(x, y);
#endif
}

Read flag register from C program

For the sake of curiosity I'm trying to read the flag register and print it out in a nice way.
I've tried reading it using gcc's asm keyword, but i can't get it to work. Any hints how to do it? I'm running a Intel Core 2 Duo and Mac OS X. The following code is what I have. I hoped it would tell me if an overflow happened:
#include <stdio.h>
int main (void){
int a=10, b=0, bold=0;
printf("%d\n",b);
while(1){
a++;
__asm__ ("pushf\n\t"
"movl 4(%%esp), %%eax\n\t"
"movl %%eax , %0\n\t"
:"=r"(b)
:
:"%eax"
);
if(b!=bold){
printf("register changed \n %d\t to\t %d",bold , b);
}
bold = b;
}
}
This gives a segmentation fault. When I run gdb on it I get this:
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x000000005fbfee5c
0x0000000100000eaf in main () at asm.c:9
9 asm ("pushf \n\t"

You can use the PUSHF/PUSHFD/PUSHFQ instruction (see http://siyobik.info/main/reference/instruction/PUSHF%2FPUSHFD for details) to push the flag register onto the stack. From there on you can interpret it in C. Otherwise you can test directly (against the carry flag for unsigned arithmetic or the overflow flag for signed arithmetic) and branch.
(to be specific, to test for the overflow bit you can use JO (jump if set) and JNO (jump if not set) to branch -- it's bit #11 (0-based) in the register)
About the EFLAGS bit layout: http://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture#EFLAGS_Register
A very crude Visual C syntax test (just wham-bam / some jumps to debug flow), since I don't know about the GCC syntax:
int test2 = 2147483647; // max 32-bit signed int (0x7fffffff)
unsigned int flags_w_overflow, flags_wo_overflow;
__asm
{
mov ebx, test2 // ebx = test value
// test for no overflow
xor eax, eax // eax = 0
add eax, ebx // add ebx
jno no_overflow // jump if no overflow
testoverflow:
// test for overflow
xor ecx, ecx // ecx = 0
inc ecx // ecx = 1
add ecx, ebx // overflow!
pushfd // store flags (32 bits)
jo overflow // jump if overflow
jmp done // jump if not overflown :(
no_overflow:
pushfd // store flags (32 bits)
pop edx // edx = flags w/o overflow
jmp testoverflow // back to next test
overflow:
jmp done // yeah we're done here :)
done:
pop eax // eax = flags w/overflow
mov flags_w_overflow, eax // store
mov flags_wo_overflow, edx // store
}
if (flags_w_overflow & (1 << 11)) __asm int 0x3 // overflow bit set correctly
if (flags_wo_overflow & (1 << 11)) __asm int 0x3 // overflow bit set incorrectly
return 0;

This maybe the case of the XY problem. To check for overflow you do not need to get the hardware overflow flag as you think because the flag can be calculated easily from the sign bits
An illustrative example is what happens if we add 127 and 127 using 8-bit registers. 127+127 is 254, but using 8-bit arithmetic the result would be 1111 1110 binary, which is -2 in two's complement, and thus negative. A negative result out of positive operands (or vice versa) is an overflow. The overflow flag would then be set so the program can be aware of the problem and mitigate this or signal an error. The overflow flag is thus set when the most significant bit (here considered the sign bit) is changed by adding two numbers with the same sign (or subtracting two numbers with opposite signs). Overflow never occurs when the sign of two addition operands are different (or the sign of two subtraction operands are the same).
Internally, the overflow flag is usually generated by an exclusive or of the internal carry into and out of the sign bit. As the sign bit is the same as the most significant bit of a number considered unsigned, the overflow flag is "meaningless" and normally ignored when unsigned numbers are added or subtracted.
https://en.wikipedia.org/wiki/Overflow_flag
So the C implementation is
int add(int a, int b, int* overflowed)
{
// do an unsigned addition since to prevent UB due to signed overflow
unsigned int r = (unsigned int)a + (unsigned int)b;
// if a and b have the same sign and the result's sign is different from a and b
// then the addition was overflowed
*overflowed = !!((~(a ^ b) & (a ^ r)) & 0x80000000);
return (int)r;
}
This way it works portably on any architectures, unlike your solution which only works on x86. Smart compilers may recognize the pattern and change to using the overflow flag if possible. On most RISC architectures like MIPS or RISC-V there is no flag and all signed/unsigned overflow must be checked in software by analyzing the sign bits like that
Some compilers have intrinsics for checking overflow like __builtin_add_overflow in Clang and GCC. And with that intrinsic you can also easily see how the overflow is calculated on non-flag architectures. For example on ARM it's done like this
add w3, w0, w1 # r = a + b
eon w0, w0, w1 # a = a ^ ~b
eor w1, w3, w1 # b = b ^ r
str w3, [x2] # store sum ([x2] = r)
and w0, w1, w0 # a = a & b = (a ^ ~b) & (b ^ r)
lsr w0, w0, 31 # overflowed = a >> 31
ret
which is just a variation of what I've written above
See also
Checking overflow in C
Detecting signed overflow in C/C++
Is it possible to access the overflow flag register in a CPU with C++?
Very detailed explanation of Overflow and Carry flags evaluation techniques
For unsigned int it's much easier
unsigned int a, b, result = a + b;
int overflowed = (result < a);

The compiler can reorder instructions, so you cannot rely on your lahf being next to the increment. In fact, there may not be an increment at all. In your code, you don't use the value of a, so the compiler can completely optimize it out.
So, either write the increment + check in assembler, or write it in C.
Also, lahf loads only ah (8 bits) from eflags, and the Overflow flag is outside of that. Better use pushf; pop %eax.
Some tests:
#include <stdio.h>
int main (void){
int a=2147483640, b=0, bold=0;
printf("%d\n",b);
while(1){
a++;
__asm__ __volatile__ ("pushf \n\t"
"pop %%eax\n\t"
"movl %%eax, %0\n\t"
:"=r"(b)
:
:"%eax"
);
if((b & 0x800) != (bold & 0x800)){
printf("register changed \n %x\t to\t %x\n",bold , b);
}
bold = b;
}
}
$ gcc -Wall -o ex2 ex2.c
$ ./ex2 # Works by sheer luck
0
register changed
200206 to 200a96
register changed
200a96 to 200282
$ gcc -Wall -O -o ex2 ex2.c
$ ./ex2 # Doesn't work, the compiler hasn't even optimized yet!
0

You can't assume anything about how GCC implemented the a++ operation, or whether it even did the computation before your inline asm, or before a function call.
You could make a an (unused) input to your inline asm, but gcc could still have chosen to use lea to copy-and-add instead of inc or add, or constant-propagation after inlining could have turned it into a mov-immediate.
And of course gcc could have done some other computation that writes FLAGS right before your inline asm.
There is no way to make a++; asm(...) safe for this
Stop now, you're on the wrong track. If you insist on using asm, you need to do the add or inc inside the asm so you can read the flags output. If you only care about the overflow flag, use SETCC, specifically seto %0, to create an 8-bit output value. Or better, use GCC6 flag-output syntax to tell the compiler that a boolean output result is in the OF condition in FLAGS at the end of your inline asm.
Also, signed overflow in C is undefined behaviour, so actually causing overflow in a++ is already a bug. It usually won't manifest itself if you somehow detect it after the fact, but if you use a as an array index or something gcc may have widened it to 64-bit to avoid redoing sign-extension.
GCC has builtins for add with overflow detection, since gcc5
There are builtins for signed/unsigned add, sub, and mul, see the GCC manual, that avoid signed-overflow UB and tell you if there was overflow.
bool __builtin_add_overflow (type1 a, type2 b, type3 *res) is the generic version
bool __builtin_sadd_overflow (int a, int b, int *res) is the signed int version
bool __builtin_saddll_overflow (long long int a, long long int b, long long int *res) is the signed 64-bit long long version.
The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc.
There's a saddl version in case you want the operation for whatever size long is on the target platform. (For x86-64 gcc, int is always 32-bit, long long is always 64-bit, but long depends on Windows vs. non-Windows. For platforms like AVR, int would be 16-bit, and only long would be 32-bit.)
int checked_add_int(int a, int b, bool *of) {
int result;
*of = __builtin_sadd_overflow(a, b, &result);
return result;
}
compiles with gcc -O3 for x86-64 System V to this asm, on Godbolt
checked_add_int:
mov eax, edi
add eax, esi # can't use the normal lea eax, [rdi+rsi]
seto BYTE PTR [rdx]
and BYTE PTR [rdx], 1 # silly compiler, it's already 0/1
ret
ICC19 uses setcc into an integer register and then stores that, same difference as far as uops, but worse code-size.
After inlining to a caller that did if(of) {} it should just jo or jno instead of actually using setcc to create an integer 0/1; in general this should inline efficiently.
Also, since gcc7, there's a builtin to ask if an addition (after promotion to a given type) would overflow, without returning the value.
#include <stdbool.h>
int overflows(int a, int b) {
bool of = __builtin_add_overflow_p(a, b, (int)0);
return of;
}
compiles with gcc -O3 for x86-64 System V to this asm, also on Godbolt
overflows:
xor eax, eax
add edi, esi
seto al
ret
See also Detecting signed overflow in C/C++

Others have offered good alternate code and reasons why what you're trying to do probably doesn't give the result you want, but the actual bug in your code is that you corrupted the stack state by pushing without popping. I would rewrite the asm as:
pushf
pop %0
Or you could just add $4,%%esp at the end of your asm to fix the stack pointer if you prefer the inefficient way.

The following C program will read the FLAGS register when compiled with GCC and any x86 or x86_64 machine following a calling convention in which integers are returned to %eax. You may need to pass the -zexecstack argument to the compiler.
#include<stdio.h>
#include<stdlib.h>
int(*f)()=(void*)L"\xc3589c";
int main( int argc, char **argv ) {
if( argc < 3 ) {
printf( "Usage: %s <augend> <addend>\n", *argv );
return 0;
}
int a=atoi(argv[1])+atoi(argv[2]);
int b=f();
printf("%d CF %d PF %d AF %d ZF %d SF %d TF %d IF %d DF %d OF %d IOPL %d NT %d RF %d VM %d AC %d VIF %d VIP %d ID %d\n", a, b&1, b/4&1, b>>4&1, b>>6&1, b>>7&1, b>>8&1, b>>9&1, b>>10&1, b>>11&1, b>>12&3, b>>14&1, b>>16&1, b>>17&1, b>>18&1, b>>19&1, b>>20&1, b>>21&1 );
}
Try it online!
The funny looking string literal disassembles to
0x0000000000000000: 9C pushfq
0x0000000000000001: 58 pop rax
0x0000000000000002: C3 ret

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

gcc intrinsic for extended division/multiplication - c

Related

Why Interrupts not generates by C code but easy generates by assembly instructions?

MSVC Inline ASM to GCC

Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

Signed saturated add of 64-bit ints?

Read flag register from C program

Categories

Resources