MSVC Inline ASM to GCC

MSVC Inline ASM to GCC - c

I'm trying to handle both MSVC and GCC compilers while updating this code base to work on GCC. But I'm unsure exactly how GCCs inline ASM works. Now I'm not great at translating ASM to C else I would just use C instead of ASM.
SLONG Div16(signed long a, signed long b)
{
signed long v;
#ifdef __GNUC__ // GCC doesnt work.
__asm() {
#else // MSVC
__asm {
#endif
mov edx, a
mov ebx, b
mov eax, edx
shl eax, 16
sar edx, 16
idiv ebx
mov v, eax
}
return v;
}
signed long ROR13(signed long val)
{
_asm{
ror val, 13
}
}
I assume ROR13 works something like (val << 13) | (val >> (32 - 13)) but the code doesn't produce the same output.
What is the proper way to translate this inline ASM to GCC and/or whats the C translation of this code?

GCC uses a completely different syntax for inline assembly than MSVC does, so it's quite a bit of work to maintain both forms. It's not an especially good idea, either. There are many problems with inline assembly. People often use it because they think it'll make their code run faster, but it usually has quite the opposite effect. Unless you're an expert in both assembly language and the compiler's code-generation strategies, you are far better off letting the compiler's optimizer generate the code.
When you try to do that, you will have to be a bit careful here, though: signed right shifts are implementation-defined in C, so if you care about portability, you need to cast the value to an equivalent unsigned type:
#include <limits.h> // for CHAR_BIT
signed long ROR13(signed long val)
{
return ((unsigned long)val >> 13) |
((unsigned long)val << ((sizeof(val) * CHAR_BIT) - 13));
}
(See also Best practices for circular shift (rotate) operations in C++).
This will have the same semantics as your original code: ROR val, 13. In fact, MSVC will generate precisely that object code, as will GCC. (Clang, interestingly, will do ROL val, 19, which produces the same result, given the way that rotations work. ICC 17 generates an extended shift instead: SHLD val, val, 19. I'm not sure why; maybe that's faster than rotation on certain Intel processors, or maybe it's the same on Intel but slower on AMD.)
To implement Div16 in pure C, you want:
signed long Div16(signed long a, signed long b)
{
return ((long long)a << 16) / b;
}
On a 64-bit architecture that can do native 64-bit division, (assuming long is still a 32-bit type like on Windows) this will be transformed into:
movsxd rax, a # sign-extend from 32 to 64, if long wasn't already 64-bit
shl rax, 16
cqo # sign-extend rax into rdx:rax
movsxd rcx, b
idiv rcx # or idiv b if the inputs were already 64-bit
ret
Unfortunately, on 32-bit x86, the code isn't nearly as good. Compilers emit a call into their internal library function that provides extended 64-bit division, because they can't prove that using a single 64b/32b => 32b idiv instruction won't fault. (It will raise a #DE exception if the quotient doesn't fit in eax, rather than just truncating)
In other words, transforming:
int32_t Divide(int64_t a, int32_t b)
{
return (a / b);
}
into:
mov eax, a_low
mov edx, a_high
idiv b # will fault if a/b is outside [-2^32, 2^32-1]
ret
is not a legal optimization—the compiler is unable to emit this code. The language standard says that a 64/32 division is promoted to a 64/64 division, which always produces a 64-bit result. That you later cast or coerce that 64-bit result to a 32-bit value is irrelevant to the semantics of the division operation itself. Faulting for some combinations of a and b would violate the as-if rule, unless the compiler can prove that those combinations of a and b are impossible. (For example, if b was known to be greater than 1<<16, this could be a legal optimization for a = (int32_t)input; a <<= 16; But even though this would produce the same behaviour as the C abstract machine for all inputs, gcc and clang
currently don't do that optimization.)
There simply isn't a good way to override the rules imposed by the language standard and force the compiler to emit the desired object code. MSVC doesn't offer an intrinsic for it (although there is a Windows API function, MulDiv, it's not fast, and just uses inline assembly for its own implementation—and with a bug in a certain case, now cemented thanks to the need for backwards compatibility). You essentially have no choice but to resort to assembly, either inline or linked in from an external module.
So, you get into ugliness. It looks like this:
signed long Div16(signed long a, signed long b)
{
#ifdef __GNUC__ // A GNU-style compiler (e.g., GCC, Clang, etc.)
signed long quotient;
signed long remainder; // (unused, but necessary to signal clobbering)
__asm__("idivl %[divisor]"
: "=a" (quotient),
"=d" (remainder)
: "0" ((unsigned long)a << 16),
"1" (a >> 16),
[divisor] "rm" (b)
:
);
return quotient;
#elif _MSC_VER // A Microsoft-style compiler (i.e., MSVC)
__asm
{
mov eax, DWORD PTR [a]
mov edx, eax
shl eax, 16
sar edx, 16
idiv DWORD PTR [b]
// leave result in EAX, where it will be returned
}
#else
#error "Unsupported compiler"
#endif
}
This results in the desired output on both Microsoft and GNU-style compilers.
Well, mostly. For some reason, when you use the rm constraint, which gives the compiler to freedom to choose whether to treat the divisor as either a memory operand or load it into a register, Clang generates worse object code than if you just use r (which forces it to load it into a register). This doesn't affect GCC or ICC. If you care about the quality of output on Clang, you'll probably just want to use r, since this will give equally good object code on all compilers.
Live Demo on Godbolt Compiler Explorer
(Note: GCC uses the SAL mnemonic in its output, instead of the SHL mnemonic. These are identical instructions—the difference only matters for right shifts—and all sane assembly programmers use SHL. I have no idea why GCC emits SAL, but you can just convert it mentally into SHL.)

Related

How to get bits of specific xmm registers?

So I want to get the value or state of specific xmm registers. This is primarily for a crash log or just to see the state of the registers for debugging. I tried this, but it doesn't seem to work:
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
register __m128i my_val __asm__("xmm0");
__asm__ ("" :"=r"(my_val));
printf("%llu %llu\n", my_val & 0xFFFFFFFFFFFFFFFF, my_val << 63);
return 0;
}
As far as I know, the store related intrinsics would not treat the __m128i as a POD data type but rather as a reference to one of the xmm registers.
How do I get and access the bits stored in the __m128i as 64 bit integers? Or does my __asm__ above work?

How do I get and access the bits stored in the __m128i as 64 bit integers?
You will have to convert the __m128i vector to a pair of uint64_t variables. You can do that with conversion intrinsics:
uint64_t lo = _mm_cvtsi128_si64(my_val);
uint64_t hi = _mm_cvtsi128_si64(_mm_unpackhi_epi64(my_val, my_val));
...or though memory:
uint64_t buf[2];
_mm_storeu_si128((__m128i*)buf, my_val);
uint64_t lo = buf[0];
uint64_t hi = buf[1];
The latter may be worse in terms of performance, but if you intend to use it only for debugging, it would do. It is also trivial to adapt to differently sized elements, if you need that.
Or does my __asm__ above work?
No, it doesn't. The "=r" output constraint does not allow vector registers, such as xmm0, which you pass as an output, it only allows general purpose registers. No general purpose registers are 128-bit wide, so that asm statement makes no sense.
Also, I should note that my_val << 63 shifts the value in the wrong way. If you wanted to output the high half of the hypothetical 128-bit value then you should've shifted right, not left. And besides that, shifts on vectors are either not implemented or act on each element of the vector rather than the vector as a whole, depending on the compiler. But this part is moot, as with the code above you don't need any shifts to output the two halves.

If you really want to know about register values, rather than __m128i C variable values, I'd suggest using a debugger like GDB. print /x $xmm0.v2_int64 when stopped at a breakpoint.
Capturing a register at the top of a function is a pretty flaky and unreliable thing to try to attempt (smells like you've already gone down the wrong design path)1. But you're on the right track with a register-asm local var. However, xmm0 can't match an "=r" constraint, only "=x". See Reading a register value into a C variable for more about using an empty asm template to tell the compiler you want a C variable to be what was in a register.
You do need the asm volatile("" : "=x"(var)); statement, though; GNU C register-asm local vars have no guarantees whatsoever except when used as operands to asm statements. (GCC will often keep your var in that register anyway, but IIRC clang won't.)
There's not a lot of guarantee about where this will be ordered wrt. other code (asm volatile may help some, or for stronger ordering also use a "memory" clobber). Also no guarantee that GCC won't use the register for something else first. (Especially a call-clobbered register like any xmm reg.) But it does at least happen to work in the version I tested.
print a __m128i variable shows how to print a __m128i as two 64-bit halves once you have it, or as other element sizes. The compiler will often optimize _mm_store_si128 / reload into shuffles, and this is for printing anyway so keep it simple.
Using a unsigned __int128 tmp; would also be an option in GNU C on x86-64.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h>
#endif
// If you need this, you're probably doing something wrong.
// There's no guarantee about what a compiler will have in XMM0 at any point
void foo() {
register __m128i xmm0 __asm__("xmm0");
__asm__ volatile ("" :"=x"(xmm0));
alignas(16) uint64_t buf[2];
_mm_store_si128((__m128i*)buf, xmm0);
printf("%llu %llu\n", buf[1], buf[0]); // I'd normally use hex, like %#llx
}
This prints the high half first (most significant), so reading left to right across both elements we get each byte in descending order of memory address within buf.
It compiles to the asm we want with both GCC and clang (Godbolt), not stepping on xmm0 before reading it.
# GCC10.2 -O3
foo:
movhlps xmm1, xmm0
movq rdx, xmm0 # low half -> RDX
mov edi, OFFSET FLAT:.LC0
xor eax, eax
movq rsi, xmm1 # high half -> RSI
jmp printf
Footnote 1:
If you make sure your function doesn't inline, you could take advantage of the calling convention to get the incoming values of xmm0..7 (for x86-64 System V), or xmm0..3 if you have no integer args (Windows x64).
__attribute__((noinline))
void foo(__m128i xmm0, __m128i xmm1, __m128i xmm2, etc.) {
// do whatever you want with the xmm0..7 args
}
If you want to provide a different prototype for the function for callers to use (which omits the __m128i args), that can maybe work. It's of course Undefined Behaviour in ISO C, but if you truly stop inlining, the effects depend on the calling convention. As long as you make sure it's noinline so link-time optimization doesn't do cross-file inlining.
Of course, the mere fact of inserting a function call will change register allocation in the caller, so this only helps for a function you were going to call anyway.

Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

Given this code:
#include <string.h>
int equal4(const char* a, const char* b)
{
return memcmp(a, b, 4) == 0;
}
int less4(const char* a, const char* b)
{
return memcmp(a, b, 4) < 0;
}
GCC 7 on x86_64 introduced an optimization for the first case (Clang has done it for a long time):
mov eax, DWORD PTR [rsi]
cmp DWORD PTR [rdi], eax
sete al
movzx eax, al
But the second case still calls memcmp():
sub rsp, 8
mov edx, 4
call memcmp
add rsp, 8
shr eax, 31
Could a similar optimization be applied to the second case? What's the best assembly for this, and is there any clear reason why it isn't being done (by GCC or Clang)?
See it on Godbolt's Compiler Explorer: https://godbolt.org/g/jv8fcf

If you generate code for a little-endian platform, optimizing four-byte memcmp for inequality to a single DWORD comparison is invalid.
When memcmp compares individual bytes it goes from low-addressed bytes to high-addressed bytes, regardless of the platform.
In order for memcmp to return zero all four bytes must be identical. Hence, the order of comparison does not matter. Therefore, DWORD optimization is valid, because you ignore the sign of the result.
However, when memcmp returns a positive number, byte ordering matters. Hence, implementing the same comparison using 32-bit DWORD comparison requires a specific endianness: the platform must be big-endian, otherwise the result of comparison would be incorrect.

Endianness is the problem here. Consider this input:
a = 01 00 00 03
b = 02 00 00 02
If you compare these two arrays by treating them as 32-bit integers, then you'll find that a is larger (because 0x03000001 > 0x02000002). On a big-endian machine, this test would probably work as expected.

As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers. It couldn't inline as efficiently as == 0 on little-endian x86.
More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips is big-endian MIPS, while mipsel is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t* instead of const char*.)
If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).
If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.
Pointer-casting to const uint32_t* would be Undefined Behaviour, if the bytes were written as anything but aligned uint32_t or through char*. If you're not sure about strict-aliasing and/or alignment, memcpy into abytes or use GNU C attributes: see another Q&A about alignment and strict-aliasing for workarounds. Most compilers are good at optimizing away small fixed-size memcpy.
// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.
#include <endian.h>
#include <stdint.h>
static inline
uint32_t load32_native_endian(const void *vp){
typedef uint32_t unaligned_aliasing_u32 __attribute__((aligned(1),may_alias));
const unaligned_aliasing_u32 *up = vp;
return *up; // #ifndef __GNUC__ then use memcpy
}
int equal4_optim(const char* a, const char* b) {
uint32_t abytes = load32_native_endian(a);
uint32_t bbytes = load32_native_endian(b);
return abytes == bbytes;
}
int less4_optim(const char* a, const char* b) {
uint32_t a_native = be32toh(load32_native_endian(a));
uint32_t b_native = be32toh(load32_native_endian(b));
return a_native < b_native;
}
I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.
I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).
less4:
mov edi, [rdi]
mov esi, [rsi]
bswap edi
bswap esi
# data loaded and byte-swapped to native unsigned integers
xor eax,eax # solves the same problem as gcc's movzx, see below
cmp edi, esi
setb al # eax=1 if *a was Below(unsigned) *b, else 0
ret
Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).
Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap instructions.
On CPUs that support movbe, it can save code size: movbe ecx, [rsi] is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.
See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).
In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.

Endianness is one problem, but signed char is another. For example, consider that the four bytes you compare are 0x207f2020 and 0x20802020. The 80 as signed char is -128, the 7f as signed char is +127. But if you compare the four bytes, no comparison will give you the right order.
Of course you can do an xor with 0x80808080 and then you can just use an unsigned compare.

Signed saturated add of 64-bit ints?

I'm looking for some C code for signed saturated 64-bit addition that compiles to efficient x86-64 code with the gcc optimizer. Portable code would be ideal, although an asm solution could be used if necessary.
static const int64 kint64max = 0x7fffffffffffffffll;
static const int64 kint64min = 0x8000000000000000ll;
int64 signed_saturated_add(int64 x, int64 y) {
bool x_is_negative = (x & kint64min) != 0;
bool y_is_negative = (y & kint64min) != 0;
int64 sum = x+y;
bool sum_is_negative = (sum & kint64min) != 0;
if (x_is_negative != y_is_negative) return sum; // can't overflow
if (x_is_negative && !sum_is_negative) return kint64min;
if (!x_is_negative && sum_is_negative) return kint64max;
return sum;
}
The function as written produces a fairly lengthy assembly output with several branches. Any tips on optimization? Seems like it ought to be be implementable with just an ADD with a few CMOV instructions but I'm a little bit rusty with this stuff.

This may be optimized further but here is a portable solution. It does not invoked undefined behavior and it checks for integer overflow before it could occur.
#include <stdint.h>
int64_t sadd64(int64_t a, int64_t b)
{
if (a > 0) {
if (b > INT64_MAX - a) {
return INT64_MAX;
}
} else if (b < INT64_MIN - a) {
return INT64_MIN;
}
return a + b;
}

This is a solution that continues in the vein that had been given in one of the comments, and has been used in ouah's solution, too. here the generated code should be without conditional jumps
int64_t signed_saturated_add(int64_t x, int64_t y) {
// determine the lower or upper bound of the result
int64_t ret = (x < 0) ? INT64_MIN : INT64_MAX;
// this is always well defined:
// if x < 0 this adds a positive value to INT64_MIN
// if x > 0 this subtracts a positive value from INT64_MAX
int64_t comp = ret - x;
// the condition is equivalent to
// ((x < 0) && (y > comp)) || ((x >=0) && (y <= comp))
if ((x < 0) == (y > comp)) ret = x + y;
return ret;
}
The first looks as if there would be a conditional move to do, but because of the special values my compiler gets off with an addition: in 2's complement INT64_MIN is INT64_MAX+1.
There is then only one conditional move for the assignment of the sum, in case anything is fine.
All of this has no UB, because in the abstract state machine the sum is only done if there is no overflow.

Related: unsigned saturation is much easier, and efficiently possible in pure ISO C: How to do unsigned saturating addition in C?
Compilers are terrible at all of the pure C options proposed so far.
They don't see that they can use the signed-overflow flag result from an add instruction to detect that saturation to INT64_MIN/MAX is necessary. AFAIK there's no pure C pattern that compilers recognize as reading the OF flag result of an add.
Inline asm is not a bad idea here, but we can avoid that with GCC's builtins that expose UB-safe wrapping signed addition with a boolean overflow result. https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
(If you were going to use GNU C inline asm, that would limit you just as much as these GNU C builtins. And these builtins aren't arch-specific. They do require gcc5 or newer, but gcc4.9 and older are basically obsolete. https://gcc.gnu.org/wiki/DontUseInlineAsm - it defeats constant propagation and is hard to maintain.)
This version uses the fact that INT64_MIN = INT64_MAX + 1ULL (for 2's complement) to select INT64_MIN/MAX based on the sign of b. Signed-overflow UB is avoided by using uint64_t for that add, and GNU C defines the behaviour of converting an unsigned integer to a signed type that can't represent its value (bit-pattern used unchanged). Current gcc/clang benefit from this hand-holding because they don't figure out this trick from a ternary like (b<0) ? INT64_MIN : INT64_MAX. (See below for the alternate version using that). I haven't checked the asm on 32-bit architectures.
GCC only supports 2's complement integer types, so a function using __builtin_add_overflow doesn't have to care about portability to C implementations that use 1's complement (where the same identity holds) or sign/magnitude (where it doesn't), even if you made a version for long or int instead of int64_t.
#include <stdint.h>
#ifndef __cplusplus
#include <stdbool.h>
#endif
// static inline
int64_t signed_sat_add64_gnuc_v2(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction depending on the sign bit
return ((uint64_t)b >> 63) + INT64_MAX;
// INT64_MIN = INT64_MAX + 1 wraparound, done with unsigned
}
return res;
}
Another option is (b>>63) ^ INT64_MAX which might be useful if manually vectorizing where SIMD XOR can run on more ports than SIMD ADD, like on Intel before Skylake. (But x86 doesn't have SIMD 64-bit arithmetic right shift, only logical, so this would only help for an int32_t version, and you'd need to efficiently detect overflow in the first place. Or you might use a variable blend on the sign bit, like blendvpd) See Add saturate 32-bit signed ints intrinsics? with x86 SIMD intrinsics (SSE2/SSE4)
On Godbolt with gcc9 and clang8 (along with the other implementations from other answers):
# gcc9.1 -O3 (clang chooses branchless with cmov)
signed_sat_add64_gnuc_v2:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rax, 9223372036854775807 # INT64_MAX
shr rsi, 63 # b is still available after the ADD
add rax, rsi
ret
When inlining into a loop, the mov imm64 can be hoisted. If b is needed afterwards then we might need an extra mov, otherwise shr/add can destroy b, leaving the INT64_MAX constant in a register undamaged. Or if the compiler wants to use cmov (like clang does), it has to mov/shr because it has to get the saturation constant ready before the add, preserving both operands.
Notice that the critical path for the non-overflowing case only includes an add and a not-taken jo. These can't macro-fuse into a single uop even on Sandybridge-family, but the jo only costs throughput not latency thanks to branch prediction + speculative execution. (When inlining, the mov will go away.)
If saturation is actually not rare and branch prediction is a problem, compile with profile-guided optimization and gcc will hopefully do if-conversion and use a cmovno instead of jo, like clang does. This puts the MIN/MAX selection on the critical path, as well as the CMOV itself. The MIN/MAX selection can run in parallel with the add.
You could use a<0 instead. I used b because I think most people would write x = sadd(x, 123) instead of x = sadd(123, x), and having a compile-time-constant allows the b<0 to optimize away. For maximal optimization opportunity, you could use if (__builtin_constant_p(a)) to ask the compiler if a was a compile-time constant. That works for gcc, but clang evaluates the const-ness way too early, before inlining, so it's useless except in macros with clang. (Related: ICC19 doesn't do constant propagation through __builtin_saddll_overflow: it puts both inputs in registers and still does the add. GCC and Clang just return a constant.)
This optimization is especially valuable inside a loop with the MIN/MAX selection hoisted, leaving only add + cmovo. (Or add + jo to a mov.)
cmov is a 2 uop instruction on Intel P6-family and SnB-family before Broadwell because it has 3 inputs. On other x86 CPUs (Broadwell / Skylake, and AMD), it's a single-uop instruction. On most such CPUs it has 1 cycle latency. It's a simple ALU select operation; the only complication is 3 inputs (2 regs + FLAGS). But on KNL it's still 2-cycle latency.
Unfortunately gcc for AArch64 fails to use adds to set flags and check the V (overflow) flag result, so it spends several instructions deciding whether to branch.
Clang does a great job, and AArch64's immediate encodings can represent INT64_MAX as an operand to eor or add.
// clang8.0 -O3 -target aarch64
signed_sat_add64_gnuc:
orr x9, xzr, #0x7fffffffffffffff // mov constant = OR with zero reg
adds x8, x0, x1 // add and set flags
add x9, x9, x1, lsr #63 // sat = (b shr 63) + MAX
csel x0, x9, x8, vs // conditional-select, condition = VS = oVerflow flag Set
ret
Optimizing MIN vs. MAX selection
As noted above, return (b<0) ? INT64_MIN : INT64_MAX; doesn't compile optimally with most versions of gcc/clang; they generate both constant in registers and cmov to select, or something similar on other ISAs.
We can assume 2's complement because GCC only supports 2's complement integer types, and because the ISO C optional int64_t type is guaranteed to be 2's complement if it exists. (Signed overflow of int64_t is still UB, this allows it to be a simple typedef of long or long long).
(On a sign/magnitude C implementation that supported some equivalent of __builtin_add_overflow, a version of this function for long long or int couldn't use the SHR / ADD trick. For extreme portability you'd probably just use the simple ternary, or for sign/magnitude specifically you could return (b&0x800...) | 0x7FFF... to OR the sign bit of b into a max-magnitude number.)
For two's complement, the bit-patterns for MIN and MAX are 0x8000... (just the high bit set) and 0x7FFF... (all other bits set). They have a couple interesting properties: MIN = MAX + 1 (if computed with unsigned on the bit-pattern), and MIN = ~MAX: their bit-patterns are bitwise inverses, aka one's complement of each other.
The MIN = ~MAX property follows from ~x = -x - 1 (a re-arrangement of the standard -x = ~x + 1 2's complement identity) and the fact that MIN = -MAX - 1. The +1 property is unrelated, and follows from simple rollover from most-positive to most-negative and applies to the one's complement encoding of signed integer as well. (But not sign/magnitude; you'd get -0 where the unsigned magnitude ).
The above function uses the MIN = MAX + 1 trick. The MIN = ~MAX trick is also usable by broadcasting the sign bit to all positions with an arithmetic right shift (creating 0 or 0xFF...), and XORing with that.
GNU C guarantees that signed right shifts are arithmetic (sign-extension), so (b>>63) ^ INT64_MAX is equivalent to (b<0) ? INT64_MIN : INT64_MAX in GNU C.
ISO C leaves signed right shifts implementation-defined, but we could use a ternary of b<0 ? ~0ULL : 0ULL. Compilers will optimize the following to sar / xor, or equivalent instruction(s), but it has no implementation-defined behaviour. AArch64 can use a shifted input operand for eor just as well as it can for add.
// an earlier version of this answer used this
int64_t mask = (b<0) ? ~0ULL : 0; // compiles to sar with good compilers, but is not implementation-defined.
return mask ^ INT64_MAX;
Fun fact: AArch64 has a csinv instruction: conditional-select inverse. And it can put INT64_MIN into a register with a single 32-bit mov instruction, thanks to its powerful immediate encodings for simple bit-patterns. AArch64 GCC was already using csinv for the MIN = ~MAX trick for the original return (b<0) ? INT64_MIN : INT64_MAX; version.
clang 6.0 and earlier on Godbolt were using shr/add for the plain (b<0) ? INT64_MIN : INT64_MAX; version. It looks more efficient than what clang7/8 do, so that's a regression / missed-optimization bug I think. (And it's the whole point of this section and why I wrote a 2nd version.)
I chose the MIN = MAX + 1 version because it could possible auto-vectorize better: x86 has 64-bit SIMD logical right shifts but only 16 and 32-bit SIMD arithmetic right shifts until AVX512F. Of course, signed-overflow detection with SIMD probably makes it not worth it until AVX512 for 64-bit integers. Well maybe AVX2. And if it's part of some larger calculation that can otherwise vectorize efficiently, then unpacking to scalar and back sucks.
For scalar it's truly a wash; neither way compiles any better, and sar/shr perform identically, and so do add/xor, on all CPUs that Agner Fog has tested. (https://agner.org/optimize/).
But + can sometimes optimize into other things, though, so you could imagine gcc folding a later + or - of a constant into the overflow branch. Or possibly using LEA for that add instead of ADD to copy-and-add. The difference in power from a simpler ALU execution unit for XOR vs. ADD is going to be lost in the noise from the cost of all the power it takes to do out-of-order execution and other stuff; all x86 CPUs have single-cycle scalar ADD and XOR, even for 64-bit integers, even on P4 Prescott/Nocona with its exotic adders.
Also #chqrlie suggested a compact readable way to write it in C without UB that looks nicer than the super-portable int mask = ternary thing.
The earlier "simpler" version of this function
Doesn't depend on any special property of MIN/MAX, so maybe useful for saturating to other boundaries with other overflow-detection conditions. Or in case a compiler does something better with this version.
int64_t signed_sat_add64_gnuc(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction for a given `b`
return (b<0) ? INT64_MIN : INT64_MAX;
}
return res;
}
which compiles as follows
# gcc9.1 -O3 (clang chooses branchless)
signed_sat_add64_gnuc:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rdx, 9223372036854775807
test rsi, rsi # one of the addends is still available after
movabs rax, -9223372036854775808 # missed optimization: lea rdx, [rax+1]
cmovns rax, rdx # branchless selection of which saturation limit
ret
This is basically what #drwowe's inline asm does, but with a test replacing one cmov. (And of course different conditions on the cmov.)
Another downside to this vs. the _v2 with shr/add is that this needs 2 constants. In a loop, this would tie up an extra register. (Again unless b is a compile-time constant.)
clang uses cmov instead of a branch, and does spot the lea rax, [rcx + 1] trick to avoid a 2nd 10-byte mov r64, imm64 instruction. (Or clang6.0 and earlier use the shr 63/add trick instead of that cmov.)
The first version of this answer put int64_t sat = (b<0) ? MIN : MAX outside the if(), but gcc missed the optimization of moving that inside the branch so it's not run at all for the non-overflow case. That's even better than running it off the critical path. (And doesn't matter if the compiler decides to go branchless).
But when I put it outside the if and after the __builtin_saddll_overflow, gcc was really dumb and saved the bool result in an integer, then did the test/cmov, then used test on the saddll_overflow result again to put it back in FLAGS. Reordering the source fixed that.

I'm still looking for a decent portable solution, but this is as good as I've come up with so far:
Suggestions for improvements?
int64 saturated_add(int64 x, int64 y) {
#if __GNUC__ && __X86_64__
asm("add %1, %0\n\t"
"jno 1f\n\t"
"cmovge %3, %0\n\t"
"cmovl %2, %0\n"
"1:" : "+r"(x) : "r"(y), "r"(kint64min), "r"(kint64max));
return x;
#else
return portable_saturated_add(x, y);
#endif
}

Read flag register from C program

For the sake of curiosity I'm trying to read the flag register and print it out in a nice way.
I've tried reading it using gcc's asm keyword, but i can't get it to work. Any hints how to do it? I'm running a Intel Core 2 Duo and Mac OS X. The following code is what I have. I hoped it would tell me if an overflow happened:
#include <stdio.h>
int main (void){
int a=10, b=0, bold=0;
printf("%d\n",b);
while(1){
a++;
__asm__ ("pushf\n\t"
"movl 4(%%esp), %%eax\n\t"
"movl %%eax , %0\n\t"
:"=r"(b)
:
:"%eax"
);
if(b!=bold){
printf("register changed \n %d\t to\t %d",bold , b);
}
bold = b;
}
}
This gives a segmentation fault. When I run gdb on it I get this:
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x000000005fbfee5c
0x0000000100000eaf in main () at asm.c:9
9 asm ("pushf \n\t"

You can use the PUSHF/PUSHFD/PUSHFQ instruction (see http://siyobik.info/main/reference/instruction/PUSHF%2FPUSHFD for details) to push the flag register onto the stack. From there on you can interpret it in C. Otherwise you can test directly (against the carry flag for unsigned arithmetic or the overflow flag for signed arithmetic) and branch.
(to be specific, to test for the overflow bit you can use JO (jump if set) and JNO (jump if not set) to branch -- it's bit #11 (0-based) in the register)
About the EFLAGS bit layout: http://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture#EFLAGS_Register
A very crude Visual C syntax test (just wham-bam / some jumps to debug flow), since I don't know about the GCC syntax:
int test2 = 2147483647; // max 32-bit signed int (0x7fffffff)
unsigned int flags_w_overflow, flags_wo_overflow;
__asm
{
mov ebx, test2 // ebx = test value
// test for no overflow
xor eax, eax // eax = 0
add eax, ebx // add ebx
jno no_overflow // jump if no overflow
testoverflow:
// test for overflow
xor ecx, ecx // ecx = 0
inc ecx // ecx = 1
add ecx, ebx // overflow!
pushfd // store flags (32 bits)
jo overflow // jump if overflow
jmp done // jump if not overflown :(
no_overflow:
pushfd // store flags (32 bits)
pop edx // edx = flags w/o overflow
jmp testoverflow // back to next test
overflow:
jmp done // yeah we're done here :)
done:
pop eax // eax = flags w/overflow
mov flags_w_overflow, eax // store
mov flags_wo_overflow, edx // store
}
if (flags_w_overflow & (1 << 11)) __asm int 0x3 // overflow bit set correctly
if (flags_wo_overflow & (1 << 11)) __asm int 0x3 // overflow bit set incorrectly
return 0;

This maybe the case of the XY problem. To check for overflow you do not need to get the hardware overflow flag as you think because the flag can be calculated easily from the sign bits
An illustrative example is what happens if we add 127 and 127 using 8-bit registers. 127+127 is 254, but using 8-bit arithmetic the result would be 1111 1110 binary, which is -2 in two's complement, and thus negative. A negative result out of positive operands (or vice versa) is an overflow. The overflow flag would then be set so the program can be aware of the problem and mitigate this or signal an error. The overflow flag is thus set when the most significant bit (here considered the sign bit) is changed by adding two numbers with the same sign (or subtracting two numbers with opposite signs). Overflow never occurs when the sign of two addition operands are different (or the sign of two subtraction operands are the same).
Internally, the overflow flag is usually generated by an exclusive or of the internal carry into and out of the sign bit. As the sign bit is the same as the most significant bit of a number considered unsigned, the overflow flag is "meaningless" and normally ignored when unsigned numbers are added or subtracted.
https://en.wikipedia.org/wiki/Overflow_flag
So the C implementation is
int add(int a, int b, int* overflowed)
{
// do an unsigned addition since to prevent UB due to signed overflow
unsigned int r = (unsigned int)a + (unsigned int)b;
// if a and b have the same sign and the result's sign is different from a and b
// then the addition was overflowed
*overflowed = !!((~(a ^ b) & (a ^ r)) & 0x80000000);
return (int)r;
}
This way it works portably on any architectures, unlike your solution which only works on x86. Smart compilers may recognize the pattern and change to using the overflow flag if possible. On most RISC architectures like MIPS or RISC-V there is no flag and all signed/unsigned overflow must be checked in software by analyzing the sign bits like that
Some compilers have intrinsics for checking overflow like __builtin_add_overflow in Clang and GCC. And with that intrinsic you can also easily see how the overflow is calculated on non-flag architectures. For example on ARM it's done like this
add w3, w0, w1 # r = a + b
eon w0, w0, w1 # a = a ^ ~b
eor w1, w3, w1 # b = b ^ r
str w3, [x2] # store sum ([x2] = r)
and w0, w1, w0 # a = a & b = (a ^ ~b) & (b ^ r)
lsr w0, w0, 31 # overflowed = a >> 31
ret
which is just a variation of what I've written above
See also
Checking overflow in C
Detecting signed overflow in C/C++
Is it possible to access the overflow flag register in a CPU with C++?
Very detailed explanation of Overflow and Carry flags evaluation techniques
For unsigned int it's much easier
unsigned int a, b, result = a + b;
int overflowed = (result < a);

The compiler can reorder instructions, so you cannot rely on your lahf being next to the increment. In fact, there may not be an increment at all. In your code, you don't use the value of a, so the compiler can completely optimize it out.
So, either write the increment + check in assembler, or write it in C.
Also, lahf loads only ah (8 bits) from eflags, and the Overflow flag is outside of that. Better use pushf; pop %eax.
Some tests:
#include <stdio.h>
int main (void){
int a=2147483640, b=0, bold=0;
printf("%d\n",b);
while(1){
a++;
__asm__ __volatile__ ("pushf \n\t"
"pop %%eax\n\t"
"movl %%eax, %0\n\t"
:"=r"(b)
:
:"%eax"
);
if((b & 0x800) != (bold & 0x800)){
printf("register changed \n %x\t to\t %x\n",bold , b);
}
bold = b;
}
}
$ gcc -Wall -o ex2 ex2.c
$ ./ex2 # Works by sheer luck
0
register changed
200206 to 200a96
register changed
200a96 to 200282
$ gcc -Wall -O -o ex2 ex2.c
$ ./ex2 # Doesn't work, the compiler hasn't even optimized yet!
0

You can't assume anything about how GCC implemented the a++ operation, or whether it even did the computation before your inline asm, or before a function call.
You could make a an (unused) input to your inline asm, but gcc could still have chosen to use lea to copy-and-add instead of inc or add, or constant-propagation after inlining could have turned it into a mov-immediate.
And of course gcc could have done some other computation that writes FLAGS right before your inline asm.
There is no way to make a++; asm(...) safe for this
Stop now, you're on the wrong track. If you insist on using asm, you need to do the add or inc inside the asm so you can read the flags output. If you only care about the overflow flag, use SETCC, specifically seto %0, to create an 8-bit output value. Or better, use GCC6 flag-output syntax to tell the compiler that a boolean output result is in the OF condition in FLAGS at the end of your inline asm.
Also, signed overflow in C is undefined behaviour, so actually causing overflow in a++ is already a bug. It usually won't manifest itself if you somehow detect it after the fact, but if you use a as an array index or something gcc may have widened it to 64-bit to avoid redoing sign-extension.
GCC has builtins for add with overflow detection, since gcc5
There are builtins for signed/unsigned add, sub, and mul, see the GCC manual, that avoid signed-overflow UB and tell you if there was overflow.
bool __builtin_add_overflow (type1 a, type2 b, type3 *res) is the generic version
bool __builtin_sadd_overflow (int a, int b, int *res) is the signed int version
bool __builtin_saddll_overflow (long long int a, long long int b, long long int *res) is the signed 64-bit long long version.
The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc.
There's a saddl version in case you want the operation for whatever size long is on the target platform. (For x86-64 gcc, int is always 32-bit, long long is always 64-bit, but long depends on Windows vs. non-Windows. For platforms like AVR, int would be 16-bit, and only long would be 32-bit.)
int checked_add_int(int a, int b, bool *of) {
int result;
*of = __builtin_sadd_overflow(a, b, &result);
return result;
}
compiles with gcc -O3 for x86-64 System V to this asm, on Godbolt
checked_add_int:
mov eax, edi
add eax, esi # can't use the normal lea eax, [rdi+rsi]
seto BYTE PTR [rdx]
and BYTE PTR [rdx], 1 # silly compiler, it's already 0/1
ret
ICC19 uses setcc into an integer register and then stores that, same difference as far as uops, but worse code-size.
After inlining to a caller that did if(of) {} it should just jo or jno instead of actually using setcc to create an integer 0/1; in general this should inline efficiently.
Also, since gcc7, there's a builtin to ask if an addition (after promotion to a given type) would overflow, without returning the value.
#include <stdbool.h>
int overflows(int a, int b) {
bool of = __builtin_add_overflow_p(a, b, (int)0);
return of;
}
compiles with gcc -O3 for x86-64 System V to this asm, also on Godbolt
overflows:
xor eax, eax
add edi, esi
seto al
ret
See also Detecting signed overflow in C/C++

Others have offered good alternate code and reasons why what you're trying to do probably doesn't give the result you want, but the actual bug in your code is that you corrupted the stack state by pushing without popping. I would rewrite the asm as:
pushf
pop %0
Or you could just add $4,%%esp at the end of your asm to fix the stack pointer if you prefer the inefficient way.

The following C program will read the FLAGS register when compiled with GCC and any x86 or x86_64 machine following a calling convention in which integers are returned to %eax. You may need to pass the -zexecstack argument to the compiler.
#include<stdio.h>
#include<stdlib.h>
int(*f)()=(void*)L"\xc3589c";
int main( int argc, char **argv ) {
if( argc < 3 ) {
printf( "Usage: %s <augend> <addend>\n", *argv );
return 0;
}
int a=atoi(argv[1])+atoi(argv[2]);
int b=f();
printf("%d CF %d PF %d AF %d ZF %d SF %d TF %d IF %d DF %d OF %d IOPL %d NT %d RF %d VM %d AC %d VIF %d VIP %d ID %d\n", a, b&1, b/4&1, b>>4&1, b>>6&1, b>>7&1, b>>8&1, b>>9&1, b>>10&1, b>>11&1, b>>12&3, b>>14&1, b>>16&1, b>>17&1, b>>18&1, b>>19&1, b>>20&1, b>>21&1 );
}
Try it online!
The funny looking string literal disassembles to
0x0000000000000000: 9C pushfq
0x0000000000000001: 58 pop rax
0x0000000000000002: C3 ret

What is this x86 inline assembly doing?

I came across this code and need to understand what it is doing. It just seems to be declaring two bytes and then doing nothing...
uint64_t x;
__asm__ __volatile__ (".byte 0x0f, 0x31" : "=A" (x));
Thanks!

This is generating two bytes (0F 31) directly into the code stream. This is an RDTSC instruction, which reads the time-stamp counter into EDX:EAX, which will then be copied to the variable 'x' by the output constraint "=A"(x)

0F 31 is the x86 opcode for the RDTSC (read time stamp counter) instruction; it places the value read into the EDX and EAX registers.
The _ _ asm__ directive isn't just declaring two bytes, it's placing inline assembly into the C code. Presumably, the program has a way of using the value in those registers immediately afterwards.
http://en.wikipedia.org/wiki/Time_Stamp_Counter

It's inserting an 0F 31 opcode, which according to this site is:
0F 31 P1+ f2 RDTSC EAX EDX IA32_T... Read Time-Stamp Counter
Then it is storing the result in the x variable

It's inline asm for rdtsc, with the machine-code encoding written out to support really old assemblers that don't know the mnemonic.
Unfortunately, it only works correctly in 32bit code because "=A" doesn't split 64bit operands in half in 64bit code. (The gcc manual even uses rdtsc an an example to illustrate this)
The safe way to write this, which compiles to optimal code with gcc -m32 or -m64, is:
#include <stdint.h>
uint64_t timestamp_safe(void)
{
unsigned long tsc_low, tsc_high; // not uint32_t: saves a zero-extend for -m64 (but not x32 :/)
asm volatile("rdtsc" : "=d"(tsc_high), "=a" (tsc_low));
return ((uint64_t)tsc_high << 32) | tsc_low;
}
In 32bit code, it's just rdtsc/ret, but in 64bit code it does the necessary shift/or to get both halves into rax for the return value.
See it on the Godbolt compiler explorer.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

MSVC Inline ASM to GCC - c

Related

How to get bits of specific xmm registers?

Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

Signed saturated add of 64-bit ints?

Read flag register from C program

What is this x86 inline assembly doing?

Categories

Resources