using definitions in c inefficient? - c

is using definitions in c like this:
#define C1 42
#define C2 10
#define finalC C1 * C2
inefficient as this code will be ALWAYS run during run-time even though the calculation might not be needed?

as this code will be ALWAYS run during run-time even though the calculation might not be needed?
The multiplication will most likely not be executed at run-time. For example, this short function:
#define C1 42
#define C2 10
#define finalC C1 * C2
int foo() { return finalC; }
may compile into:
foo():
push rbp
mov rbp, rsp
mov eax, 420
pop rbp
ret
and that's without any optimization flags!
The macro expansion happens before the compiler processes the code; and it is allowed, and tends to, evaluate constant expressions.
PS - The compiler used was GCC 11.3 and the target architecture is x86_64.

No reasonable compiler will multiple 42 by 10 for this code at run-time. The product, 420, will be computed during compilation (unless optimized out because it is not even needed in the final program).
The macro definitions and replacements are evaluated during program translation, not while the program is running.

Related

MSVC Inline ASM to GCC

I'm trying to handle both MSVC and GCC compilers while updating this code base to work on GCC. But I'm unsure exactly how GCCs inline ASM works. Now I'm not great at translating ASM to C else I would just use C instead of ASM.
SLONG Div16(signed long a, signed long b)
{
signed long v;
#ifdef __GNUC__ // GCC doesnt work.
__asm() {
#else // MSVC
__asm {
#endif
mov edx, a
mov ebx, b
mov eax, edx
shl eax, 16
sar edx, 16
idiv ebx
mov v, eax
}
return v;
}
signed long ROR13(signed long val)
{
_asm{
ror val, 13
}
}
I assume ROR13 works something like (val << 13) | (val >> (32 - 13)) but the code doesn't produce the same output.
What is the proper way to translate this inline ASM to GCC and/or whats the C translation of this code?
GCC uses a completely different syntax for inline assembly than MSVC does, so it's quite a bit of work to maintain both forms. It's not an especially good idea, either. There are many problems with inline assembly. People often use it because they think it'll make their code run faster, but it usually has quite the opposite effect. Unless you're an expert in both assembly language and the compiler's code-generation strategies, you are far better off letting the compiler's optimizer generate the code.
When you try to do that, you will have to be a bit careful here, though: signed right shifts are implementation-defined in C, so if you care about portability, you need to cast the value to an equivalent unsigned type:
#include <limits.h> // for CHAR_BIT
signed long ROR13(signed long val)
{
return ((unsigned long)val >> 13) |
((unsigned long)val << ((sizeof(val) * CHAR_BIT) - 13));
}
(See also Best practices for circular shift (rotate) operations in C++).
This will have the same semantics as your original code: ROR val, 13. In fact, MSVC will generate precisely that object code, as will GCC. (Clang, interestingly, will do ROL val, 19, which produces the same result, given the way that rotations work. ICC 17 generates an extended shift instead: SHLD val, val, 19. I'm not sure why; maybe that's faster than rotation on certain Intel processors, or maybe it's the same on Intel but slower on AMD.)
To implement Div16 in pure C, you want:
signed long Div16(signed long a, signed long b)
{
return ((long long)a << 16) / b;
}
On a 64-bit architecture that can do native 64-bit division, (assuming long is still a 32-bit type like on Windows) this will be transformed into:
movsxd rax, a # sign-extend from 32 to 64, if long wasn't already 64-bit
shl rax, 16
cqo # sign-extend rax into rdx:rax
movsxd rcx, b
idiv rcx # or idiv b if the inputs were already 64-bit
ret
Unfortunately, on 32-bit x86, the code isn't nearly as good. Compilers emit a call into their internal library function that provides extended 64-bit division, because they can't prove that using a single 64b/32b => 32b idiv instruction won't fault. (It will raise a #DE exception if the quotient doesn't fit in eax, rather than just truncating)
In other words, transforming:
int32_t Divide(int64_t a, int32_t b)
{
return (a / b);
}
into:
mov eax, a_low
mov edx, a_high
idiv b # will fault if a/b is outside [-2^32, 2^32-1]
ret
is not a legal optimization—the compiler is unable to emit this code. The language standard says that a 64/32 division is promoted to a 64/64 division, which always produces a 64-bit result. That you later cast or coerce that 64-bit result to a 32-bit value is irrelevant to the semantics of the division operation itself. Faulting for some combinations of a and b would violate the as-if rule, unless the compiler can prove that those combinations of a and b are impossible. (For example, if b was known to be greater than 1<<16, this could be a legal optimization for a = (int32_t)input; a <<= 16; But even though this would produce the same behaviour as the C abstract machine for all inputs, gcc and clang
currently don't do that optimization.)
There simply isn't a good way to override the rules imposed by the language standard and force the compiler to emit the desired object code. MSVC doesn't offer an intrinsic for it (although there is a Windows API function, MulDiv, it's not fast, and just uses inline assembly for its own implementation—and with a bug in a certain case, now cemented thanks to the need for backwards compatibility). You essentially have no choice but to resort to assembly, either inline or linked in from an external module.
So, you get into ugliness. It looks like this:
signed long Div16(signed long a, signed long b)
{
#ifdef __GNUC__ // A GNU-style compiler (e.g., GCC, Clang, etc.)
signed long quotient;
signed long remainder; // (unused, but necessary to signal clobbering)
__asm__("idivl %[divisor]"
: "=a" (quotient),
"=d" (remainder)
: "0" ((unsigned long)a << 16),
"1" (a >> 16),
[divisor] "rm" (b)
:
);
return quotient;
#elif _MSC_VER // A Microsoft-style compiler (i.e., MSVC)
__asm
{
mov eax, DWORD PTR [a]
mov edx, eax
shl eax, 16
sar edx, 16
idiv DWORD PTR [b]
// leave result in EAX, where it will be returned
}
#else
#error "Unsupported compiler"
#endif
}
This results in the desired output on both Microsoft and GNU-style compilers.
Well, mostly. For some reason, when you use the rm constraint, which gives the compiler to freedom to choose whether to treat the divisor as either a memory operand or load it into a register, Clang generates worse object code than if you just use r (which forces it to load it into a register). This doesn't affect GCC or ICC. If you care about the quality of output on Clang, you'll probably just want to use r, since this will give equally good object code on all compilers.
Live Demo on Godbolt Compiler Explorer
(Note: GCC uses the SAL mnemonic in its output, instead of the SHL mnemonic. These are identical instructions—the difference only matters for right shifts—and all sane assembly programmers use SHL. I have no idea why GCC emits SAL, but you can just convert it mentally into SHL.)

Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

Given this code:
#include <string.h>
int equal4(const char* a, const char* b)
{
return memcmp(a, b, 4) == 0;
}
int less4(const char* a, const char* b)
{
return memcmp(a, b, 4) < 0;
}
GCC 7 on x86_64 introduced an optimization for the first case (Clang has done it for a long time):
mov eax, DWORD PTR [rsi]
cmp DWORD PTR [rdi], eax
sete al
movzx eax, al
But the second case still calls memcmp():
sub rsp, 8
mov edx, 4
call memcmp
add rsp, 8
shr eax, 31
Could a similar optimization be applied to the second case? What's the best assembly for this, and is there any clear reason why it isn't being done (by GCC or Clang)?
See it on Godbolt's Compiler Explorer: https://godbolt.org/g/jv8fcf
If you generate code for a little-endian platform, optimizing four-byte memcmp for inequality to a single DWORD comparison is invalid.
When memcmp compares individual bytes it goes from low-addressed bytes to high-addressed bytes, regardless of the platform.
In order for memcmp to return zero all four bytes must be identical. Hence, the order of comparison does not matter. Therefore, DWORD optimization is valid, because you ignore the sign of the result.
However, when memcmp returns a positive number, byte ordering matters. Hence, implementing the same comparison using 32-bit DWORD comparison requires a specific endianness: the platform must be big-endian, otherwise the result of comparison would be incorrect.
Endianness is the problem here. Consider this input:
a = 01 00 00 03
b = 02 00 00 02
If you compare these two arrays by treating them as 32-bit integers, then you'll find that a is larger (because 0x03000001 > 0x02000002). On a big-endian machine, this test would probably work as expected.
As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers. It couldn't inline as efficiently as == 0 on little-endian x86.
More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips is big-endian MIPS, while mipsel is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t* instead of const char*.)
If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).
If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.
Pointer-casting to const uint32_t* would be Undefined Behaviour, if the bytes were written as anything but aligned uint32_t or through char*. If you're not sure about strict-aliasing and/or alignment, memcpy into abytes or use GNU C attributes: see another Q&A about alignment and strict-aliasing for workarounds. Most compilers are good at optimizing away small fixed-size memcpy.
// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.
#include <endian.h>
#include <stdint.h>
static inline
uint32_t load32_native_endian(const void *vp){
typedef uint32_t unaligned_aliasing_u32 __attribute__((aligned(1),may_alias));
const unaligned_aliasing_u32 *up = vp;
return *up; // #ifndef __GNUC__ then use memcpy
}
int equal4_optim(const char* a, const char* b) {
uint32_t abytes = load32_native_endian(a);
uint32_t bbytes = load32_native_endian(b);
return abytes == bbytes;
}
int less4_optim(const char* a, const char* b) {
uint32_t a_native = be32toh(load32_native_endian(a));
uint32_t b_native = be32toh(load32_native_endian(b));
return a_native < b_native;
}
I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.
I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).
less4:
mov edi, [rdi]
mov esi, [rsi]
bswap edi
bswap esi
# data loaded and byte-swapped to native unsigned integers
xor eax,eax # solves the same problem as gcc's movzx, see below
cmp edi, esi
setb al # eax=1 if *a was Below(unsigned) *b, else 0
ret
Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).
Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap instructions.
On CPUs that support movbe, it can save code size: movbe ecx, [rsi] is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.
See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).
In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.
Endianness is one problem, but signed char is another. For example, consider that the four bytes you compare are 0x207f2020 and 0x20802020. The 80 as signed char is -128, the 7f as signed char is +127. But if you compare the four bytes, no comparison will give you the right order.
Of course you can do an xor with 0x80808080 and then you can just use an unsigned compare.

AVR GCC, assembly C stub functions, eor and the required constant value

I'm having this code:
uint16_t swap_bytes(uint16_t x)
{
asm volatile(
"eor, %A0, %B0" "\n\t"
"eor, %B0, %A0" "\n\t"
"eor, %A0, %B0" "\n\t"
: "=r" (x)
: "0" (x)
);
return x;
}
Which translates (by avr-gcc version 4.8.1 with -std=gnu99 -save-temps) to:
.global swap_bytes
.type swap_bytes, #function
swap_bytes:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
/* #APP */
; 43 "..\lib\own\ownlib.c" 1
eor, r24, r25
eor, r25, r24
eor, r24, r25
; 0 "" 2
/* #NOAPP */
ret
.size swap_bytes, .-swap_bytes
But then the compiler is complaining like that:
|65|Error: constant value required|
|65|Error: garbage at end of line|
|66|Error: constant value required|
|66|Error: garbage at end of line|
|67|Error: constant value required|
|67|Error: garbage at end of line|
||=== Build failed: 6 error(s), 0 warning(s) (0 minute(s), 0 second(s)) ===|
The mentioned lines are the ones with the eor commands. Why does the compiler having problems with that? The registers are even upper (>= r16) ones where nearly all operations are possible. constant value required sounds to me like it expects a literal... I dont get it.
Just to clarify for future googlers:
eor, r24, r25
has an extra comma after the eor. This should be written as:
eor r24, r25
I would also encourage you (again) to consider using gcc's __builtin_bswap16. In case you are not familiar with the gcc 'builtin' functions, they are functions that are built into the compiler, and (despite looking like functions) are typically inlined. They have been written and optimized by people who understand all the ins and outs of the various processors and can take into account things you may not have considered.
I understand the desire to keep code as small as possible. And I accept that it is possible that (somehow) this builtin on your specific processor is producing sub-optimal code (I assume you have checked?). On the other hand, it may produce exactly the same code. Or it may use some even more clever trick to do this. Or it might interleave instructions from the surrounding code to take advantage of pipelining (or some other avr-specific thing that I have never heard of because I don't speak 'avr').
What's more, consider this code:
int main()
{
return __builtin_bswap16(12345);
}
Your code always takes 3 instructions to process a swap. However with builtins, the compiler can recognize that the arg is constant and compute the value at compile time instead of at run time. Hard to be more efficient than that.
I could also point out the benefits of "easier to support." Writing inline asm is HARD to do correctly. And future maintainers hate to touch it cuz they're never quite sure how it works. And of course, the builtin is going to be more cross-platform portable.
Still not convinced? My last pitch: Even after you fix the commas, your inline asm code still isn't quite right. Consider this code:
int main(int argc, char *argv[])
{
return swap_bytes(argc) + swap_bytes(argc);
}
Because of the way you have written written swap_bytes (ie using volatile), gcc must compute the value twice (see the definition of volatile). Had you omitted volatile (or if you had used the builtin which does this correctly), it would have realized that argc doesn't change and re-used the output from the first call. Did I mention that correctly writing inline asm is HARD?
I don't know your code, constraints, level of expertise or requirements. Maybe your solution really is the best. The most I can do is to encourage you to think long and hard before using inline asm in production code.

Short-circuiting on boolean operands without side effects

For the bounty: How can this behavior can be disabled on a case-by-case basis without disabling or lowering the optimization level?
The following conditional expression was compiled on MinGW GCC 3.4.5, where a is a of type signed long, and m is of type unsigned long.
if (!a && m > 0x002 && m < 0x111)
The CFLAGS used were -g -O2. Here is the corresponding assembly GCC output (dumped with objdump)
120: 8b 5d d0 mov ebx,DWORD PTR [ebp-0x30]
123: 85 db test ebx,ebx
125: 0f 94 c0 sete al
128: 31 d2 xor edx,edx
12a: 83 7d d4 02 cmp DWORD PTR [ebp-0x2c],0x2
12e: 0f 97 c2 seta dl
131: 85 c2 test edx,eax
133: 0f 84 1e 01 00 00 je 257 <_MyFunction+0x227>
139: 81 7d d4 10 01 00 00 cmp DWORD PTR [ebp-0x2c],0x110
140: 0f 87 11 01 00 00 ja 257 <_MyFunction+0x227>
120-131 can easily be traced as first evaluating !a, followed by the evaluation of m > 0x002. The first jump conditional does not occur until 133. By this time, two expressions have been evaluated, regardless of the outcome of the first expression: !a. If a was equal to zero, the expression can (and should) be concluded immediately, which is not done here.
How does this relate to the the C standard, which requires Boolean operators to short-circuit as soon as the outcome can be determined?
The C standard only specifies the behavior of an "abstract machine"; it does not specify the generation of assembly. As long as the observable behavior of a program matches that on the abstract machine, the implementation can use whatever physical mechanism it likes for implementing the language constructs. The relevant section in the standard (C99) is 5.1.2.3 Program execution.
It is probably a compiler optimization since comparing integral types has no side effects. You could try compiling without optimizations or using a function that has side effects instead of the comparison operator and see if it still does this.
For example, try
if (printf("a") || printf("b")) {
printf("c\n");
}
and it should print ac
As others have mentioned, this assembly output is a compiler optimization that doesn't affect program execution (as far as the compiler can tell). If you want to selectively disable this optimization, you need to tell the compiler that your variables should not be optimized across the sequence points in the code.
Sequence points are control expressions (the evaluations in if, switch, while, do and all three sections of for), logical ORs and ANDs, conditionals (?:), commas and the return statement.
To prevent compiler optimization across these points, you must declare your variable volatile. In your example, you can specify
volatile long a;
unsigned long m;
{...}
if (!a && m > 0x002 && m < 0x111) {...}
The reason that this works is that volatile is used to instruct the compiler that it can't predict the behavior of an equivalent machine with respect to the variable. Therefore, it must strictly obey the sequence points in your code.
The compiler's optimising - it gets the result into EBX, moves it to AL, part of EAX, does the second check into EDX, then branches based on the comparison of EAX and EDX. This saves a branch and leaves the code running faster, without making any difference at all in terms of side effects.
If you compile with -O0 rather than -O2, I imagine it will produce more naive assembly that more closely matches your expectations.
The code is behaving correctly (i.e., in accordance with the requirements of the language standard) either way.
It appears that you're trying to find a way to generate specific assembly code. Of two possible assembly code sequences, both of which behave the same way, you find one satisfactory and the other unsatisfactory.
The only really reliable way to guarantee the satisfactory assembly code sequence is to write the assembly code explicitly. gcc does support inline assembly.
C code specifies behavior. Assembly code specifies machine code.
But all this raises the question: why does it matter to you? (I'm not saying it shouldn't, I just don't understand why it should.)
EDIT: How exactly are a and m defined? If, as you suggest, they're related to memory-mapped devices, then they should be declared volatile -- and that might be exactly the solution to your problem. If they're just ordinary variables, then the compiler can do whatever it likes with them (as long as it doesn't affect the program's visible behavior) because you didn't ask it not to.

Reading a register value into a C variable [duplicate]

This question already has answers here:
Why can't I get the value of asm registers in C?
(2 answers)
Closed 1 year ago.
I remember seeing a way to use extended gcc inline assembly to read a register value and store it into a C variable.
I cannot though for the life of me remember how to form the asm statement.
Editor's note: this way of using a local register-asm variable is now documented by GCC as "not supported". It still usually happens to work on GCC, but breaks with clang. (This wording in the documentation was added after this answer was posted, I think.)
The global fixed-register variable version has a large performance cost for 32-bit x86, which only has 7 GP-integer registers (not counting the stack pointer). This would reduce that to 6. Only consider this if you have a global variable that all of your code uses heavily.
Going in a different direction than other answers so far, since I'm not sure what you want.
GCC Manual § 5.40 Variables in Specified Registers
register int *foo asm ("a5");
Here a5 is the name of the register which should be used…
Naturally the register name is cpu-dependent, but this is not a problem, since specific registers are most often useful with explicit assembler instructions (see Extended Asm). Both of these things generally require that you conditionalize your program according to cpu type.
Defining such a register variable does not reserve the register; it remains available for other uses in places where flow control determines the variable's value is not live.
GCC Manual § 3.18 Options for Code Generation Conventions
-ffixed-reg
Treat the register named reg as a fixed register; generated code should never refer to it (except perhaps as a stack pointer, frame pointer or in some other fixed role).
This can replicate Richard's answer in a simpler way,
int main() {
register int i asm("ebx");
return i + 1;
}
although this is rather meaningless, as you have no idea what's in the ebx register.
If you combined these two, compiling this with gcc -ffixed-ebx,
#include <stdio.h>
register int counter asm("ebx");
void check(int n) {
if (!(n % 2 && n % 3 && n % 5)) counter++;
}
int main() {
int i;
counter = 0;
for (i = 1; i <= 100; i++) check(i);
printf("%d Hamming numbers between 1 and 100\n", counter);
return 0;
}
you can ensure that a C variable always uses resides in a register for speedy access and also will not get clobbered by other generated code. (Handily, ebx is callee-save under usual x86 calling conventions, so even if it gets clobbered by calls to other functions compiled without -ffixed-*, it should get restored too.)
On the other hand, this definitely isn't portable, and usually isn't a performance benefit either, as you're restricting the compiler's freedom.
Here is a way to get ebx:
int main()
{
int i;
asm("\t movl %%ebx,%0" : "=r"(i));
return i + 1;
}
The result:
main:
subl $4, %esp
#APP
movl %ebx,%eax
#NO_APP
incl %eax
addl $4, %esp
ret
Edit:
The "=r"(i) is an output constraint, telling the compiler that the first output (%0) is a register that should be placed in the variable "i". At this optimization level (-O5) the variable i never gets stored to memory, but is held in the eax register, which also happens to be the return value register.
I don't know about gcc, but in VS this is how:
int data = 0;
__asm
{
mov ebx, 30
mov data, ebx
}
cout<<data;
Essentially, I moved the data in ebx to your variable data.
This will move the stack pointer register into the sp variable.
intptr_t sp;
asm ("movl %%esp, %0" : "=r" (sp) );
Just replace 'esp' with the actual register you are interested in (but make sure not to lose the %%) and 'sp' with your variable.
From the GCC docs itself: http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
#include <stdio.h>
void gav(){
//rgv_t argv = get();
register unsigned long long i asm("rax");
register unsigned long long ii asm("rbx");
printf("I`m gav - first arguman is: %s - 2th arguman is: %s\n", (char *)i, (char *)ii);
}
int main(void)
{
char *test = "I`m main";
char *test1 = "I`m main2";
printf("0x%llx\n", (unsigned long long)&gav);
asm("call %P0" : :"i"((unsigned long long)&gav), "a"(test), "b"(test1));
return 0;
}
You can't know what value compiler-generated code will have stored in any register when your inline asm statement runs, so the value is usually meaningless, and you'd be much better off using a debugger to look at register values when stopped at a breakpoint.
That being said, if you're going to do this strange task, you might as well do it efficiently.
On some targets (like x86) you can use specific-register output constraints to tell the compiler which register an output will be in. Use a specific-register output constraint with an empty asm template (zero instructions) to tell the compiler that your asm statement doesn't care about that register value on input, but afterward the given C variable will be in that register.
#include <stdint.h>
int foo() {
uint64_t rax_value; // type width determines register size
asm("" : "=a"(rax_value)); // =letter determines which register (or partial reg)
uint32_t ebx_value;
asm("" : "=b"(ebx_value));
uint16_t si_value;
asm("" : "=S"(si_value) );
uint8_t sil_value; // x86-64 required to use the low 8 of a reg other than a-d
// With -m32: error: unsupported size for integer register
asm("# Hi mom, my output constraint picked %0" : "=S"(sil_value) );
return sil_value + ebx_value;
}
Compiled with clang5.0 on Godbolt for x86-64. Notice that the 2 unused output values are optimized away, no #APP / #NO_APP compiler-generated asm-comment pairs (which switch the assembler out / into fast-parsing mode, or at least used to if that's no longer a thing). This is because I didn't use asm volatile, and they have an output operand so they're not implicitly volatile.
foo(): # #foo()
# BB#0:
push rbx
#APP
#NO_APP
#DEBUG_VALUE: foo:ebx_value <- %EBX
#APP
# Hi mom, my output constraint picked %sil
#NO_APP
#DEBUG_VALUE: foo:sil_value <- %SIL
movzx eax, sil
add eax, ebx
pop rbx
ret
# -- End function
# DW_AT_GNU_pubnames
# DW_AT_external
Notice the compiler-generated code to add two outputs together, directly from the registers specified. Also notice the push/pop of RBX, because RBX is a call-preserved register in the x86-64 System V calling convention. (And basically all 32 and 64-bit x86 calling conventions). But we've told the compiler that our asm statement writes a value there. (Using an empty asm statement is kind of a hack; there's no syntax to directly tell the compiler we just want to read a register, because like I said you don't know what the compiler was doing with the registers when your asm statement is inserted.)
The compiler will treat your asm statement as if it actually wrote that register, so if it needs the value for later, it will have copied it to another register (or spilled to memory) when your asm statement "runs".
The other x86 register constraints are b (bl/bx/ebx/rbx), c (.../rcx), d (.../rdx), S (sil/si/esi/rsi), D (.../rdi). There is no specific constraint for bpl/bp/ebp/rbp, even though it's not special in functions without a frame pointer. (Maybe because using it would make your code not compiler with -fno-omit-frame-pointer.)
You can use register uint64_t rbp_var asm ("rbp"), in which case asm("" : "=r" (rbp_var)); guarantees that the "=r" constraint will pick rbp. Similarly for r8-r15, which don't have any explicit constraints either. On some architectures, like ARM, asm-register variables are the only way to specify which register you want for asm input/output constraints. (And note that asm constraints are the only supported use of register asm variables; there's no guarantee that the variable's value will be in that register any other time.
There's nothing to stop the compiler from placing these asm statements anywhere it wants within a function (or parent functions after inlining). So you have no control over where you're sampling the value of a register. asm volatile may avoid some reordering, but maybe only with respect to other volatile accesses. You could check the compiler-generated asm to see if you got what you wanted, but beware that it might have been by chance and could break later.
You can place an asm statement in the dependency chain for something else to control where the compiler places it. Use a "+rm" constraint to tell the compiler it modifies some other variable which is actually used for something that doesn't optimize away.
uint32_t ebx_value;
asm("" : "=b"(ebx_value), "+rm"(some_used_variable) );
where some_used_variable might be a return value from one function, and (after some processing) passed as an arg to another function. Or computed in a loop, and will be returned as the function's return value. In that case, the asm statement is guaranteed to come at some point after the end of the loop, and before any code that depends on the later value of that variable.
This will defeat optimizations like constant-propagation for that variable, though. https://gcc.gnu.org/wiki/DontUseInlineAsm. The compiler can't assume anything about the output value; it doesn't check that the asm statement has zero instructions.
This doesn't work for some registers that gcc won't let you use as output operands or clobbers, e.g. the stack pointer.
Reading the value into a C variable might make sense for a stack pointer, though, if your program does something special with stacks.
As an alternative to inline-asm, there's __builtin_frame_address(0) to get a stack address. (But IIRC, cause that function to make a full stack frame, even when -fomit-frame-pointer is enabled, like it is by default on x86.)
Still, in many functions that's nearly free (and making a stack frame can be good for code-size, because of smaller addressing modes for RBP-relative than RSP-relative access to local variables).
Using a mov instruction in an asm statement would of course work, too.
Isn't this what you are looking for?
Syntax:
asm ("fsinx %1,%0" : "=f" (result) : "f" (angle));

Resources