The difference between asm, asm volatile and clobbering memory - c

When implementing lock-free data structures and timing code it's often necessary to suppress the compiler's optimisations. Normally people do this using asm volatile with memory in the clobber list, but you sometimes see just asm volatile or just a plain asm clobbering memory.
What impact do these different statements have on code generation (particularly in GCC, as it's unlikely to be portable)?
Just for reference, these are the interesting variations:
asm (""); // presumably this has no effect on code generation
asm volatile ("");
asm ("" ::: "memory");
asm volatile ("" ::: "memory");

See the "Extended Asm" page in the GCC documentation.
You can prevent an asm instruction from being deleted by writing the keyword volatile after the asm. [...] The volatile keyword indicates that the instruction has important side-effects. GCC will not delete a volatile asm if it is reachable.
and
An asm instruction without any output operands will be treated identically to a volatile asm instruction.
None of your examples have output operands specified, so the asm and asm volatile forms behave identically: they create a point in the code which may not be deleted (unless it is proved to be unreachable).
This is not quite the same as doing nothing. See this question for an example of a dummy asm which changes code generation - in that example, code that goes round a loop 1000 times gets vectorised into code which calculates 16 iterations of the loop at once; but the presence of an asm inside the loop inhibits the optimisation (the asm must be reached 1000 times).
The "memory" clobber makes GCC assume that any memory may be arbitrarily read or written by the asm block, so will prevent the compiler from reordering loads or stores across it:
This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory.
(That does not prevent a CPU from reordering loads and stores with respect to another CPU, though; you need real memory barrier instructions for that.)

asm ("") does nothing (or at least, it's not supposed to do anything.
asm volatile ("") also does nothing.
asm ("" ::: "memory") is a simple compiler fence.
asm volatile ("" ::: "memory") AFAIK is the same as the previous. The volatile keyword tells the compiler that it's not allowed to move this assembly block. For example, it may be hoisted out of a loop if the compiler decides that the input values are the same in every invocation. I'm not really sure under what conditions the compiler will decide that it understands enough about the assembly to try to optimize its placement, but the volatile keyword suppresses that entirely. That said, I would be very surprised if the compiler attempted to move an asm statement that had no declared inputs or outputs.
Incidentally, volatile also prevents the compiler from deleting the expression if it decides that the output values are unused. This can only happen if there are output values though, so it doesn't apply to asm ("" ::: "memory").

Just for completeness on Lily Ballard's answer, Visual Studio 2010 offers _ReadBarrier(), _WriteBarrier() and _ReadWriteBarrier() to do the same (VS2010 doesn't allow inline assembly for 64-bit apps).
These don't generate any instructions but affect the behaviour of the compiler. A nice example is here.
MemoryBarrier() generates lock or DWORD PTR [rsp], 0

Related

Why do read and write barrier for x86 in glibc not use __volatile asm?

I am studying glibc (The version is 2.32). As for memory barrier, read, write and full barrier for x86 are as follows:
#define atomic_full_barrier() \
__asm __volatile (LOCK_PREFIX "orl $0, (%%" SP_REG ")" ::: "memory")
#define atomic_read_barrier() __asm ("" ::: "memory")
#define atomic_write_barrier() __asm ("" ::: "memory")
As cppreference and this answer say, volatile
tells the compiler don't optimize and reorder this instruction.
Why do write and read barrier not use __asm __volatile, while full barrier uses it?
An asm statement with no output operands is implicitly volatile (GCC manual).
So they're actually all volatile, which is necessary for them to not be removed by the optimizer.
(non-volatile asm is assumed to be a pure function with no side effects, run only if needed to produce the outputs. The clobbers are only clobbered if / when the optimizer decides it needs to run the asm statement).
Different authors chose to be more or less explicit. If you git blame, I expect you'll see those were written at different times and/or by different people.

Why can't local variable be used in GNU C basic inline asm statements?

Why cannot I use local variables from main to be used in basic asm inline? It is only allowed in extended asm, but why so?
(I know local variables are on the stack after return address (and therefore cannot be used once the function return), but that should not be the reason to not use them)
And example of basic asm:
int a = 10; //global a
int b = 20; //global b
int result;
int main() {
asm ( "pusha\n\t"
"movl a, %eax\n\t"
"movl b, %ebx\n\t"
"imull %ebx, %eax\n\t"
"movl %eax, result\n\t"
"popa");
printf("the answer is %d\n", result);
return 0;
}
example of extended:
int main (void) {
int data1 = 10; //local var - could be used in extended
int data2 = 20;
int result;
asm ( "imull %%edx, %%ecx\n\t"
"movl %%ecx, %%eax"
: "=a"(result)
: "d"(data1), "c"(data2));
printf("The result is %d\n",result);
return 0;
}
Compiled with:
gcc -m32 somefile.c
platform:
uname -a:
Linux 5.0.0-32-generic #34-Ubuntu SMP Wed Oct 2 02:06:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
You can use local variables in extended assembly, but you need to tell the extended assembly construct about them. Consider:
#include <stdio.h>
int main (void)
{
int data1 = 10;
int data2 = 20;
int result;
__asm__(
" movl %[mydata1], %[myresult]\n"
" imull %[mydata2], %[myresult]\n"
: [myresult] "=&r" (result)
: [mydata1] "r" (data1), [mydata2] "r" (data2));
printf("The result is %d\n",result);
return 0;
}
In this [myresult] "=&r" (result) says to select a register (r) that will be used as an output (=) value for the lvalue result, and that register will be referred to in the assembly as %[myresult] and must be different from the input registers (&). (You can use the same text in both places, result instead of myresult; I just made it different for illustration.)
Similarly [mydata1] "r" (data1) says to put the value of expression data1 into a register, and it will be referred to in the assembly as %[mydata1].
I modified the code in the assembly so that it only modifies the output register. Your original code modifies %ecx but does not tell the compiler it is doing that. You could have told the compiler that by putting "ecx" after a third :, which is where the list of “clobbered” registers goes. However, since my code lets the compiler assign a register, I would not have a specific register to list in the clobbered register. There may be a way to tell the compiler that one of the input registers will be modified but is not needed for output, but I do not know. (Documentation is here.) For this task, a better solution is to tell the compiler to use the same register for one of the inputs as the output:
__asm__(
" imull %[mydata1], %[myresult]\n"
: [myresult] "=r" (result)
: [mydata1] "r" (data1), [mydata2] "0" (data2));
In this, the 0 with data2 says to make it the same as operand 0. The operands are numbered in the order they appear, starting with 0 for the first output operand and continuing into the input operands. So, when the assembly code starts, %[myresult] will refer to some register that the value of data2 has been placed in, and the compiler will expect the new value of result to be in that register when the assembly is done.
When doing this, you have to match the constraint with how a thing will be used in assembly. For the r constraint, the compiler supplies some text that can be used in assembly language where a general processor register is accepted. Others include m for a memory reference, and i for an immediate operand.
There is little distinction between "Basic asm" and "Extended asm"; "basic asm" is just a special case where the __asm__ statement has no lists of outputs, inputs, or clobbers. The compiler does not do % substitution in the assembly string for Basic asm. If you want inputs or outputs you have to specify them, and then it's what people call "extended asm".
In practice, it may be possible to access external (or even file-scope static) objects from "basic asm". This is because these objects will (respectively may) have symbol names at the assembly level. However, to perform such access you need to be careful of whether it is position-independent (if your code will be linked into libraries or PIE executables) and meets other ABI constraints that might be imposed at linking time, and there are various considerations for compatibility with link-time optimization and other transformations the compiler may perform. In short, it's a bad idea because you can't tell the compiler that a basic asm statement modified memory. There's no way to make it safe.
A "memory" clobber (Extended asm) can make it safe to access static-storage variables by name from the asm template.
The use-case for basic asm is things that modify the machine state only, like asm("cli") in a kernel to disable interrupts, without reading or writing any C variables. (Even then, you'd often use a "memory" clobber to make sure the compiler had finished earlier memory operations before changing machine state.)
Local (automatic storage, not static ones) variables fundamentally never have symbol names, because they don't exist in a single instance; there's one object per live instance of the block they're declared in, at runtime. As such, the only possible way to access them is via input/output constraints.
Users coming from MSVC-land may find this surprising since MSVC's inline assembly scheme papers over the issue by transforming local variable references in their version of inline asm into stack-pointer-relative accesses, among other things. The version of inline asm it offers however is not compatible with an optimizing compiler, and little to no optimization can happen in functions using that type of inline asm. GCC and the larger compiler world that grew alongside C out of unix does not do anything similar.
You can't safely use globals in Basic Asm statements either; it happens to work with optimization disabled but it's not safe and you're abusing the syntax.
There's very little reason to ever use Basic Asm. Even for machine-state control like asm("cli") to disable interrupts, you'd often want a "memory" clobber to order it wrt. loads / stores to globals. In fact, GCC's https://gcc.gnu.org/wiki/ConvertBasicAsmToExtended page recommends never using Basic Asm because it differs between compilers, and GCC might change to treating it as clobbering everything instead of nothing (because of existing buggy code that makes wrong assumptions). This would make a Basic Asm statement that uses push/pop even more inefficient if the compiler is also generating stores and reloads around it.
Basically the only use-case for Basic Asm is writing the body of an __attribute__((naked)) function, where data inputs/outputs / interaction with other code follows the ABI's calling convention, instead of whatever custom convention the constraints / clobbers describe for a truly inline block of code.
The design of GNU C inline asm is that it's text that you inject into the compiler's normal asm output (which is then fed to the assembler, as). Extended asm makes the string a template that it can substitute operands into. And the constraints describe how the asm fits into the data-flow of the program logic, as well as registers it clobbers.
Instead of parsing the string, there is syntax that you need to use to describe exactly what it does. Parsing the template for var names would only solve part of the language-design problem that operands need to solve, and would make the compiler's code more complicated. (It would have to know more about every instruction to know whether memory, register, or immediate was allowed, and stuff like that. Normally its machine-description files only need to know how to go from logical operation to asm, not the other direction.)
Your Basic asm block is broken because you modify C variables without telling the compiler about it. This could break with optimization enabled (maybe only with more complex surrounding code, but happening to work is not the same thing as actually safe. This is why merely testing GNU C inline asm code is not even close to sufficient for it to be future proof against new compilers and changes in surrounding code). There is no implicit "memory" clobber. (Basic asm is the same as Extended asm except for not doing % substitution on the string literal. So you don't need %% to get a literal % in the asm output. It's implicitly volatile like Extended asm with no outputs.)
Also note that if you were targeting i386 MacOS, you'd need _result in your asm. result only happens to work because the asm symbol name exactly matches the C variable name. Using Extended asm constraints would make it portable between GNU/Linux (no leading underscore) vs. other platforms that do use a leading _.
Your Extended asm is broken because you modify an input ("c") (without telling the compiler that register is also an output, e.g. an output operand using the same register).
It's also inefficient: if a mov is the first or last instruction of your template, you're almost always doing it wrong and should have used better constraints.
Instead, you can do:
asm ("imull %%edx, %%ecx\n\t"
: "=c"(result)
: "d"(data1), "c"(data2));
Or better, use "+r"(data2) and "r"(data1) operands to give the compiler free choice when doing register allocation instead of potentially forcing the compiler to emit unnecessary mov instructions. (See #Eric's answer using named operands and "=r" and a matching "0" constraint; that's equivalent to "+r" but lets you use different C names for the input and output.)
Look at the asm output of the compiler to see how code-gen happened around your asm statement, if you want to make sure it was efficient.
Since local vars don't have a symbol / label in the asm text (instead they live in registers or at some offset from the stack or frame pointer, i.e. automatic storage), it can't work to use symbol names for them in asm.
Even for global vars, you want the compiler to be able to optimize around your inline asm as much as possible, so you want to give the compiler the option of using a copy of a global var that's already in a register, instead of getting the value in memory in sync with a store just so your asm can reload that.
Having the compiler try to parse your asm and figure out which C local var names are inputs and outputs would have been possible. (But would be a complication.)
But if you want it to be efficient, you need to figure out when x in the asm can be a register like EAX, instead of doing something braindead like always storing x into memory before the asm statement, and then replacing x with 8(%rsp) or whatever. If you want to give the asm statement control over where inputs can be, you need constraints in some form. Doing it on a per-operand basis makes total sense, and means the inline-asm handling doesn't have to know that bts can take an immediate or register source but not memory, for and other machine-specific details like that. (Remember; GCC is a portable compiler; baking a huge amount of per-machine info into the inline-asm parser would be bad.)
(MSVC forces all C vars in _asm{} blocks to be memory. It's impossible to use to efficiently wrap a single instruction because the input has to bounce through memory, even if you wrap it in a function so you can use the officially-supported hack of leaving a value in EAX and falling off the end of a non-void function. What is the difference between 'asm', '__asm' and '__asm__'? And in practice MSVC's implementation was apparently pretty brittle and hard to maintain, so much so that they removed it for x86-64, and it was documented as not supported in function with register args even in 32-bit mode! That's not the fault of the syntax design, though, just the actual implementation.)
Clang does support -fasm-blocks for _asm { ... } MSVC-style syntax where it parses the asm and you use C var names. It probably forces inputs and outputs into memory but I haven't checked.
Also note that GCC's inline asm syntax with constraints is designed around the same system of constraints that GCC-internals machine-description files use to describe the ISA to the compiler. (The .md files in the GCC source that tell the compiler about an instruction to add numbers that takes inputs in "r" registers, and has the text string for the mnemonic. Notice the "r" and "m" in some examples in https://gcc.gnu.org/onlinedocs/gccint/RTL-Template.html).
The design model of asm in GNU C is that it's a black-box for optimizer; you must fully describe the effects of the code (to the optimizer) using constraints. If you clobber a register, you have to tell the compiler. If you have an input operand that you want to destroy, you need to use a dummy output operand with a matching constraint, or a "+r" operand to update the corresponding C variable's value.
If you read or write memory pointed-to by a register input, you have to tell the compiler. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
If you use the stack, you have to tell the compiler (but you can't, so instead you have to avoid stepping on the red-zone :/ Using base pointer register in C++ inline asm) See also the inline-assembly tag wiki
GCC's design makes it possible for the compiler to give you an input in a register, and use the same register for a different output. (Use an early-clobber constraint if that's not ok; GCC's syntax is designed to efficiently wrap a single instruction that reads all its inputs before writing any of its outputs.)
If GCC could only infer all of these things from C var names appearing in asm source, I don't think that level of control would be possible. (At least not plausible.) And there'd probably be surprising effects all over the place, not to mention missed optimizations. You only ever use inline asm when you want maximum control over things, so the last thing you want is the compiler using a lot of complex opaque logic to figure out what to do.
(Inline asm is complex enough in its current design, and not used much compared to plain C, so a design that requires very complex compiler support would probably end up with a lot of compiler bugs.)
GNU C inline asm isn't designed for low-performance low-effort. If you want easy, just write in pure C or use intrinsics and let the compiler do its job. (And file missed-optimization bug reports if it makes sub-optimal code.)
This is because asm is a defined language which is common for all compilers on the same processor family. After using the __asm__ keyword, you can reliably use any good manual for the processor to then start writing useful code.
But it does not have a defined interface for C, and lets be honest, if you don't interface your assembler with your C code then why is it there?
Examples of useful very simple asm: generate a debug interrupt; set the floating point register mode (exceptions/accuracy);
Each compiler writer has invented their own mechanism to interface to C. For example in one old compiler you had to declare the variables you want to share as named registers in the C code. In GCC and clang they allow you to use their quite messy 2-step system to reference an input or output index, then associate that index with a local variable.
This mechanism is the "extension" to the asm standard.
Of course, the asm is not really a standard. Change processor and your asm code is trash. When we talk in general about sticking to the c/c++ standards and not using extensions, we don't talk about asm, because you are already breaking every portability rule there is.
Then, on top of that, if you are going to call C functions, or your asm declares functions that are callable by C then you will have to match to the calling conventions of your compiler. These rules are implicit. They constrain the way you write your asm, but it will still be legal asm, by some criteria.
But if you were just writing your own asm functions, and calling them from asm, you may not be constrained so much by the c/c++ conventions: make up your own register argument rules; return values in any register you want; make stack frames, or don't; preserve the stack frame through exceptions - who cares?
Note that you might still be constrained by the platform's relocatable code conventions (these are not "C" conventions, but are often described using C syntax), but this is still one way that you can write a chunk of "portable" asm functions, then call them using "extended" embedded asm.

Inserting "marker" instructions into assembly without GCC reordering them

For purposes of doing performance analysis it is useful to be able to
tell which line of C code goes with which line of generated assembly
code. This can be very difficult once a sufficient number of
optimization passes get involved, and I devised the following scheme
to make it easier (though it has a lot of caveats). I figured I would
use in-line assembly to insert an instruction that is effectively a
nop, but that the compiler would rarely or never generate itself. Then
when I looked at the generated code I could infer that assembly code
that appears between the inserted marker instructions probably comes
from C code that lies between the in-line assembly statements.
I came up with these candidates:
// Force insertion of a instruction that will only clobber
// flags and that the compiler hardly ever uses itself. Lie and say
// that it alters memory to try to prevent the compiler from moving
// around. Mark it volatile so the compiler can't remove it entirely.
#define ASSEMBLY_MARKER_0() \
__asm__ volatile ("cld" : /* no outputs */ : /* no inputs */ : "memory", "cc")
#define ASSEMBLY_MARKER_1() \
__asm__ volatile ("xorl %%eax,0" : /* no outputs */ : /* no inputs */ : "memory", "cc")
Then I decided to test whether the compiler would move instructions
across these boundaries. clang appears to do exactly what I want, but
GCC appears to not be deterred either by the memory clobbering or the
fact that this snippet is volatile. It reorders instructions anyway!
Is there any way to prevent this?
I know there are a lot of caveats to this method even if I get it to
work -- I may heavily influence generated code around the markers. But
I maintain that it would still be useful for finding things like
accidental implicit conversions between integer widths, and other
"wait that should never be necessary..." type problems.
You can see the difference between GCC and clang here: https://godbolt.org/z/ZtUPc9
C code:
int f(int x)
{
__asm__ volatile ("xorl %%eax,0" : /* no outputs */ : /* no inputs */ : "memory", "cc");
int j = x << 3;
__asm__ volatile ("xorl %%eax,0" : /* no outputs */ : /* no inputs */ : "memory", "cc");
return j;
}
GCC:
xorl %eax,0
xorl %eax,0
lea eax, [0+rdi*8]
ret
Clang:
xor dword ptr [0], eax
lea eax, [8*rdi]
xor dword ptr [0], eax
ret
Edits to answer questions in comments:
Why not nops? Because gcc inserts those itself often. The point is to stick out.
Why not move code into its own function? If you're doing this analysis on C++ template code for example, there be many layers of inlining that occur before producing the function that actually goes in the executable, and the code may be very different if you turn off the inlining (e.g. the code may have been written with the assumption that constant folding, dead code elimitation etc would get rid of trivial things).
Then I decided to test whether the compiler would move instructions across these boundaries. clang appears to do exactly what I want, but GCC appears to not be deterred either by the memory clobbering or the fact that this snippet is volatile. It reorders instructions anyway! Is there any way to prevent this?
Not really. The point is that such memory barriers avoid reordering stuff across it that is volatile (like volatile accesses or asm volatile) and / or memory accesses. Or in the case of x86 and cc (condition code) parts of def-use chains of condition code cannot be moved across. Such barriers do not whatsoever avoid moving unrelated instructions across it.
Sometimes it can be helpful to add options -save-temps -fverbose-asm to better understand assembly code and its relation to C. New versions of GCC dump C/C++ code alongside assembly code (dumped as *.s). When you inspect assembly (as opposed to disassembly) it's sufficient to inject asm comments to show where the inline asm is injected, there is no need to add actual instructions. Legibility of assembly might be improved by disabling debug-info (-g0).
To better understand the code, you can also disable passes that usually result in great amount of instruction reordering like instruction scheduling (-fno-schedule-insns,
-fno-schedule-insns2) but that has a big performance impact, of course.

How to tell the GCC compiler that code should be generated serial, i.e., without jumps

How to tell the GCC compiler that code should be generated serial, i.e., without jumps.
I'm working on a project that embeds inline assembly into a C source code (or LLVM IR).
My implementation depends on code between the inline assembly to be written into the executable as-is.
More formally, suppose I have the soure code (C or LLVM IR):
.label_start: (inserted as inline assembly)
inline_assembly0
source_code0
source_code1
inline_assembly1
...
.label_end: (inserted as inline assembly)
...
Now, this should not be compiled as:
.jmp_target:
source_code1
inline_assembly1
...
.label_end: (inserted as inline assembly)
...
.label_start: (inserted as inline assembly)
inline_assembly0
source_code0
jmp jmp_target
I.e. code should stay between labels without jumps reordering .label_start and .label_end.
Is there any way of telling GCC that everything between two inline assembly labels should stay "intact" without being reordered? My implementation depends on this.
If I understand your question, the GCC manual has a few words on this (emphasis added).
Note that the compiler can move even volatile asm instructions relative to other code, including across jump instructions. For example, on many targets there is a system register that controls the rounding mode of floating-point operations. Setting it with a volatile asm, as in the following PowerPC example, does not work reliably.
asm volatile("mtfsf 255, %0" : : "f" (fpenv));
sum = x + y;
The compiler may move the addition back before the volatile asm. To make it work as expected, add an artificial dependency to the asm by referencing a variable in the subsequent code, for example:
asm volatile("mtfsf 255,%1" : "=X" (sum) : "f" (fpenv));
sum = x + y;
Basically, you need a “dummy use” to prevent reordering.
We also use this sort of thing in Mono to extend the liveness of a reference in low-level GC code, ensuring it won’t be prematurely freed if the GC interrupts a routine:
static inline void dummy_use (void *v)
__asm__ volatile ("" : "=r"(v) : "r"(v));
}
try to disable optimization, gcc -O0 source.c

How to specify clobbered bottom of the x87 FPU stack with extended gcc assembly?

In a codebase of ours I found this snippet for fast, towards-negative-infinity1 rounding on x87:
inline int my_int(double x)
{
int r;
#ifdef _GCC_
asm ("fldl %1\n"
"fistpl %0\n"
:"=m"(r)
:"m"(x));
#else
// ...
#endif
return r;
}
I'm not extremely familiar with GCC extended assembly syntax, but from what I gather from the documentation:
r must be a memory location, where I'm writing back stuff;
x must be a memory location too, whence the data comes from.
there's no clobber specification, so the compiler can rest assured that at the end of the snippet the registers are as he left them.
Now, to come to my question: it's true that in the end the FPU stack is balanced, but what if all the 8 locations were already in use and I'm overflowing it? How can the compiler know that it cannot trust ST(7) to be where it left it? Should some clobber be added?
Edit I tried to specify st(7) in the clobber list and it seems to affect the codegen, now I'll wait for some confirmation of this fact.
As a side note: looking at the implementation of the barebones lrint both in glibc and in MinGW I see something like
__asm__ __volatile__ ("fistpl %0"
: "=m" (retval)
: "t" (x)
: "st");
where we are asking for the input to be placed directly in ST(0) (which avoids that potentially useless fldl); what is that "st" clobber? The docs seems to mention only t (i.e. the top of the stack).
yes, it depends from the current rounding mode, which in our application should always be "towards negative infinity".
looking at the implementation of the barebones lrint both in glibc and in MinGW I see something like
__asm__ __volatile__ ("fistpl %0"
: "=m" (retval)
: "t" (x)
: "st");
where we are asking for the input to be placed directly in ST(0) (which avoids that potentially useless fldl)
This is actually the correct way to represent the code you want as inline assembly.
To get the most optimal possible code generated, you want to make use of the inputs and outputs. Rather than hard-coding the necessary load/store instructions, let the compiler generate them. Not only does this introduce the possibility of eliding potentially unnecessary instructions, it also means that the compiler can better schedule these instructions when they are required (that is, it can interleave the instruction within a prior sequence of code, often minimizing its cost).
what is that "st" clobber? The docs seems to mention only t (i.e. the top of the stack).
The "st" clobber refers to the st(0) register, i.e., the top of the x87 FPU stack. What Intel/MASM notation calls st(0), AT&T/GAS notation generally refers to as simply st. And, as per GCC's documentation for clobbers, the items in the clobber list are "either register names or the special clobbers" ("cc" (condition codes/flags) and "memory"). So this just means that the inline assembly clobbers (overwrites) the st(0) register. The reason why this clobber is necessary is that the fistpl instruction pops the top of the stack, thus clobbering the original contents of st(0).
The only thing that concerns me regarding this code is the following paragraph from the documentation:
Clobber descriptions may not in any way overlap with an input or output operand. For example, you may not have an operand describing a register class with one member when listing that register in the clobber list. Variables declared to live in specific registers (see Explicit Register Variables) and used as asm input or output operands must have no part mentioned in the clobber description. In particular, there is no way to specify that input operands get modified without also specifying them as output operands.
When the compiler selects which registers to use to represent input and output operands, it does not use any of the clobbered registers. As a result, clobbered registers are available for any use in the assembler code.
As you already know, the t constraint means the top of the x87 FPU stack. The problem is, this is the same as the st register, and the documentation very clearly said that we could not have a clobber that specifies the same register as one of the input/output operands. Furthermore, since the documentation states that the compiler is forbidden to use any of the clobbered registers to represent input/output operands, this inline assembly makes an impossible request—load this value at the top of the x87 FPU stack without putting it in st!
Now, I would assume that the authors of glibc know what they are doing and are more familiar with the compiler's implementation of inline assembly than you or I, so this code is probably legal and legitimate.
Actually, it seems that the unusual case of the x87's stack-like registers forces an exception to the normal interactions between clobbers and operands. The official documentation says:
On x86 targets, there are several rules on the usage of stack-like registers in the operands of an asm. These rules apply only to the operands that are stack-like registers:
Given a set of input registers that die in an asm, it is necessary to know which are implicitly popped by the asm, and which must be explicitly popped by GCC.
An input register that is implicitly popped by the asm must be explicitly clobbered, unless it is constrained to match an output operand.
That fits our case exactly.
Further confirmation is provided by an example appearing in the official documentation (bottom of the linked section):
This asm takes two inputs, which are popped by the fyl2xp1 opcode, and replaces them with one output. The st(1) clobber is necessary for the compiler to know that fyl2xp1 pops both inputs.
asm ("fyl2xp1" : "=t" (result) : "0" (x), "u" (y) : "st(1)");
Here, the clobber st(1) is the same as the input constraint u, which seems to violate the above-quoted documentation regarding clobbers, but is used and justified for precisely the same reason that "st" is used as the clobber in your original code, because fistpl pops the input.
All of that said, and now that you know how to correctly write the code in inline assembly, I have to echo previous commenters who suggested that the best solution would be not to use inline assembly at all. Just call lrint, which not only has the exact semantics that you want, but can also be better optimized by the compiler under certain circumstances (e.g., transforming it into a single cvtsd2si instruction when the target architecture supports SSE).

Resources