How to ignore output in inline assembly? - c

In inline assembly the first : refers to output and the second to input, what if I don't want to use the output? can I leave it empty like this:
asm ("add $0, %rcx"
:
:"m"(Example) /* intput */
);
Plus if I want to use output only can I delete the other :?

You can leave the output part empty and omit colons if all parts after the colons are empty.
Quote from Extended Asm (Using the GNU Compiler Collection (GCC)):
asm asm-qualifiers ( AssemblerTemplate
: OutputOperands
[ : InputOperands
[ : Clobbers ] ])
If there are no output operands but there are input operands, place two consecutive colons where the output operands would go:
__asm__ ("some instructions"
: /* No outputs. */
: "r" (Offset / 8));
Note that you normally need to tell the compiler about a register you modify, either with an output operand or a clobber. You can put multiple colons on the same line, for example
asm("" ::: "memory") is a common way to write a compiler barrier.

GCC determines which operands are output vs. input from the colons, not inferred from lack of "=" or "+". Yes this is redundant, but no you can't do anything about it.
You can put multiple colons on the same line, for example
asm("" ::: "memory") is a common way to write a compiler barrier. So it's not painful to include the necessary colons. You could write asm("..." :: "m"(input) ); if that would actually be safe without any outputs or clobbers.
But you normally need to tell the compiler about a register you modify, either with an output operand or a clobber, so you often need all three colons. Or if you don't write any registers, usually that's because you're wrapping some "system" instruction that has some kind of effect that you generally don't want the compiler to reorder memory accesses around. Like invlpg. Or for example:
asm("clflush %0" ::"m"(*ptr) : "memory");
should have a memory clobber to order the cache-line flush after any earlier stores (which might have been to the same line).
You code has some syntax bugs, as well as correctness / UB, at least if you meant to add the memory operand, not a constant 0.
In GNU C Extended Asm, the template string is very much like a printf format string for the compiler to substitute in operands where you use %something. (And yes, it's just a dumb text substitution to create text to feed to the assembler, as. GCC doesn't "understand" your asm, that's why you have to describe it accurately to the compiler using input/output/clobber constraints).
When you want a literal %, like in register names in AT&T syntax such as %rcx, you have to actually write %%rcx.
(Normally it's best to avoid hard-coding registers; use dummy output operands to let the compiler pick which registers to use. You can even name them, like %[input])
I assume you meant to add the memory source operand to RCX. That would be %0.
$0 is an immediate 0, i.e. RCX += 0, so the instruction only actually modifies FLAGS.
Assuming you meant add %0, %%rcx, your code writes a register (RCX). You must tell the compiler about registers / memory you modify. Otherwise it might have a C variable in RCX, and expect to read its value after the asm statement. So you need either a (dummy) output or a clobber anyway.
(If you did actually mean "add $0, %%rcx", then the only architectural effect is to set FLAGS. Inline asm for i386 / amd64 already implicitly has a "cc" clobber so we don't have to tell the compiler about that side-effect.)
Your options for "add %0, %%rcx" to be safe include:
Use a clobber:
asm ("add %0, %%rcx" // %0 expands the first operand, $0 was an immediate
:
: "Irm"(Example) /* input, also allow reg or 32-bit immediate */
: "rcx"
);
Use a dummy output operand (and make it volatile, like it was implicitly when you had no output operand). Note that we get to omit the : clobbers part of the asm statement.
uint64_t dummy;
asm volatile ("add %0, %%rcx"
: "=c"(dummy); // "c" forces picking cl/cx/ecx/rcx based on size
: "Irm"(Example)
);
// without volatile, the asm statement can be optimized away if you don't read dummy later
Note that push %%rcx ; pop %%rcx around the add would not be safe: it modifies RSP which might affect the addressing mode the compiler picked for "m"(Example), and it steps on the red-zone below RSP.
See https://stackoverflow.com/tags/inline-assembly/info for more.

Related

x86-64 Zero Flag is clearing between inline calls (and another problem)

I am using the bsf x86-64 instruction found on page 210 of Intels developers manual found here. Essentially, if a least significant 1 bit is found, its bit index is stored in the destination operand .
Furthermore, the ZF flag is set to 1 if all the source operand is 0; otherwise, the ZF flag is cleared.
I am compiling my C code with inline x86-64 assembly instructions. I have defined a C function which invokes the bsf instruction:
uint64_t bitScanForward(T_bitboard b) {
__asm__(
"bsf %rcx,%rax\n"
"leave\n"
"ret\n"
);
}
and also another C function which checks if the status of the ZF bit in the flag register:
uint64_t isZFSet() {
printf("\n"); <- This is another problem I am having (see below)...
__asm__(
"jz true\n"
"movq $0,%rax\n"//return false
"jmp end\n"
"true:\n"
"movq $1,%rax\n"//return true
"end:\n"
"leave\n"
"ret\n"
);
}
I have tested these and found that the ZF flag is always cleared even when the bsf comand is applied to the number zero, seemingly going against the specification.
//Calling function...
//Do stuff...
bitScanForward(0ULL);//ULL is 64 bit on my machine
if(isZFSet()){//ZF flag *should* be set here but its not
printf("ZF flag is set\n");
}
//More stuff...
I suspect the reason the ZF flag is clearing is due to entering and leaving one set of inline instructions to another.
How can I ensure that the flag in the above code is set as specified in the documentation? (I don't want to change much of my code or design)
My "other problem" is that if I dont include the printf statement in the isZFFlagSet, the function seemingly doesnt execute. Totally bizarre. Can anyone explain why?
You are treating an aggressively optimizing C compiler as if it were a macro assembler. That just plain isn't going to work. To get GCC to emit correct code in the presence of assembly inserts, you have to annotate the inserts with complete information about the registers and memory regions that are affected by the assembly code, and you have to use ancillary C statements to mesh them with the surrounding code. Even then, there are things the assembly insert cannot do at all. I urge you to scrap this entire mess and instead use the __builtin_ctzll intrinsic, as suggested in the comments on the question.
Now, to specifics. Your first function is incorrect because GCC does not support use of leave or ret inside an assembly insert. (More generally, assembly inserts may not alter the stack pointer, and may only jump to designated labels within the same function.) The correct way to use bsf from a GCC-style assembly insert is with "extended asm" with input and output operands:
uint64_t bitScanForward(uint64_t b) {
uint64_t ret;
asm ("bsf %1, %0" : "=r" (ret) : "r" (b));
return ret;
}
You must declare a C variable to receive the output of the operation, and explicitly return that variable; having bsf write to %rax would not work (unlike how it was in old MSVC). BSF accepts any two registers as operands, so there is no need to use constraints more specific than r.
Your second function is incorrect because you didn't tell GCC that the condition codes were meaningful after bitScanForward, and because GCC does not support using the condition-code register as an input to an assembly insert. In order to read the ZF output from bsf you must do so within the same assembly insert that invoked bsf:
uint64_t countTrailingZeroes(uint64_t b) {
uint64_t ret;
asm ("bsf %1, %0\n\t"
"cmove %2, %0"
: "=&r" (ret)
: "r" (b), "rm" (64));
return ret;
}
This requires special care -- see how the constraint on operand 0 is now =&r instead of just =r? Without that, GCC is liable to think it can put operand 2 in the same register as operand 0.
Alternatively, you can specify that ZF is an output, which is supported (see the "flag output operands" section of the manual) and then supply a default value from C:
uint64_t countTrailingZeroes(uint64_t b) {
uint64_t ret;
int zf;
asm ("bsf %2, %0"
: "=r" (ret), "=#ccz" (zf) : "r" (b));
if (zf) ret = 64;
return ret;
}

Return values and constraints for asm (assembler code) using gnu Extended Asm

As a (beginners) exercise I want to implement a swap operation in C using GNU asm.
I'm confused about the restraints. My code is:
int c = 1;
int b = 2;
asm ("xchg %2,%3"
: "=r" (c), "=r" (b)
: "<X>" (c), "<Y>" (b)
);
return c+10*b;
Where $<X>$ and $<Y>$ are replaced by ("0", "0"), ("0", "r"), ("r", "r") and ("r", 0").
("0", "0") does not compile. (Why?)
("0", "r") returns 12, the expected result.
("r", "r") and ("r", "0") return 21, as if nothing happened. (Why?)
So what is wrong in those cases? Why does it fail?
You can use : "+r" (c), "+r" (b) to declare read/write operands1 (https://gcc.gnu.org/onlinedocs/gcc/Modifiers.html) In/out instructions go in the first (output) group of constraints.
And of course you should use xchg %0, %1 to make sure the asm template only references operands you actually declared.
asm ("xchg %0,%1"
: "+r" (c), "+r" (b)
);
Footnote 1: Fun fact: GCC internally implements read/write operand by inventing matching input constraints for these output. So if you didn't fix the %2,%3 in the template, it would still compile. But you wouldn't have any guarantee of which order GCC chose to do the matching constraints in.
To find out what happened with your wrong constraints, make your asm template print the unused operand names as an asm comment, e.g. xchg %2,%3 # inputs= %0,%1 so you can look at the compiler's asm output (gcc -S or on https://godbolt.org), and see which registers it picked. If it picked opposite constraints you'd see no swap.
Or with "r" if it picked 2 registers that are different from the output operands then you're stepping on the compilers toes by modifying registers it expected to remain unchanged (because you told it those were inputs, and could be in separate registers from the outputs.) So no, "r" can't be safe.
If you did want to use matching constraints manually, you'd of course use "0"(c), "1"(b) to force it to pick the same register for c as an input that it picked for c as the output #0, and same for b as input as output #1.
Of course there's no need for a runtime instruction at all to just tell the compiler that C variables have each other's values.
asm ("nop # c input in %2, output in %0. b input in %3, output in %1"
: "=r" (c), "=r" (b)
: "1"(c), "0" (b) // opposite matching constraints
);
Even the nop is unnecessary, but https://godbolt.org/ by default filters comments so it's convenient to put a comment on a NOP.

How to understand this GNU C inline assembly macro for PowerPC stwbrx

This is basically to perform swap for the buffers while transferring a message buffer. This statement left me puzzled (because of my unfamiliarity with the embedded assembly code in c). This is a power pc instruction
#define ASMSWAP32(dest_addr,data) __asm__ volatile ("stwbrx %0, 0, %1" : : "r" (data), "r" (dest_addr))
Besides being unsafe because of a bug, this macro is also less efficient than what the compiler will generate for you.
stwbrx = store word byte-reversed. The x stands for indexed.
You don't need inline asm for this in GNU C, where you can use __builtin_bswap32 and let the compiler emit this instruction for you.
void swapstore_asm(int a, int *p) {
ASMSWAP32(p, a);
}
void swapstore_c(int a, int *p) {
*p = __builtin_bswap32(a);
}
Compiled with gcc4.8.5 -O3 -mregnames, we get identical code from both functions (Godbolt compiler explorer):
swapstore:
stwbrx %r3, 0, %r4
blr
swapstore_c:
stwbrx %r3,0,%r4
blr
But with a more complicated address (storing to p[off], where off is an integer function arg), the compiler knows how to use both register inputs, while your macro forces the compiler to have the address in a single register:
void swapstore_offset(int a, int *p, int off) {
= __builtin_bswap32(a);
}
swapstore_offset:
slwi %r5,%r5,2 # *4 = sizeof(int)
stwbrx %r3,%r4,%r5 # use an indexed addressing mode, with both registers non-zero
blr
swapstore_offset_asm:
slwi %r5,%r5,2
add %r4,%r4,%r5 # extra instruction forced by using the macro
stwbrx %r3, 0, %r4
blr
BTW, if you're having trouble understanding GNU C inline asm templates, looking at the compiler's asm output can be a useful way to see what gets substituted in. See How to remove "noise" from GCC/clang assembly output? for more about reading compiler asm output.
Also note that this macro is buggy: it's missing a "memory" clobber for the store. And yes, you still need that with asm volatile. The compiler doesn't assume that *dest_addr is modified unless you tell it, so it could hoist a non-volatile load of *dest_addr ahead of this insn, or more likely to be a real problem, sink a store after it. (e.g. if you zeroed a buffer before storing to it with this, the compiler might actually zero after this instruction.)
Instead of a "memory" clobber (and also leaving out volatile), you could tell the compiler which memory location you modify with a =m" (*dest_addr) operand, either as a dummy operand or with a constraint on the addressing mode so you could use it as reg+reg. (IDK PPC well enough to know what "=m" usually expands to.)
In most cases this bug won't bite you, but it's still a bug. Upgrading your compiler version or using link-time optimization could maybe make your program buggy with no source-level changes.
This kind of thing is why https://gcc.gnu.org/wiki/DontUseInlineAsm
See also https://stackoverflow.com/tags/inline-assembly/info.
#define ASMSWAP32(dest_addr,data) ...
This part should be clear
__asm__ volatile ( ... : : "r" (data), "r" (dest_addr))
This is the actual inline assembly:
Two values are passed to the assmbly code; no value is returned from the assembly code (this is the colons after the actual assembly code).
Both parameters are passed in registers ("r"). The expression %0 will be replaced by the register that contains the value of data while the expression %1 will be replaced by the register that contains the value of dest_addr (which will be a pointer in this case).
The volatile here means that the assembly code has to be executed at this point and cannot be moved to somewhere else.
So if you use the following code in the C source:
ASMSWAP(&a, b);
... the following assembler code will be generated:
# write the address of a to register 5 (for example)
...
# write the value of b to register 6
...
stwbrx 6, 0, 5
So the first argument of the stwbrx instruction is the value of b and the last argument is the address of a.
stwbrx x, 0, y
This instruction writes the value in register x to the address stored in register y; however it writes the value in "reverse endian" (on a big-endian CPU it writes the value "little endian".
The following code:
uint32 a;
ASMSWAP32(&a, 0x12345678);
... should therefore result in a = 0x78563412.

GCC inline - push address, not its value to stack

I'm experimenting with GCC's inline assembler (I use MinGW, my OS is Win7).
Right now I'm only getting some basic C stdlib functions to work. I'm generally familiar with the Intel syntax, but new to AT&T.
The following code works nice:
char localmsg[] = "my local message";
asm("leal %0, %%eax" : "=m" (localmsg));
asm("push %eax");
asm("call %0" : : "m" (puts));
asm("add $4,%esp");
That LEA seems redundant, however, as I can just push the value straight onto the stack. Well, due to what I believe is an AT&T peculiarity, doing this:
asm("push %0" : "=m" (localmsg));
will generate the following assembly code in the final executable:
PUSH DWORD PTR SS:[ESP+1F]
So instead of pushing the address to my string, its contents were pushed because the "pointer" was "dereferenced", in C terms. This obviously leads to a crash.
I believe this is just GAS's normal behavior, but I was unable to find any information on how to overcome this. I'd appreciate any help.
P.S. I know this is a trivial question to those who are experienced in the matter. I expect to be downvoted, but I've just spent 45 minutes looking for a solution and found nothing.
P.P.S. I realize the proper way to do this would be to call puts( ) in the C code. This is for purely educational/experimental reasons.
While inline asm is always a bit tricky, calling functions from it is particularly challenging. Not something I would suggest for a "getting to known inline asm" project. If you haven't already, I suggest looking through the very latest inline asm docs. A lot of work has been done to try to explain how inline asm works.
That said, here are some thoughts:
1) Using multiple asm statements like this is a bad idea. As the docs say: Do not expect a sequence of asm statements to remain perfectly consecutive after compilation. If certain instructions need to remain consecutive in the output, put them in a single multi-instruction asm statement.
2) Directly modifying registers (like you are doing with eax) without letting gcc know you are doing so is also a bad idea. You should either use register constraints (so gcc can pick its own registers) or clobbers to let gcc know you are stomping on them.
3) When a function (like puts) is called, while some registers must have their values restored before returning, some registers can be treated as scratch registers by the called function (ie modified and not restored before returning). As I mentioned in #2, having your asm modify registers without informing gcc is a very bad idea. If you know the ABI for the function you are calling, you can add its scratch registers to the asm's clobber list.
4) While in this specific example you are using a constant string, as a general rule, when passing asm pointers to strings, structs, arrays, etc, you are likely to need the "memory" clobber to ensure that any pending writes to memory are performed before starting to execute your asm.
5) Actually, the lea is doing something very important. The value of esp is not known at compile time, so it's not like you can perform push $12345. Someone needs to compute (esp + the offset of localmsg) before it can be pushed on the stack. Also, see second example below.
6) If you prefer intel format (and what right-thinking person wouldn't?), you can use -masm=intel.
Given all this, my first cut at this code looks like this. Note that this does NOT clobber puts' scratch registers. That's left as an exercise...
#include <stdio.h>
int main()
{
const char localmsg[] = "my local message";
int result;
/* Use 'volatile' since 'result' is usually not going to get used,
which might tempt gcc to discard this asm statement as unneeded. */
asm volatile ("push %[msg] \n\t" /* Push the address of the string. */
"call %[puts] \n \t" /* Call the print function. */
"add $4,%%esp" /* Clean up the stack. */
: "=a" (result) /* The result code from puts. */
: [puts] "m" (puts), [msg] "r" (localmsg)
: "memory", "esp");
printf("%d\n", result);
}
True this doesn't avoid the lea due to #5. However, if that's really important, try this:
#include <stdio.h>
const char localmsg[] = "my local message";
int main()
{
int result;
/* Use 'volatile' since 'result' is usually not going to get used. */
asm volatile ("push %[msg] \n\t" /* Push the address of the string. */
"call %[puts] \n \t" /* Call the print function. */
"add $4,%%esp" /* Clean up the stack. */
: "=a" (result) /* The result code. */
: [puts] "m" (puts), [msg] "i" (localmsg)
: "memory", "esp");
printf("%d\n", result);
}
As a global, the address of localmsg is now knowable at compile time (ok, I'm simplifying a bit), the asm produced looks like this:
push $__ZL8localmsg
call _puts
add $4,%esp
Tada.

Why does "memory" clobber need to be specified when using array element as an output memory constraint in GCC's inline asm?

I'm trying to do some asm level (MIPS with some DSP extentions) optimizations of an audio codec. There is some DSP processing involved after which I'm coming to the point the result needs to be stored back into an array. Here's a code I thought it should do it:
asm(
eDSP_MFLO(8, 1) // move the accumulated result to $8
"sw $8, %0\n" // result => array
: "=m"(s[i])
:: "$8"
);
The problem is this code works or not (I get junk in the array when it doesn't), depending on its surrounding code, unless I add "memory" to the clobber list:
asm(
eDSP_MFLO(8, 1) // move the accumulated result to $8
"sw $8, %0\n" // result => array element s[i]
: "=m"(s[i])
:: "$8", "memory"
);
I'm having hard times understanding why it is necessary. I wouldn't question it if I calculated the offset into array in the asm block myself so that the compiler didn't know which memory addresses have been changed, but since GCC is doing these steps by itself why does it require the extra "memory" clobber?
Take a look at the generated assembly code for the whole function both with and without the "memory" constraint (use gcc -S ...). It looks like gcc has a copy of s[i] in a register that was loaded from memory before the asm() statement, and it doesn't realize that that register contains out of date information after the asm() statement unless you add the "memory" constraint.

Resources