x86-64 Assembly "cmovge" to C code

x86-64 Assembly "cmovge" to C code - c

While I shouldn't list out the entire 4 line sample I'm given, (since this is a homework question) I'm confused how this should be read and translated into C.
cmovge %edi, %eax
What I understand so far is that the instruction is a conditional move for when the result is >=. It's comparing the first parameter of a function %edi to the integer register %eax (which was assigned the other parameter value %esi in the previous line of assembly code). However, I don't understand its result.
My problem is interpreting the optimized code. It doesn't manipulate the stack, and I'm not sure how to write this in C (or at least the gcc switch I could even use to generate the same result when compiling).
Could someone please give a few small examples of how the cmovge instruction might translate into C code? If it doesn't make sense as its own line of code, feel free to make something up with it.
This is in x86-64 assembly through a virtualized Linux operating system (CentOS 7).

I'm probably giving you the whole solution here:
int
doit(int a, int b) {
return a >= b ? a : b;
}
With gcc -O3 -masm=intel becomes:
doit:
.LFB0:
.cfi_startproc
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
.cfi_endproc

Related

Why can't I get the value of asm registers in C?

I'm trying to get the values of the assembly registers rdi, rsi, rdx, rcx, r8, but I'm getting the wrong value, so I don't know if what I'm doing is taking those values or telling the compiler to write on these registers, and if that's the case how could I achieve what I'm trying to do (Put the value of assembly registers in C variables)?
When this code compiles (with gcc -S test.c)
#include <stdio.h>
void beautiful_function(int a, int b, int c, int d, int e) {
register long rdi asm("rdi");
register long rsi asm("rsi");
register long rdx asm("rdx");
register long rcx asm("rcx");
register long r8 asm("r8");
const long save_rdi = rdi;
const long save_rsi = rsi;
const long save_rdx = rdx;
const long save_rcx = rcx;
const long save_r8 = r8;
printf("%ld\n%ld\n%ld\n%ld\n%ld\n", save_rdi, save_rsi, save_rdx, save_rcx, save_r8);
}
int main(void) {
beautiful_function(1, 2, 3, 4, 5);
}
it outputs the following assembly code (before the function call):
movl $1, %edi
movl $2, %esi
movl $3, %edx
movl $4, %ecx
movl $5, %r8d
callq _beautiful_function
When I compile and execute it outputs this:
0
0
4294967296
140732705630496
140732705630520
(some undefined values)
What did I do wrong ? and how could I do this?

Your code didn't work because Specifying Registers for Local Variables explicitly tells you not to do what you did:
The only supported use for this feature is to specify registers for input and output operands when calling Extended asm (see Extended Asm).
Other than when invoking the Extended asm, the contents of the specified register are not guaranteed. For this reason, the following uses are explicitly not supported. If they appear to work, it is only happenstance, and may stop working as intended due to (seemingly) unrelated changes in surrounding code, or even minor changes in the optimization of a future version of gcc:
Passing parameters to or from Basic asm
Passing parameters to or from Extended asm without using input or output operands.
Passing parameters to or from routines written in assembler (or other languages) using non-standard calling conventions.
To put the value of registers in variables, you can use Extended asm, like this:
long rdi, rsi, rdx, rcx;
register long r8 asm("r8");
asm("" : "=D"(rdi), "=S"(rsi), "=d"(rdx), "=c"(rcx), "=r"(r8));
But note that even this might not do what you want: the compiler is within its rights to copy the function's parameters elsewhere and reuse the registers for something different before your Extended asm runs, or even to not pass the parameters at all if you never read them through the normal C variables. (And indeed, even what I posted doesn't work when optimizations are enabled.) You should strongly consider just writing your whole function in assembly instead of inline assembly inside of a C function if you want to do what you're doing.

Even if you had a valid way of doing this (which this isn't), it probably only makes sense at the top of a function which isn't inlined. So you'd probably need __attribute__((noinline, noclone)). (noclone is a GCC attribute that clang will warn about not recognizing; it means not to make an alternate version of the function with fewer actual args, to be called in the case where some of them are known constants that can get propagated into the clone.)
register ... asm local vars aren't guaranteed to do anything except when used as operands to Extended Asm statements. GCC does sometimes still read the named register if you leave it uninitialized, but clang doesn't. (And it looks like you're on a Mac, where the gcc command is actually clang, because so many build scripts use gcc instead of cc.)
So even without optimization, the stand-alone non-inlined version of your beautiful_function is just reading uninitialized stack space when it reads your rdi C variable in const long save_rdi = rdi;. (GCC does happen to do what you wanted here, even at -Os - optimizes but chooses not to inline your function. See clang and GCC (targeting Linux) on Godbolt, with asm + program output.).
Using an asm statement to make register asm do something
(This does what you say you want (reading registers), but because of other optimizations, still doesn't produce 1 2 3 4 5 with clang when the caller can see the definition. Only with actual GCC. There might be a clang option to disable some relevant IPA / IPO optimization, but I didn't find one.)
You can use an asm volatile() statement with an empty template string to tell the compiler that the values in those registers are now the values of those C variables. (The register ... asm declarations force it to pick the right register for the right variable)
#include <stdlib.h>
#include <stdio.h>
__attribute__((noinline,noclone))
void beautiful_function(int a, int b, int c, int d, int e) {
register long rdi asm("rdi");
register long rsi asm("rsi");
register long rdx asm("rdx");
register long rcx asm("rcx");
register long r8 asm("r8");
// "activate" the register-asm locals:
// associate register values with C vars here, at this point
asm volatile("nop # asm statement here" // can be empty, nop is just because Godbolt filters asm comments
: "=r"(rdi), "=r"(rsi), "=r"(rdx), "=r"(rcx), "=r"(r8) );
const long save_rdi = rdi;
const long save_rsi = rsi;
const long save_rdx = rdx;
const long save_rcx = rcx;
const long save_r8 = r8;
printf("%ld\n%ld\n%ld\n%ld\n%ld\n", save_rdi, save_rsi, save_rdx, save_rcx, save_r8);
}
int main(void) {
beautiful_function(1, 2, 3, 4, 5);
}
This makes asm in your beautiful_function that does capture the incoming values of your registers. (It doesn't inline, and the compiler happens not to have used any instructions before the asm statement that steps on any of those registers. The latter is not guaranteed in general.)
On Godbolt with clang -O3 and gcc -O3
gcc -O3 does actually work, printing what you expect. clang still prints garbage, because the caller sees that the args are unused, and decides not to set those registers. (If you'd hidden the definition from the caller, e.g. in another file without LTO, that wouldn't happen.)
(With GCC, noninline,noclone attributes are enough to disable this inter-procedural optimization, but not with clang. Not even compiling with -fPIC makes that possible. I guess the idea is that symbol-interposition to provide an alternate definition of beautiful_function that does use its args would violate the one definition rule in C. So if clang can see a definition for a function, it assumes that's how the function works, even if it isn't allowed to actually inline it.)
With clang:
main:
pushq %rax # align the stack
# arg-passing optimized away
callq beautiful_function#PLT
# indirect through the PLT because I compiled for Linux with -fPIC,
# and the function isn't "static"
xorl %eax, %eax
popq %rcx
retq
But the actual definition for beautiful_function does exactly what you want:
# clang -O3
beautiful_function:
pushq %r14
pushq %rbx
nop # asm statement here
movq %rdi, %r9 # copying all 5 register outputs to different regs
movq %rsi, %r10
movq %rdx, %r11
movq %rcx, %rbx
movq %r8, %r14
leaq .L.str(%rip), %rdi
xorl %eax, %eax
movq %r9, %rsi # then copying them to printf args
movq %r10, %rdx
movq %r11, %rcx
movq %rbx, %r8
movq %r14, %r9
popq %rbx
popq %r14
jmp printf#PLT # TAILCALL
GCC wastes fewer instructions, just for example starting with movq %r8, %r9 to move your r8 C var as the 6th arg to printf. Then movq %rcx, %r8 to set up the 5th arg, overwriting one of the output registers before it's read all of them. Something clang was over-cautious about. However, clang does still push/pop %r12 around the asm statement; I don't understand why. It ends by tailcalling printf, so it wasn't for alignment.
Related:
How to specify a specific register to assign the result of a C expression in inline assembly in GCC? - the opposite problem: materialize a C variable value in a specific register at a certain point.
Reading a register value into a C variable - the previous canonical Q&A which uses the now-unsupported register ... asm("regname") method like you were trying to. Or with a register-asm global variable, which hurts efficiency of all your code by leaving it otherwise untouched.
I forgot I'd answered that Q&A, making basically the same points as this. And some other points, e.g. that this doesn't work on registers like the stack pointer.

How do I translate an optimized x86-64 asm loop back to a C for loop?

I have the following:
foo:
movl $0, %eax //result = 0
cmpq %rsi, %rdi // rdi = x, rsi = y?
jle .L2
.L3:
addq %rdi, %rax //result = result + i?
subq $1, %rdi //decrement?
cmp %rdi, rsi
jl .L3
.L2
rep
ret
And I'm trying to translate it to:
long foo(long x, long y)
{
long i, result = 0;
for (i= ; ; ){
//??
}
return result;
}
I don't know what cmpq %rsi, %rdi mean.
Why isn't there another &eax for long i?
I would love some help in figuring this out. I don't know what I'm missing - I been going through my notes, textbook, and rest of the internet and I am stuck. It's a review question, and I've been at it for hours.

Assuming this is a function taking 2 parameters. Assuming this is using the gcc amd64 calling convention, it will pass the two parameters in rdi and rsi. In your C function you call these x and y.
long foo(long x /*rdi*/, long y /*rsi*/)
{
//movl $0, %eax
long result = 0; /* rax */
//cmpq %rsi, %rdi
//jle .L2
if (x > y) {
do {
//addq %rdi, %rax
result += x;
//subq $1, %rdi
--x;
//cmp %rdi, rsi
//jl .L3
} while (x > y);
}
return result;
}

I don't know what cmpq %rsi, %rdi mean
That's AT&T syntax for cmp rdi, rsi. https://www.felixcloutier.com/x86/CMP.html
You can look up the details of what a single instruction does in an ISA manual.
More importantly, cmp/jcc like cmp %rsi,%rdi/jl is like jump if rdi<rsi.
Assembly - JG/JNLE/JL/JNGE after CMP. If you go through all the details of how cmp sets flags, and which flags each jcc condition checks, you can verify that it's correct, but it's much easier to just use the semantic meaning of JL = Jump on Less-than (assuming flags were set by a cmp) to remember what they do.
(It's reversed because of AT&T syntax; jcc predicates have the right semantic meaning for Intel syntax. This is one of the major reasons I usually prefer Intel syntax, but you can get used to AT&T syntax.)
From the use of rdi and rsi as inputs (reading them without / before writing them), they're the arg-passing registers. So this is the x86-64 System V calling convention, where integer args are passed in RDI, RSI, RDX, RCX, R8, R9, then on the stack. (What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 covers function calls as well as system calls). The other major x86-64 calling convention is Windows x64, which passes the first 2 args in RCX and RDX (if they're both integer types).
So yes, x=RDI and y=RSI. And yes, result=RAX. (writing to EAX zero-extends into RAX).
From the code structure (not storing/reloading every C variable to memory between statements), it's compiled with some level of optimization enabled, so the for() loop turned into a normal asm loop with the conditional branch at the bottom. Why are loops always compiled into "do...while" style (tail jump)? (#BrianWalker's answer shows the asm loop transliterated back to C, with no attempt to form it back into an idiomatic for loop.)
From the cmp/jcc ahead of the loop, we can tell that the compiler can't prove the loop runs a non-zero number of iterations. So whatever the for() loop condition is, it might be false the first time. (That's unsurprising given signed integers.)
Since we don't see a separate register being used for i, we can conclude that optimization reused another var's register for i. Like probably for(i=x;, and then with the original value of x being unused for the rest of the function, it's "dead" and the compiler can just use RDI as i, destroying the original value of x.
I guessed i=x instead of y because RDI is the arg register that's modified inside the loop. We expect that the C source modifies i and result inside the loop, and presumably doesn't modify it's input variables x and y. It would make no sense to do i=y and then do stuff like x--, although that would be another valid way of decompiling.
cmp %rdi, %rsi / jl .L3 means the loop condition to (re)enter the loop is rsi-rdi < 0 (signed), or i<y.
The cmp/jcc before the loop is checking the opposite condition; notice that the operands are reversed and it's checking jle, i.e. jng. So that makes sense, it really is same loop condition peeled out of the loop and implemented differently. Thus it's compatible with the C source being a plain for() loop with one condition.
sub $1, %rdi is obviously i-- or --i. We can do that inside the for(), or at the bottom of the loop body. The simplest and most idiomatic place to put it is in the 3rd section of the for(;;) statement.
addq %rdi, %rax is obviously adding i to result. We already know what RDI and RAX are in this function.
Putting the pieces together, we arrive at:
long foo(long x, long y)
{
long i, result = 0;
for (i= x ; i>y ; i-- ){
result += i;
}
return result;
}
Which compiler made this code?
From the .L3: label names, this looks like output from gcc. (Which somehow got corrupted, removing the : from .L2, and more importantly removing the % from %rsi in one cmp. Make sure you copy/paste code into SO questions to avoid this.)
So it may be possible with the right gcc version/options to get exactly this asm back out for some C input. It's probably gcc -O1, because movl $0, %eax rules out -O2 and higher (where GCC would look for the xor %eax,%eax peephole optimization for zeroing a register efficiently). But it's not -O0 because that would be storing/reloading the loop counter to memory. And -Og (optimize a bit, for debugging) likes to use a jmp to the loop condition instead of a separate cmp/jcc to skip the loop. This level of detail is basically irrelevant for simply decompiling to C that does the same thing.
The rep ret is another sign of gcc; gcc7 and earlier used this in their default tune=generic output for ret that's reached as a branch target or a fall-through from a jcc, because of AMD K8/K10 branch prediction. What does `rep ret` mean?
gcc8 and later will still use it with -mtune=k8 or -mtune=barcelona. But we can rule that out because that tuning option would use dec %rdi instead of subq $1, %rdi. (Only a few modern CPUs have any problems with inc/dec leaving CF unmodified, for register operands. INC instruction vs ADD 1: Does it matter?)
gcc4.8 and later put rep ret on the same line. gcc4.7 and earlier print it as you've shown, with the rep prefix on the line before.
gcc4.7 and later like to put the initial branch before the mov $0, %eax, which looks like a missed optimization. It means they need a separate return 0 path out of the function, which contains another mov $0, %eax.
gcc4.6.4 -O1 reproduces your output exactly, for the source shown above, on the Godbolt compiler explorer
# compiled with gcc4.6.4 -O1 -fverbose-asm
foo:
movl $0, %eax #, result
cmpq %rsi, %rdi # y, x
jle .L2 #,
.L3:
addq %rdi, %rax # i, result
subq $1, %rdi #, i
cmpq %rdi, %rsi # i, y
jl .L3 #,
.L2:
rep
ret
So does this other version which uses i=y. Of course there are many things we could add that would optimize away, like maybe i=y+1 and then having a loop condition like x>--i. (Signed overflow is undefined behaviour in C, so the compiler can assume it doesn't happen.)
// also the same asm output, using i=y but modifying x in the loop.
long foo2(long x, long y) {
long i, result = 0;
for (i= y ; x>i ; x-- ){
result += x;
}
return result;
}
In practice the way I actually reversed this:
I copy/pasted the C template into Godbolt (https://godbolt.org/). I could see right away (from the mov $0 instead of xor-zero, and from the label names) that it looked like gcc -O1 output, so I put in that command line option and picked an old-ish version of gcc like gcc6. (Turns out this asm was actually from a much older gcc).
I tried an initial guess like x<y based on the cmp/jcc, and i++ (before I'd actually read the rest of the asm carefully at all), because for loops often use i++. The trivial-looking infinite-loop asm output showed me that was obviously wrong :P
I guessed that i=x, but after taking a wrong turn with a version that did result += x but i--, I realized that i was a distraction and at first simplified by not using i at all. I just used x-- while first reversing it because obviously RDI=x. (I know the x86-64 System V calling convention well enough to see that instantly.)
After looking at the loop body, the result += x and x-- were totally obvious from the add and sub instructions.
cmp/jl was obviously a something < something loop condition involving the 2 input vars.
I wasn't sure I if it was x<y or y<x, and newer gcc versions were using jne as the loop condition. I think at that point I cheated and looked at Brian's answer to check it really was x > y, instead of taking a minute to work through the actual logic. But once I had figured out it was x--, only x>y made sense. The other one would be true until wraparound if it entered the loop at all, but signed overflow is undefined behaviour in C.
Then I looked at some older gcc versions to see if any made asm more like in the question.
Then I went back and replaced x with i inside the loop.
If this seems kind of haphazard and slapdash, that's because this loop is so tiny that I didn't expect to have any trouble figuring it out, and I was more interested in finding source + gcc version that exactly reproduced it, rather than the original problem of just reversing it at all.
(I'm not saying beginners should find it that easy, I'm just documenting my thought process in case anyone's curious.)

Converting x86 to Y86

I'm trying to figure out to convert this x86 assembly code to Y86 form:
Given the c program:
int sum(int x) {
if (x == 0 || x ==1) {
return 1;
} else {
return x + sum(x-1);
}
}
The following x86-64 assembly code is generated:
sum:
cmpl $1, %rdi
ja .L8
movl $1, %eax
ret
.L8:
pushq %rbx
movl %edi, %ebx
leal -1(%rdi), %edi
call sum
addl %ebx, %eax
popq %rbx
ret
How can I convert this to Y86-64 assembly code that does the same thing?
Thank you!

In this case, you can convert by replacing each instruction with a short sequence of y86 instructions which does exactly the same thing.
y86 is Turing complete, but very crippled, so in general you can't always easily convert. Some single x86 instructions might need an entire loop or very long function to implement, but that's not the case for any of your instructions. Each of them can be transliterated to one or a few y86 instructions. (Some might need a scratch register; I forget if y86 has compare with immediate or only mov-immediate to register.)
Your code doesn't have any multiplies, shifts, or bsf, or floating-point, or anything else that y86 doesn't have (and would need a loop to emulate).
Look up each x86 instruction in the instruction-set reference manual (like this online version, or this older one where not having AVX/AVX2 instructions means less to wade through. See also the x86 tag wiki for links to Intel and AMD's PDF manuals.) Look at the Operation section where pseudo-code describes the exact effect of the instruction on the architectural state. That's the behaviour you want to implement using y86 instructions.
As an example, I forget if y86 has push / pop, but if not you can always manipulate rsp directly and load/store. e.g. sub $8, %rsp ; movrm %rbx, (rsp) is push (except it clobbers flags where x86's push doesn't).

Understanding gcc output for if (a>=3)

I thought since condition is a >= 3, we should use jl (less).
But gcc used jle (less or equal).
It make no sense to me; why did the compiler do this?

You're getting mixed up by a transformation the compiler made on the way from the C source to the asm implementation. gcc's output implements your function this way:
a = 5;
if (a<=2) goto ret0;
return 1;
ret0:
return 0;
It's all clunky and redundant because you compiled with -O0, so it stores a to memory and then reloads it, so you could modify it with a debugger if you set a breakpoint and still have the code "work".
See also How to remove "noise" from GCC/clang assembly output?
Compilers generally prefer to reduce the magnitude of a comparison constant, so it's more likely to fit in a sign-extended 8-bit immediate instead of needing a 32-bit immediate in the machine code.
We can get some nice compact code by writing a function that takes an arg, so it won't optimize away when we enable optimizations.
int cmp(int a) {
return a>=128; // In C, a boolean converts to int as 0 or 1
}
gcc -O3 on Godbolt, targetting the x86-64 ABI (same as your code):
xorl %eax, %eax # whole RAX = 0
cmpl $127, %edi
setg %al # al = (edi>127) : 1 : 0
ret
So it transformed a >=128 into a >127 comparison. This saves 3 bytes of machine code, because cmp $127, %edi can use the cmp $imm8, r/m32 encoding (cmp r/m32, imm8 in Intel syntax in Intel's manual), but 128 would have to use cmp $imm32, r/m32.
BTW, comparisons and conditions make sense in Intel syntax, but are backwards in AT&T syntax. For example, cmp edi, 127 / jg is taken if edi > 127.
But in AT&T syntax, it's cmp $127, %edi, so you have to mentally reverse the operands or think of a > instead of <

The assembly code is comparing a to two, not three. That's why it uses jle. If a is less than or equal to two it logically follows that a IS NOT greater than or equal to 3, and therefore 0 should be returned.

How to associate assembly code to exact line in C program?

Here is an example found via an assembly website. This is the C code:
int main()
{
int a = 5;
int b = a + 6;
return 0;
}
Here is the associated assembly code:
(gdb) disassemble
Dump of assembler code for function main:
0x0000000100000f50 <main+0>: push %rbp
0x0000000100000f51 <main+1>: mov %rsp,%rbp
0x0000000100000f54 <main+4>: mov $0x0,%eax
0x0000000100000f59 <main+9>: movl $0x0,-0x4(%rbp)
0x0000000100000f60 <main+16>: movl $0x5,-0x8(%rbp)
0x0000000100000f67 <main+23>: mov -0x8(%rbp),%ecx
0x0000000100000f6a <main+26>: add $0x6,%ecx
0x0000000100000f70 <main+32>: mov %ecx,-0xc(%rbp)
0x0000000100000f73 <main+35>: pop %rbp
0x0000000100000f74 <main+36>: retq
End of assembler dump.
I can safely assume that this line of assembly code:
0x0000000100000f6a <main+26>: add $0x6,%ecx
correlates to this line of C:
int b = a + 6;
But is there a way to extract which lines of assembly are associated to the specific line of C code?
In this small sample it's not too difficult, but in larger programs and when debugging a larger amount of code it gets a bit cumbersome.

But is there a way to extract which lines of assembly are associated to the specific line of C code?
Yes, in principle - your compiler can probably do it (GCC option -fverbose-asm, for example). Alternatively, objdump -lSd or similar will disassemble a program or object file with source and line number annotations where available.
In general though, for a large optimized program, this can be very hard to follow.
Even with perfect annotation, you'll see the same source line mentioned multiple times as expressions and statements are split up, interleaved and reordered, and some instructions associated with multiple source expressions.
In this case, you just need to think about the relationship between your source and the assembly, but it takes some effort.

One of the best tools I've found for this is Matthew Godbolt's Compiler Explorer.
It features multiple compiler toolchains, auto-recompiles, and it immediately shows the assembly output with colored lines to show the corresponding line of source code.

First, you need to compile the program keeping inside its object file informations about the source code either via gdwarf or g flag or both. Next, if you want to debug it is important for the compiler to avoid optimizations, otherwise it is difficult to see a correspondence code<>assembly.
gcc -gdwarf -g3 -O0 prog.c -o out
Next, tell the disassembler to output the source code. The source flag involves the disassemble flag.
objdump --source out

#Useless is very right. Anyways, a trick to know where C has arrived in the machine code is to inject markers in it; for instance,
#define ASM_MARK do { asm __volatile__("nop; nop; nop;\n\t" :::); } while (0);
int main()
{
int a = 5;
ASM_MARK;
int b = a + 6;
ASM_MARK;
return 0;
}
You will see:
main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
nop; nop; nop;
movl -4(%rbp), %eax
addl $6, %eax
movl %eax, -8(%rbp)
nop; nop; nop;
movl $0, %eax
popq %rbp
ret
You need to use the __volatile__ keyword or equivalent in order to tell the compiler not to interfere and this is often compiler-specific (notice the __), as C does not
provide this kind of syntax.