In C language,what's the assemble of "b++".
I got two situations:
1) one instruction
addl $0x1,-4(%rbp)
2) three instructions
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
Are these two situations caused by the compiler?
my code:
int main()
{
int ret = 0;
int i = 2;
ret = i++;
ret = ++i;
return ret;
}
the .s file(++i use addl instrction, i++ use other):
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, -8(%rbp) //ret
movl $2, -4(%rbp) //i
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
movl %eax, -8(%rbp)
addl $1, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413"
.section .note.GNU-stack,"",#progbits
The ISO standard does not mandate at all what happens under the covers. It specifies a "virtual machine" that acts in a certain way given the C instructions you provide to it.
So, if your C compiler is implemented as a C-to-Dartmouth-Basic converter, b++ is just as likely to lead to 10 let b = b + 1 as anything else :-)
If you're compiling to common assembler code, then you're likely to see a difference depending on whether you use the result, specifically b++; as opposed to a = b++ since the result of the former can be safely thrown away.
You're also likely to see massive differences based on optimisation level.
Bottom line, short of specifying all the things that can affect the output (including but not limited to compiler, target platform, and optimisation levels).
The first one is the output for ++i as part of ret = ++i. It doesn't need to keep the old value around, because it's doing ++i and then res=i. Incrementing in memory and then reloading that is a really stupid and inefficient way to compile that, but you compiled with optimization disabled so gcc isn't even trying to make good asm output.
The 2nd one is the output for i++ as part of ret = i++. It needs to keep the old value of i around, so it loads into a register and uses lea to calculate i+1 in a different register. It could have just stored to ret and then incremented the register before storing back to i, but I guess with optimizations disabled gcc doesn't notice that.
Previous answer to the previous vague question without source, and with bogus code:
The asm for a tiny expression like b++ totally depends on the surrounding code in the rest of the function (or with optimization disabled, at least the rest of the statement) and whether it's a global or local, and whether it's declared volatile.
And of course compiler optimization options have a massive impact; with optimization disabled, gcc makes a separate block of asm for every C statement so you can use the GDB jump command to go to a different source line and have the code still produce the same behaviour you'd expect from the C abstract machine. Obviously this highly constrains code-gen: nothing is kept in registers across statements. This is good for source-level debugging, but sucks to read by hand because of all the noise of store/reload.
For the choice of inc vs. add, see INC instruction vs ADD 1: Does it matter? clang -O3 -mtune=bdver2 uses inc for memory-destination increments, but with generic tuning or any Intel P6 or Sandybridge-family CPU it uses add $1, (mem) for better micro-fusion.
See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at and making sense of compiler asm output.
The 2nd version in your original question looks like mostly un-optimized compiler output for this weird source:
// inside some function
int b;
// leaq -4(%rbp), %rax // rax = &b
b++; // incl (%rax)
b = (int)&b; // mov %eax, -4(%rbp)
(The question has since been edited to different code; looks like the original was mis-typed by hand mixing an opcode from once line with an operand from another line. I reproduce it here so all the comments about it being weird still make sense. For the updated code, see the first half of my answer: it depends on surrounding code and having optimization disabled. Using res = b++ needs the old value of b, not the incremented value, hence different asm.)
If that's not what your source does, then you must have left out some intervening instructions or something. Or else the compiler is re-using that stack slot for something else.
I'm curious what compiler you got that from, because gcc and clang typically don't like to use results they just computed. I'd have expected incl -4(%rbp).
Also that doesn't explain mov %eax, -4(%rbp). The compiler already used the address in %rax for inc, so why would a compiler revert to a 1-byte-longer RBP-relative addressing mode instead of mov %eax, (%rax)? Referencing fewer different registers that haven't been recently written is a good thing for Intel P6-family CPUs (up to Nehalem), to reduce register-read stalls. (Otherwise irrelevant.)
Using RBP as a frame pointer (and doing increments in memory instead of keeping simple variables in registers) looks like un-optimized code. But it can't be from gcc -O0, because it computes the address before the increment, and those have to be from two separate C statements.
b++ = &b; isn't valid because b++ isn't an lvalue. Well actually the comma operator lets you do b++, b = &b; in one statement, but gcc -O0 still evaluates it in order, rather than computing the address early.
Of course with optimization enabled, b would have to be volatile to explain incrementing in memory right before overwriting it.
clang is similar, but actually does compute that address early. For b++; b = &b;, notice that clang6.0 -O0 does an LEA and keeps RAX around across the increment. I guess clang's code-gen doesn't support consistent debugging with GDB's jump the way gcc does.
leaq -4(%rbp), %rax
movl -4(%rbp), %ecx
addl $1, %ecx
movl %ecx, -4(%rbp)
movl %eax, %ecx # copy the LEA result
movl %ecx, -4(%rbp)
I wasn't able to get gcc or clang to emit the sequence of instructions you show in the question with unoptimized or optimized + volatile, on the Godbolt compiler explorer. I didn't try ICC or MSVC, though. (Although unless that's disassembly, it can't be MSVC because it doesn't have an option to emit AT&T syntax.)
Any good compiler will optimise b++ to ++b if the result of the expression is discarded. You see this particularly in increments in for loops.
That's what is happening in your "one instruction" case.
It's not typically instructive to look at un-optimized compiler output, since values (variables) will usually be updated using a load-modify-store paradigm. This might be useful initially when getting to grips with assembly, but it's not the output to expect from an optimizing compiler that maintains values, pointers, etc., in registers for frequent use. (see: locality of reference)
/* un-optimized logic: */
int i = 2;
ret = i++; /* assign ret <- i, and post-increment i (ret = i; i++ (i = 3)) */
ret = ++i; /* pre-increment i, and assign ret <- i (++i (i = 4); ret = i) */
i.e., any modern, optimising compiler can easily determine that the final value of ret is (4).
Removing all the extraneous directives, etc., gcc-7.3.0 on OS X gives me:
_main: /* Darwin x86-64 ABI adds leading underscores to symbols... */
movl $4, %eax
ret
Apple's native clang, and the MacPorts clang-6.0 set up basic stack frame, but still optimise the ret arithmetic away:
_main:
pushq %rbp
movq %rsp, %rbp
movl $4, %eax
popq %rbp
retq
Note that the Mach-O (OS X) ABI is very similar to the ELF ABI for user-space code. Just try compiling with at least -O2 to get a feel for 'real' (production) code.
Related
i have this simple piece of code in c:
#include <stdio.h>
void test() {}
int main()
{
if (2 < 3) {
int zz = 10;
}
return 0;
}
when i see the assembly output of this code:
test():
pushq %rbp
movq %rsp, %rbp
nop
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
movl $10, -4(%rbp) // space is created for zz on stack
movl $0, %eax
popq %rbp
ret
i got the assembly from here (default options)
I can't see where is the instruction for the conditional check?
You don't see it, because it isn't there. The compiler was able to perform analysis, and rather easily see that this branch will always be entered.
Instead of emitting a check that will do nothing but waste CPU cycles, it emits an easily optimized version of the code.
A C program is not a sequence of instructions for the CPU to perform. That's what the emitted machine code is. A C program is a description of the behavior your compiled program should have. A compiler is free to translate it in almost any way it wants, so long as you get that behavior.
It's known as "the as-if rule".
The interesting thing here is that gcc and clang optimize away the if() even at -O0, unlike some other compilers (ICC and MSVC).
gcc -O0 doesn't mean no optimization, it means no extra optimization beyond what's needed to compile at all. But gcc does have to transform through a couple internal representations of the function logic before emitting asm. (GIMPLE and Register Transfer Language). gcc doesn't have a special "dumb mode" where it slavishly transliterates every part of every C expression to asm.
Even a super-simple one-pass compiler like TCC does minor optimizations within an expression (or even a statement), like realizing that an always-true condition doesn't require branching.
gcc -O0 is the default, which you obviously used because the dead store to zz isn't optimized away.
gcc -O0 aims to compile quickly, and to give consistent debugging results.
Every C variable exists in memory, whether it's ever used or not.
Nothing is kept in registers across C statements (except variables declared register; -O0 is the only time that keyword does anything). So you can modify any C variable with a debugger while single-stepping. i.e. spill/reload everything between separate C statements. See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? (This is why benchmarking for -O0 is nonsense: writing the same code with fewer larger expressions is faster only at -O0, not with real settings like -O3).
Other interesting consequences: constant-propagation doesn't work, see Why does integer division by -1 (negative one) result in FPE? for a case where gcc uses div for a variable set to a constant, vs. something simpler for a literal constant.
Every statement is compiled independently, so you can even jump to a different source line (within the same function) using GDB and get consistent results. (Unlike in optimized code where that would be likely to crash or give nonsense, and definitely not match the C abstract machine).
Given all those requirements for gcc -O0 behaviour, if (2 < 3) can still be optimized to zero asm instructions. The behaviour doesn't depend on the value of any variable, and it's a single statement. There's no way it can ever be not-taken, so the simplest way to compile it is no instructions: fall-through into the { body } of the if.
Note that gcc -O0's rules / restrictions go far beyond the C as-if rule that the machine-code for a function merely has to implement all externally-visible behaviour of the C source. gcc -O3 optimizes the whole function down to just
main: # with optimization
xor eax, eax
ret
because it doesn't care about keeping asm for every C statement.
Other compilers:
See all 4 of the major x86 compilers on Godbolt.
clang is similar to gcc, but with a dead store of 0 to another spot on the stack, as well as the 10 for zz. clang -O0 is often closer to a transliteration of C into asm, for example it will use div for x / 2 instead of a shift, while gcc uses a multiplicative inverse for division by a constant even at -O0. But in this case, clang also decides that no instructions are sufficient for an always-true condition.
ICC and MSVC both emit asm for the branch, but instead of the mov $2, %ecx / cmp $3, %ecx you might expect, they both actually do 0 != 1 for no apparent reason:
# ICC18
pushq %rbp #6.1
movq %rsp, %rbp #6.1
subq $16, %rsp #6.1
movl $0, %eax #7.5
cmpl $1, %eax #7.5
je ..B1.3 # Prob 100% #7.5
movl $10, -16(%rbp) #9.16
..B1.3: # Preds ..B1.2 ..B1.1
movl $0, %eax #11.12
leave #11.12
ret #11.12
MSVC uses the xor-zeroing peephole optimization even without optimization enabled.
It's slightly interesting to look at which local / peephole optimizations compilers do even at -O0, but it doesn't tell you anything fundamental about C language rules or your code, it just tells you about compiler internals and the tradeoffs the compiler devs chose between spending time looking for simple optimizations vs. compiling even faster in no-optimization mode.
The asm is never intended to faithfully represent the C source in any kind of way that would let a decompiler reconstruct it. Just to implement equivalent logic.
It's simple. It is not there. The compiler optimized it away.
Here is the assembly when compiling with gcc without optimization:
.file "k.c"
.text
.globl test
.type test, #function
test:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size test, .-test
.globl main
.type main, #function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $10, -4(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Debian 6.3.0-18) 6.3.0 20170516"
.section .note.GNU-stack,"",#progbits
and here it is with optimization:
.file "k.c"
.text
.p2align 4,,15
.globl test
.type test, #function
test:
.LFB11:
.cfi_startproc
rep ret
.cfi_endproc
.LFE11:
.size test, .-test
.section .text.startup,"ax",#progbits
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB12:
.cfi_startproc
xorl %eax, %eax
ret
.cfi_endproc
.LFE12:
.size main, .-main
.ident "GCC: (Debian 6.3.0-18) 6.3.0 20170516"
.section .note.GNU-stack,"",#progbits
As you can see, not only the comparison is optimized away. Almost the whole main is optimized away since it does not produce anything visible. The variable zz is never used. The only observable thing your code does is returning 0.
2 is always less tan 3 so, as the compiler know the result of 2<3 is always true, there is no need for an if decision in assembler.
The optimization means to generate less time / less code.
if (2<3)
is allways true, therefore the Compiler emmits no opcode for it.
The condition if (2<3) is always true. So a decent compiler would detect this generate the code as if the condition doesn't exist. In fact, if you optimize it with -O3, godbolt.org generates just:
test():
rep ret
main:
xor eax, eax
ret
This is again valid because a compiler is allowed optimise and transform the code as long as the observable behaviour is preserved.
I'm trying to figure out to convert this x86 assembly code to Y86 form:
Given the c program:
int sum(int x) {
if (x == 0 || x ==1) {
return 1;
} else {
return x + sum(x-1);
}
}
The following x86-64 assembly code is generated:
sum:
cmpl $1, %rdi
ja .L8
movl $1, %eax
ret
.L8:
pushq %rbx
movl %edi, %ebx
leal -1(%rdi), %edi
call sum
addl %ebx, %eax
popq %rbx
ret
How can I convert this to Y86-64 assembly code that does the same thing?
Thank you!
In this case, you can convert by replacing each instruction with a short sequence of y86 instructions which does exactly the same thing.
y86 is Turing complete, but very crippled, so in general you can't always easily convert. Some single x86 instructions might need an entire loop or very long function to implement, but that's not the case for any of your instructions. Each of them can be transliterated to one or a few y86 instructions. (Some might need a scratch register; I forget if y86 has compare with immediate or only mov-immediate to register.)
Your code doesn't have any multiplies, shifts, or bsf, or floating-point, or anything else that y86 doesn't have (and would need a loop to emulate).
Look up each x86 instruction in the instruction-set reference manual (like this online version, or this older one where not having AVX/AVX2 instructions means less to wade through. See also the x86 tag wiki for links to Intel and AMD's PDF manuals.) Look at the Operation section where pseudo-code describes the exact effect of the instruction on the architectural state. That's the behaviour you want to implement using y86 instructions.
As an example, I forget if y86 has push / pop, but if not you can always manipulate rsp directly and load/store. e.g. sub $8, %rsp ; movrm %rbx, (rsp) is push (except it clobbers flags where x86's push doesn't).
It seems state-of-art compilers treat arguments passed by stack as read-only. Note that in the x86 calling convention, the caller pushes arguments onto the stack and the callee uses the arguments in the stack. For example, the following C code:
extern int goo(int *x);
int foo(int x, int y) {
goo(&x);
return x;
}
is compiled by clang -O3 -c g.c -S -m32 in OS X 10.10 into:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 10
.globl _foo
.align 4, 0x90
_foo: ## #foo
## BB#0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl 8(%ebp), %eax
movl %eax, -4(%ebp)
leal -4(%ebp), %eax
movl %eax, (%esp)
calll _goo
movl -4(%ebp), %eax
addl $8, %esp
popl %ebp
retl
.subsections_via_symbols
Here, the parameter x(8(%ebp)) is first loaded into %eax; and then stored in -4(%ebp); and the address -4(%ebp) is stored in %eax; and %eax is passed to the function goo.
I wonder why Clang generates code that copy the value stored in 8(%ebp) to -4(%ebp), rather than just passing the address 8(%ebp) to the function goo. It would save memory operations and result in a better performance. I observed a similar behaviour in GCC too (under OS X). To be more specific, I wonder why compilers do not generate:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 10
.globl _foo
.align 4, 0x90
_foo: ## #foo
## BB#0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
leal 8(%ebp), %eax
movl %eax, (%esp)
calll _goo
movl 8(%ebp), %eax
addl $8, %esp
popl %ebp
retl
.subsections_via_symbols
I searched for documents if the x86 calling convention demands the passed arguments to be read-only, but I couldn't find anything on the issue. Does anybody have any thought on this issue?
The rules for C are that parameters must be passed by value. A compiler converts from one language (with one set of rules) to a different language (potentially with a completely different set of rules). The only limitation is that the behaviour remains the same. The rules of the C language do not apply to the target language (e.g. assembly).
What this means is that if a compiler feels like generating assembly language where parameters are passed by reference and are not passed by value; then this is perfectly legal (as long as the behaviour remains the same).
The real limitation has nothing to do with C at all. The real limitation is linking. So that different object files can be linked together, standards are needed to ensure that whatever the caller in one object file expects matches whatever the callee in another object file provides. This is what's known as the ABI. In some cases (e.g. 64-bit 80x86) there are multiple different ABIs for the exact same architecture.
You can even invent your own ABI that's radically different (and implement your own tools that support your own radically different ABI) and that's perfectly legal as far as the C standards go; even if your ABI requires "pass by reference" for everything (as long as the behaviour remains the same).
Actually, I just compiled this function using GCC:
int foo(int x)
{
goo(&x);
return x;
}
And it generated this code:
_foo:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
leal 8(%ebp), %eax
movl %eax, (%esp)
call _goo
movl 8(%ebp), %eax
leave
ret
This is using GCC 4.9.2 (on 32-bit cygwin if it matters), no optimizations. So in fact, GCC did exactly what you thought it should do and used the argument directly from where the caller pushed it on the stack.
The C programming language mandates that arguments are passed by value. So any modification of an argument (like an x++; as the first statement of your foo) is local to the function and does not propagate to the caller.
Hence, a general calling convention should require copying of arguments at every call site. Calling conventions should be general enough for unknown calls, e.g. thru a function pointer!
Of course, if you pass an address to some memory zone, the called function is free to dereference that pointer, e.g. as in
int goo(int *x) {
static int count;
*x = count++;
return count % 3;
}
BTW, you might use link-time optimizations (by compiling and linking with clang -flto -O2 or gcc -flto -O2) to perhaps enable the compiler to improve or inline some calls between translation units.
Notice that both Clang/LLVM and GCC are free software compilers. Feel free to propose an improvement patch to them if you want to (but since both are very complex pieces of software, you'll need to work some months to make that patch).
NB. When looking into produced assembly code, pass -fverbose-asm to your compiler!
While fiddling with simple C code, I noticed something strange. Why does ICC produces incl %eax in assembly code generated for increment instead of addl $1, %eax? GCC behaves as expected though, using add.
Example code (-O3 used on both GCC and ICC)
int A, B, C, D, E;
void foo()
{
A = B + 1;
B = 0;
C++;
D++;
D++;
E += 2;
}
Result on ICC
L__routine_start_foo_0:
foo:
movl B(%rip), %eax #5.13
movl D(%rip), %edx #8.9
incl %eax #5.17
movl E(%rip), %ecx #10.9
addl $2, %edx #9.9
addl $2, %ecx #10.9
movl %eax, A(%rip) #5.9
movl $0, B(%rip) #6.9
incl C(%rip) #7.9
movl %edx, D(%rip) #9.9
movl %ecx, E(%rip) #10.9
ret
For example, see here.
As such, I'm wondering - is this an intended feature, a bug or some quirk resulting from some specific setting? If add is (supposedly) better due to flags update or efficiency (which is the conclusion based on the links below) - why does ICC use inc?
Related:
Relative performance of x86 inc vs. add instruction
Is ADD 1 really faster than INC ? x86
GCC doesn't make use of inc
Note:
I'm asking this question explicitly because none of the questions I found or was directed to on SO does explain this behaviour. My previous question concerning this matter got closed because, supposedly, it's trivial and has been answered. I don't find it trivial. I didn't find an answer in all of the links and answers given. It's not another "how to plug my mouse into my PC" problem. All of the questions explain why add is/could be better on new x86 processors or why GCC uses it, but none concerns ICC.
Any insight on ICC design choices would be also very welcome.
PS I don't consider "it does it because it does" a valid answer.
It is not unreasonable to assume at this point that incl was selected as it takes only one byte (0x40) instead of three (0x83 0xc0 0x01).
I have the following piece of code that I wrote in C. Its fairly simple as it just right bit-shifts x for every loop of for.
int main() {
int x = 1;
for (int i = 0; i > -2; i++) {
x >> 2;
}
}
Now the strange thing that is happening is that when I just compile it without any optimizations or with first level optimization (-O), it runs just fine (I am timing the executable and its about 1.4s with -O and 5.4s without any optimizations.
Now when I add -O2 or -O3 switch for compilation and time the resulting executable, it doesn't stop (I have tested for up to 60s).
Any ideas on what might be causing this?
The optimized loop is producing an infinite loop which is a result of you depending on signed integer overflow. Signed integer overflow is undefined behavior in C and should not be depended on. Not only can it confuse developers it may also be optimized out by the compiler.
Assembly (no optimizations): gcc -std=c99 -S -O0 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $1, -4(%rbp)
movl $0, -8(%rbp)
jmp L2
L3:
incl -8(%rbp)
L2:
cmpl $-2, -8(%rbp)
jg L3
movl $0, %eax
leave
ret
Assembly (optimized level 3): gcc -std=c99 -S -O3 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
L2:
jmp L2 #<- infinite loop
You will get the definitive answer by looking at the binary that's produced (using objdump or something).
But as others have noted, this is probably because you're relying on undefined behaviour. One possible explanation is that the compiler is free to assume that i will never be less than -2, and so will eliminate the conditional entirely, and convert this into an infinite loop.
Also, your code has no observable side effects, so the compiler is also free to optimise the entire program away to nothing, if it likes.
Additional information about why integer overflows are undefined can be found here:
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
Search for the paragraph "Signed integer overflow".