Get Assembly code after every optimization GCC makes?

Get Assembly code after every optimization GCC makes? - c

From Optimization Compiler on Wikipedia,
Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources.
and GCC has a lot of optimization options.
I'd like to study the generated assembly (the one -S gives) after each optimization GCC performs when compiling with different flags like -O1, -O2, -O3, etc.
How can I do this?
Edit: My input will be C code.

Intermediate representation can be saved to files using -fdump-tree-all switch.
There are more fine-grained -fdump switches awailable.
See gcc manual for details.
To be able to read these representations, take a look into GCC internals manual.

gcc -S (Capital S)
gives asm output, but the assembler can change things so I prefer to just make an object
gcc -c -o myfile.o myfile.c
then disassemble
objdump -D myfile.o
Understand that this is not linked so external branch destinations and other external addresses will have a placeholder instead of a real number. If you want to see the optimizations compile with no optimizations (-O0) then compile with -O1 then -O2 and -O3 and see what if anything changes. There are other optimzation flags as well you can play with. To see the difference you need to compile with and without the flags and compare the differences yourself.
diff won't work, you will see why (register allocation changes).

Whilst it is possible to take a small piece of code, compile it with -S and with a variety of options, the difficulty is understanding what actually changed. It only takes a small change to make code quite different - one variable going into a register means that register is no longer available for something, causing knock-on effects to all of the remaining code in the function.
I was comparing the same code from two nearly identical functions earlier today (to do with a question on C++), and there was ONE difference in the source code. One change in which variable was used for the termination condition inside one for-loop led to over lines of assembler code changing. Because the compiler decided to arrange the registers in a different way, using a different register for one of the main variables, and then everything else changed as a consequence.
I've seen cases where adding a small change to a function moves it from being inlined to not being inlined, which in turn makes big changes to ALL of the code in the program that calls that code.
So, yes, by all means, compile very simple code with different optimisations, and use -S to inspect the compiler generated code. Then compare the different variants to see what effect it has. But unless you are used to reading assembler code, and understand what you are actually looking for, it can often be hard to see the forest for the trees.
It is also worth considering that optimisation steps often work in conjunction - one step allows another step to do its work (inline leads to branch merging, register usage, and so on).

Compile with switch -S to get the assembly code. This should work for any level of optimization.
For instance, to get the assembly code generated in O2 mode, try:
g++/gcc -S -O2 input.cpp
a corresponding input.s will be generated, which contains the assembly code generated. Repeat this for any optimization level you want.

gcc/clang performs optimizations on the intermediate representations (IR), which can be printed after each optimization pass.
for gcc it is (-fdump-tree-all) 'http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html'
with clang it is (-llvm -print-after-all).
Clang/gcc offers many more options to analyze the optimizations. It is easy to turn on/off an optimization from the command line (http://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html, http://llvm.org/docs/Passes.html)
with clang-llvm you can also list the optimization passes which were performed using the command line option (-mllvm -debug-pass=Structure)

If you would like to study compiler optimization and are agnostic to the compiler, then take a look at the Clang/LLVM projects. Clang is a C compiler that can output LLVM IR and LLVM commands can apply specific optimization passes individually.
Output LLVM IR:
clang test.c -S -emit-llvm -o test.ll
Perform optimization pass:
opt test.ll -<optimization_pass> -S -o test_opt.ll
Compile to assembly:
llc test.ll -o test.s

Solution1:
gcc -O1 -S test.c (the capital O and capital S)
Solution2:
This site can also help you. You can use -O0, -O1, .. whatever suitable compiler options to get what you want.
Example from that site: (tested by both solutions)
void maxArray(double* x, double* y) {
for (int i = 0; i < 65536; i++) {
if (y[i] > x[i]) x[i] = y[i];
}
}
Compier Option -O0:
result:
maxArray(double*, double*):
pushq %rbp
movq %rsp, %rbp
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movl $0, -4(%rbp)
jmp .L2
.L5:
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -32(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm0
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm1
ucomisd %xmm1, %xmm0
jbe .L3
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rax, %rdx
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rcx
movq -32(%rbp), %rax
addq %rcx, %rax
movq (%rax), %rax
movq %rax, (%rdx)
.L3:
addl $1, -4(%rbp)
.L2:
cmpl $65535, -4(%rbp)
jle .L5
popq %rbp
ret
Compiler Option -O1:
result:
maxArray(double*, double*):
movl $0, %eax
.L5:
movsd (%rsi,%rax), %xmm0
ucomisd (%rdi,%rax), %xmm0
jbe .L2
movsd %xmm0, (%rdi,%rax)
.L2:
addq $8, %rax
cmpq $524288, %rax
jne .L5
rep; ret

Related

Trying to understand the gcc assembly output for printf()

I'm trying to learn how to understand assembly code so I've been studying the assembly output of GCC for some stupid programs. One of them was nothing but int i = 0;, the code of which I more or less fully understand now (the biggest struggle was understanding the GAS directives strewn about). Anyway, I went a step forward and added printf("%d\n", i); to see if I could understand that and suddenly the code is much more chaotic.
.file "helloworld.c"
.text
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "%d\n"
.section .text.startup,"ax",#progbits
.p2align 4
.globl main
.type main, #function
main:
subq $8, %rsp
xorl %edx, %edx
leaq .LC0(%rip), %rsi
xorl %eax, %eax
movl $1, %edi
call __printf_chk#PLT
xorl %eax, %eax
addq $8, %rsp
ret
.size main, .-main
.ident "GCC: (Gentoo 10.2.0-r3 p4) 10.2.0"
.section .note.GNU-stack,"",#progbits
I'm compiling this with gcc -S -O3 -fno-asynchronous-unwind-tables to remove the .cfi directives, however -O2 produces the same code so -O3 is overkill. My understanding of assembly is quite limited but it seems to me like the compiler is doing a lot of unneccessary stuff here. Why subtract and then add 8 to rsp? Why is it performing so many xors? There's only one variable. What is movl $1, %edi doing? I thought maybe the compiler was doing something stupid in an attempt to optimize but as I said, it's not optimizing beyond -O2, also it performs all of these operations even at -O1. To be honest I don't understand the unoptimized code at all so I assume it's inefficient.
The only thing that comes to mind is that the call to printf uses these registers, otherwise they are unused and serve no purpose. Is that actually the case? If so, how is it possible to tell?
Thanks in advance. I'm reading a book on compiler design at the moment and I've read most of the GCC manual (I read the whole chapter on optimization) and I've read some introductory x86_64 asm material, if somebody could point me toward some other resources (besides the Intel x86 manual) for learning more I would also appreciate that.

For the compiler that you are using it looks like printf(...) is mapped to __printf_chk(1, ...)
To understand the code, you need to understand the parameter passing conventions for the platform (part of the ABI). Once you know that up to 4 params are passed in %rdi, %rsi, %rdx, %rcx, you can understand most of what is going on:
subq $8, %rsp ; allocate 8 bytes of stack
xorl %edx, %edx ; i = 0 ; put it in the 3rd parameter for __printf_chk
leaq .LC0(%rip), %rsi ; 2nd parameter for __printf_chk. The: "%d\n"
xorl %eax, %eax ; 0 variadic fp params
movl $1, %edi ; 1st parameter for __printf_chk
call __printf_chk#PLT ; call the runtime loader wrapper for __printf_chk
xorl %eax, %eax ; return 0 from main
addq $8, %rsp ; deallocate 8 bytes of stack.
ret
Nate points out in the comments that section 3.5.7 in the ABI explains the %eax = 0 (no floating point variadic parameters.)

The assembly of “b++”

In C language,what's the assemble of "b++".
I got two situations:
1) one instruction
addl $0x1,-4(%rbp)
2) three instructions
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
Are these two situations caused by the compiler?
my code:
int main()
{
int ret = 0;
int i = 2;
ret = i++;
ret = ++i;
return ret;
}
the .s file(++i use addl instrction, i++ use other)：
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, -8(%rbp) //ret
movl $2, -4(%rbp) //i
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
movl %eax, -8(%rbp)
addl $1, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413"
.section .note.GNU-stack,"",#progbits

The ISO standard does not mandate at all what happens under the covers. It specifies a "virtual machine" that acts in a certain way given the C instructions you provide to it.
So, if your C compiler is implemented as a C-to-Dartmouth-Basic converter, b++ is just as likely to lead to 10 let b = b + 1 as anything else :-)
If you're compiling to common assembler code, then you're likely to see a difference depending on whether you use the result, specifically b++; as opposed to a = b++ since the result of the former can be safely thrown away.
You're also likely to see massive differences based on optimisation level.
Bottom line, short of specifying all the things that can affect the output (including but not limited to compiler, target platform, and optimisation levels).

The first one is the output for ++i as part of ret = ++i. It doesn't need to keep the old value around, because it's doing ++i and then res=i. Incrementing in memory and then reloading that is a really stupid and inefficient way to compile that, but you compiled with optimization disabled so gcc isn't even trying to make good asm output.
The 2nd one is the output for i++ as part of ret = i++. It needs to keep the old value of i around, so it loads into a register and uses lea to calculate i+1 in a different register. It could have just stored to ret and then incremented the register before storing back to i, but I guess with optimizations disabled gcc doesn't notice that.
Previous answer to the previous vague question without source, and with bogus code:
The asm for a tiny expression like b++ totally depends on the surrounding code in the rest of the function (or with optimization disabled, at least the rest of the statement) and whether it's a global or local, and whether it's declared volatile.
And of course compiler optimization options have a massive impact; with optimization disabled, gcc makes a separate block of asm for every C statement so you can use the GDB jump command to go to a different source line and have the code still produce the same behaviour you'd expect from the C abstract machine. Obviously this highly constrains code-gen: nothing is kept in registers across statements. This is good for source-level debugging, but sucks to read by hand because of all the noise of store/reload.
For the choice of inc vs. add, see INC instruction vs ADD 1: Does it matter? clang -O3 -mtune=bdver2 uses inc for memory-destination increments, but with generic tuning or any Intel P6 or Sandybridge-family CPU it uses add $1, (mem) for better micro-fusion.
See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at and making sense of compiler asm output.
The 2nd version in your original question looks like mostly un-optimized compiler output for this weird source:
// inside some function
int b;
// leaq -4(%rbp), %rax // rax = &b
b++; // incl (%rax)
b = (int)&b; // mov %eax, -4(%rbp)
(The question has since been edited to different code; looks like the original was mis-typed by hand mixing an opcode from once line with an operand from another line. I reproduce it here so all the comments about it being weird still make sense. For the updated code, see the first half of my answer: it depends on surrounding code and having optimization disabled. Using res = b++ needs the old value of b, not the incremented value, hence different asm.)
If that's not what your source does, then you must have left out some intervening instructions or something. Or else the compiler is re-using that stack slot for something else.
I'm curious what compiler you got that from, because gcc and clang typically don't like to use results they just computed. I'd have expected incl -4(%rbp).
Also that doesn't explain mov %eax, -4(%rbp). The compiler already used the address in %rax for inc, so why would a compiler revert to a 1-byte-longer RBP-relative addressing mode instead of mov %eax, (%rax)? Referencing fewer different registers that haven't been recently written is a good thing for Intel P6-family CPUs (up to Nehalem), to reduce register-read stalls. (Otherwise irrelevant.)
Using RBP as a frame pointer (and doing increments in memory instead of keeping simple variables in registers) looks like un-optimized code. But it can't be from gcc -O0, because it computes the address before the increment, and those have to be from two separate C statements.
b++ = &b; isn't valid because b++ isn't an lvalue. Well actually the comma operator lets you do b++, b = &b; in one statement, but gcc -O0 still evaluates it in order, rather than computing the address early.
Of course with optimization enabled, b would have to be volatile to explain incrementing in memory right before overwriting it.
clang is similar, but actually does compute that address early. For b++; b = &b;, notice that clang6.0 -O0 does an LEA and keeps RAX around across the increment. I guess clang's code-gen doesn't support consistent debugging with GDB's jump the way gcc does.
leaq -4(%rbp), %rax
movl -4(%rbp), %ecx
addl $1, %ecx
movl %ecx, -4(%rbp)
movl %eax, %ecx # copy the LEA result
movl %ecx, -4(%rbp)
I wasn't able to get gcc or clang to emit the sequence of instructions you show in the question with unoptimized or optimized + volatile, on the Godbolt compiler explorer. I didn't try ICC or MSVC, though. (Although unless that's disassembly, it can't be MSVC because it doesn't have an option to emit AT&T syntax.)

Any good compiler will optimise b++ to ++b if the result of the expression is discarded. You see this particularly in increments in for loops.
That's what is happening in your "one instruction" case.

It's not typically instructive to look at un-optimized compiler output, since values (variables) will usually be updated using a load-modify-store paradigm. This might be useful initially when getting to grips with assembly, but it's not the output to expect from an optimizing compiler that maintains values, pointers, etc., in registers for frequent use. (see: locality of reference)
/* un-optimized logic: */
int i = 2;
ret = i++; /* assign ret <- i, and post-increment i (ret = i; i++ (i = 3)) */
ret = ++i; /* pre-increment i, and assign ret <- i (++i (i = 4); ret = i) */
i.e., any modern, optimising compiler can easily determine that the final value of ret is (4).
Removing all the extraneous directives, etc., gcc-7.3.0 on OS X gives me:
_main: /* Darwin x86-64 ABI adds leading underscores to symbols... */
movl $4, %eax
ret
Apple's native clang, and the MacPorts clang-6.0 set up basic stack frame, but still optimise the ret arithmetic away:
_main:
pushq %rbp
movq %rsp, %rbp
movl $4, %eax
popq %rbp
retq
Note that the Mach-O (OS X) ABI is very similar to the ELF ABI for user-space code. Just try compiling with at least -O2 to get a feel for 'real' (production) code.

gcc assembler not recognizing tzcntl or lzcntl though my processor does

If I compile this C function:
int function(unsigned u) {
int r = __builtin_clz(u) + __builtin_ctz(u);
return r;
}
in clang using -march=native, I get the following assembly:
pushq %rbp
movq %rsp, %rbp
lzcntl %edi, %ecx
tzcntl %edi, %eax
addl %ecx, %eax
popq %rbp
retq
and the compiled object code runs fine on my processor. (I'm on a Mac OS, so I query the CPU features using sysctl -a machdep.cpu. It reports the BMI1 capability which includes TZCNT, and also the LZCNT capability. Also if I link that into a binary and execute it I get the expected results.)
If I compile the same code using gcc using -march=native, I get similar assembly:
xorl %eax, %eax
lzcntl %edi, %eax
tzcntl %edi, %edi
addl %edi, %eax
ret
but the assembler refuses to compile it. I get no such instruction errors for both the lzcntl and the tzcntl. What do I need to do to get the assembler to cooperate? Is it feasible/reasonable to tell gcc to use a different assembler? If so, how do I do that?
I'm not trying to solve a specific problem, I'm just trying to explore in what circumstances I can use builtins for things like count-leading/trailing-zeros, popcount, and so on.

Why does gcc create redundant assembly code?

I wanted to look into how certain C/C++ features were translated into assembly and I created the following file:
struct foo {
int x;
char y[0];
};
char *bar(struct foo *f)
{
return f->y;
}
I then compiled this with gcc -S (and also tried with g++ -S) but when I looked at the assembly code, I was disappointed to find a trivial redundancy in the bar function that I thought gcc should be able to optimize away:
_bar:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movabsq $4, %rcx
addq %rcx, %rax
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
popq %rbp
ret
Leh_func_end1:
Among other things, the lines
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
seem pointlessly redundant. Is there any reason gcc (and possibly other compilers) cannot/does not optimize this away?

I thought gcc should be able to optimize away.
From the gcc manual:
Without any optimization option, the compiler's goal is to reduce the cost of compilation and to make debugging produce the expected results.
In other words, it doesn't optimize unless you ask it to. When I turn on optimizations using the -O3 flag, gcc 4.4.6 produces much more efficient code:
bar:
.LFB0:
.cfi_startproc
leaq 4(%rdi), %rax
ret
.cfi_endproc
For more details, see Options That Control Optimization in the manual.

The code the compiler generates without optimization is typically a straight instruction-by-instruction translation, and the instructions are not those of the program but those of an intermediate representation in which redundancy may have been introduced.
If you expect assembly without such redundant instructions, use gcc -O -S
The kind of optimization you were expecting is called peephole optimization. Compilers usually have plenty of these, because unlike more global optimizations, they are cheap to apply and (generally) do not risk making the code worse—if applied towards the end of the compilation, at least.
In this blog post, I provide an example where both GCC and Clang may go as far as generating shorter 32-bit instructions when the integer type in the source code is 64-bit but only the lowest 32-bit of the result matter.

Compiler optimization causing program to run slower

I have the following piece of code that I wrote in C. Its fairly simple as it just right bit-shifts x for every loop of for.
int main() {
int x = 1;
for (int i = 0; i > -2; i++) {
x >> 2;
}
}
Now the strange thing that is happening is that when I just compile it without any optimizations or with first level optimization (-O), it runs just fine (I am timing the executable and its about 1.4s with -O and 5.4s without any optimizations.
Now when I add -O2 or -O3 switch for compilation and time the resulting executable, it doesn't stop (I have tested for up to 60s).
Any ideas on what might be causing this?

The optimized loop is producing an infinite loop which is a result of you depending on signed integer overflow. Signed integer overflow is undefined behavior in C and should not be depended on. Not only can it confuse developers it may also be optimized out by the compiler.
Assembly (no optimizations): gcc -std=c99 -S -O0 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $1, -4(%rbp)
movl $0, -8(%rbp)
jmp L2
L3:
incl -8(%rbp)
L2:
cmpl $-2, -8(%rbp)
jg L3
movl $0, %eax
leave
ret
Assembly (optimized level 3): gcc -std=c99 -S -O3 main.c
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
L2:
jmp L2 #<- infinite loop

You will get the definitive answer by looking at the binary that's produced (using objdump or something).
But as others have noted, this is probably because you're relying on undefined behaviour. One possible explanation is that the compiler is free to assume that i will never be less than -2, and so will eliminate the conditional entirely, and convert this into an infinite loop.
Also, your code has no observable side effects, so the compiler is also free to optimise the entire program away to nothing, if it likes.

Additional information about why integer overflows are undefined can be found here:
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
Search for the paragraph "Signed integer overflow".

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight