Why does gcc create redundant assembly code? - c

I wanted to look into how certain C/C++ features were translated into assembly and I created the following file:
struct foo {
int x;
char y[0];
};
char *bar(struct foo *f)
{
return f->y;
}
I then compiled this with gcc -S (and also tried with g++ -S) but when I looked at the assembly code, I was disappointed to find a trivial redundancy in the bar function that I thought gcc should be able to optimize away:
_bar:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movabsq $4, %rcx
addq %rcx, %rax
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
popq %rbp
ret
Leh_func_end1:
Among other things, the lines
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
seem pointlessly redundant. Is there any reason gcc (and possibly other compilers) cannot/does not optimize this away?

I thought gcc should be able to optimize away.
From the gcc manual:
Without any optimization option, the compiler's goal is to reduce the cost of compilation and to make debugging produce the expected results.
In other words, it doesn't optimize unless you ask it to. When I turn on optimizations using the -O3 flag, gcc 4.4.6 produces much more efficient code:
bar:
.LFB0:
.cfi_startproc
leaq 4(%rdi), %rax
ret
.cfi_endproc
For more details, see Options That Control Optimization in the manual.

The code the compiler generates without optimization is typically a straight instruction-by-instruction translation, and the instructions are not those of the program but those of an intermediate representation in which redundancy may have been introduced.
If you expect assembly without such redundant instructions, use gcc -O -S
The kind of optimization you were expecting is called peephole optimization. Compilers usually have plenty of these, because unlike more global optimizations, they are cheap to apply and (generally) do not risk making the code worse—if applied towards the end of the compilation, at least.
In this blog post, I provide an example where both GCC and Clang may go as far as generating shorter 32-bit instructions when the integer type in the source code is 64-bit but only the lowest 32-bit of the result matter.

Related

gcc - structures as label for debugging

Context:
Linux 64.
I would like a way to tell gcc to keep the structure as they are when generating assembly with gcc -O0 -S -g myprog.c
By that, I mean: instead of referencing the structure by address, I would like them to be referenced by label. That would ease the parsing without reading the source code again.
So, for example:
struct mystruct{
int32_t a;
char * b;
}
would become something like:
label_mystruct:
-4(label_mystruct)
-12(label_mystruct)
and for example, referenced by:
add $56, -4(label_mystruct)
Currently, it is referenced like
.globl _main
_main:
LFB13:
LM157:
pushq %rbp #
LCFI27:
movq %rsp, %rbp#,
LCFI28:
subq $80, %rsp#,
movl %edi,-68(%rbp) # argc, argc,
movq %rsi,-80(%rbp) # argv, argv
Next line is the culprit:
movq -56(%rbp), %rdx # list, D.3781
movq -16(%rbp), %rax # arr, D.3780
movq %rdx, %rsi # D.3781,
movq %rax, %rdi # D.3780,
call _myaddhu #
I would like it to be
label_mystruct:
-4(label_mystruct)
-12(label_mystruct)
.globl _main
_main:
LFB13:
LM157:
pushq %rbp #
LCFI27:
movq %rsp, %rbp#,
LCFI28:
subq $80, %rsp#,
movl %edi,-68(%rbp) # argc, argc,
movq %rsi,-80(%rbp) # argv, argv
Now it is fine:
movq label_mystruct, %rdx # list, D.3781
movq -16(%rbp), %rax # arr, D.3780
movq %rdx, %rsi # D.3781,
movq %rax, %rdi # D.3780,
call _myaddhu #
Question:
Is that possible with gcc and without using external tools?
I think it's not possible, and this is by the setup used in GCC.
The problem here is that the struct here is stored on the stack and you cannot really have a label referring to something on the stack. If the struct was not on the stack it would have had a label referring to it (for example if it were a global variable).
What you have on the other hand is that GCC would generate debugging info which has information about what data is placed when running specific code. In your example it would in essense say that "when executing this code -56(%ebp) points to mystruct".
On the other hand if you would write assembler code by hand you could certainly have symbolic references to a variable. You could for example do:
#define MYSTRUCT -56(%ebp)
...
movq MYSTRUCT, %rdx
however the MYSTRUCT will be expanded and that symbol being lost during assembling the code. It would be of no help if GCC did this (except maybe that the assembler code generated by -s could be more readable), in addition GCC does not pass the assembler through preprocessor anyway (because it don't do this).
You get that if you put your struct into static storage. This of course alters the meaning of the code. For example, this code
struct {
int a, b;
} test;
int settest(int a, b) {
test.a = a;
test.b = b;
}
compiles to (cleaned up):
settest:
movl %edi, test(%rip)
movl %esi, test+4(%rip)
ret
.comm test,8,4
You could also try to pass the option -fverbose-asm to gcc which instructs gcc to add some annotations that might make the assembly easier to read.

Can recursive union find be optimized?

When implementing union-find, I would usually write the find function with path compression like this:
def find(x):
if x != par[x]:
par[x] = find(par[x])
return par[x]
This is easy to remember and arguably easy to read. This is also how many books and websites describe the algorithm.
However, naively compiled, this would use stack memory linear in the input size. In many languages and systems that would by default result in a stack overflow.
The only non-recursive way I know of writing find is this:
def find(x):
p = par[x]
while p != par[p]:
p = par[p]
while x != p:
x, par[x] = par[x], p
return p
It seems unlikely that many compilers would find that. (Maybe Haskell would?)
My question is in what cases it is safe to use the former version of find? If no widely used language can remove the recursion, shouldn't we tell people to use the iterative version? And might there be a simpler iterative implementation?
There seem to be two separate questions here.
First - can optimizing compilers notice this and rewrite it? It's difficult to answer this question without testing all compilers and all versions. I tried this out using gcc 4.8.4 on the following code:
size_t find(size_t uf[], size_t index) {
if (index != uf[index]) {
uf[index] = find(uf, uf[index]);
}
return uf[index];
}
void link(size_t uf[], size_t i, size_t j) {
uf[find(uf, i)] = uf[find(uf, j)];
}
This doesn't implement the union-by-rank optimization, but does support path compression. I compiled this using optimization level -O3 and the assembly is shown here:
find:
.LFB23:
.cfi_startproc
pushq %r14
.cfi_def_cfa_offset 16
.cfi_offset 14, -16
pushq %r13
.cfi_def_cfa_offset 24
.cfi_offset 13, -24
pushq %r12
.cfi_def_cfa_offset 32
.cfi_offset 12, -32
pushq %rbp
.cfi_def_cfa_offset 40
.cfi_offset 6, -40
pushq %rbx
.cfi_def_cfa_offset 48
.cfi_offset 3, -48
leaq (%rdi,%rsi,8), %rbx
movq (%rbx), %rax
cmpq %rsi, %rax
je .L2
leaq (%rdi,%rax,8), %rbp
movq 0(%rbp), %rdx
cmpq %rdx, %rax
je .L3
leaq (%rdi,%rdx,8), %r12
movq %rdx, %rax
movq (%r12), %rcx
cmpq %rcx, %rdx
je .L4
leaq (%rdi,%rcx,8), %r13
movq %rcx, %rax
movq 0(%r13), %rdx
cmpq %rdx, %rcx
je .L5
leaq (%rdi,%rdx,8), %r14
movq %rdx, %rax
movq (%r14), %rsi
cmpq %rsi, %rdx
je .L6
call find // <--- Recursion!
movq %rax, (%r14)
.L6:
movq %rax, 0(%r13)
.L5:
movq %rax, (%r12)
.L4:
movq %rax, 0(%rbp)
.L3:
movq %rax, (%rbx)
.L2:
popq %rbx
.cfi_def_cfa_offset 40
popq %rbp
.cfi_def_cfa_offset 32
popq %r12
.cfi_def_cfa_offset 24
popq %r13
.cfi_def_cfa_offset 16
popq %r14
.cfi_def_cfa_offset 8
ret
.cfi_endproc
Given the existence of the recursive call in the middle, it doesn't look like this tail call was eliminated. In fairness, that's because the transformation you're describing is pretty nontrivial, so I'm not surprised it didn't find it. This doesn't mean that no optimizing compiler can find it, but that one major one won't.
Your second question is why we present the algorithm this way. As someone who teaches both algorithms and programming, I think it's extremely valuable to discuss algorithms using a presentation that's as simple as possible, even if it means abstracting away some particular implementation details. Here, the key idea behind the algorithm is to update the parent pointers of all the nodes encountered up on the way to the representative. Recursion happens to be a pretty clean way of describing that idea, even though when implemented naively it risks a stack overflow. However, by expressing the pseudocode in that particular way, it's easier to describe and discuss it and to prove that it will work as advertised. We could describe it the other way to avoid a stack overflow, but in Theoryland we usually don't worry about details like that and the updated presentation, while more directly translatable into practice, would make it harder to see the key ideas.
When looking at pseudocode for more advanced algorithms and data structures, it's common to omit critically important implementation details and to handwave that certain tasks are possible to do in certain time bounds. When discussing algorithms or data structures that build on top of even more complex algorithms and data structures, it often becomes impossible to write out pseudocode for everything because you have layers on top of layers on top of layers of glossed-over details. From a theoretical perspective, this is fine - if the reader does want to implement it, they can fill in the blanks. On the other hand, if the reader is more interested in the key techniques from the paper and the theory (which in academic settings is common), they won't get bogged down in implementation details.

Setting up local stack according to x86-64 calling convention on linux

I am doing some extended assembly optimization on gnu C code running on 64 bit linux. I wanted to print debugging messages from within the assembly code and that's how I came accross the following. I am hoping someone can explain what I am supposed to do in this situation.
Take a look at this sample function:
void test(int a, int b, int c, int d){
__asm__ volatile (
"movq $0, %%rax\n\t"
"pushq %%rax\n\t"
"popq %%rax\n\t"
:
:"m" (a)
:"cc", "%rax"
);
}
Since the four agruments to the function are of class INTEGER, they will be passed through registers and then pushed onto the stack. The strange thing to me is how gcc actually does it:
test:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl %edx, -12(%rbp)
movl %ecx, -16(%rbp)
movq $0, %rax
pushq %rax
popq %rax
popq %rbp
ret
The passed arguments are pushed onto the stack, but the stack pointer is not decremented. Thus, when I do pushq %rax, the values of a and b are overwritten.
What I am wondering: is there a way to ask gcc to properly set up the local stack? Am I simply not supposed to use push and pop in function calls?
x86-64 abi provides a 128 byte red zone under the stack pointer, and the compiler decided to use that. You can turn that off using -mno-red-zone option.

Get Assembly code after every optimization GCC makes?

From Optimization Compiler on Wikipedia,
Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources.
and GCC has a lot of optimization options.
I'd like to study the generated assembly (the one -S gives) after each optimization GCC performs when compiling with different flags like -O1, -O2, -O3, etc.
How can I do this?
Edit: My input will be C code.
Intermediate representation can be saved to files using -fdump-tree-all switch.
There are more fine-grained -fdump switches awailable.
See gcc manual for details.
To be able to read these representations, take a look into GCC internals manual.
gcc -S (Capital S)
gives asm output, but the assembler can change things so I prefer to just make an object
gcc -c -o myfile.o myfile.c
then disassemble
objdump -D myfile.o
Understand that this is not linked so external branch destinations and other external addresses will have a placeholder instead of a real number. If you want to see the optimizations compile with no optimizations (-O0) then compile with -O1 then -O2 and -O3 and see what if anything changes. There are other optimzation flags as well you can play with. To see the difference you need to compile with and without the flags and compare the differences yourself.
diff won't work, you will see why (register allocation changes).
Whilst it is possible to take a small piece of code, compile it with -S and with a variety of options, the difficulty is understanding what actually changed. It only takes a small change to make code quite different - one variable going into a register means that register is no longer available for something, causing knock-on effects to all of the remaining code in the function.
I was comparing the same code from two nearly identical functions earlier today (to do with a question on C++), and there was ONE difference in the source code. One change in which variable was used for the termination condition inside one for-loop led to over lines of assembler code changing. Because the compiler decided to arrange the registers in a different way, using a different register for one of the main variables, and then everything else changed as a consequence.
I've seen cases where adding a small change to a function moves it from being inlined to not being inlined, which in turn makes big changes to ALL of the code in the program that calls that code.
So, yes, by all means, compile very simple code with different optimisations, and use -S to inspect the compiler generated code. Then compare the different variants to see what effect it has. But unless you are used to reading assembler code, and understand what you are actually looking for, it can often be hard to see the forest for the trees.
It is also worth considering that optimisation steps often work in conjunction - one step allows another step to do its work (inline leads to branch merging, register usage, and so on).
Compile with switch -S to get the assembly code. This should work for any level of optimization.
For instance, to get the assembly code generated in O2 mode, try:
g++/gcc -S -O2 input.cpp
a corresponding input.s will be generated, which contains the assembly code generated. Repeat this for any optimization level you want.
gcc/clang performs optimizations on the intermediate representations (IR), which can be printed after each optimization pass.
for gcc it is (-fdump-tree-all) 'http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html'
with clang it is (-llvm -print-after-all).
Clang/gcc offers many more options to analyze the optimizations. It is easy to turn on/off an optimization from the command line (http://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html, http://llvm.org/docs/Passes.html)
with clang-llvm you can also list the optimization passes which were performed using the command line option (-mllvm -debug-pass=Structure)
If you would like to study compiler optimization and are agnostic to the compiler, then take a look at the Clang/LLVM projects. Clang is a C compiler that can output LLVM IR and LLVM commands can apply specific optimization passes individually.
Output LLVM IR:
clang test.c -S -emit-llvm -o test.ll
Perform optimization pass:
opt test.ll -<optimization_pass> -S -o test_opt.ll
Compile to assembly:
llc test.ll -o test.s
Solution1:
gcc -O1 -S test.c (the capital O and capital S)
Solution2:
This site can also help you. You can use -O0, -O1, .. whatever suitable compiler options to get what you want.
Example from that site: (tested by both solutions)
void maxArray(double* x, double* y) {
for (int i = 0; i < 65536; i++) {
if (y[i] > x[i]) x[i] = y[i];
}
}
Compier Option -O0:
result:
maxArray(double*, double*):
pushq %rbp
movq %rsp, %rbp
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movl $0, -4(%rbp)
jmp .L2
.L5:
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -32(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm0
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm1
ucomisd %xmm1, %xmm0
jbe .L3
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rax, %rdx
movl -4(%rbp), %eax
cltq
leaq 0(,%rax,8), %rcx
movq -32(%rbp), %rax
addq %rcx, %rax
movq (%rax), %rax
movq %rax, (%rdx)
.L3:
addl $1, -4(%rbp)
.L2:
cmpl $65535, -4(%rbp)
jle .L5
popq %rbp
ret
Compiler Option -O1:
result:
maxArray(double*, double*):
movl $0, %eax
.L5:
movsd (%rsi,%rax), %xmm0
ucomisd (%rdi,%rax), %xmm0
jbe .L2
movsd %xmm0, (%rdi,%rax)
.L2:
addq $8, %rax
cmpq $524288, %rax
jne .L5
rep; ret

Not getting Segmentation Fault in C

here is the c code:
char **s;
s[334]=strdup("test");
printf("%s\n",s[334]);`
i know that strdup does the allocation of "test", but the case s[334] where we will put the pointer to the string "test" is not allocated,however,this code works like a charm
Your code exhibits undefined behavior. That does not mean it will crash. All it means is that you can't predict anything about what will happen.
A crash is rather likely, but not guaranteed at all, in this case.
"Undefined behaviour" doesn't mean you'll get a segfault, it means you might get a segfault. A conforming implementation might also decide to display ASCII art of a puppy.
You might like to check this code with a tool like Valgrind.
You don't always get segmentation fault if you access uninitialized memory.
You do access uninitialized memory here.
The compiler is too smart for us! It knows that printf("%s\n", some_string) is exactly the same as puts(some_string), so it can simplify
char **s;
s[334]=strdup("test");
printf("%s\n",s[334]);
into
char **s;
s[334]=strdup("test");
puts(s[334]);
and then (assuming no UB) that is again equivalent to
puts(strdup("test"));
So, by chance the segment fault didn't happen (this time).
I get a segfault without optimisations, but when compiled with optimisations, gcc doesn't bother with the s at all, it's eliminated as dead code.
gcc -Os -S:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi # .LC0 is where "test" is at
call strdup
addq $8, %rsp
.cfi_def_cfa_offset 8
movq %rax, %rdi
jmp puts
.cfi_endproc
gcc -S -O (same for -O2, -O3):
.LFB23:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $5, %edi
call malloc
movq %rax, %rdi
testq %rax, %rax
je .L2
movl $1953719668, (%rax)
movb $0, 4(%rax)
.L2:
call puts
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc

Resources