Why does LLVM add two extra instructions for the same program?

Why does LLVM add two extra instructions for the same program? - c

I am compiling this C program and comparing the generated assembly code:
int main(){ return 0; }
GCC gives this main function (cc hello.c -S):
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $0, %eax
leave
ret
LLVM gives this main function (clang hello.c -S):
_main:
Leh_func_begin0:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movl $0, %eax
movl $0, -4(%rbp)
popq %rbp
ret
Leh_func_end0:
What are movl $0, -4(%rbp) and popq %rbp needed for? Moving something on the stack and popping it directly afterwards seems useless to me.

The movl $0, -4(%rbp) instruction is dead, because this is unoptimized code. Try passing in -O to both compilers to see what changes.

Actually, they're comparable. Leave is a high level instruction:
From the Intel manual:
16-bit: C9 LEAVE A Valid Valid Set SP to BP, then pop BP.
32-bit: C9 LEAVE A N.E. Valid Set ESP to EBP, then pop EBP.
64-bit: C9 LEAVE A Valid N.E. Set RSP to RBP, then pop RBP.
basically, leave is equivalent to
movq %rbp, %rsp
popq %rbp

It looks like LLVM is using a traditional function prolog/epilog, whereas GCC is taking advantage of the fact that the entry point doesn't need to clean up

Related

can't find assembly for pre-written functions

I am a huge fan of network protocols and libnet, which is why I've been trying to imitate some network protocols that are not included by libnet. Capturing packets, imitating headers etc works so far. Now I need a way to actually write these exact packets to my network card. I've tried libnet_adv_write_rawipv4() and -link(), both won't work. I can't cull the headers with libnet_adv_cull_header() because of the stupid errors and bugs. So I figured, that the problem could be solved with a little assembly: get the assembly code for the actual libnet_build() and libnet_write() call, alter some bytes and voila: raw bytes get written to the network card. So I have written a dummy program:
#include <stdio.h>
#include <stdlib.h>
#include <libnet.h>
int main() {
libnet_t *l;
l = libnet_init(LIBNET_RAW4, 0, NULL);
libnet_build_tcp(2000, 450, 0, 1234, TH_SYN, 254, 0, NULL, LIBNET_TCP_H + 5,
"aaaaa", 5, l, 0);
libnet_build_ipv4(LIBNET_TCP_H + LIBNET_IPV4_H + 5, 0, 1, 0, 64, 6, 0,
2186848448, 22587584, NULL, 0, l, 0);
libnet_write(l);
return 0;
}
Works so far. Now I got the assembly version of the program using
gcc -o program program.c -S
And this is where the actual problem starts:
.LC0:
.string "aaaaa"
.text
.globl main
.type main, #function
main:
.LFB2:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $0, %edx
movl $0, %esi
movl $1, %edi
call libnet_init
movq %rax, -8(%rbp)
subq $8, %rsp
pushq $0
pushq -8(%rbp)
pushq $5
pushq $.LC0
pushq $25
pushq $0
pushq $0
movl $254, %r9d
movl $2, %r8d
movl $1234, %ecx
movl $0, %edx
movl $450, %esi
movl $2000, %edi
call libnet_build_tcp
addq $64, %rsp
subq $8, %rsp
pushq $0
pushq -8(%rbp)
pushq $0
pushq $0
pushq $22587584
pushq $-2108118848
pushq $0
movl $6, %r9d
movl $64, %r8d
movl $0, %ecx
movl $1, %edx
movl $0, %esi
movl $45, %edi
call libnet_build_ipv4
addq $64, %rsp
movq -8(%rbp), %rax
movq %rax, %rdi
call libnet_write
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE2:
.size main, .-main
See this?
call libnet_build_ipv4
I can't copy the assembly code of these build() or write() calls, because all there is is a reference to them. Now, where would I find the assembly code for these pre-written functions included in libnet-functions.h (libnet_build_ipv4(), libnet_build_tcp(), libnet_write()) ?????

GDB is your friend in situations like this. You don't say anything about what platform you're on, the following example works on Ubuntu, but should work similarly on other distributions.
First, make sure that you have debug-symbols for libnet installed:
sudo apt install libnet1-dbg
Find out where libnet is installed:
~$ dpkg -L libnet1 | grep \.so
/usr/lib/x86_64-linux-gnu/libnet.so.1.7.0
/usr/lib/x86_64-linux-gnu/libnet.so.1
Open it (or your own application) with GDB:
~$ gdb /usr/lib/x86_64-linux-gnu/libnet.so.1.7.0
Reading symbols from /usr/lib/x86_64-linux-gnu/libnet.so.1.7.0...Reading symbols from /usr/lib/debug//usr/lib/x86_64-linux-gnu/libnet.so.1.7.0...done.
done.
Use the disassemble command to inspect anything you like:
(gdb) disassemble libnet_build_ipv4
Dump of assembler code for function libnet_build_ipv4:
0x0000000000007d60 <+0>: push %r15
0x0000000000007d62 <+2>: push %r14
0x0000000000007d64 <+4>: push %r13
0x0000000000007d66 <+6>: push %r12
0x0000000000007d68 <+8>: push %rbp
0x0000000000007d69 <+9>: push %rbx
0x0000000000007d6a <+10>: sub $0x48,%rsp
0x0000000000007d6e <+14>: mov 0xa8(%rsp),%rbx
0x0000000000007d76 <+22>: mov %edx,0x8(%rsp)
0x0000000000007d7a <+26>: mov %fs:0x28,%rax
0x0000000000007d83 <+35>: mov %rax,0x38(%rsp)
0x0000000000007d88 <+40>: xor %eax,%eax
0x0000000000007d8a <+42>: mov %ecx,0x14(%rsp)
0x0000000000007d8e <+46>: mov 0x80(%rsp),%r14d
0x0000000000007d96 <+54>: test %rbx,%rbx
0x0000000000007d99 <+57>: mov 0x98(%rsp),%r15
0x0000000000007da1 <+65>: je 0x810a <libnet_build_ipv4+938>
0x0000000000007da7 <+71>: mov %esi,%r13d
0x0000000000007daa <+74>: mov 0xb0(%rsp),%esi
0x0000000000007db1 <+81>: mov %edi,%ebp
0x0000000000007db3 <+83>: mov $0xd,%ecx
0x0000000000007db8 <+88>: mov $0x14,%edx
0x0000000000007dbd <+93>: mov %rbx,%rdi
0x0000000000007dc0 <+96>: mov %r9d,0x1c(%rsp)
0x0000000000007dc5 <+101>: mov %r8d,0x18(%rsp)
0x0000000000007dca <+106>: callq 0xea10 <libnet_pblock_probe>
0x0000000000007dcf <+111>: test %rax,%rax
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb)

Unexplained code generation for global register variables on mingw-w64

In our project we make use of global register variables. In particular, we use %r12, %r13, %r14 for 64-bit and %esi, %edi for 32-bit code.
For example:
register void * my_var asm ("r12");
These global vars are accessed from different modules (.c files).
According to the ABI (http://www.x86-64.org/documentation/abi.pdf), these regs “belong” to the calling function, and the called function is required to preserve their values.
For mingw64, we can see these regs are saved on the stack before any call are made, even if that call doesn't use these regs inside. However, this doesn't occur when we compile using gcc on linux. Has anyone run into this or understand why this may be?
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $40, %rsp
movq %rcx, %rbx
xorl %ecx, %ecx
call my_func
testl %eax, %eax
je .L40
movq 168(%rbx), %rax
addq $40, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
ret

Setting up local stack according to x86-64 calling convention on linux

I am doing some extended assembly optimization on gnu C code running on 64 bit linux. I wanted to print debugging messages from within the assembly code and that's how I came accross the following. I am hoping someone can explain what I am supposed to do in this situation.
Take a look at this sample function:
void test(int a, int b, int c, int d){
__asm__ volatile (
"movq $0, %%rax\n\t"
"pushq %%rax\n\t"
"popq %%rax\n\t"
:
:"m" (a)
:"cc", "%rax"
);
}
Since the four agruments to the function are of class INTEGER, they will be passed through registers and then pushed onto the stack. The strange thing to me is how gcc actually does it:
test:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl %edx, -12(%rbp)
movl %ecx, -16(%rbp)
movq $0, %rax
pushq %rax
popq %rax
popq %rbp
ret
The passed arguments are pushed onto the stack, but the stack pointer is not decremented. Thus, when I do pushq %rax, the values of a and b are overwritten.
What I am wondering: is there a way to ask gcc to properly set up the local stack? Am I simply not supposed to use push and pop in function calls?

x86-64 abi provides a 128 byte red zone under the stack pointer, and the compiler decided to use that. You can turn that off using -mno-red-zone option.

Why GCC didn't optimize this tail call?

I have the code working with lined lists. I use tail calls. Unfortunately, GCC does not optimise the calls.
Here is C code of the function that recursively calculates length of the linked list:
size_t ll_length(const ll_t* list) {
return ll_length_rec(list, 0);
}
size_t ll_length_rec(const ll_t* list, size_t size_so_far)
{
if (list) {
return ll_length_rec(list->next, size_so_far + 1);
} else {
return size_so_far;
}
}
and here is the assembler code:
.globl _ll_length_rec
_ll_length_rec:
LFB8:
.loc 1 47 0
pushq %rbp
LCFI6:
movq %rsp, %rbp
LCFI7:
subq $32, %rsp
LCFI8:
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
.loc 1 48 0
cmpq $0, -8(%rbp)
je L8
.loc 1 49 0
movq -16(%rbp), %rsi
incq %rsi
movq -8(%rbp), %rax
movq 8(%rax), %rdi
call _ll_length_rec # < THIS SHOUD BE OPTIMIZED
movq %rax, -24(%rbp)
jmp L10
If GCC would optimize it, there would be no call in the asm. I compile it with:
gcc -S -fnested-functions -foptimize-sibling-calls \
-03 -g -Wall -o llist llist.c
and GCC version is:
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)

If I add -O3 to your compilation line, it does not seem to generate the offending call, while without it, I get the unoptimised call. I don't know all gcc options in my head, but is -03 a typo for -O3 or intentional?
Ltmp2:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
jmp LBB1_1
.align 4, 0x90
LBB1_3:
addq $2, %rsi
Ltmp3:
movq (%rax), %rdi
Ltmp4:
LBB1_1:
Ltmp5:
testq %rdi, %rdi
je LBB1_5
Ltmp6:
movq (%rdi), %rax
testq %rax, %rax
jne LBB1_3
incq %rsi
LBB1_5:
movq %rsi, %rax
Ltmp7:
Ltmp8:
popq %rbp
ret

Most likely because neither of your functions are declared as static, which means that the symbols must be visible to the linker in case any other compilation units need them at link time. Try to compile with the -fwhole-program flag and see what happens.

Probably depends on the version of GCC and specific build. This is what I get from GCC 3.4.4 on Windows starting from -O2 and up
.globl _ll_length_rec
.def _ll_length_rec; .scl 2; .type 32; .endef
_ll_length_rec:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
jmp L3
.p2align 4,,7
L6:
movl (%edx), %edx
incl %eax
L3:
testl %edx, %edx
jne L6
popl %ebp
ret

Why doesn't the compiler allocate and deallocate local var with "sub" and "add" on the stack?

According to some textbooks, the compiler will use sub* to allocate memory for local variables.
For example, I write a Hello World program:
int main()
{
puts("hello world");
return 0;
}
I guess this will be compiled to some assembly code on the 64 bit OS:
subq $8, %rsp
movq $.LC0, (%rsp)
calq puts
addq $8, %rsp
The subq allocates 8 byte memory (size of a point) for the argument and the addq deallocates it.
But when I input gcc -S hello.c (I use the llvm-gcc on Mac OS X 10.8), I get some assembly code.
.section __TEXT,__text,regular,pure_instructions
.globl _main
.align 4, 0x90
_main:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $16, %rsp
Ltmp2:
xorb %al, %al
leaq L_.str(%rip), %rcx
movq %rcx, %rdi
callq _puts
movl $0, -8(%rbp)
movl -8(%rbp), %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %eax
addq $16, %rsp
popq %rbp
ret
.......
L_.str:
.asciz "hello world!"
Around this callq without any addq and subq. Why? And what is the function of addq $16, %rsp?
Thanks for any input.

You don't have any local variables in your main(). All you may have in it is a pseudo-variable for the parameter passed to puts(), the address of the "hello world" string.
According to your last disassembly, the calling conventions appear to be such that the first parameter to puts() is passed in the rdi register and not on the stack, which is why there isn't any stack space allocated for this parameter.
However, since you're compiling your program with optimization disabled, you may encounter some unnecessary stack space allocations and reads and writes to and from that space.
This code illustrates it:
subq $16, %rsp ; allocate some space
...
movl $0, -8(%rbp) ; write to it
movl -8(%rbp), %eax ; read back from it
movl %eax, -4(%rbp) ; write to it
movl -4(%rbp), %eax ; read back from it
addq $16, %rsp
Those four mov instructions are equivalent to just one simple movl $0, %eax, no memory is needed to do that.
If you add an optimization switch like -O2 in your compile command, you'll see more meaningful code in the disassembly.
Also note that some space allocations may be needed solely for the purpose of keeping the stack pointer aligned, which improves performance or avoids issues with misaligned memory accesses (you could get the #AC exception on misaligned accesses if it's enabled).
The above code shows it too. See, those four mov instructions only use 8 bytes of memory, while the add and sub instructions grow and shrink the stack by 16.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why does LLVM add two extra instructions for the same program? - c

The movl $0, -4(%rbp) instruction is dead, because this is unoptimized code. Try passing in -O to both compilers to see what changes.

It looks like LLVM is using a traditional function prolog/epilog, whereas GCC is taking advantage of the fact that the entry point doesn't need to clean up

Related

can't find assembly for pre-written functions

Unexplained code generation for global register variables on mingw-w64

Setting up local stack according to x86-64 calling convention on linux

Why GCC didn't optimize this tail call?

Why doesn't the compiler allocate and deallocate local var with "sub" and "add" on the stack?

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why does LLVM add two extra instructions for the same program? - c

The movl $0, -4(%rbp) instruction is dead, because this is unoptimized code. Try passing in -O to both compilers to see what changes.

It looks like LLVM is using a traditional function prolog/epilog, whereas GCC is taking advantage of the fact that the entry point doesn't need to clean up

Related

can't find assembly for pre-written functions

Unexplained code generation for global register variables on mingw-w64

Setting up local stack according to x86-64 calling convention on linux

Why GCC didn't optimize this tail call?

Why doesn't the compiler allocate and deallocate local var with "sub*" and "add*" on the stack?

Categories

Resources

Why doesn't the compiler allocate and deallocate local var with "sub" and "add" on the stack?