Trying to understand the gcc assembly output for printf()

Trying to understand the gcc assembly output for printf() - c

I'm trying to learn how to understand assembly code so I've been studying the assembly output of GCC for some stupid programs. One of them was nothing but int i = 0;, the code of which I more or less fully understand now (the biggest struggle was understanding the GAS directives strewn about). Anyway, I went a step forward and added printf("%d\n", i); to see if I could understand that and suddenly the code is much more chaotic.
.file "helloworld.c"
.text
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "%d\n"
.section .text.startup,"ax",#progbits
.p2align 4
.globl main
.type main, #function
main:
subq $8, %rsp
xorl %edx, %edx
leaq .LC0(%rip), %rsi
xorl %eax, %eax
movl $1, %edi
call __printf_chk#PLT
xorl %eax, %eax
addq $8, %rsp
ret
.size main, .-main
.ident "GCC: (Gentoo 10.2.0-r3 p4) 10.2.0"
.section .note.GNU-stack,"",#progbits
I'm compiling this with gcc -S -O3 -fno-asynchronous-unwind-tables to remove the .cfi directives, however -O2 produces the same code so -O3 is overkill. My understanding of assembly is quite limited but it seems to me like the compiler is doing a lot of unneccessary stuff here. Why subtract and then add 8 to rsp? Why is it performing so many xors? There's only one variable. What is movl $1, %edi doing? I thought maybe the compiler was doing something stupid in an attempt to optimize but as I said, it's not optimizing beyond -O2, also it performs all of these operations even at -O1. To be honest I don't understand the unoptimized code at all so I assume it's inefficient.
The only thing that comes to mind is that the call to printf uses these registers, otherwise they are unused and serve no purpose. Is that actually the case? If so, how is it possible to tell?
Thanks in advance. I'm reading a book on compiler design at the moment and I've read most of the GCC manual (I read the whole chapter on optimization) and I've read some introductory x86_64 asm material, if somebody could point me toward some other resources (besides the Intel x86 manual) for learning more I would also appreciate that.

For the compiler that you are using it looks like printf(...) is mapped to __printf_chk(1, ...)
To understand the code, you need to understand the parameter passing conventions for the platform (part of the ABI). Once you know that up to 4 params are passed in %rdi, %rsi, %rdx, %rcx, you can understand most of what is going on:
subq $8, %rsp ; allocate 8 bytes of stack
xorl %edx, %edx ; i = 0 ; put it in the 3rd parameter for __printf_chk
leaq .LC0(%rip), %rsi ; 2nd parameter for __printf_chk. The: "%d\n"
xorl %eax, %eax ; 0 variadic fp params
movl $1, %edi ; 1st parameter for __printf_chk
call __printf_chk#PLT ; call the runtime loader wrapper for __printf_chk
xorl %eax, %eax ; return 0 from main
addq $8, %rsp ; deallocate 8 bytes of stack.
ret
Nate points out in the comments that section 3.5.7 in the ABI explains the %eax = 0 (no floating point variadic parameters.)

Related

GCC check if an expression will execute in a constant time

Let say I have the (x << n) | (x >> (-n & 63)) expression.
There is nothing conditional in it.
So, to my understanding, it will be executed in constant time.
Indeed, when I compile the following code using gcc -O3 -S:
#include <stdint.h>
// rotate left x by n places assuming n < 64
uint64_t rotl64(uint64_t x, uint8_t n) {
return (x << n) | (x >> (-n & 63));
}
I get, on linux/amd64, the following output (which executes in constant time):
.file "test.c"
.text
.p2align 4
.globl rotl64
.type rotl64, #function
rotl64:
.LFB0:
.cfi_startproc
movq %rdi, %rax
movl %esi, %ecx
rolq %cl, %rax
ret
.cfi_endproc
.LFE0:
.size rotl64, .-rotl64
.ident "GCC: (Alpine 9.3.0) 9.3.0"
.section .note.GNU-stack,"",#progbits
However, on linux/386 I get an output that contains conditional jumps:
.file "test.c"
.text
.p2align 4
.globl rotl64
.type rotl64, #function
rotl64:
.LFB0:
.cfi_startproc
pushl %edi
.cfi_def_cfa_offset 8
.cfi_offset 7, -8
pushl %esi
.cfi_def_cfa_offset 12
.cfi_offset 6, -12
movl 12(%esp), %eax
movl 16(%esp), %edx
movzbl 20(%esp), %ecx
movl %eax, %esi
movl %edx, %edi
shldl %esi, %edi
sall %cl, %esi
testb $32, %cl
je .L4
movl %esi, %edi
xorl %esi, %esi
.L4:
negl %ecx
andl $63, %ecx
shrdl %edx, %eax
shrl %cl, %edx
testb $32, %cl
je .L5
movl %edx, %eax
xorl %edx, %edx
.L5:
orl %esi, %eax
orl %edi, %edx
popl %esi
.cfi_restore 6
.cfi_def_cfa_offset 8
popl %edi
.cfi_restore 7
.cfi_def_cfa_offset 4
ret
.cfi_endproc
.LFE0:
.size rotl64, .-rotl64
.ident "GCC: (Alpine 9.3.0) 9.3.0"
.section .note.GNU-stack,"",#progbits
From what I understand, here the 64 bits operations have to be emulated, hence the need of conditional jumps.
Does GCC provide a builtin function that indicates if an expression will be compiled with no jumps?
If it isn't the case, how can I know if an expression will be executed in constant time?
Is this a problem for timing sensitive applications like security?

Does GCC provide a builtin function that indicates if an expression will be compiled with no jumps?
No.
If it isn't the case, how can I know if an expression will be executed in constant time?
By looking at the generated assembly code.
Is this a problem for timing sensitive applications like security?
Yes. That's why in these cases don't trust the compilers (and porters/package builders changing compiler settings) and rather implement it in assembly.
There are some constant time functions in general libc's, like in OpenBSD and FreeBSD. Like timingsafe_bcmp and timingsafe_memcmp, which are written in pure C, but their authors trust their packagers not to be like Debian or Ubuntu, who are assumed to break it.
Many other such functions are in the various security libraries itself, but even then you can safely assume that they are broken. For sure in OpenSSL and libsodium in many cases.

No such a function does not exist.
And unless you are writing the compiler (you're not) you should not really care about the actual machine code being generated. The compiler is free to optimize that code anyway it sees fit (as long as it is correct) depending on the options you pass in. And with -O3 you should get the fastest code, even with jumps.
If there were a function like you suggested, you're code would be tied to a single version of a single compiler with a particular set of optimization options. In other words: bye bye portability.

Why does a simple C "hello world" program not work if gcc -O3 without "volatile"?

I have the following C program
int main() {
char string[] = "Hello, world.\r\n";
__asm__ volatile ("syscall;" :: "a" (1), "D" (0), "S" ((unsigned long) string), "d" (sizeof(string) - 1)); }
which I want to run under Linux with with x86 64 bit. I call the syscall for "write" with 0 as fd argument because this is stdout.
If I compile under gcc with -O3, it does not work. A look into the assembly code
.file "test_for_o3.c"
.text
.section .text.startup,"ax",#progbits
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
subq $40, %rsp
.cfi_def_cfa_offset 48
xorl %edi, %edi
movl $15, %edx
movq %fs:40, %rax
movq %rax, 24(%rsp)
xorl %eax, %eax
movq %rsp, %rsi
movl $1, %eax
#APP
# 5 "test_for_o3.c" 1
syscall;
# 0 "" 2
#NO_APP
movq 24(%rsp), %rcx
xorq %fs:40, %rcx
jne .L5
xorl %eax, %eax
addq $40, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L5:
.cfi_restore_state
call __stack_chk_fail#PLT
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0"
.section .note.GNU-stack,"",#progbits
tells us that gcc has simply not put the string data into the assembly code. Instead, if I declare "string" as "volatile", it works fine.
However, the idea of "volatile" is just to use it for variables that can change their values by (from the view of the executing function) unexpected events, isn't it? "volatile" can make code much slower, hence it should be avoided if possible.
As I would suppose, gcc must assume that the content of "string" must not be ignored because the pointer "string" is used as an input parameter in the inline assembly (and gcc has no idea what the inline assembly code will do with it).
If this is "allowed" behaviour of gcc, where can I read more about all the formal constraints I have to be aware of when writing code for -O3?
A second question would be what the "volatile" statement along with the inline assembly directive does exactly. I just got used to mark all inline assembly directives with "volatile" because it had not worked otherwise, in some situations.

How do i get rid of call __x86.get_pc_thunk.ax

I tried to compile and convert a very simple C program to assembly language.
I am using Ubuntu and the OS type is 64 bit.
This is the C Program.
void add();
int main() {
add();
return 0;
}
if i use gcc -S -m32 -fno-asynchronous-unwind-tables -o simple.S simple.c this is how my assembly source code File should look like:
.file "main1.c"
.text
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call add
movl $0, %eax
movl %ebp, %esp
popl %ebp
ret
.size main, .-main
.ident "GCC: (Debian 4.4.5-8) 4.4.5" // this part should say Ubuntu instead of Debian
.section .note.GNU-stack,"",#progbits
but instead it looks like this:
.file "main0.c"
.text
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %ecx
call __x86.get_pc_thunk.ax
addl $_GLOBAL_OFFSET_TABLE_, %eax
movl %eax, %ebx
call add#PLT
movl $0, %eax
popl %ecx
popl %ebx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.section
.text.__x86.get_pc_thunk.ax,"axG",#progbits,__x86.get_pc_thunk.ax,comdat
.globl __x86.get_pc_thunk.ax
.hidden __x86.get_pc_thunk.ax
.type __x86.get_pc_thunk.ax, #function
__x86.get_pc_thunk.ax:
movl (%esp), %eax
ret
.ident "GCC: (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406"
.section .note.GNU-stack,"",#progbits
At my University they told me to use the Flag -m32 if I am using a 64 bit Linux version. Can somebody tell me what I am doing wrong?
Am I even using the correct Flag?
edit after -fno-pie
.file "main0.c"
.text
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $4, %esp
call add
movl $0, %eax
addl $4, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.ident "GCC: (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406"
.section .note.GNU-stack,"",#progbits
it looks better but it's not exactly the same.
for example what does leal mean?

As a general rule, you cannot expect two different compilers to generate the same assembly code for the same input, even if they have the same version number; they could have any number of extra "patches" to their code generation. As long as the observable behavior is the same, anything goes.
You should also know that GCC, in its default -O0 mode, generates intentionally bad code. It's tuned for ease of debugging and speed of compilation, not for either clarity or efficiency of the generated code. It is often easier to understand the code generated by gcc -O1 than the code generated by gcc -O0.
You should also know that the main function often needs to do extra setup and teardown that other functions do not need to do. The instruction leal 4(%esp),%ecx is part of that extra setup. If you only want to understand the machine code corresponding to the code you wrote, and not the nitty details of the ABI, name your test function something other than main.
(As pointed out in the comments, that setup code is not as tightly tuned as it could be, but it doesn't normally matter, because it's only executed once in the lifetime of the program.)
Now, to answer the question that was literally asked, the reason for the appearance of
call __x86.get_pc_thunk.ax
is because your compiler defaults to generating "position-independent" executables. Position-independent means the operating system can load the program's machine code at any address in (virtual) memory and it'll still work. This allows things like address space layout randomization, but to make it work, you have to take special steps to set up a "global pointer" at the beginning of every function that accesses global variables or calls another function (with some exceptions). It's actually easier to explain the code that's generated if you turn optimization on:
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %ecx
This is all just setting up main's stack frame and saving registers that need to be saved. You can ignore it.
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
The special function __x86.get_pc_thunk.bx loads its return address -- which is the address of the addl instruction that immediately follows -- into the EBX register. Then we add to that address the value of the magic constant _GLOBAL_OFFSET_TABLE_, which, in position-independent code, is the difference between the address of the instruction that uses _GLOBAL_OFFSET_TABLE_ and the address of the global offset table. Thus, EBX now points to the global offset table.
call add#PLT
Now we call add#PLT, which means call add, but jump through the "procedure linkage table" to do it. The PLT takes care of the possibility that add is defined in a shared library rather than the main executable. The code in the PLT uses the global offset table and assumes that you have already set EBX to point to it, before calling an #PLT symbol.  That's why main has to set up EBX even though nothing appears to use it. If you had instead written something like
extern int number;
int main(void) { return number; }
then you would see a direct use of the GOT, something like
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
movl number#GOT(%ebx), %eax
movl (%eax), %eax
We load up EBX with the address of the GOT, then we can load the address of the global variable number from the GOT, and then we actually dereference the address to get the value of number.
If you compile 64-bit code instead, you'll see something different and much simpler:
movl number(%rip), %eax
Instead of all this mucking around with the GOT, we can just load number from a fixed offset from the program counter. PC-relative addressing was added along with the 64-bit extensions to the x86 architecture. Similarly, your original program, in 64-bit position-independent mode, will just say
call add#PLT
without setting up EBX first. The call still has to go through the PLT, but the PLT uses PC-relative addressing itself and doesn't need any help from its caller.
The only difference between __x86.get_pc_thunk.bx and __x86.get_pc_thunk.ax is which register they store their return address in: EBX for .bx, EAX for .ax. I have also seen GCC generate .cx and .dx variants. It's just a matter of which register it wants to use for the global pointer -- it must be EBX if there are going to be calls through the PLT, but if there aren't any then it can use any register, so it tries to pick one that isn't needed for anything else.
Why does it call a function to get the return address? Older compilers would do this instead:
call 1f
1: pop %ebx
but that screws up return-address prediction, so nowadays the compiler goes to a little extra trouble to make sure every call is paired with a ret.

The extra junk you're seeing is due to your version of GCC special-casing main to compensate for possibly-broken entry point code starting it with a misaligned stack. I'm not sure how to disable this or if it's even possible, but renaming the function to something other than main will suppress it for the sake of your reading.
After renaming to xmain I get:
xmain:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
call add
movl $0, %eax
leave
ret

memcopying data off the stack in C

I was putting together a C riddle for a couple of my friends when a friend drew my attention to the fact that the following snippet (which happens to be part of the riddle I'd been writing) ran differently when compiled and run on OSX
#include <stdio.h>
#include <string.h>
int main()
{
int a = 10;
volatile int b = 20;
volatile int c = 30;
int data[3];
memcpy(&data, &a, sizeof(data));
printf("%d %d %d\n", data[0], data[1], data[2]);
}
What you'd expect the output to be is 10 20 30, which happens to be the case under Linux, but when the code is built under OSX you'd get 10 followed by two random numbers. After some debugging and looking at the compiler-generated assembly I came to the conclusion that this is due to how the stack is built. I am by no means an assembly expert, but the assembly code generated on Linux seems pretty straightforward to understand while the one generated on OSX threw me off a little. Perhaps I could use some help from here.
This is the code that was generated on Linux:
.file "code.c"
.section .text.unlikely,"ax",#progbits
.LCOLDB0:
.section .text.startup,"ax",#progbits
.LHOTB0:
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB23:
.cfi_startproc
movl $10, -12(%rsp)
xorl %eax, %eax
movl $20, -8(%rsp)
movl $30, -4(%rsp)
ret
.cfi_endproc
.LFE23:
.size main, .-main
.section .text.unlikely
.LCOLDE0:
.section .text.startup
.LHOTE0:
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
.section .note.GNU-stack,"",#progbits
And this is the code that was generated on OSX:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 12
.globl _main
.p2align 4, 0x90
_main: ## #main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
subq $16, %rsp
movl $20, -8(%rbp)
movl $30, -4(%rbp)
leaq L_.str(%rip), %rdi
movl $10, %esi
xorl %eax, %eax
callq _printf
xorl %eax, %eax
addq $16, %rsp
popq %rbp
retq
.cfi_endproc
.section __TEXT,__cstring,cstring_literals
L_.str: ## #.str
.asciz "%d %d %d\n"
.subsections_via_symbols
I'm really only interested in two questions here.
Why is this happening?
Are there any get-arounds to this issue?
I know this is not a practical way to utilize the stack as I'm a professional C developer, which is really the only reason I found this problem interesting to invest some of my time into.

Accessing memory past the end of a declared variable is undefined behaviour - there is no guarantee as to what will happen when you try to do that. Because of how the compiler generated the assembly under Linux, you happened to get the 3 variables directly in a row on the stack, however that behaviour is just a coincidence - the compiler could legally add extra data in between the variables on the stack or really do anything - the result is not defined by the language standard. So in answer to your first question, it's happening because what you're doing is not part of the language by design. In answer to your second, there's no way to reliably get the same result from multiple compilers because the compilers are not programmed to reliably reproduce undefined behaviour.

undefined behavior. You don't expect to copy 10, 20 ,30. You hope not to seg-fault.
There is nothing to guarantee that a,b, and c are sequential memory addresses, which is your naive assumption. On Linux, the compiler happened to make them sequential. You can't even rely on gcc always doing that.

You already know that the behavior is undefined. A good reason for the behavior to be different on OS/X and Linux is these systems use a different compiler, that generates different code:
When you run gcc in Linux, you invoke the installed version the Gnu C compiler.
When you run gcc in your version of OS/X, you most likely invoke the installed version of clang.
Try gcc --version on both systems and amaze your friends.

Prinfting multiple values in Assembly

So I do have some assembly code which I wrote on my linux VM (Manjaro, x86_64). It looks like this:
.section .rodata
.LC0:
.string "The value of a is: %d, of b: %d"
.text
.globl main
.type main, #function
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $15, -4(%rbp)
movl $20, -8(%rbp)
movl -8(%rbp), %edx
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
Basically I want to insert 2 values in registers, then somehow print them (formated like in .LC0). Well, I got stuck, so I just wrote C program, and used gcc -S to see how it looks. It gave me something similair to the code above. I don't understand two things:
If I store 20 in %edx and 15 in %eax, then why passing only %eax to %esi causes printf to print the values both from %eax and %edx?
Why do I have to put a zero constant everytime before and after printf (as gcc does?)

Why do I have to put a zero constant everytime before and after printf
These are two different issues.
Zero before printf conforms to x86-64 a.k.a. AMD64 SysV ABI to specify count of variable arguments in vector (XMMn, YMMn...) registers.
Zero after printf is this function return value (likely, return 0 at its end).
why passing only %eax to %esi causes printf to print the values both from %eax and %edx?
It does not.
The same ABI specifies: the first argument (printf format string pointer) in %rdi; the second argument (first variable argument) in %rsi, and so on. Additional move of arguments seems to be artifact of non-optimized (-O0) gcc output code. If you add any optimization (even -Og), youʼll see these senseless moves wiped out.