Why doesn't gcc reference the PLT for function calls?

Why doesn't gcc reference the PLT for function calls? - c

I'm trying to learn assembly by compiling simple functions and looking at the output.
I'm looking at calling functions in other libraries. Here's a toy C function that calls a function defined elsewhere:
void give_me_a_ptr(void*);
void foo() {
give_me_a_ptr("foo");
}
Here's the assembly produced by gcc:
$ gcc -Wall -Wextra -g -O0 -c call_func.c
$ objdump -d call_func.o
call_func.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <foo>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 callq e <foo+0xe>
e: 90 nop
f: 5d pop %rbp
10: c3 retq
I was expecting something like call <give_me_a_ptr#plt>. Why is this jumping to a relative position before it even knows where give_me_a_ptr is defined?
I'm also puzzled by mov $0, %edi. This looks like it's passing a null pointer -- surely mov $address_of_string, %rdi would be correct here?

You're not building with symbol-interposition enabled (a side-effect of -fPIC), so the call destination address can potentially be resolved at link time to an address in another object file that is being statically linked into the same executable. (e.g. gcc foo.o bar.o).
However, if the symbol is only found in a library that you're dynamically linking to (gcc foo.o -lbar), the call has to be indirected through the PLT to support.
Now this is the tricky part: without -fPIC or -fPIE, gcc still emits asm that calls the function directly:
int puts(const char*); // puts exists in libc, so we can link this example
void call_puts(void) { puts("foo"); }
# gcc 5.3 -O3 (without -fPIC)
movl $.LC0, %edi # absolute 32bit addressing: slightly smaller code, because static data is known to be in the low 2GB, in the default "small" code model
jmp puts # tail-call optimization. Same as call puts/ret, except for stack alignment
But if you look at the linked binary:
(on this Godbolt compiler explorer link, click the "binary" button to toggle between gcc -S asm output and objdump -dr disassembly)
# disassembled linker output
mov $0x400654,%edi
jmpq 400490 <puts#plt>
During linking, the call to puts was "magically" replaced with indirection through puts#plt, and a puts#plt definition is present in the linked executable.
I don't know the details of how this works, but it's done at link time when linking to a shared library. Crucially, it doesn't require anything in the header files to mark the function prototype as being in a shared library. You get the same results from including <stdio.h> as you do from declaring puts yourself. (This is highly not recommended; it's probably legal for a C implementation to only work properly with the declarations in headers. It happens to work on Linux, though.)
When compiling a position-independent executable (with -fPIE), the linked binary jumps to puts through the PLT, identically to without -fPIC. However, the compiler asm output is different (try it yourself on the godbolt link above):
call_puts: # compiled with -fPIE
leaq .LC0(%rip), %rdi # RIP-relative addressing for static data
jmp puts#PLT
The compiler forces indirection through the PLT for any calls to functions it can't see the definition for. I don't understand why. In PIE mode, we're compiling code for an executable, not a shared library. The linker should be able to link multiple object files into a position-independent executable with direct calls between functions defined in the executable. I'm testing on Linux (my desktop and godbolt), not OS X, where I assume gcc -fPIE is the default. It might be configured differently, IDK.
With -fPIC instead of -fPIE, things are even worse: even calls to global functions defined within the same compilation unit have to go through the PLT, to support symbol interposition. (e.g. LD_PRELOAD=intercept_some_functions.so ./a.out)
The differences between -fPIC and -fPIE are mainly that PIE can assume no symbol interposition for functions in the same compilation unit, but PIC can't. OS X requires position-independent executables, as well as shared libraries, but there is a difference in what the compiler can do when making code for a library vs. making code for an executable.
This Godbolt example has some more functions that demonstrate stuff about PIC and PIE mode, e.g. that call_puts() can't inline into another function in PIC mode, only PIE.
See also: Shared object in Linux without symbol interposition, -fno-semantic-interposition error.
puzzled by mov $0, %edi
You're looking at disassembly output from the .o, where addresses are just placeholder 0s that will be replaced by the linker at link time, based on the relocation information in the ELF object file. That's why #Leandros suggested objdump -r.
Similarly, the relative displacement in the call machine code is all-zeros, because the linker hasn't filled it in yet.

I'm still studying this linking process myself, but wanted to restate something in my own words. The PLT-related user function calls might not all be stuffed with the proper code by the time execution starts. Doing so could take a lot of time at the start of execution; and not all the function calls instrumented by the PLT might even be used. So under a 'lazy binding' method, the very first time a 'user' function is called through the PLT code, it always jumps to the PLT 'binding function' first. The binding function goes out and finds the right address for the 'user' function (I think from the GOT) and then replaces the PLT entry (that points to the binding function) with the code pointing to the 'user' function. So thereafter every time the user function is called, the 'lazy' binding function is not called; the 'user' function is called instead. This might be why the PLT entry looks odd at first blush; it's pointing to the binding function and not to the 'user' function.

Related

Linking and calling printf from gas assembly

There are a few related questions to this which I've come across, such as Printf with gas assembly and Calling C printf from assembly but I'm hoping this is a bit different.
I have the following program:
.section .data
format:
.ascii "%d\n"
.section .text
.globl _start
_start:
// print "55"
mov $format, %rdi
mov $55, %rsi
mov $0, %eax
call printf # how to link?
// exit
mov $60, %eax
mov $0, %rdi
syscall
Two questions related to this:
Is it possible to use only as (gas) and ld to link this to the printf function, using _start as the entry point? If so, how could that be done?
If not, other than changing _start to main, what would be the gcc invocation to run things properly?

It is possible to use ld, but not recommended: if you use libc functions, you need to initialise the C runtime. That is done automatically if you let the C compiler provide _start and start your program as main. If you use the libc but not the C runtime initialisation code, it may seem to work, but it can also lead to strange spurious failure.
If you start your program from main (your second case) instead, it's as simple as doing gcc -o program program.s where program.s is your source file. On some Linux distributions you may also need to supply -no-pie as your program is not written in PIC style (don't worry about this for now).
Note also that I recommend not mixing libc calls with raw system calls. Instead of doing a raw exit system call, call the C library function exit. This lets the C runtime deinitialise itself correctly, including flushing any IO streams.
Now if you assemble and link your program as I said in the first paragraph, you'll notice that it might crash. This is because the stack needs to be aligned to a multiple of 16 bytes on calls to functions. You can ensure this alignment by pushing a qword of data on the stack at the beginning of each of your functions (remember to pop it back off at the end).

Static executable segfaults if location counter is initialized as too small or too large in linker script

I'm trying to generate a static executable for this program (with musl):
main.S:
.section .text
.global main
main:
mov $msg, %rdi
mov $0, %rax
call printf
mov %rax, %rdi
mov $60, %rax
syscall
msg:
.ascii "hello world from printf\n\0"
Compilation command:
clang -g -c main.S -o main.o
Linking command (musl libc is placed in musl directory (version 1.2.1)):
ld main.o musl/crt1.o -o sm -Tstatic.ld -static -lc -lm -Lmusl
Linker script (static.ld):
ENTRY(_start)
SECTIONS
{
. = 0x100e8;
}
This config results in a working executable, but if I change the location counter offset to 0x10000 or 0x20000, the resulting executable crashes during startup with a segfault. On debugging I found that musl initialization code tries to read the program headers (location received in aux vector), and for some reason the memory address of program header as given by aux vector is unmapped in our address space.
What is the cause of this behavior? What exactly is the counter offset in a linker script? How does it affect the linker output other than altering the load address?
Note: The segfault occurs when the the musl initialization code tries to access program headers

There are a few issues here.
Your main.S has a stack alignment bug: on x86_64, you must realign the stack to 16-byte boundary before calling any other function (you can assume 8-byte alignment on entry).
Without this, I get a crash inside printf due to movaps %xmm0,0x40(%rsp) with misaligned $rsp.
Your link order is wrong: crt1.o should be linked before main.o
When you don't leave SIZEOF_HEADERS == 0xe8 space before starting your .text section, you are leaving it up to the linker to put program headers elsewhere, and it does. The trouble is: musl (and a lot of other code) assumes that the file header and program headers are mapped in (but the ELF format doesn't require this). So they crash.
The right way to specify start address:
ENTRY(_start)
SECTIONS
{
. = 0x10000 + SIZEOF_HEADERS;
}
Update:
Why does the order matter?
Linkers (in general) will assemble initializers (constructors) left to right. When you call standard C library routines from main(), you expect the standard library to have initialized itself before main() was called. Code in crt1.o is responsible for performing such initialization.
If you link in the wrong order: crt1.o after main.o, construction may not happen correctly. Whether you'll be able to observe this depends on implementation details of the standard library, and exactly what parts of it you are using. So your binary may appear to work correctly. But it is still better to link objects in the correct order.
I'm leaving 0x10000 space, isn't it enough for headers?
You are interfering with the built-in default linker script, and instead giving it incomplete specification of how to lay out your program in memory. When you do so, you need to know how the linker will react. Different linkers will react differently.
The binutils ld reacts by not emitting a LOAD segment covering program headers. The ld.lld reacts differently -- it actually moves .text past program headers.
The resulting binaries still crash though, because the binary layout is not what the kernel expects, and the kernel-supplied AT_PHDR address in the aux vector is wrong.
It looks like the kernel expects the first LOAD segment to be the one which contains program headers. Arguably that is a bug in the kernel -- nothing in the ELF spec requires this. But all normal binaries do have program headers in the first LOAD segment, so you'll just have to do the same (or convince kernel developers to add code to handle your weird binary layout).

How do you call C functions from Assembly and how do you link it Statically?

I am playing around and trying to understand the low-level operation of computers and programs. To that end, I am experimenting with linking Assembly and C.
I have 2 program files:
Some C code here in "callee.c":
#include <unistd.h>
void my_c_func() {
write(1, "Hello, World!\n", 14);
return;
}
I also have some GAS x86_64 Assembly here in "caller.asm":
.text
.globl my_entry_pt
my_entry_pt:
# call my c function
call my_c_func # this function has no parameters and no return data
# make the 'exit' system call
mov $60, %rax # set the syscall to the index of 'exit' (60)
mov $0, %rdi # set the single parameter, the exit code to 0 for normal exit
syscall
I can build and execute the program like this:
$ as ./caller.asm -o ./caller.obj
$ gcc -c ./callee.c -o ./callee.obj
$ ld -e my_entry_pt -lc ./callee.obj ./caller.obj -o ./prog.out -dynamic-linker /lib64/ld-linux-x86-64.so.2
$ ldd ./prog.out
linux-vdso.so.1 (0x00007fffdb8fe000)
libc.so.6 => /lib64/libc.so.6 (0x00007f46c7756000)
/lib64/ld-linux-x86-64.so.2 (0x00007f46c7942000)
$ ./prog.out
Hello, World!
Along the way, I had some problems. If I don't set the -dynamic-linker option, it defaults to this:
$ ld -e my_entry_pt -lc ./callee.obj ./caller.obj -o ./prog.out
$ ldd ./prog.out
linux-vdso.so.1 (0x00007ffc771c5000)
libc.so.6 => /lib64/libc.so.6 (0x00007f8f2abe2000)
/lib/ld64.so.1 => /lib64/ld-linux-x86-64.so.2 (0x00007f8f2adce000)
$ ./prog.out
bash: ./prog.out: No such file or directory
Why is this? Is there a problem with the linker defaults on my system? How can/should I fix it?
Also, static linking doesn't work.
$ ld -static -e my_entry_pt -lc ./callee.obj ./caller.obj -o ./prog.out
ld: ./callee.obj: in function `my_c_func':
callee.c:(.text+0x16): undefined reference to `write'
Why is this? Shouldn't write() just be a c library wrapper for the syscall 'write'? How can I fix it?
Where can I find the documentation on the C function calling convention so I can read up on how parameters are passed back and forth, etc...?
Lastly, while this seems to work for this simple example, am I doing something wrong in my initialization of the C stack? I mean, right now, I'm doing nothing. Should I be allocing memory from the kernel for the stack, setting bounds, and setting %rsp and %rbp before I start trying to call functions. Or is the kernel loader taking care of all this for me? If so, will all architectures under a Linux kernel take care of it for me?

While the Linux kernel provides a syscall named write, it does not mean that you automatically get a wrapper function of the same name you can call from C as write(). In fact, you need inline assembly to call any syscalls from C, if you're not using libc, because libc defines those wrapper functions.
Instead of explicitly linking your binaries with ld, let gcc do it for you. It can even assemble assembly files (internally executing a suitable version of as), if the source ends with a .s suffix. It looks like your linking problems are simply a disagreement between what GCC assumes and how you do it via LD yourself.
No, it's not a bug; the ld default path for ld.so isn't the one used on modern x86-64 GNU/Linux systems. (/lib/ld64.so.1 might have been used on early x86-64 GNU/Linux ports before the dust settled on where multi-arch systems would put everything to support both i386 and x86-64 versions of libraries installed at the same time. Modern systems use /lib64/ld-linux-x86-64.so.2)
Linux uses the System V ABI. The AMD64 Architecture Processor Supplement (PDF) describes the initial execution environment (when _start gets invoked), and the calling convention. Essentially, you have an initialized stack, with environment and command-line arguments stored in it.
Let's construct a fully working example, containing both C and assembly (AT&T syntax) sources, and a final static and dynamic binaries.
First, we need a Makefile to save typing long commands:
# SPDX-License-Identifier: CC0-1.0
CC := gcc
CFLAGS := -Wall -Wextra -O2 -march=x86-64 -mtune=generic -m64 \
-ffreestanding -nostdlib -nostartfiles
LDFLAGS :=
all: static-prog dynamic-prog
clean:
rm -f static-prog dynamic-prog *.o
%.o: %.c
$(CC) $(CFLAGS) $^ -c -o $#
%.o: %.s
$(CC) $(CFLAGS) $^ -c -o $#
dynamic-prog: main.o asm.o
$(CC) $(CFLAGS) $^ $(LDFLAGS) -o $#
static-prog: main.o asm.o
$(CC) -static $(CFLAGS) $^ $(LDFLAGS) -o $#
Makefiles are particular about their indentation, but SO converts tabs to spaces. So, after pasting the above, run sed -e 's|^ *|\t|' -i Makefile to fix the indentation back to tabs.
The SPDX License Identifier in the above Makefile and all following files tell you that these files are licensed under Creative Commons Zero license: that is, these are all dedicated to public domain.
Compilation flags used:
-Wall -Wextra: Enable all warnings. It is a good practice.
-O2: Optimize the code. This is a commonly used optimization level, usually considered sufficient and not too extreme.
-march=x86-64 -mtune=generic -m64: Compile to 64-bit x86-64 AKA AMD64 architecture. These are the defaults; you can use -march=native to optimize for your own system.
-ffreestanding: Compilation targets the freestanding C environment. Tells the compiler it can't assume that strlen or memcpy or other library functions are available, so don't optimize a loop, struct copy, or array initialization into calls to strlen, memcpy, or memset, for example. If you do provide asm implementations of any functions gcc might want to invent calls to, you can leave this out. (Especially if you're writing a program that will run under an OS)
-nostdlib -nostartfiles: Do not link in the standard C library or its startup files. (Actually, -nostdlib already "includes" -nostartfiles, so -nostdlib alone would suffice.)
Next, let's create a header file, nolib.h, that implements nolib_exit() and nolib_write() wrappers around the group_exit and write syscalls:
// SPDX-License-Identifier: CC0-1.0
/* Require Linux on x86-64 */
#if !defined(__linux__) || !defined(__x86_64__)
#error "This only works on Linux on x86-64."
#endif
/* Known syscall numbers, without depending on glibc or kernel headers */
#define SYS_write 1
#define SYS_exit_group 231
// Normally you'd use
// #include <asm/unistd.h> for __NR_write and __NR_exit_group
// or even #include <sys/syscall.h> for SYS_write
/* Inline assembly macro for a single-parameter no-return syscall */
#define SYSCALL1_NORET(nr, arg1) \
__asm__ volatile ( "syscall\n\t" : : "a" (nr), "D" (arg1) : "rcx", "r11", "memory")
/* Inline assembly macro for a three-parameter syscall */
#define SYSCALL3(retval, nr, arg1, arg2, arg3) \
__asm__ volatile ( "syscall\n\t" : "=a" (retval) : "a" (nr), "D" (arg1), "S" (arg2), "d" (arg3) : "rcx", "r11", "memory" )
/* exit() function */
static inline void nolib_exit(int retval)
{
SYSCALL1_NORET(SYS_exit_group, retval);
}
/* Some errno values */
#define EINTR 4 /* Interrupted system call */
#define EBADF 9 /* Bad file descriptor */
#define EINVAL 22 /* Invalid argument */
// or #include <asm/errno.h> to define these
/* write() syscall wrapper - returns negative errno if an error occurs */
static inline long nolib_write(int fd, const void *data, long len)
{
long retval;
if (fd == -1)
return -EBADF;
if (!data || len < 0)
return -EINVAL;
SYSCALL3(retval, SYS_write, fd, data, len);
return retval;
}
The reason the nolib_exit() uses the exit_group syscall instead of the exit syscall is that exit_group ends the entire process. If you run a program under strace, you'll see it too calls exit_group syscall at the very end. (Syscall implementation of exit())
Next, we need some C code. main.c:
// SPDX-License-Identifier: CC0-1.0
#include "nolib.h"
const char *c_function(void)
{
return "C function";
}
static inline long nolib_put(const char *msg)
{
if (!msg) {
return nolib_write(1, "(null)", 6);
} else {
const char *end = msg;
while (*end)
end++; // strlen
if (end > msg)
return nolib_write(1, msg, (unsigned long)(end - msg));
else
return 0;
}
}
extern const char *asm_function(int);
void _start(void)
{
nolib_put("asm_function(0) returns '");
nolib_put(asm_function(0));
nolib_put("', and asm_function(1) returns '");
nolib_put(asm_function(1));
nolib_put("'.\n");
nolib_exit(0);
}
nolib_put() is just a wrapper around nolib_write(), that finds the end of the string to be written, and calculates the number of characters to be written based on that. If the parameter is a NULL pointer, it prints (null).
Because this is a freestanding environment, and the default name for the entry point is _start, this defines _start as a C function that never returns. (It must not ever return, because the ABI does not provide any return address; it would just crash the process. Instead, an exit-type syscall must be called at end.)
The C source declares and calls a function asm_function, that takes an integer parameter, and returns a pointer to a string. Obviously, we'll implement this in assembly.
The C source also declares a function c_function, that we can call from assembly.
Here's the assembly part, asm.s:
# SPDX-License-Identifier: CC0-1.0
.text
.section .rodata
.one:
.string "One" # includes zero terminator
.text
.p2align 4,,15
.globl asm_function #### visible to the linker
.type asm_function, #function
asm_function:
cmpl $1, %edi
jne .else
leaq .one(%rip), %rax
ret
.else:
subq $8, %rsp # 16B stack alignment for a call to C
call c_function
addq $8, %rsp
ret
.size asm_function, .-asm_function
We don't need to declare c_function as an extern because GNU as treats all unknown symbols as external symbols anyway. We could add Call Frame Information directives, at least .cfi_startproc and .cfi_endproc, but I left them out so it wouldn't be so obvious I just wrote the original code in C and let GCC compile it to assembly, and then prettified it just a bit. (Did I write that out aloud? Oops! But seriously, compiler output is often a good starting point for a hand-written asm implementation of something, unless it does a very bad job of optimizing.)
The subq $8, %rsp adjusts the stack so that it will be a multiple of 16 for the c_function. (On x86-64, stacks grow down, so to reserve 8 bytes of stack, you subtract 8 from the stack pointer.) After the call returns, addq $8, %rsp reverts the stack back to original.
With these four files, we're ready. To build the example binaries, run e.g.
reset ; make clean all
Running either ./static-prog or ./dynamic-prog will output
asm_function(0) returns 'C function', and asm_function(1) returns 'One'.
The two binaries are just 2 kB (static) and 6 kB (dynamic) in size or so, although you can make them even smaller by stripping unneeded stuff,
strip --strip-unneeded static-prog dynamic-prog
which removes about 0.5 kB to 1 kB of unneeded stuff from them – the exact amount varies depending on the version of GCC and Binutils you use.
On some other architectures, we'd need to also link against libgcc (via -lgcc), because some C features rely on internal GCC functions. 64-bit integer division (named udivdi or similar) on various architectures is a typical example.
As mentioned in the comments, the first version of the above examples had a few issues that need to be addressed. They do not stop the example from executing or working as intended, and were overlooked because the examples were written from scratch for this answer (in the hopes that others finding this question later on via web searches might find this useful), and I'm not perfect. :)
memory clobber argument to the inline assembly, in the syscall preprocessor macros
Adding "memory" in the clobbered list tells the compiler that the inline assembly may access (read and/or write) memory other than those specified in the parameter lists. It is obviously needed for the write syscall, but it is actually important for all syscalls, because the kernel can deliver e.g. signals in the same thread before returning from the syscall, and signal delivery can/will access memory.
As the GCC documentation mentions, this clobber also behaves like a read/write memory barrier for the compiler (but NOT for the processor!). In other words, with the memory clobber, the compiler knows that it must write any changes in variables etc. in memory before the inline assembly, and that unrelated variables and other memory content (not explicitly listed in the inline assembly inputs, outputs, or clobbers) may also change, and will generate the code we actually want, without making incorrect assumptions.
-fPIC -pie: Omitted for simplicity
Position independent code is usually only relevant for shared libraries. In real projects' Makefiles, you will need to use a different set of compilation flags for objects that will be compiled as a dynamic library, static library, dynamically linked executable, or a static executable, as the desired properties (and therefore compiler/linker flags) vary.
In an example such as this one, it is better to try and avoid such extraneous things, as it is a reasonable question to ask on its own ("Which compiler options to use to achieve X, when needing Y ?"), and the answers depend on the required features and context.
In most modern distros, PIE is the default and you might want -fno-pie -no-pie to simplify debugging / disassembling. 32-bit absolute addresses no longer allowed in x86-64 Linux?
-nostdlib does imply (or "include") -nostartfiles
There are quite a few overall options and link options we can use to control how the code is compiled and linked.
Many of the options GCC supports are grouped. For example, -O2 is actually shorthand for a collection of optimization features that you can explicitly specify.
Here, the reason for keeping both is to remind human programmers of the expectations for the code: no standard library, and no start files/objects.
-march=x86-64 -mtune=generic -m64 is the default on x86-64
Again, this is kept more as a reminder of what the code expects. Without a specific architecture definition, one might get the wrong impression that the code should be compilable in general, because C typically is not architecture specific!
The nolib.h header file does contain preprocessor checks (using pre-defined compiler macros to detect the operating system and hardware architecture), halting the compilation with an error for other OSes and hardware architectures.
Most Linux distributions provide the syscall numbers in <asm/unistd.h>, as __NR_name.
These are derived from the actual kernel sources. However, for any given architecture, these are the stable userspace ABI, and will not change. New ones may be added. Only in some extraordinary circumstances (unfixable security holes, perhaps?) can a syscall be deprecated and stop functioning.
It is always better to use the syscall numbers from the kernel, preferably via the aforementioned header, but it's possible to build this program with only GCC, no glibc or Linux kernel headers installed. For someone writing their own standard C library, they should include the file (from Linux kernel sources).
I do know that Debian derivatives (Ubuntu, Mint, et cetera) all do provide the <asm/unistd.h> file, but there are many, many other Linux distributions, and I just am not sure about all of them. I opted to only define the two (exit_group and write), to minimize the risk of problems.
(Editor's note: the file might be in a different place in the filesystem, but the <asm/unistd.h> include path should always work if the right header package is installed. It's part of the kernel's user-space C/asm API.)
Compilation flag -g adds debug symbols, which adds greatly when debugging – for example, when running and examining the binary in gdb.
I omitted this and all related flags, because I did not want to expand the topic any further, and because this example is easily debugged at the asm level and examined even without. See GDB asm tips like layout reg at the bottom of the x86 tag wiki
The System V ABI requires that before a call to a function, the stack is aligned to 16 bytes. So at the top of the function, RSP+-8 is 16-byte aligned, and if there are any stack args, they'll be aligned.
The call instruction pushes the current instruction pointer to the stack, and because this is a 64-bit architecture, that too is 64 bits = 8 bytes. So, to conform to the ABI, we really need to adjust the stack pointer by 8 before calling the function, to ensure it too gets a properly aligned stack pointer. These were initially omitted, but are now included in the assembly (asm.s file).
This matters, because on x86-64, SSE/AVX SIMD vectors have different instructions for aligned-to-16-bytes and unaligned accesses, with the aligned accesses being significantly faster or certain processors. (Why does System V / AMD64 ABI mandate a 16 byte stack alignment?). Using aligned SIMD instructions like movaps with unaligned addresses will cause the process to crash. (e.g. glibc scanf Segmentation faults when called from a function that doesn't align RSP is a real-life example of what happens when you get this wrong.)
However, when we do such stack manipulations, we really should add CFI (Call Frame Information) directives to ensure debugging and stack unwinding etc. works correctly. In this case, for general CFI, we prepend .cfi_startproc before the first instruction in an assembly function, and .cfi_endproc after the last instruction in an assembly function. For the Canonical Frame Address, CFA, we add .cfi_def_cfa_offset N after any instruction that modifies the stack pointer. Essentially, N is 8 at the beginning of the function, and increases as much as %rsp is decremented, and vice versa. See this article for more.
Internally, these directives produce information (metadata) stored in the .eh_frame and .eh_frame_hdr sections in the ELF object files and binaries, depending on other compilation flags.
So, in this case, the subq $8, %rsp should be followed by .cfi_def_cfa_offset 16, and the addq $8, %rsp by .cfi_def_cfa_offset 8, plus .cfi_startproc at the beginning of asm_function and .cfi_endproc after the final ret.
Note that you can often see rep ret instead of just rep in assembly sources. This is nothing but a workaround to certain processors having branch-prediction performance issues when jumping to or falling through a JCC to a ret instruction. The rep prefix does nothing, except it does fix the issues those processors might otherwise have with such a jump. Recent GCC versions stopped doing this by default as the affected AMD CPUs are very old and not as relevant these days. What does `rep ret` mean?
The "key" option, -ffreestanding, is one that chooses a C "dialect"
The C programming language is actually separated into two different environments: hosted, and freestanding.
The hosted environment is one where the standard C library is available, and is used when you write programs, applications, or daemons in C.
The freestanding environment is one where the standard C library is not available. It is used when you write kernels, firmware for microcontrollers or embedded systems, implement (parts of) your own standard C library, or a "standard library" for some other C-derived language.
As an example, the Arduino programming environment is based on a subset of freestanding C++. The standard C++ library is not available, and many features of C++ like exceptions are not supported. In fact, it is very close to freestanding C with classes. The environment also uses a special pre-preprocessor, which for example automatically prepends declarations of functions without the user having to write them.
Probably the most well known example of freestanding C is the Linux kernel. Not only is the standard C library not available, but the kernel code must actually avoid floating-point operations as well, because of certain hardware considerations.
For a better understanding of what exactly does the freestanding C environment look like to a programmer, I think the best thing is to go look at the language standard itself. As of now (June 2020), the most recent standard is ISO C18. While the standard itself is not free, the final draft is; for C18, it is draft N2176(PDF).

The ld default path for ld.so (the ELF interpreter) isn't the one used on modern x86-64 GNU/Linux systems.
/lib/ld64.so.1 might have been used on early x86-64 GNU/Linux ports before the dust settled on where multi-arch systems would put everything to support both i386 and x86-64 versions of libraries installed at the same time. Modern systems use /lib64/ld-linux-x86-64.so.2.
There was never a good time to update the default in GNU binutils ld; when some systems were using the default, changing it would have broken them. Multi-arch systems had to configure their GCC to pass -dynamic-linker /some/path to ld, so they simply did that instead of asking and waiting for the ld default to change. So nobody ever needed the ld default to change to make anything work, except for people playing around with assembly and using ld by hand to create dynamically-linked executables.
Instead of doing that, you can link using gcc -nostartfiles to omit CRT start code which defines a _start, but still link with the normal libraries including -lc, -lgcc internal helper functions if needed, etc.
See also Assembling 32-bit binaries on a 64-bit system (GNU toolchain) for more info on assembling with/without libc for asm that defines _start, or with libc + CRT for asm that defines main. (Leave out the -m32 from that answer for 64-bit; when using gcc to invoke as and ld for you, that's the only difference.)
ld -static -e my_entry_pt -lc ./callee.obj ./caller.obj -o ./prog.out
doesn't link because you put -lc before the object files that reference symbols in libc.
Order matters in linker command lines, for static libraries.
However, ld -static -e my_entry_pt ./callee.o ./caller.o -lc -o ./prog.out will link, but makes a program that segfaults when it calls glibc functions like write without having called glibc's init functions.
Dynamic linking takes care of that for you (glibc has .init functions that get called by the dynamic linker, the same mechanism that allows C++ static initializers to run in a C++ shared library). CRT startup code also calls those functions in the right order, but you left that out, too, and wrote your own entry point.
#Example's answer avoids that problem by defining its own write wrapper instead of linking with -lc, so it can be truly freestanding.
I thought glibc's write wrapper function would be simple enough not to crash, but that's not the case. It checks if the program is multi-threaded or something by loading from %fs:0x18. The kernel doesn't init FS base for thread-local storage; that's something user-space (glibc's internal init functions) would have to do.
glibc's write() faults on mov %fs:0x18,%eax if you haven't called glibc's init functions. (In a statically-linked executable where glibc couldn't get the dynamic linker to run them for you.)
Dump of assembler code for function write:
=> 0x0000000000401040 <+0>: endbr64 # for CET, or NOP on CPUs without CET
0x0000000000401044 <+4>: mov %fs:0x18,%eax ### this faults with no TLS setup
0x000000000040104c <+12>: test %eax,%eax
0x000000000040104e <+14>: jne 0x401060 <write+32>
0x0000000000401050 <+16>: mov $0x1,%eax # simple case: EAX = __NR_write
0x0000000000401055 <+21>: syscall
0x0000000000401057 <+23>: cmp $0xfffffffffffff000,%rax
0x000000000040105d <+29>: ja 0x4010b0 <write+112> # update errno on error
0x000000000040105f <+31>: retq # else return
0x0000000000401060 <+32>: sub $0x28,%rsp # the non-simple case:
0x0000000000401064 <+36>: mov %rdx,0x18(%rsp) # write is an async cancellation point or something
0x0000000000401069 <+41>: mov %rsi,0x10(%rsp)
0x000000000040106e <+46>: mov %edi,0x8(%rsp)
0x0000000000401072 <+50>: callq 0x4010e0 <__libc_enable_asynccancel>
0x0000000000401077 <+55>: mov 0x18(%rsp),%rdx
0x000000000040107c <+60>: mov 0x10(%rsp),%rsi
0x0000000000401081 <+65>: mov %eax,%r8d
0x0000000000401084 <+68>: mov 0x8(%rsp),%edi
0x0000000000401088 <+72>: mov $0x1,%eax
0x000000000040108d <+77>: syscall
0x000000000040108f <+79>: cmp $0xfffffffffffff000,%rax
0x0000000000401095 <+85>: ja 0x4010c4 <write+132>
0x0000000000401097 <+87>: mov %r8d,%edi
0x000000000040109a <+90>: mov %rax,0x8(%rsp)
0x000000000040109f <+95>: callq 0x401140 <__libc_disable_asynccancel>
0x00000000004010a4 <+100>: mov 0x8(%rsp),%rax
0x00000000004010a9 <+105>: add $0x28,%rsp
0x00000000004010ad <+109>: retq
0x00000000004010ae <+110>: xchg %ax,%ax
0x00000000004010b0 <+112>: mov $0xfffffffffffffffc,%rdx # errno update for the simple case
0x00000000004010b7 <+119>: neg %eax
0x00000000004010b9 <+121>: mov %eax,%fs:(%rdx) # thread-local errno?
0x00000000004010bc <+124>: mov $0xffffffffffffffff,%rax
0x00000000004010c3 <+131>: retq
0x00000000004010c4 <+132>: mov $0xfffffffffffffffc,%rdx # same for the async case
0x00000000004010cb <+139>: neg %eax
0x00000000004010cd <+141>: mov %eax,%fs:(%rdx)
0x00000000004010d0 <+144>: mov $0xffffffffffffffff,%rax
0x00000000004010d7 <+151>: jmp 0x401097 <write+87>
I don't fully understand what exactly write is checking for or doing. It may have something to do with async I/O, and/or POSIX thread cancellation points.

Why do I get oveflow in printf? [duplicate]

This question already has answers here:
Can't call C standard library function on 64-bit Linux from assembly (yasm) code
(2 answers)
How to print a number in assembly NASM?
(6 answers)
Closed 3 years ago.
Hey I have to call a function of glibc in assembly for an exercise. So I found this code to call printf.
section .rodata
format: db 'Hello %s', 10
name: db 'Conrad'
section .text
global main
extern printf
main:
; printf(format, name)
mov rdi, format
mov rsi, name
call printf
; return 0
mov rax, 0
ret
But i get the error:
Symbol `printf' causes overflow in R_X86_64_PC32 relocation
Compiled it with:
nasm -f elf64 -o test.o test.asm
gcc -o test test.o
The error occurs after doing
./test

Change call printf to call printf#PLT. The former only works if the actual definition of printf is withing ±2GB of the call instruction, which can't be known if the definition is in a shared library (it would work if you static link, though). "Overflow" is telling you that the relative address, which would need to be up to 64-bit, overflows in a 32-bit call instruction offset.
By using printf#PLT, you'll instead get a relative address that resolves statically at link time to a thunk in the PLT, which loads and jumps to the address of the function definition, resolved at dynamic-linking time.
As Maxime B. noted, the loads of the addresses of format and name are also not correct for position-independent code. They should be loaded with "rip-relative" form, but it looks like you're using the weird "Intel syntax" for asm and I'm not sure how to write it in that syntax. You could, as Maxime B. suggested, build with -fno-pie, but better would be figuring out the way to fix your code so it doesn't depend on being linked for a particular fixed address.

You should compile your with -no-pie
This error is explained here. Quoting the original post:
Debian switched to PIC/PIE binaries in 64-bits mode & GCC in your case is trying to link your object as PIC, but it will encounter absolute address in mov $str, %rdi.

What are the differences comparing PIE, PIC code and executable on 64-bit x86 platform?

The test is on Ubuntu 12.04 64-bit. x86 architecture.
I am confused about the concept Position Independent Executable (PIE) and Position Independent code (PIC), and I guess they are not orthogonal.
Here is my quick experiment.
gcc -fPIC -pie quickSort.c -o a_pie.out
gcc -fPIC quickSort.c -o a_pic.out
gcc a.out
objdump -Dr -j .text a.out > a1.temp
objdump -Dr -j .text a_pic.out > a2.temp
objdump -Dr -j .text a_pie.out > a3.temp
And I have the following findings.
A. a.out contains some PIC code, but only resists in the libc prologue and epilogue functions, as shown in below:
4004d0: 48 83 3d 70 09 20 00 cmpq $0x0,0x200970(%rip) # 600e48 <__JCR_END__>
In the assembly instructions of my simple quicksort program, I didn't find any PIC instructions.
B. a_pic.out contains PIC code, and I didn't find any non-PIC instructions... In the instructions of my quicksort program, all the global data are accessed by PIC instructions like this:
40053b: 48 8d 05 ea 02 00 00 lea 0x2ea(%rip),%rax # 40082c <_IO_stdin_used+0x4>
C. a_pie.out contains syntax-identical instructions comparing with a_pic.out. However, the memory addresses of a_pie.out's .text section range from 0x630 to 0xa57, while the same section of a_pic.out ranges from 0x400410 to 0x400817.
Could anyone give me some explanations of these phenomenons? Especially the finding C. Again, I am really confused about PIE vs. PIC, and have no idea how to explain the finding C..

I am confused about the concept Position Independent Executable (PIE) and Position Independent code (PIC), and I guess they are not orthogonal.
The only real difference between PIE and PIC is that you are allowed to interpose symbols in PIC, but not in PIE. Except for that, they are pretty much equivalent.
You can read about symbol interposition here.
C. a_pie.out contains syntax-identical instructions comparing with a_pic.out. However, the memory addresses of a_pie.out's .text section range from 0x630 to 0xa57, while the same section of a_pic.out ranges from 0x400410 to 0x400817.
It's hard to understand what you find surprising about this.
The PIE binary is linked just as a shared library, and so its default load address (the .p_vaddr of the first LOAD segment) is zero. The expectation is that something will relocate this binary away from zero page, and load it at some random address.
On the other hand, a non-PIE executable is always loaded at its linked-at address. On Linux, the default address for x86_64 binaries is 0x400000, and so the .text ends up not far from there.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight