Why do I get oveflow in printf? [duplicate]

Why do I get oveflow in printf? [duplicate] - c

This question already has answers here:
Can't call C standard library function on 64-bit Linux from assembly (yasm) code
(2 answers)
How to print a number in assembly NASM?
(6 answers)
Closed 3 years ago.
Hey I have to call a function of glibc in assembly for an exercise. So I found this code to call printf.
section .rodata
format: db 'Hello %s', 10
name: db 'Conrad'
section .text
global main
extern printf
main:
; printf(format, name)
mov rdi, format
mov rsi, name
call printf
; return 0
mov rax, 0
ret
But i get the error:
Symbol `printf' causes overflow in R_X86_64_PC32 relocation
Compiled it with:
nasm -f elf64 -o test.o test.asm
gcc -o test test.o
The error occurs after doing
./test

Change call printf to call printf#PLT. The former only works if the actual definition of printf is withing ±2GB of the call instruction, which can't be known if the definition is in a shared library (it would work if you static link, though). "Overflow" is telling you that the relative address, which would need to be up to 64-bit, overflows in a 32-bit call instruction offset.
By using printf#PLT, you'll instead get a relative address that resolves statically at link time to a thunk in the PLT, which loads and jumps to the address of the function definition, resolved at dynamic-linking time.
As Maxime B. noted, the loads of the addresses of format and name are also not correct for position-independent code. They should be loaded with "rip-relative" form, but it looks like you're using the weird "Intel syntax" for asm and I'm not sure how to write it in that syntax. You could, as Maxime B. suggested, build with -fno-pie, but better would be figuring out the way to fix your code so it doesn't depend on being linked for a particular fixed address.

You should compile your with -no-pie
This error is explained here. Quoting the original post:
Debian switched to PIC/PIE binaries in 64-bits mode & GCC in your case is trying to link your object as PIC, but it will encounter absolute address in mov $str, %rdi.

Related

Static executable segfaults if location counter is initialized as too small or too large in linker script

I'm trying to generate a static executable for this program (with musl):
main.S:
.section .text
.global main
main:
mov $msg, %rdi
mov $0, %rax
call printf
mov %rax, %rdi
mov $60, %rax
syscall
msg:
.ascii "hello world from printf\n\0"
Compilation command:
clang -g -c main.S -o main.o
Linking command (musl libc is placed in musl directory (version 1.2.1)):
ld main.o musl/crt1.o -o sm -Tstatic.ld -static -lc -lm -Lmusl
Linker script (static.ld):
ENTRY(_start)
SECTIONS
{
. = 0x100e8;
}
This config results in a working executable, but if I change the location counter offset to 0x10000 or 0x20000, the resulting executable crashes during startup with a segfault. On debugging I found that musl initialization code tries to read the program headers (location received in aux vector), and for some reason the memory address of program header as given by aux vector is unmapped in our address space.
What is the cause of this behavior? What exactly is the counter offset in a linker script? How does it affect the linker output other than altering the load address?
Note: The segfault occurs when the the musl initialization code tries to access program headers

There are a few issues here.
Your main.S has a stack alignment bug: on x86_64, you must realign the stack to 16-byte boundary before calling any other function (you can assume 8-byte alignment on entry).
Without this, I get a crash inside printf due to movaps %xmm0,0x40(%rsp) with misaligned $rsp.
Your link order is wrong: crt1.o should be linked before main.o
When you don't leave SIZEOF_HEADERS == 0xe8 space before starting your .text section, you are leaving it up to the linker to put program headers elsewhere, and it does. The trouble is: musl (and a lot of other code) assumes that the file header and program headers are mapped in (but the ELF format doesn't require this). So they crash.
The right way to specify start address:
ENTRY(_start)
SECTIONS
{
. = 0x10000 + SIZEOF_HEADERS;
}
Update:
Why does the order matter?
Linkers (in general) will assemble initializers (constructors) left to right. When you call standard C library routines from main(), you expect the standard library to have initialized itself before main() was called. Code in crt1.o is responsible for performing such initialization.
If you link in the wrong order: crt1.o after main.o, construction may not happen correctly. Whether you'll be able to observe this depends on implementation details of the standard library, and exactly what parts of it you are using. So your binary may appear to work correctly. But it is still better to link objects in the correct order.
I'm leaving 0x10000 space, isn't it enough for headers?
You are interfering with the built-in default linker script, and instead giving it incomplete specification of how to lay out your program in memory. When you do so, you need to know how the linker will react. Different linkers will react differently.
The binutils ld reacts by not emitting a LOAD segment covering program headers. The ld.lld reacts differently -- it actually moves .text past program headers.
The resulting binaries still crash though, because the binary layout is not what the kernel expects, and the kernel-supplied AT_PHDR address in the aux vector is wrong.
It looks like the kernel expects the first LOAD segment to be the one which contains program headers. Arguably that is a bug in the kernel -- nothing in the ELF spec requires this. But all normal binaries do have program headers in the first LOAD segment, so you'll just have to do the same (or convince kernel developers to add code to handle your weird binary layout).

How do you call C functions from Assembly and how do you link it Statically?

I am playing around and trying to understand the low-level operation of computers and programs. To that end, I am experimenting with linking Assembly and C.
I have 2 program files:
Some C code here in "callee.c":
#include <unistd.h>
void my_c_func() {
write(1, "Hello, World!\n", 14);
return;
}
I also have some GAS x86_64 Assembly here in "caller.asm":
.text
.globl my_entry_pt
my_entry_pt:
# call my c function
call my_c_func # this function has no parameters and no return data
# make the 'exit' system call
mov $60, %rax # set the syscall to the index of 'exit' (60)
mov $0, %rdi # set the single parameter, the exit code to 0 for normal exit
syscall
I can build and execute the program like this:
$ as ./caller.asm -o ./caller.obj
$ gcc -c ./callee.c -o ./callee.obj
$ ld -e my_entry_pt -lc ./callee.obj ./caller.obj -o ./prog.out -dynamic-linker /lib64/ld-linux-x86-64.so.2
$ ldd ./prog.out
linux-vdso.so.1 (0x00007fffdb8fe000)
libc.so.6 => /lib64/libc.so.6 (0x00007f46c7756000)
/lib64/ld-linux-x86-64.so.2 (0x00007f46c7942000)
$ ./prog.out
Hello, World!
Along the way, I had some problems. If I don't set the -dynamic-linker option, it defaults to this:
$ ld -e my_entry_pt -lc ./callee.obj ./caller.obj -o ./prog.out
$ ldd ./prog.out
linux-vdso.so.1 (0x00007ffc771c5000)
libc.so.6 => /lib64/libc.so.6 (0x00007f8f2abe2000)
/lib/ld64.so.1 => /lib64/ld-linux-x86-64.so.2 (0x00007f8f2adce000)
$ ./prog.out
bash: ./prog.out: No such file or directory
Why is this? Is there a problem with the linker defaults on my system? How can/should I fix it?
Also, static linking doesn't work.
$ ld -static -e my_entry_pt -lc ./callee.obj ./caller.obj -o ./prog.out
ld: ./callee.obj: in function `my_c_func':
callee.c:(.text+0x16): undefined reference to `write'
Why is this? Shouldn't write() just be a c library wrapper for the syscall 'write'? How can I fix it?
Where can I find the documentation on the C function calling convention so I can read up on how parameters are passed back and forth, etc...?
Lastly, while this seems to work for this simple example, am I doing something wrong in my initialization of the C stack? I mean, right now, I'm doing nothing. Should I be allocing memory from the kernel for the stack, setting bounds, and setting %rsp and %rbp before I start trying to call functions. Or is the kernel loader taking care of all this for me? If so, will all architectures under a Linux kernel take care of it for me?

While the Linux kernel provides a syscall named write, it does not mean that you automatically get a wrapper function of the same name you can call from C as write(). In fact, you need inline assembly to call any syscalls from C, if you're not using libc, because libc defines those wrapper functions.
Instead of explicitly linking your binaries with ld, let gcc do it for you. It can even assemble assembly files (internally executing a suitable version of as), if the source ends with a .s suffix. It looks like your linking problems are simply a disagreement between what GCC assumes and how you do it via LD yourself.
No, it's not a bug; the ld default path for ld.so isn't the one used on modern x86-64 GNU/Linux systems. (/lib/ld64.so.1 might have been used on early x86-64 GNU/Linux ports before the dust settled on where multi-arch systems would put everything to support both i386 and x86-64 versions of libraries installed at the same time. Modern systems use /lib64/ld-linux-x86-64.so.2)
Linux uses the System V ABI. The AMD64 Architecture Processor Supplement (PDF) describes the initial execution environment (when _start gets invoked), and the calling convention. Essentially, you have an initialized stack, with environment and command-line arguments stored in it.
Let's construct a fully working example, containing both C and assembly (AT&T syntax) sources, and a final static and dynamic binaries.
First, we need a Makefile to save typing long commands:
# SPDX-License-Identifier: CC0-1.0
CC := gcc
CFLAGS := -Wall -Wextra -O2 -march=x86-64 -mtune=generic -m64 \
-ffreestanding -nostdlib -nostartfiles
LDFLAGS :=
all: static-prog dynamic-prog
clean:
rm -f static-prog dynamic-prog *.o
%.o: %.c
$(CC) $(CFLAGS) $^ -c -o $#
%.o: %.s
$(CC) $(CFLAGS) $^ -c -o $#
dynamic-prog: main.o asm.o
$(CC) $(CFLAGS) $^ $(LDFLAGS) -o $#
static-prog: main.o asm.o
$(CC) -static $(CFLAGS) $^ $(LDFLAGS) -o $#
Makefiles are particular about their indentation, but SO converts tabs to spaces. So, after pasting the above, run sed -e 's|^ *|\t|' -i Makefile to fix the indentation back to tabs.
The SPDX License Identifier in the above Makefile and all following files tell you that these files are licensed under Creative Commons Zero license: that is, these are all dedicated to public domain.
Compilation flags used:
-Wall -Wextra: Enable all warnings. It is a good practice.
-O2: Optimize the code. This is a commonly used optimization level, usually considered sufficient and not too extreme.
-march=x86-64 -mtune=generic -m64: Compile to 64-bit x86-64 AKA AMD64 architecture. These are the defaults; you can use -march=native to optimize for your own system.
-ffreestanding: Compilation targets the freestanding C environment. Tells the compiler it can't assume that strlen or memcpy or other library functions are available, so don't optimize a loop, struct copy, or array initialization into calls to strlen, memcpy, or memset, for example. If you do provide asm implementations of any functions gcc might want to invent calls to, you can leave this out. (Especially if you're writing a program that will run under an OS)
-nostdlib -nostartfiles: Do not link in the standard C library or its startup files. (Actually, -nostdlib already "includes" -nostartfiles, so -nostdlib alone would suffice.)
Next, let's create a header file, nolib.h, that implements nolib_exit() and nolib_write() wrappers around the group_exit and write syscalls:
// SPDX-License-Identifier: CC0-1.0
/* Require Linux on x86-64 */
#if !defined(__linux__) || !defined(__x86_64__)
#error "This only works on Linux on x86-64."
#endif
/* Known syscall numbers, without depending on glibc or kernel headers */
#define SYS_write 1
#define SYS_exit_group 231
// Normally you'd use
// #include <asm/unistd.h> for __NR_write and __NR_exit_group
// or even #include <sys/syscall.h> for SYS_write
/* Inline assembly macro for a single-parameter no-return syscall */
#define SYSCALL1_NORET(nr, arg1) \
__asm__ volatile ( "syscall\n\t" : : "a" (nr), "D" (arg1) : "rcx", "r11", "memory")
/* Inline assembly macro for a three-parameter syscall */
#define SYSCALL3(retval, nr, arg1, arg2, arg3) \
__asm__ volatile ( "syscall\n\t" : "=a" (retval) : "a" (nr), "D" (arg1), "S" (arg2), "d" (arg3) : "rcx", "r11", "memory" )
/* exit() function */
static inline void nolib_exit(int retval)
{
SYSCALL1_NORET(SYS_exit_group, retval);
}
/* Some errno values */
#define EINTR 4 /* Interrupted system call */
#define EBADF 9 /* Bad file descriptor */
#define EINVAL 22 /* Invalid argument */
// or #include <asm/errno.h> to define these
/* write() syscall wrapper - returns negative errno if an error occurs */
static inline long nolib_write(int fd, const void *data, long len)
{
long retval;
if (fd == -1)
return -EBADF;
if (!data || len < 0)
return -EINVAL;
SYSCALL3(retval, SYS_write, fd, data, len);
return retval;
}
The reason the nolib_exit() uses the exit_group syscall instead of the exit syscall is that exit_group ends the entire process. If you run a program under strace, you'll see it too calls exit_group syscall at the very end. (Syscall implementation of exit())
Next, we need some C code. main.c:
// SPDX-License-Identifier: CC0-1.0
#include "nolib.h"
const char *c_function(void)
{
return "C function";
}
static inline long nolib_put(const char *msg)
{
if (!msg) {
return nolib_write(1, "(null)", 6);
} else {
const char *end = msg;
while (*end)
end++; // strlen
if (end > msg)
return nolib_write(1, msg, (unsigned long)(end - msg));
else
return 0;
}
}
extern const char *asm_function(int);
void _start(void)
{
nolib_put("asm_function(0) returns '");
nolib_put(asm_function(0));
nolib_put("', and asm_function(1) returns '");
nolib_put(asm_function(1));
nolib_put("'.\n");
nolib_exit(0);
}
nolib_put() is just a wrapper around nolib_write(), that finds the end of the string to be written, and calculates the number of characters to be written based on that. If the parameter is a NULL pointer, it prints (null).
Because this is a freestanding environment, and the default name for the entry point is _start, this defines _start as a C function that never returns. (It must not ever return, because the ABI does not provide any return address; it would just crash the process. Instead, an exit-type syscall must be called at end.)
The C source declares and calls a function asm_function, that takes an integer parameter, and returns a pointer to a string. Obviously, we'll implement this in assembly.
The C source also declares a function c_function, that we can call from assembly.
Here's the assembly part, asm.s:
# SPDX-License-Identifier: CC0-1.0
.text
.section .rodata
.one:
.string "One" # includes zero terminator
.text
.p2align 4,,15
.globl asm_function #### visible to the linker
.type asm_function, #function
asm_function:
cmpl $1, %edi
jne .else
leaq .one(%rip), %rax
ret
.else:
subq $8, %rsp # 16B stack alignment for a call to C
call c_function
addq $8, %rsp
ret
.size asm_function, .-asm_function
We don't need to declare c_function as an extern because GNU as treats all unknown symbols as external symbols anyway. We could add Call Frame Information directives, at least .cfi_startproc and .cfi_endproc, but I left them out so it wouldn't be so obvious I just wrote the original code in C and let GCC compile it to assembly, and then prettified it just a bit. (Did I write that out aloud? Oops! But seriously, compiler output is often a good starting point for a hand-written asm implementation of something, unless it does a very bad job of optimizing.)
The subq $8, %rsp adjusts the stack so that it will be a multiple of 16 for the c_function. (On x86-64, stacks grow down, so to reserve 8 bytes of stack, you subtract 8 from the stack pointer.) After the call returns, addq $8, %rsp reverts the stack back to original.
With these four files, we're ready. To build the example binaries, run e.g.
reset ; make clean all
Running either ./static-prog or ./dynamic-prog will output
asm_function(0) returns 'C function', and asm_function(1) returns 'One'.
The two binaries are just 2 kB (static) and 6 kB (dynamic) in size or so, although you can make them even smaller by stripping unneeded stuff,
strip --strip-unneeded static-prog dynamic-prog
which removes about 0.5 kB to 1 kB of unneeded stuff from them – the exact amount varies depending on the version of GCC and Binutils you use.
On some other architectures, we'd need to also link against libgcc (via -lgcc), because some C features rely on internal GCC functions. 64-bit integer division (named udivdi or similar) on various architectures is a typical example.
As mentioned in the comments, the first version of the above examples had a few issues that need to be addressed. They do not stop the example from executing or working as intended, and were overlooked because the examples were written from scratch for this answer (in the hopes that others finding this question later on via web searches might find this useful), and I'm not perfect. :)
memory clobber argument to the inline assembly, in the syscall preprocessor macros
Adding "memory" in the clobbered list tells the compiler that the inline assembly may access (read and/or write) memory other than those specified in the parameter lists. It is obviously needed for the write syscall, but it is actually important for all syscalls, because the kernel can deliver e.g. signals in the same thread before returning from the syscall, and signal delivery can/will access memory.
As the GCC documentation mentions, this clobber also behaves like a read/write memory barrier for the compiler (but NOT for the processor!). In other words, with the memory clobber, the compiler knows that it must write any changes in variables etc. in memory before the inline assembly, and that unrelated variables and other memory content (not explicitly listed in the inline assembly inputs, outputs, or clobbers) may also change, and will generate the code we actually want, without making incorrect assumptions.
-fPIC -pie: Omitted for simplicity
Position independent code is usually only relevant for shared libraries. In real projects' Makefiles, you will need to use a different set of compilation flags for objects that will be compiled as a dynamic library, static library, dynamically linked executable, or a static executable, as the desired properties (and therefore compiler/linker flags) vary.
In an example such as this one, it is better to try and avoid such extraneous things, as it is a reasonable question to ask on its own ("Which compiler options to use to achieve X, when needing Y ?"), and the answers depend on the required features and context.
In most modern distros, PIE is the default and you might want -fno-pie -no-pie to simplify debugging / disassembling. 32-bit absolute addresses no longer allowed in x86-64 Linux?
-nostdlib does imply (or "include") -nostartfiles
There are quite a few overall options and link options we can use to control how the code is compiled and linked.
Many of the options GCC supports are grouped. For example, -O2 is actually shorthand for a collection of optimization features that you can explicitly specify.
Here, the reason for keeping both is to remind human programmers of the expectations for the code: no standard library, and no start files/objects.
-march=x86-64 -mtune=generic -m64 is the default on x86-64
Again, this is kept more as a reminder of what the code expects. Without a specific architecture definition, one might get the wrong impression that the code should be compilable in general, because C typically is not architecture specific!
The nolib.h header file does contain preprocessor checks (using pre-defined compiler macros to detect the operating system and hardware architecture), halting the compilation with an error for other OSes and hardware architectures.
Most Linux distributions provide the syscall numbers in <asm/unistd.h>, as __NR_name.
These are derived from the actual kernel sources. However, for any given architecture, these are the stable userspace ABI, and will not change. New ones may be added. Only in some extraordinary circumstances (unfixable security holes, perhaps?) can a syscall be deprecated and stop functioning.
It is always better to use the syscall numbers from the kernel, preferably via the aforementioned header, but it's possible to build this program with only GCC, no glibc or Linux kernel headers installed. For someone writing their own standard C library, they should include the file (from Linux kernel sources).
I do know that Debian derivatives (Ubuntu, Mint, et cetera) all do provide the <asm/unistd.h> file, but there are many, many other Linux distributions, and I just am not sure about all of them. I opted to only define the two (exit_group and write), to minimize the risk of problems.
(Editor's note: the file might be in a different place in the filesystem, but the <asm/unistd.h> include path should always work if the right header package is installed. It's part of the kernel's user-space C/asm API.)
Compilation flag -g adds debug symbols, which adds greatly when debugging – for example, when running and examining the binary in gdb.
I omitted this and all related flags, because I did not want to expand the topic any further, and because this example is easily debugged at the asm level and examined even without. See GDB asm tips like layout reg at the bottom of the x86 tag wiki
The System V ABI requires that before a call to a function, the stack is aligned to 16 bytes. So at the top of the function, RSP+-8 is 16-byte aligned, and if there are any stack args, they'll be aligned.
The call instruction pushes the current instruction pointer to the stack, and because this is a 64-bit architecture, that too is 64 bits = 8 bytes. So, to conform to the ABI, we really need to adjust the stack pointer by 8 before calling the function, to ensure it too gets a properly aligned stack pointer. These were initially omitted, but are now included in the assembly (asm.s file).
This matters, because on x86-64, SSE/AVX SIMD vectors have different instructions for aligned-to-16-bytes and unaligned accesses, with the aligned accesses being significantly faster or certain processors. (Why does System V / AMD64 ABI mandate a 16 byte stack alignment?). Using aligned SIMD instructions like movaps with unaligned addresses will cause the process to crash. (e.g. glibc scanf Segmentation faults when called from a function that doesn't align RSP is a real-life example of what happens when you get this wrong.)
However, when we do such stack manipulations, we really should add CFI (Call Frame Information) directives to ensure debugging and stack unwinding etc. works correctly. In this case, for general CFI, we prepend .cfi_startproc before the first instruction in an assembly function, and .cfi_endproc after the last instruction in an assembly function. For the Canonical Frame Address, CFA, we add .cfi_def_cfa_offset N after any instruction that modifies the stack pointer. Essentially, N is 8 at the beginning of the function, and increases as much as %rsp is decremented, and vice versa. See this article for more.
Internally, these directives produce information (metadata) stored in the .eh_frame and .eh_frame_hdr sections in the ELF object files and binaries, depending on other compilation flags.
So, in this case, the subq $8, %rsp should be followed by .cfi_def_cfa_offset 16, and the addq $8, %rsp by .cfi_def_cfa_offset 8, plus .cfi_startproc at the beginning of asm_function and .cfi_endproc after the final ret.
Note that you can often see rep ret instead of just rep in assembly sources. This is nothing but a workaround to certain processors having branch-prediction performance issues when jumping to or falling through a JCC to a ret instruction. The rep prefix does nothing, except it does fix the issues those processors might otherwise have with such a jump. Recent GCC versions stopped doing this by default as the affected AMD CPUs are very old and not as relevant these days. What does `rep ret` mean?
The "key" option, -ffreestanding, is one that chooses a C "dialect"
The C programming language is actually separated into two different environments: hosted, and freestanding.
The hosted environment is one where the standard C library is available, and is used when you write programs, applications, or daemons in C.
The freestanding environment is one where the standard C library is not available. It is used when you write kernels, firmware for microcontrollers or embedded systems, implement (parts of) your own standard C library, or a "standard library" for some other C-derived language.
As an example, the Arduino programming environment is based on a subset of freestanding C++. The standard C++ library is not available, and many features of C++ like exceptions are not supported. In fact, it is very close to freestanding C with classes. The environment also uses a special pre-preprocessor, which for example automatically prepends declarations of functions without the user having to write them.
Probably the most well known example of freestanding C is the Linux kernel. Not only is the standard C library not available, but the kernel code must actually avoid floating-point operations as well, because of certain hardware considerations.
For a better understanding of what exactly does the freestanding C environment look like to a programmer, I think the best thing is to go look at the language standard itself. As of now (June 2020), the most recent standard is ISO C18. While the standard itself is not free, the final draft is; for C18, it is draft N2176(PDF).

The ld default path for ld.so (the ELF interpreter) isn't the one used on modern x86-64 GNU/Linux systems.
/lib/ld64.so.1 might have been used on early x86-64 GNU/Linux ports before the dust settled on where multi-arch systems would put everything to support both i386 and x86-64 versions of libraries installed at the same time. Modern systems use /lib64/ld-linux-x86-64.so.2.
There was never a good time to update the default in GNU binutils ld; when some systems were using the default, changing it would have broken them. Multi-arch systems had to configure their GCC to pass -dynamic-linker /some/path to ld, so they simply did that instead of asking and waiting for the ld default to change. So nobody ever needed the ld default to change to make anything work, except for people playing around with assembly and using ld by hand to create dynamically-linked executables.
Instead of doing that, you can link using gcc -nostartfiles to omit CRT start code which defines a _start, but still link with the normal libraries including -lc, -lgcc internal helper functions if needed, etc.
See also Assembling 32-bit binaries on a 64-bit system (GNU toolchain) for more info on assembling with/without libc for asm that defines _start, or with libc + CRT for asm that defines main. (Leave out the -m32 from that answer for 64-bit; when using gcc to invoke as and ld for you, that's the only difference.)
ld -static -e my_entry_pt -lc ./callee.obj ./caller.obj -o ./prog.out
doesn't link because you put -lc before the object files that reference symbols in libc.
Order matters in linker command lines, for static libraries.
However, ld -static -e my_entry_pt ./callee.o ./caller.o -lc -o ./prog.out will link, but makes a program that segfaults when it calls glibc functions like write without having called glibc's init functions.
Dynamic linking takes care of that for you (glibc has .init functions that get called by the dynamic linker, the same mechanism that allows C++ static initializers to run in a C++ shared library). CRT startup code also calls those functions in the right order, but you left that out, too, and wrote your own entry point.
#Example's answer avoids that problem by defining its own write wrapper instead of linking with -lc, so it can be truly freestanding.
I thought glibc's write wrapper function would be simple enough not to crash, but that's not the case. It checks if the program is multi-threaded or something by loading from %fs:0x18. The kernel doesn't init FS base for thread-local storage; that's something user-space (glibc's internal init functions) would have to do.
glibc's write() faults on mov %fs:0x18,%eax if you haven't called glibc's init functions. (In a statically-linked executable where glibc couldn't get the dynamic linker to run them for you.)
Dump of assembler code for function write:
=> 0x0000000000401040 <+0>: endbr64 # for CET, or NOP on CPUs without CET
0x0000000000401044 <+4>: mov %fs:0x18,%eax ### this faults with no TLS setup
0x000000000040104c <+12>: test %eax,%eax
0x000000000040104e <+14>: jne 0x401060 <write+32>
0x0000000000401050 <+16>: mov $0x1,%eax # simple case: EAX = __NR_write
0x0000000000401055 <+21>: syscall
0x0000000000401057 <+23>: cmp $0xfffffffffffff000,%rax
0x000000000040105d <+29>: ja 0x4010b0 <write+112> # update errno on error
0x000000000040105f <+31>: retq # else return
0x0000000000401060 <+32>: sub $0x28,%rsp # the non-simple case:
0x0000000000401064 <+36>: mov %rdx,0x18(%rsp) # write is an async cancellation point or something
0x0000000000401069 <+41>: mov %rsi,0x10(%rsp)
0x000000000040106e <+46>: mov %edi,0x8(%rsp)
0x0000000000401072 <+50>: callq 0x4010e0 <__libc_enable_asynccancel>
0x0000000000401077 <+55>: mov 0x18(%rsp),%rdx
0x000000000040107c <+60>: mov 0x10(%rsp),%rsi
0x0000000000401081 <+65>: mov %eax,%r8d
0x0000000000401084 <+68>: mov 0x8(%rsp),%edi
0x0000000000401088 <+72>: mov $0x1,%eax
0x000000000040108d <+77>: syscall
0x000000000040108f <+79>: cmp $0xfffffffffffff000,%rax
0x0000000000401095 <+85>: ja 0x4010c4 <write+132>
0x0000000000401097 <+87>: mov %r8d,%edi
0x000000000040109a <+90>: mov %rax,0x8(%rsp)
0x000000000040109f <+95>: callq 0x401140 <__libc_disable_asynccancel>
0x00000000004010a4 <+100>: mov 0x8(%rsp),%rax
0x00000000004010a9 <+105>: add $0x28,%rsp
0x00000000004010ad <+109>: retq
0x00000000004010ae <+110>: xchg %ax,%ax
0x00000000004010b0 <+112>: mov $0xfffffffffffffffc,%rdx # errno update for the simple case
0x00000000004010b7 <+119>: neg %eax
0x00000000004010b9 <+121>: mov %eax,%fs:(%rdx) # thread-local errno?
0x00000000004010bc <+124>: mov $0xffffffffffffffff,%rax
0x00000000004010c3 <+131>: retq
0x00000000004010c4 <+132>: mov $0xfffffffffffffffc,%rdx # same for the async case
0x00000000004010cb <+139>: neg %eax
0x00000000004010cd <+141>: mov %eax,%fs:(%rdx)
0x00000000004010d0 <+144>: mov $0xffffffffffffffff,%rax
0x00000000004010d7 <+151>: jmp 0x401097 <write+87>
I don't fully understand what exactly write is checking for or doing. It may have something to do with async I/O, and/or POSIX thread cancellation points.

Is it possible in practice to compile millions of small functions into a static binary?

I've created a static library with about 2 million small functions, but I'm having trouble linking it to my main function, using GCC (tested 4.8.5 or 7.3.0) under Linux x86_64.
The linker complains about relocation truncations, very much like those in this question.
I've already tried using -mcmodel=large, but as the answer to that same question says, I would
"need a crt1.o that can handle full 64-bit addresses". I've then tried compiling one, following this answer, but recent glibc won't compile under -mcmodel=large, even if libgcc does, which accomplishes nothing.
I've also tried adding the flags -fPIC and/or -fPIE to no avail. The best I get is this sole error:
ld: failed to convert GOTPCREL relocation; relink with --no-relax
and adding that flag also doesn't help.
I've searched around the Internet for hours, but most posts are very old and I can't find a way to do this.
I'm aware this is not a common thing to try, but I think it should be possible to do this. I'm working in an HPC environment, so memory or time constraints are not the issue here.
Has anyone been successful in accomplishing something similar with a recent compiler and toolchain?

Either don't use the standard library or patch it. As for the 2.34 version, Glibc doesn't support the large code model. (See also Glibc mailing list and Redhat Bugzilla)
Explanation
Let's examine the Glibc source code to understand why recompiling with -mcmodel=large accomplished nothing. It replaced the relocations originating from C files. But Glibc contained hardcoded 32-bit relocations in raw Assembly files, such as in start.S (sysdeps/x86_64/start.S).
call *__libc_start_main#GOTPCREL(%rip)
start.S emitted R_X86_64_GOTPCREL for __libc_start_main, which used relative addressing. x86_64 CALL instruction didn't support relative jumps by more than 32-bit displacement, see AMD64 Manual 3. So, ld couldn't offset the relocation R_X86_64_GOTPCREL because the code size surpassed 2GB.
Adding -fPIC didn't help due to the same ISA constraints. For position-independent code, the compiler still generated relative jumps.
Patching
In short, you have to replace 32-bit relocations in the Assembly code. See System V Application Binary Interface AMD64 Architecture Process Supplement for more info about implementing 64-bit relocations. See also this for a more in-depth explanation of code models.
Why don't 32-bit relocations suffice for the large code model? Because we can't rely on other symbols being in a range of 2GB. All calls must become absolute. Contrast with the small PIC code model, where the compiler generates relative jumps whenever possible.
Let's look closely at the R_X86_64_GOTPCREL relocation. It contains the 32-bit difference between RIP and the symbol's GOT entry address. It has a 64-bit substitute — R_X86_64_GOTPCREL64, but I couldn't find a way to use it in Assembly.
So, to replace the GOTPCREL, we have to compute the symbol entry GOT base offset and the GOT address itself. We can calculate the GOT location once in the function prologue because it doesn't change.
First, let's get the GOT base (code lifted wholesale from the ABI Supplement). The GLOBAL_OFFSET_TABLE relocation specifies the offset relative to the current position:
leaq 1f(%rip), %r11
1: movabs $_GLOBAL_OFFSET_TABLE_, %r15
leaq (%r11, %r15), %r15
With the GOT base residing on the %r15 register, now we have to find the symbol's GOT entry offset. The R_X86_64_GOT64 relocation specifies exactly this. With this, we can rewrite the call to __libc_start_main as:
movabs $__libc_start_main#GOT, %r11
call *(%r11, %r15)
We replaced R_X86_64_GOTPCREL with GLOBAL_OFFSET_TABLE and R_X86_64_GOT64. Replace others in the same vein.
N.B.: Replace R_X86_64_GOT64 with R_X86_64_PLTOFF64 for functions from dynamically linked executables.
Testing
Verify the patch correctness using the following test that requires the large code model. It doesn't contain a million small functions, having one huge function and one small function instead.
Your compiler must support the large code model. If you use GCC, you'll need to build it from the source with the flag -mcmodel=large. Startup files shouldn't contain 32-bit relocations.
The foo function takes more than 2GB, rendering 32-bit relocations unusable. Thus, the test will fail with the overflow error if compiled without -mcmodel=large. Also, add flags -O0 -fPIC -static, link with gold.
extern int foo();
extern int bar();
int foo(){
bar();
// Call sys_exit
asm( "mov $0x3c, %%rax \n"
"xor %%rdi, %%rdi \n"
"syscall \n"
".zero 1 << 32 \n"
: : : "rax", "rdx");
return 0;
}
int bar(){
return 0;
}
int __libc_start_main(){
foo();
return 0;
}
int main(){
return 0;
}
N.B. I used patched Glibc startup files without the standard library itself, so I had to define both _libc_start_main and main.

Why doesn't gcc reference the PLT for function calls?

I'm trying to learn assembly by compiling simple functions and looking at the output.
I'm looking at calling functions in other libraries. Here's a toy C function that calls a function defined elsewhere:
void give_me_a_ptr(void*);
void foo() {
give_me_a_ptr("foo");
}
Here's the assembly produced by gcc:
$ gcc -Wall -Wextra -g -O0 -c call_func.c
$ objdump -d call_func.o
call_func.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <foo>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 callq e <foo+0xe>
e: 90 nop
f: 5d pop %rbp
10: c3 retq
I was expecting something like call <give_me_a_ptr#plt>. Why is this jumping to a relative position before it even knows where give_me_a_ptr is defined?
I'm also puzzled by mov $0, %edi. This looks like it's passing a null pointer -- surely mov $address_of_string, %rdi would be correct here?

You're not building with symbol-interposition enabled (a side-effect of -fPIC), so the call destination address can potentially be resolved at link time to an address in another object file that is being statically linked into the same executable. (e.g. gcc foo.o bar.o).
However, if the symbol is only found in a library that you're dynamically linking to (gcc foo.o -lbar), the call has to be indirected through the PLT to support.
Now this is the tricky part: without -fPIC or -fPIE, gcc still emits asm that calls the function directly:
int puts(const char*); // puts exists in libc, so we can link this example
void call_puts(void) { puts("foo"); }
# gcc 5.3 -O3 (without -fPIC)
movl $.LC0, %edi # absolute 32bit addressing: slightly smaller code, because static data is known to be in the low 2GB, in the default "small" code model
jmp puts # tail-call optimization. Same as call puts/ret, except for stack alignment
But if you look at the linked binary:
(on this Godbolt compiler explorer link, click the "binary" button to toggle between gcc -S asm output and objdump -dr disassembly)
# disassembled linker output
mov $0x400654,%edi
jmpq 400490 <puts#plt>
During linking, the call to puts was "magically" replaced with indirection through puts#plt, and a puts#plt definition is present in the linked executable.
I don't know the details of how this works, but it's done at link time when linking to a shared library. Crucially, it doesn't require anything in the header files to mark the function prototype as being in a shared library. You get the same results from including <stdio.h> as you do from declaring puts yourself. (This is highly not recommended; it's probably legal for a C implementation to only work properly with the declarations in headers. It happens to work on Linux, though.)
When compiling a position-independent executable (with -fPIE), the linked binary jumps to puts through the PLT, identically to without -fPIC. However, the compiler asm output is different (try it yourself on the godbolt link above):
call_puts: # compiled with -fPIE
leaq .LC0(%rip), %rdi # RIP-relative addressing for static data
jmp puts#PLT
The compiler forces indirection through the PLT for any calls to functions it can't see the definition for. I don't understand why. In PIE mode, we're compiling code for an executable, not a shared library. The linker should be able to link multiple object files into a position-independent executable with direct calls between functions defined in the executable. I'm testing on Linux (my desktop and godbolt), not OS X, where I assume gcc -fPIE is the default. It might be configured differently, IDK.
With -fPIC instead of -fPIE, things are even worse: even calls to global functions defined within the same compilation unit have to go through the PLT, to support symbol interposition. (e.g. LD_PRELOAD=intercept_some_functions.so ./a.out)
The differences between -fPIC and -fPIE are mainly that PIE can assume no symbol interposition for functions in the same compilation unit, but PIC can't. OS X requires position-independent executables, as well as shared libraries, but there is a difference in what the compiler can do when making code for a library vs. making code for an executable.
This Godbolt example has some more functions that demonstrate stuff about PIC and PIE mode, e.g. that call_puts() can't inline into another function in PIC mode, only PIE.
See also: Shared object in Linux without symbol interposition, -fno-semantic-interposition error.
puzzled by mov $0, %edi
You're looking at disassembly output from the .o, where addresses are just placeholder 0s that will be replaced by the linker at link time, based on the relocation information in the ELF object file. That's why #Leandros suggested objdump -r.
Similarly, the relative displacement in the call machine code is all-zeros, because the linker hasn't filled it in yet.

I'm still studying this linking process myself, but wanted to restate something in my own words. The PLT-related user function calls might not all be stuffed with the proper code by the time execution starts. Doing so could take a lot of time at the start of execution; and not all the function calls instrumented by the PLT might even be used. So under a 'lazy binding' method, the very first time a 'user' function is called through the PLT code, it always jumps to the PLT 'binding function' first. The binding function goes out and finds the right address for the 'user' function (I think from the GOT) and then replaces the PLT entry (that points to the binding function) with the code pointing to the 'user' function. So thereafter every time the user function is called, the 'lazy' binding function is not called; the 'user' function is called instead. This might be why the PLT entry looks odd at first blush; it's pointing to the binding function and not to the 'user' function.

What does this GCC error "... relocation truncated to fit..." mean?

I am programming the host side of a host-accelerator system. The host runs on the PC under Ubuntu Linux and communicates with the embedded hardware via a USB connection. The communication is performed by copying memory chunks to and from the embedded hardware's memory.
On the board's memory there is a memory region which I use as a mailbox where I write and read the data. The mailbox is defined as a structure and I use the same definition to allocate a mirror mailbox in my host space.
I used this technique successfully in the past so now I copied the host Eclipse project to my current project's workspace, and made the appropriate name changes. The strange thing is that when building the host project I now get the following message:
Building target: fft2d_host
Invoking: GCC C Linker
gcc -L/opt/adapteva/esdk/tools/host/x86_64/lib -o "fft2d_host" ./src/fft2d_host.o -le_host -lrt
./src/fft2d_host.o: In function `main':
fft2d_host.c:(.text+0x280): relocation truncated to fit: R_X86_64_PC32 against symbol `Mailbox' defined in COMMON section in ./src/fft2d_host.o
What does this error mean and why it won't build on the current project, while it is OK with the older project?

You are attempting to link your project in such a way that the target of a relative addressing scheme is further away than can be supported with the 32-bit displacement of the chosen relative addressing mode. This could be because the current project is larger, because it is linking object files in a different order, or because there's an unnecessarily expansive mapping scheme in play.
This question is a perfect example of why it's often productive to do a web search on the generic portion of an error message - you find things like this:
http://www.technovelty.org/code/c/relocation-truncated.html
Which offers some curative suggestions.

Minimal example that generates the error
main.S moves an address into %eax (32-bit).
main.S
_start:
mov $_start, %eax
linker.ld
SECTIONS
{
/* This says where `.text` will go in the executable. */
. = 0x100000000;
.text :
{
*(*)
}
}
Compile on x86-64:
as -o main.o main.S
ld -o main.out -T linker.ld main.o
Outcome of ld:
(.text+0x1): relocation truncated to fit: R_X86_64_32 against `.text'
Keep in mind that:
as puts everything on the .text if no other section is specified
ld uses the .text as the default entry point if ENTRY. Thus _start is the very first byte of .text.
How to fix it: use this linker.ld instead, and subtract 1 from the start:
SECTIONS
{
. = 0xFFFFFFFF;
.text :
{
*(*)
}
}
Notes:
we cannot make _start global in this example with .global _start, otherwise it still fails. I think this happens because global symbols have alignment constraints (0xFFFFFFF0 works). TODO where is that documented in the ELF standard?
the .text segment also has an alignment constraint of p_align == 2M. But our linker is smart enough to place the segment at 0xFFE00000, fill with zeros until 0xFFFFFFFF and set e_entry == 0xFFFFFFFF. This works, but generates an oversized executable.
Tested on Ubuntu 14.04 AMD64, Binutils 2.24.
Explanation
First you must understand what relocation is with a minimal example: https://stackoverflow.com/a/30507725/895245
Next, take a look at objdump -Sr main.o:
0000000000000000 <_start>:
0: b8 00 00 00 00 mov $0x0,%eax
1: R_X86_64_32 .text
If we look into how instructions are encoded in the Intel manual, we see that:
b8 says that this is a mov to %eax
0 is an immediate value to be moved to %eax. Relocation will then modify it to contain the address of _start.
When moving to 32-bit registers, the immediate must also be 32-bit.
But here, the relocation has to modify those 32-bit to put the address of _start into them after linking happens.
0x100000000 does not fit into 32-bit, but 0xFFFFFFFF does. Thus the error.
This error can only happen on relocations that generate truncation, e.g. R_X86_64_32 (8 bytes to 4 bytes), but never on R_X86_64_64.
And there are some types of relocation that require sign extension instead of zero extension as shown here, e.g. R_X86_64_32S. See also: https://stackoverflow.com/a/33289761/895245
R_AARCH64_PREL32
Asked at: How to prevent "main.o:(.eh_frame+0x1c): relocation truncated to fit: R_AARCH64_PREL32 against `.text'" when creating an aarch64 baremetal program?

I ran into this problem while building a program that requires a huge amount of stack space (over 2 GiB). The solution was to add the flag -mcmodel=medium, which is supported by both GCC and Intel compilers.

On Cygwin -mcmodel=medium is already default and doesn't help. To me adding -Wl,--image-base -Wl,0x10000000 to GCC linker did fixed the error.

Often, this error means your program is too large, and often it's too large because it contains one or more very large data objects. For example,
char large_array[1ul << 31];
int other_global;
int main(void) { return other_global; }
will produce a "relocation truncated to fit" error on x86-64/Linux, if compiled in the default mode and without optimization. (If you turn on optimization, it could, at least theoretically, figure out that large_array is unused and/or that other_global is never written, and thus generate code that doesn't trigger the problem.)
What's going on is that, by default, GCC uses its "small code model" on this architecture, in which all of the program's code and statically allocated data must fit into the lowest 2GB of the address space. (The precise upper limit is something like 2GB - 2MB, because the very lowest 2MB of any program's address space is permanently unusable. If you are compiling a shared library or position-independent executable, all of the code and data must still fit into two gigabytes, but they're not nailed to the bottom of the address space anymore.) large_array consumes all of that space by itself, so other_global is assigned an address above the limit, and the code generated for main cannot reach it. You get a cryptic error from the linker, rather than a helpful "large_array is too large" error from the compiler, because in more complex cases the compiler can't know that other_global will be out of reach, so it doesn't even try for the simple cases.
Most of the time, the correct response to getting this error is to refactor your program so that it doesn't need gigantic static arrays and/or gigabytes of machine code. However, if you really have to have them for some reason, you can use the "medium" or "large" code models to lift the limits, at the price of somewhat less efficient code generation. These code models are x86-64-specific; something similar exists for most other architectures, but the exact set of "models" and the associated limits will vary. (On a 32-bit architecture, for instance, you might have a "small" model in which the total amount of code and data was limited to something like 224 bytes.)

Remember to tackle error messages in order. In my case, the error above this one was "undefined reference", and I visually skipped over it to the more interesting "relocation truncated" error. In fact, my problem was an old library that was causing the "undefined reference" message. Once I fixed that, the "relocation truncated" went away also.

I may be wrong, but in my experience there's another possible reason for the error, the root cause being a compiler (or platform) limitation which is easy to reproduce and work around. Next the simplest example
define an array of 1GB with:
char a[1024 x 1024 x 1024];
Result: it works, no warnings. Can use 1073741824 instead of the triple product naturally
Double the previous array:
char a[2 x 1024 x 1024 x 1024];
Result in GCC: "error: size of array 'a' is negative" => That's a hint that the array argument accepted/expected is of type signed int
Based on the previous, cast the argument:
char a[(unsigned)2 x 1024 x 1024 x 1024];
Result: error relocation truncated to fit appears, along with this warning: "integer overflow in expression of type 'int'"
Workaround: use dynamic memory. Function malloc() takes an argument of type size_t which is a typedef of unsigned long long int thus avoiding the limitation
This has been my experience using GCC on Windows. Just my 2 cents.

I encountered the "relocation truncated" error on a MIPS machine. The -mcmodel=medium flag is not available on mips, instead -mxgot may help there.

I ran into the exact same issue. After compiling without the -fexceptions build flag, the file compiled with no issue

I ran into this error on 64 bit Windows when linking a c++ program which called a nasm function. I used nasm for assembly and g++ to compile the c++ and for linking.
In my case this error meant I needed DEFAULT REL at the top of my nasm assembler code.
It's written up in the NASM documentation:
Chapter 11: Writing 64-bit Code (Unix, Win64)
Obvious in retrospect, but it took me days to arrive there, so I decided to post this.
This is a minimal version of the C++ program:
> extern "C" { void matmul(void); }
int main(void) {
matmul();
return 0;
}
This is a minimal version of the nasm program:
; "DEFAULT REL" means we can access data in .bss, .data etc
; because we generate position-independent code in 64-bit "flat" memory model.
; see NASM docs
; Chapter 11: Writing 64-bit Code (Unix, Win64)
;DEFAULT REL
global matmul
section .bss
align 32 ; because we want to move 256 bit packed aligned floats to and from it
saveregs resb 32
section .text
matmul:
push rbp ; prologue
mov rbp,rsp ; aligns the stack pointer
; preserve ymm6 in local variable 'saveregs'
vmovaps [saveregs], ymm6
; restore ymm6 from local variable 'saveregs'
vmovaps ymm6, [saveregs]
mov rsp,rbp ; epilogue
pop rbp ; re-aligns the stack pointer
ret
With DEFAULT REL commented out, I got the error message above:
g++ -std=c++11 -c SO.cpp -o SOcpp.o
\bin\nasm -f win64 SO.asm -o SOnasm.obj
g++ SOcpp.o SOnasm.obj -o SO.exe
SOnasm.obj:SO.asm:(.text+0x9): relocation truncated to fit: IMAGE_REL_AMD64_ADDR32 against `.bss'
SOnasm.obj:SO.asm:(.text+0x12): relocation truncated to fit: IMAGE_REL_AMD64_ADDR32 against `.bss'
collect2.exe: error: ld returned 1 exit status

With GCC, there's a -Wl,--default-image-base-low option that sometimes helps to deal with such errors, e.g. in some MSYS2 / MinGW configurations.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight