Linker: Enforce symbol ordering in resulting binary - c

I am building a library which roughly boils down to this:
// foo.c
extern void func();
int main() {
// ...
}
I compile with gcc -o foo func.o foo.c.
This results in a binary where the symbol func is before main (i.e. has lower address).
However if I add optimization, f.e. -O3 the linker decides to place func after main.
Is there a way to enforce this order?

Some linkers seems to store their symbols by a key-value table with the hashed symbol as key. And when it comes to allocation, they might use the sequence in the hashed table, which might not be the sequence in which the symbols were encountered.
I never found a way to control this behavior. It happened to me with global variables.
You might get some control if you use a specific linker script and assign segments/sections to the functions.

Related

Making a function that defaults to aliasing an externally defined symbol in gcc/ld

I have a header-only library that's currently calling malloc and free
This header is included in a lot of different static libraries, which are used to build differently configured programs.
I would like to be able to replace those calls with calls into another allocator, at link time -- based on whether that allocator library is included in the link step, without affecting other calls to malloc and free.
My idea is to have the library call customizable_malloc and customizable_free and have those symbols resolve to malloc and free "by default" -- then the allocator library can provide alternate definitions for customizable_malloc and customizable_free
However, I messed around with weak/alias/weakref attributes and I can't seem to get anything to work. Is there a way to do this?
Note: I know I can create an extra layer of indirection: customizable_malloc could be a weak alias to a function that calls malloc. But that adds a level of indirection that seems unnecessary.
Ideally, here's the steps I want the linker to take when it comes across a call to customizable_malloc:
Check if a definition for customizable_malloc exists
If it does, call it
If it does not, behave as if the call was to regular malloc.
Clarifying note: In a single-target scenario, this could be done with #define. The library could create macros customizable_malloc and customizable_free that default to malloc and free. However, this doesn't work in this case since things are being built into static libraries without knowledge of whether there's an override.
The extra level of indirection is the only way to do it. ELF (and other real-world binary format) symbol definition syntax (including for weak symbols) does not provide any way to provide a definition in terms of a reference to an external definition from somewhere else.
Just do the wrapper approach you're considering. It's simple, clean, and relative to the cost of malloc/free it's not going to make any big difference in performance.
You can achieve desired outcome using GNU-ld --defsym option.
Example:
#include <malloc.h>
#include <stdio.h>
void *custom_malloc(size_t sz);
int main()
{
void *p = custom_malloc(1);
void *q = malloc(42); // important: malloc needs to be referenced somewhere
printf("p = %p, q = %p\n", p, q);
return 0;
}
Compiling this with gcc -c t.c will (naturally) fail to link with unresolved reference to custom_malloc (if the library providing custom_malloc is not used):
$ gcc t.o
/usr/bin/ld: t.o: in function `main':
t.c:(.text+0xe): undefined reference to `custom_malloc'
collect2: error: ld returned 1 exit status
Adding --defsym=custom_malloc=malloc solves this:
$ gcc t.o -Wl,--defsym=custom_malloc=malloc && ./a.out
p = 0x558ca4dc22a0, q = 0x558ca4dc22c0
P.S. If malloc is not linked into the program (i.e. if I comment out the // important line), then --defsym fails:
$ gcc t.c -Wl,--defsym=custom_malloc=malloc && ./a.ou
/usr/bin/ld:--defsym:1: unresolvable symbol `malloc' referenced in expression
...
But that is (I believe) not very relevant to your scenario.
P.P.S. As R correctly stated, the "extra level of indirection" could be a single unconditional JMP malloc instruction, and the overhead of such indirection is unlikely to be measurable.

Communicating symbols via linker sections

In C (x86 linux ELF, gcc/clang), is it possible to communicate symbol information through linker use/abuse? For example, say I have the following setup:
// foo.c
void a_foo_function() {...}
void b_foo_function() {...}
// bar.c
void a_bar_function() {...}
void b_bar_function() {...}
void c_bar_function() {...}
// master.c
void *array_of_function_pointers;
int main() {
// do things with function_pointers
}
I would like array_of_function_pointers to be an array containing pointers to a_foo_function and a_bar_function. In this way, master.c could interact with functions defined in foo.c and bar.c without having to explicitly know about them. I recall seeing this done before by using custom sections (a la __attribute__((section("name"))), but I can't remember exactly what tricks were played.
From what I remember, the setup allowed master.c to stay unmodified, and any child could register some/all of it's functions via linker black magic, without having to write much, if any, boilerplate. Any gurus have some insight?
One way to achieve this is to place the individual function pointers into an orphan section (that is, a section which is not placed by the linker script):
Orphan Sections
And the final paragraph of Input Section Example
You can declare these symbols __start_NAME and __stop_NAME as pointers to pointers, and use them to iterate over the section contents (the pointers stored there).
This approach is used in glibc for various purposes. For example, this commit adds libio vtable verification to glibc. The special section is called __libc_IO_vtables, and the start and stop symbols are __start___libc_IO_vtables and __stop___libc_IO_vtables.

C - Linkage process misunderstanding

Assume I have header file with a function declaration:
test.h:
int func(int a);
main.c:
#include "test.h"
int main {
return func(5);
}
test.c (without include to the test.h):
int func(int x) {
return x*x;
}
I understand why both files compile, but I thought since test.c doesn't have an include to the header, the linker won't be able to recognize that this is the implementation, but it did.
So, why did it?
Are there any "rules" when I should do include to header files?
When linkage takes place, header files are long gone by then. The linker works on so-called object files. An object file is compiled from each translation unit, i.e. in our case the C-files. Symbols that are not defined within the given object file will be resolved by the linker, which looks at all the other object files and tries to resolve the symbol.
In our case, test.c is compiled into test.o, and defines a single symbol: func. main.c is compiled into main.o, which defines the main symbol, and refers to an external symbol func. Then test.o and main.o is fed into the linker which (starting from main) will resolve func from test.o.
Header files is only a pre-processor thing. The term you're looking for is translation unit which is a fully pre-processed source file with all headers included. It's this translation unit that the compiler sees and uses as input (actually its a little more complicated than this, but lets keep it simple) to create the object files for the linker to use.
The linker knows nothing of "header files". Instead it examines the object files, which is in a special format that contains all information needed, like tables of exported symbols and other tables of undefined but referenced symbols, and then the linker uses this information from all object files and all libraries to construct the final executable program.
So in the object file generated from the main.c source file, there is a table saying "the symbol func is used but not defined here", and in the object file for the test.c source file there is a table saying "the symbol func is defined here". When the linker looks through the object files, it can match the usage of func in one object file to the definition in the other object file.
If you want to understand what the linker is doing, look at its inputs.
Compile main.o and test.o first. Then examine them with nm or objdump: you'll see that main.o has an undefined symbol func, and test.o has a defined symbol of the same name.
The linker never sees your code, your headers, or anything except those intermediate object files. Everything it needs is in there, and the only thing it matches is the symbol name and type.
Note that in C, there isn't even any information in the symbol about the function's number of arguments, or their type, or the return value. If you change test.c to declare func taking two arguments, the program will still link and run - at least, it will start running but may crash. If it survives, one of the arguments will be uninitialized. This mismatch (between func's declaration and definition) is why it's recommended to include the header in test.c, so the compiler can catch your mistake before the linker does something stupid.
Header files tell a C source file about shared information. These can be:
type definitions, such as structures;
prototypes of functions. The prototype tells what the return type is and what the parameters and their type are. This helps the compiler to check you are using return and parameters of the function correctly. Without a prototype, the compiler will assume the return type is int and the function can have any number and type of parameters;
symbolic constants, created by #defines and macros;
the names and types of global variables.
Including the header file in your compilation units (C source files) helps you to share this information with your compilation units.
The compiler will compile the unit, whether using include files or not, and is left with a number of symbols (variables and functions) that are not in the current compilation unit. In the object file it notes these. Now, when the inker collects all object files to create the executable, it will search in these object files and libraries for the symbols not resolved. If any of these symbols cannot be found, then no executable is created.
So no, the compiler doesn't need header files.
The linker only looks at the name, func. main.orequires it. test.oprovides it. So it works!
But...
Implementation files should always include their own header!
I always include the header file in the .c file to ensure that the declaration and definition have matching signatures.
Let's say you change the definition and forget to change the header. If you haven't included test.h from test.c the compiler won't be able to see that the declaration and definitions differ. But when you run your program you will get undefined behaviour.
After so many answers there is not so much to add, and, even if it could seem a little bit off topic, you may want to know about decorations.
The decorations, or mangling, applied to symbols name are used as a simple way to transfer checks from source to object.
MS in C modules, for __stdcall functions, simply adds an # sign after regular name followed by the total number of bytes used:
int func(char *) _func#4
int func(double) _func#8
For C++ the mechanism is even stronger, they add to the symbol name some characters representing the type of variable returned from a function, the class, the namespace and which parameters it takes (see).
The result is a 'fully qualified' function name. In this way it very difficult that a wrong symbol is associated to a function.
the linker doesn't really care about headers...
you only need headers to tell the compiler that main.c can use func() because WILL be there
the linker just takes all the symbols and puts them into one executable file.
NOTE: this is a simplistic view

what is the visibility/scope of a global variable in C?

I'm having two .c files: A1.c and A2.c
A1.c as follows:
int i=0;
void main()
{}
A2.c as follows:
int i=0;
void func()
{}
It compiles well but when I try to link these two .o files, there is a "multiple definition of i" error.
I understand i is a global variable here, but doesn't it need an extern keyword to be used in other files. And in my project I'm not using the extern. So how come I get an error?
At compile time, the compiler exports each global symbol to the assembler as either strong or weak, and the assembler encodes this information implicitly in the symbol table of the relocatable object file. Functions and initialized global variables get strong symbols. Uninitialized global variables get weak symbols.
Given this notion of strong and weak symbols, Unix linkers use the following rules for dealing with multiply defined symbols:
Rule 1: Multiple strong symbols are not allowed.
Rule 2: Given a strong symbol and multiple weak symbols, choose the strong symbol.
Rule 3: Given multiple weak symbols, choose any of the weak symbols.
Your code,
A1.c as follows:
int i=0; // Strong Symbol
void main() {}
A2.c as follows:
int i=0; // Strong symbol
void func() {}
As per Rule 1 this is not allowed.
For more detailed information: http://www.geeksforgeeks.org/how-linkers-resolve-multiply-defined-global-symbols/
Long story short, a statement like
extern int i;
is a declaration, while the statement
int i=0;
is a definition.
In C you can declare a variable many times in a program, but you can define it only once.The first statement signifies to A2 that the definition of the variable i is in another file.For one I can't understand why you are so apprehensive about using "extern".
In C, a global variable can be accessed from another compilation unit as long as this other compilation unit sees that it exists, by declaring it extern. The linker makes the job have linking the extern declaration and the definition in another .c.
If you want it to be only visible to the .c that you are compiling, you must specify it as static
static int i = 0;
Of course it fails on the link: it tries to combine two object files that reference an object at two different memory locations.
In such cases, the real definition of your variable must be UNIQUE across all your source code, and all other references to this variable must be done through the use of the external keyword (as you sat).
The compilation doesn't whine because it doesn't know the relationship of your two files, only the linker has to figure that out.

Compiler optimizations not compiling constant?

I have the following string declared as a constant in my code. The purpose is to provide a crude and simple way of storing simple metadata in the compiled output.
const char myString1[] ="abc123\0";
const char myString2[] = {'a','b','c','1','2','3','\0'};
When I inspect the output with a hex editor, I see other string constants but "abc123" does not appear. This leads me to believe that the optimizations that are enabled are causing the lines not to be compiled, as they are never referenced in the program.
Is there a way in code to force this to compile, or another way (in code) of getting this metadata into the binary? I don't want to do any manipulation of the binary post-compile, the goal is to keep it as simple as possible.
compiler flags
-O2 -g -Wall -c -fmessage-length=0 -fno-builtin -ffunction-sections -mcpu=cortex-m3 -mthumb
I think you are looking for the used attribute:
`used'
This attribute, attached to a variable, means that the variable
must be emitted even if it appears that the variable is not
referenced.
When applied to a static data member of a C++ class template, the
attribute also means that the member will be instantiated if the
class itself is instantiated.
Apply it like
__attribute__((used))
const char myString1[] ="abc123\0";
__attribute__((used))
const char myString2[] = {'a','b','c','1','2','3','\0'};
Given the compiler flags you posted, it is almost certainly the linker. The -ffunction-sections flag puts each definition into its own section in the object files. This allows the linker to easily determine that a data item or function is not referenced and omit it from the final binary.
Use the binutils strings command to see if these strings are present in your binary.
If they have been optimized out, you can try to use the volatile qualifier when you declare them. Note that if they are not used even with the volatile qualifier some compilers can still optimized them out.
I've come up with a solution that uses attributes and involves modifying the link script.
First I define a custom section called ".metadata".
__attribute__ ((section(".metadata")))
Then, in the SECTIONS block of the .ld script I added a KEEP(*(.metadata)) which will force the linker to include .metadata even if it's not used
.text :
{
KEEP(*(.isr_vector))
KEEP(*(.metadata))
*(.text*)
*(.rodata*)
} > MFlash32
NOTE
I found that the __attribute__ keyword had to be on the same line as the variable or else it didn't actually show up in the binary, though the .metadata section did show up in the memory map.
If you have these variables in file scope, the compiler must provide the strings, since he can't know if they will be used from a different compilation unit. So any of your ".o" files where you place these variables, must contain the string.
Now a clever linker could decide for the final binary that these constants are not needed. (I have never observed that, though.) If this is the case for your platform, you should use the variable on a "hypothetical" path, that in reality will never be taken by the program. Something like
int main(int argc, char*argv[]){
switch (argv[0][0]) {
case 1: return myString1[argv[0][1]];
case 2: return myString2[argv[0][1]];
}
...
}

Resources