I've been programming in C and C++ for quite a long time now, so I'm familiar with the linking process as a user: the preprocessor expands all prototypes and macros in each .c file which is then compiled separately into its own object file, and all object files together with static libraries are linked into an executable.
However I'd like to know more about this process: how does the linker link the object files (what do they contain anyway?)? Matching declared but undefined functions with their definitions in other files (how?)? Translating into the exact content of the program memory (context: microcontrollers)?
Application example
Ideally, I'm looking for a detailed step-by-step description of what the process is doing, based on the following simplistic example. Since it doesn't appear to be said anywhere, fame and glory to whoever answers in this way.
main.c
#include "otherfile.h"
int main(void) {
otherfile_print("Foo");
return 0;
}
otherfile.h
void otherfile_print(char const *);
otherfile.c
#include "otherfile.h"
#include <stdio.h>
void otherfile_print(char const *str) {
printf(str);
}
printf is insanely complicated, very bad for a microcontroller hello world example, blinking leds are better but that gets specific to the microcontroller. this will suffice for linking.
two.c
unsigned int glob;
unsigned int two ( unsigned int a, unsigned int b )
{
glob=5;
return(a+b+7);
}
one.c
extern unsigned int glob;
unsigned int two ( unsigned int, unsigned int );
unsigned int one ( void )
{
return(two(5,6)+glob);
}
start.s
.globl _start
_start:
bl one
b .
build everything.
% arm-none-eabi-gcc -O2 -c one.c -o one.o
% arm-none-eabi-gcc -O2 -c two.c -o two.o
% touch start.s
% arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -c one.c -o one.o
% arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -c two.c -o two.o
% arm-none-eabi-as start.s -o start.o
% arm-none-eabi-ld -Ttext=0x10000000 start.o one.o two.o -o onetwo.elf
now lets look...
arm-none-eabi-objdump -D start.o
...
00000000 <_start>:
0: ebfffffe bl 0 <one>
4: eafffffe b 4 <_start+0x4>
it not is the compiler/assemblers job to deal with external references so the branch link to one is left incomplete, they chose to make it a bl of 0 but they could have simply left it totally unencoded, it is up to the authors of the toolchain as to how to communicate between the compiler, assembler, and linker via object files.
Same here
00000000 <one>:
0: e92d4008 push {r3, lr}
4: e3a00005 mov r0, #5
8: e3a01006 mov r1, #6
c: ebfffffe bl 0 <two>
10: e59f300c ldr r3, [pc, #12] ; 24 <one+0x24>
14: e5933000 ldr r3, [r3]
18: e0800003 add r0, r0, r3
1c: e8bd4008 pop {r3, lr}
20: e12fff1e bx lr
24: 00000000 andeq r0, r0, r0
both the function two and the address for the global variable glob are unknown. Note that for the unknown variable the compiler generates code that requires the explicit address of the global so that the linker simply needs to fill in the address, also glob is .data not .text.
00000000 <two>:
0: e59f3010 ldr r3, [pc, #16] ; 18 <two+0x18>
4: e2811007 add r1, r1, #7
8: e3a02005 mov r2, #5
c: e0810000 add r0, r1, r0
10: e5832000 str r2, [r3]
14: e12fff1e bx lr
18: 00000000 andeq r0, r0, r0
here too the global is in .data not here, so the linker will have to place .data and the things in it and then fill in the addresses.
so here we have linked it all together, the gnu linker requires an entry point label defined _start (main is an extern address required by the standard bootstrap, which I am not using so we dont get a main not found error). Because I am not using a linker script the gnu linker places items in the binary in the order they were defined on the command line, as desired i need start first for a microcontroller since I am controlling the boot. I used a non-zero here for demonstration purposes as well...
10000000 <_start>:
10000000: eb000000 bl 10000008 <one>
10000004: eafffffe b 10000004 <_start+0x4>
10000008 <one>:
10000008: e92d4008 push {r3, lr}
1000000c: e3a00005 mov r0, #5
10000010: e3a01006 mov r1, #6
10000014: eb000005 bl 10000030 <two>
10000018: e59f300c ldr r3, [pc, #12] ; 1000002c <one+0x24>
1000001c: e5933000 ldr r3, [r3]
10000020: e0800003 add r0, r0, r3
10000024: e8bd4008 pop {r3, lr}
10000028: e12fff1e bx lr
1000002c: 1000804c andne r8, r0, ip, asr #32
10000030 <two>:
10000030: e59f3010 ldr r3, [pc, #16] ; 10000048 <two+0x18>
10000034: e2811007 add r1, r1, #7
10000038: e3a02005 mov r2, #5
1000003c: e0810000 add r0, r1, r0
10000040: e5832000 str r2, [r3]
10000044: e12fff1e bx lr
10000048: 1000804c andne r8, r0, ip, asr #32
Disassembly of section .bss:
1000804c <__bss_start>:
1000804c: 00000000 andeq r0, r0, r0
so the linker starts to place the first item start.o, it roughly figures out how big that needs to be by just putting what was there. those two instructions. they take 8 bytes so in theory the second item one.o goes next at 0x10000008. That means the encoding for the bl one in start.s can be completed to use the correct relative address (_start + 8 which is the value of the pc when executing so the offset is zero, pc+0 is the encoding)
the linker has roughly placed one.o into the binary it is building and it has to resolve the address to two and the global so it has to place two.o and then figure out where the end of that is to place in this case .bss not .data since I didnt pre-init the variable.
the label for two is at 0x10000030 so it encodes the bl two in one() for that relative offset, it has also placed glob at 1000804c for some reason (I didnt complete define where ram was so the gnu linker will do things like this). Despite the reason, that is where the linker defined the home for that global variable and where the address to glob is needed is filled in by the linker, both one() and two() needed those filled in.
So the compiler (assembler) and linker have to in the end result in a usable binary, the compiler (assembler) tend to worry about making position independent machine code and leave enough information for the linker so that it has the machine code and a list of unresolved externs that it has to fill in. compilers have improved over time, a simple model would be to have an address location like they did above for the global variables address, where the linker computes the absolute address and just fills it in, clearly above they did not encode the function call in a way that it can use an absolute address to one and two. instead it uses pc relative addressing. This means that the linker has to know the machine code encoding of the bl instruction. the current generation of gnu linker knows quite a bit more and can do some cool things resolving arm to thumb and back, stuff it didnt used to know (you dont need to compile for thumb interwork anymore the linker takes care of it).
So the linker takes binary blobs including data and...links them together into one binary. It first needs to know the actual addresses for the various things in the binary. How you tell the linker this is linker specific and not a global thing for all C/C++ toolchains. Gnu linker scripts are a programming language in and of themselves. These are not necessarily physical nor virtual addresses it is simply the address space of the code in whatever mode it is in (virtual or physical). Once the linker knows the addresses it, based on linker rules (again linker specific) it starts placing these various binary blobs into those address spaces. then it goes through and resolves the external/global addresses. It was not above but can be an iterative process. If for example the function two() was at an address in memory that cannot be accessed with a single pc relative instruction (say we put one near zero and two near 0xF0000000) then those that wrote the linker have two choices, the simple choice is to simply state that it cannot encode/implement that far of a branch and bail out and gnu linker did or still does do that. Or the other solution is the linker fixes the problem. the linker could add a few words of data within the range of the pc relative branch link and those few words of data are a trampoline for example an absolute address that is loaded into a register then a register based branch or perhaps of clever a pc relative branch if the trampoline is within range (in the case of 0x10000000 to 0xF0000000 that wouldnt work). If the linker has to add these few words then that may mean that some of the binary blobs have to move to make room for those few words and now all of the addresses in those binary blobs now have to move as well. So you have to make another pass across all the binary blobs, resolving all of the new addresses filling in the answers and for pc relative determining if you can still reach everything. Adding those few words might have made something that was reachable with a pc-relative now unreachable and now that requires a solution (error or patch).
The assembler itself for a single source file has to go through even more of these gyrations esp for a variable length instruction set like x86 where the addressing is a big vague. I recommend trying for yourself to make a simple assembler that only supports a few instructions but some of those branches. and parse and encode the instructions and compare that to an existing debugged assembler like gnu assembler.
test.s
ldr r1,locdat
nop
nop
nop
nop
nop
b over
locdat: .word 0x12345678
top:
nop
nop
nop
nop
nop
nop
over:
b top
the right answer is
00000000 <locdat-0x1c>:
0: e59f1014 ldr r1, [pc, #20] ; 1c <locdat>
4: e1a00000 nop ; (mov r0, r0)
8: e1a00000 nop ; (mov r0, r0)
c: e1a00000 nop ; (mov r0, r0)
10: e1a00000 nop ; (mov r0, r0)
14: e1a00000 nop ; (mov r0, r0)
18: ea000006 b 38 <over>
0000001c <locdat>:
1c: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
00000020 <top>:
20: e1a00000 nop ; (mov r0, r0)
24: e1a00000 nop ; (mov r0, r0)
28: e1a00000 nop ; (mov r0, r0)
2c: e1a00000 nop ; (mov r0, r0)
30: e1a00000 nop ; (mov r0, r0)
34: e1a00000 nop ; (mov r0, r0)
00000038 <over>:
38: eafffff8 b 20 <top>
there are parallels to that activity and the job of a linker. also you could fashion a simple linker based on the above files or something similar, extract the binary blobs and other info and start placing them in whatever address space you want.
Either one are fairly simple programming tasks, yet fairly educational. Having an existing toolchain that can produce the answer you can figure out where you are going wrong or how to get at the right answer.
Related
Context: STM32F469 Cortex-M4 (ARMv7-M Thumb-2), Win 10, GCC, STM32CubeIDE; Learning/Trying out inline assembly & reading disassembly, stack managements etc., writing to core registers, observing contents of registers, examining RAM around stack pointer to understand how things work.
I've noticed that at some point, when I call a function, in the beginning of a called function, which received an argument, the instructions generated for the C function do "store R3 at RAM address X" followed immediately "Read RAM address X and store in RAM". So it's writing and reading the same value back, R3 is not changed. If it only had wanted to save the value of R3 onto the stack, why load it back then?
C code, caller function (main), my code:
asm volatile(" LDR R0,=#0x00000000\n"
" LDR R1,=#0x11111111\n"
" LDR R2,=#0x22222222\n"
" LDR R3,=#0x33333333\n"
" LDR R4,=#0x44444444\n"
" LDR R5,=#0x55555555\n"
" LDR R6,=#0x66666666\n"
" MOV R7,R7\n" //Stack pointer value is here, used for stack data access
" LDR R8,=#0x88888888\n"
" LDR R9,=#0x99999999\n"
" LDR R10,=#0xAAAAAAAA\n"
" LDR R11,=#0xBBBBBBBB\n"
" LDR R12,=#0xCCCCCCCC\n"
);
testInt = addFifteen(testInt); //testInt=0x03; returns uint8_t, argument uint8_t
Function call generates instructions to load function argument into R3, then move it to R0, then branch with link to addFifteen. So by the time I enter addFifteen, R0 and R3 have value 0x03 (testInt). So far so good. Here is what function call looks like:
testInt = addFifteen(testInt);
08000272: ldrb r3, [r7, #11]
08000274: mov r0, r3
08000276: bl 0x80001f0 <addFifteen>
So I go into addFifteen, my C code for addFifteen:
uint8_t addFifteen(uint8_t input){
return (input + 15U);
}
and its disassembly:
addFifteen:
080001f0: push {r7}
080001f2: sub sp, #12
080001f4: add r7, sp, #0
080001f6: mov r3, r0
080001f8: strb r3, [r7, #7]
080001fa: ldrb r3, [r7, #7]
080001fc: adds r3, #15
080001fe: uxtb r3, r3
08000200: mov r0, r3
08000202: adds r7, #12
08000204: mov sp, r7
08000206: ldr.w r7, [sp], #4
0800020a: bx lr
My primary interest is in 1f8 and 1fa lines. It stored R3 on stack and then loads freshly written value back into the register that still holds the value anyway.
Questions are:
What is the purpose of this "store register A into RAM X, next read value from RAM X into register A"? Read instruction doesn't seem to serve any purpose. Make sure RAM write is complete?
Push{r7} instruction makes stack 4-byte aligned instead of 8-byte aligned. But immediately after that instruction we have SP decremented by 12 (bytes), so it becomes 8-byte aligned again. Therefore, this behavior is ok. Is this statement correct? What if an interrupt happens between these two instructions? Will alignment be fixed during ISR stacking for the duration of ISR?
From what I read about caller/callee saved registers (very hard to find any sort of well-organized information on that, if you have good material, please, share a link), at least R0-R3 must be placed on stack when I call a function. However, it's easy to notice in this case that NONE of the registers were pushed on stack, and I verified it by checking memory around stack pointer, it would have been easy to notice 0x11111111 and 0x22222222, but they aren't there, and nothing is pushing them there. The values in R0 and R3 that I had before I called the function are simply gone forever. Why weren't any registers pushed on stack before function call? I would expect to have R3 0x33333333 when addFifteen returns because that's how it was before function call, but that value is casually overwritten even before branch to addFifteen. Why didn't GCC generate instructions to push R0-R3 onto the stack and only after that branch with link to addFifteen?
If you need some compiler settings, please, let me know where to find them in Eclipse (STM32CubeIDE) and what exactly you need there, I will happily provide them and add them to the question here.
uint8_t addFifteen(uint8_t input){
return (input + 15U);
}
What you are looking at here is unoptimized and at least with gnu the input and local variables get a memory location on the stack.
00000000 <addFifteen>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
6: 4603 mov r3, r0
8: 71fb strb r3, [r7, #7]
a: 79fb ldrb r3, [r7, #7]
c: 330f adds r3, #15
e: b2db uxtb r3, r3
10: 4618 mov r0, r3
12: 370c adds r7, #12
14: 46bd mov sp, r7
16: bc80 pop {r7}
18: 4770 bx lr
What you see with r3 is that the input variable, input, comes in r0. For some reason, code is not optimized, it goes into r3, then it is saved in its memory location on the stack.
Setup the stack
00000000 <addFifteen>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
save input to the stack
6: 4603 mov r3, r0
8: 71fb strb r3, [r7, #7]
so now we can start implementing the code in the function which wants to do math on the input function, so do that math
a: 79fb ldrb r3, [r7, #7]
c: 330f adds r3, #15
Convert the result to an unsigned char.
e: b2db uxtb r3, r3
Now prepare the return value
10: 4618 mov r0, r3
and clean up and return
12: 370c adds r7, #12
14: 46bd mov sp, r7
16: bc80 pop {r7}
18: 4770 bx lr
Now if I tell it not to use a frame pointer (just a waste of a register).
00000000 <addFifteen>:
0: b082 sub sp, #8
2: 4603 mov r3, r0
4: f88d 3007 strb.w r3, [sp, #7]
8: f89d 3007 ldrb.w r3, [sp, #7]
c: 330f adds r3, #15
e: b2db uxtb r3, r3
10: 4618 mov r0, r3
12: b002 add sp, #8
14: 4770 bx lr
And you can still see each of the fundamental steps in implementing the function. Unoptimized.
Now if you optimize
00000000 <addFifteen>:
0: 300f adds r0, #15
2: b2c0 uxtb r0, r0
4: 4770 bx lr
It removes all the excess.
number two.
Yes I agree this looks wrong, but gnu certainly does not keep the stack on an alignment at all times, so this looks wrong. But I have not read the details on the arm calling convention. Nor have I read to see what gcc's interpretation is. Granted they may claim a spec, but at the end of the day the compiler authors choose the calling convention for their compiler, they are under no obligation to arm or intel or others to conform to any spec. Their choice, and like the C language itself, there are lots of places where it is implementation defined and gnu implements the C language one way and others another way. Perhaps this is the same. Same goes for this saving of the incoming variable to the stack. We will see that llvm/clang does not.
number three.
r0-r3 and another register or two may be called caller saved, but the better way to think of them is volatile. The callee is free to modify them without saving them. It is not so much a case of saving the r0 register, but instead r0 represents a variable and you are managing that variable in functionally implementing the high level code.
For example
unsigned int fun1 ( void );
unsigned int fun0 ( unsigned int x )
{
return(fun1()+x);
}
00000000 <fun0>:
0: b510 push {r4, lr}
2: 4604 mov r4, r0
4: f7ff fffe bl 0 <fun1>
8: 4420 add r0, r4
a: bd10 pop {r4, pc}
x comes in in r0, and we need to preserve that value until after fun1() is called. r0 can be destroyed/modified by fun1(). So in this case they save r4, not r0, and keep x in r4.
clang does this as well
00000000 <fun0>:
0: b5d0 push {r4, r6, r7, lr}
2: af02 add r7, sp, #8
4: 4604 mov r4, r0
6: f7ff fffe bl 0 <fun1>
a: 1900 adds r0, r0, r4
c: bdd0 pop {r4, r6, r7, pc}
Back to your function.
clang, unoptimized also keeps the input variable in memory (stack).
00000000 <addFifteen>:
0: b081 sub sp, #4
2: f88d 0003 strb.w r0, [sp, #3]
6: f89d 0003 ldrb.w r0, [sp, #3]
a: 300f adds r0, #15
c: b2c0 uxtb r0, r0
e: b001 add sp, #4
10: 4770 bx lr
and you can see the same steps, prep the stack, store the input variable. Take the input variable do the math. Prepare the return value. Clean up, return.
Clang/llvm optimized:
00000000 <addFifteen>:
0: 300f adds r0, #15
2: b2c0 uxtb r0, r0
4: 4770 bx lr
Happens to be the same as gnu. Not expected that any two different compilers generate the same code, nor any expectation that any two versions of the same compiler generate the same code.
unoptimized, the input and local variables (none in this case) get a home on the stack. So what you are seeing is the input variable being put in its home on the stack as part of the setup of the function. Then the function itself wants to operate on that variable so, unoptimized, it needs to fetch that value from memory to create an intermediate variable (that in this case did not get a home on the stack) and so on. You see this with volatile variables as well. They will get written to memory then read back then modified then written to memory and read back, etc...
yes I agree, but I have not read the specs. End of the day it is gcc's calling convention or interpretation of some spec they choose to use. They have been doing this (not being aligned 100% of the time) for a long time and it does not fail. For all called functions they are aligned when the functions are called. Interrupts in arm code generated by gcc is not aligned all the time. Been this way since they adopted that spec.
by definition r0-r3, etc are volatile. The callee can modify them at will. The callee only needs to save/preserve them if IT needs them. In both the unoptimized and optimized cases only r0 matters for your function it is the input variable and it is used for the return value. You saw in the function I created that the input variable was preserved for later, even when optimized. But, by definition, the caller assumes these registers are destroyed by called functions, and called functions can destroy the contents of these registers and no need to save them.
As far as inline assembly goes, which is a different assembly language than "real" assembly language. I think you have a ways to go before being ready for that, but maybe not. After decades of constant bare metal work I have found zero real use cases for inline assembly, the cases I see are laziness avoiding allowing real assembly into the make system or ways to avoid writing real assembly language. I see it as a ghee whiz feature that folks use like unions and bitfields.
Within gnu, for arm, you have at least four incompatible assembly languages for arm. The not unified syntax real assembly, the unified syntax real assembly. The assembly language that you see when you use gcc to assemble instead of as and then inline assembly for gcc. Despite claims of compatibility clang arm assembly language is not 100% compatible with gnu assembly language and llvm/clang does not have a separate assembler you feed it to the compiler. Arms various toolchains over the years have completely incompatible assembly language to gnu for arm. This is all expected and normal. Assembly language is specific to the tool not the target.
Before you can get into inline assembly language learn some of the real assembly language. And to be fair perhaps you do, and perhaps quite well, and this question is about the discover of how compilers generate code, and how strange it looks as you find out that it is not some one to one thing (all tools in all cases generate the same output from the same input).
For inline asm, while you can specify registers, depending on what you are doing, you generally want to let the compiler choose the register, most of the work for inline assembly is not the assembly but the language that specific compiler uses to interface it...which is compiler specific, move to another compiler and the expectation is a whole new language to learn. While moving between assemblers is also a whole new language at least the syntax of the instructions themselves tend to be the same and the language differences are in everything else, labels and directives and such. And if lucky and it is a toolchain not just an assembler, you can look at the output of the compiler to start to understand the language and compare it to any documentation you can find. Gnus documentation is pretty bad in this case, so a lot of reverse engineering is needed. At the same time you are more likely to be successful with gnu tools over any other, not because they are better, in many cases they are not, but because of the sheer user base and the common features across targets and over decades of history.
I would get really good at interfacing asm with C by creating mock C functions to see which registers are used, etc. And/or even better, implement it in C, compile it, then hand modify/improve/whatever the output of the compiler (you do not need to be a guru to beat the compiler, to be as consistent, perhaps, but fairly often you can easily see improvements that can be made on the output of gcc, and gcc has been getting worse over the last several versions it is not getting better, as you can see from time to time on this site). Get strong in the asm for this toolchain and target and how the compiler works, and then perhaps learn the gnu inline assembly language.
I'm not sure there is a specific purpose to do it. it is just one solution that the compiler has found to do it.
For example the code:
unsigned int f(unsigned int a)
{
return sqrt(a + 1);
}
compiles with ARM GCC 9 NONE with optimisation level -O0 to:
push {r7, lr}
sub sp, sp, #8
add r7, sp, #0
str r0, [r7, #4]
ldr r3, [r7, #4]
adds r3, r3, #1
mov r0, r3
bl __aeabi_ui2d
mov r2, r0
mov r3, r1
mov r0, r2
mov r1, r3
bl sqrt
...
and in level -O1 to:
push {r3, lr}
adds r0, r0, #1
bl __aeabi_ui2d
bl sqrt
...
As you can see the asm is much easier to understand in -O1: store parameter in R0, add 1, call functions.
The hardware supports non aligned stack during exception. See here
The "caller saved" registers do not necessarily need to be stored on the stack, it's up to the caller to know whether it needs to store them or not.
Here you are mixing (if I understood correctly) C and assembly: so you have to do the compiler job before switching back to C: either you store values in callee saved registers (and then you know by convention that the compiler will store them during function call) or you store them yourself on the stack.
I just read https://www.keil.com/support/man/docs/armlink/armlink_pge1406301797482.htm. but can't understand what a veneer is that arm linker inserts between function calls.
In "Procedure Call Standard for the ARM Architecture" document, it says,
5.3.1.1 Use of IP by the linker Both the ARM- and Thumb-state BL instructions are unable to address the full 32-bit address space, so
it may be necessary for the linker to insert a veneer between the
calling routine and the called subroutine. Veneers may also be needed
to support ARM-Thumb inter-working or dynamic linking. Any veneer
inserted must preserve the contents of all registers except IP (r12)
and the condition code flags; a conforming program must assume that a
veneer that alters IP may be inserted at any branch instruction that
is exposed to a relocation that supports inter-working or long
branches. Note R_ARM_CALL, R_ARM_JUMP24, R_ARM_PC24, R_ARM_THM_CALL,
R_ARM_THM_JUMP24 and R_ARM_THM_JUMP19 are examples of the ELF
relocation types with this property. See [AAELF] for full details
Here is what I guess, is it something like this ? : when function A calls function B, and when those two functions are too far apart for the bl command to express, the linker inserts function C between function A and B in such a way function C is close to function B. Now function A uses b instruction to go to function C(copying all the registers between the function call), and function C uses bl instruction(copying all the registers too). Of course the r12 register is used to keep the remaining long jump address bits. Is this what veneer means? (I don't know why arm doesn't explain what veneer is but only what veneer provides..)
It is just a trampoline. Interworking is the easier one to demonstrate, using gnu here, but the implication is that Kiel has a solution as well.
.globl even_more
.type eve_more,%function
even_more:
bx lr
.thumb
.globl more_fun
.thumb_func
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
extern unsigned int even_more ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+even_more(a));
}
Unlinked object:
Disassembly of section .text:
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a05000 mov r5, r0
8: ebfffffe bl 0 <more_fun>
c: e1a04000 mov r4, r0
10: e1a00005 mov r0, r5
14: ebfffffe bl 0 <even_more>
18: e0840000 add r0, r4, r0
1c: e8bd4070 pop {r4, r5, r6, lr}
20: e12fff1e bx lr
Linked binary (yes completely unusable, but demonstrates what the tool does)
Disassembly of section .text:
00001000 <fun>:
1000: e92d4070 push {r4, r5, r6, lr}
1004: e1a05000 mov r5, r0
1008: eb000008 bl 1030 <__more_fun_from_arm>
100c: e1a04000 mov r4, r0
1010: e1a00005 mov r0, r5
1014: eb000002 bl 1024 <even_more>
1018: e0840000 add r0, r4, r0
101c: e8bd4070 pop {r4, r5, r6, lr}
1020: e12fff1e bx lr
00001024 <even_more>:
1024: e12fff1e bx lr
00001028 <more_fun>:
1028: 4770 bx lr
102a: 46c0 nop ; (mov r8, r8)
102c: 0000 movs r0, r0
...
00001030 <__more_fun_from_arm>:
1030: e59fc000 ldr r12, [pc] ; 1038 <__more_fun_from_arm+0x8>
1034: e12fff1c bx r12
1038: 00001029 .word 0x00001029
103c: 00000000 .word 0x00000000
You cannot use bl to switch modes between arm and thumb so the linker has added a trampoline as I call it or have heard it called that you hop on and off to get to the destination. In this case essentially converting the branch part of bl into a bx, the link part they take advantage of just using the bl. You can see this done for thumb to arm or arm to thumb.
The even_more function is in the same mode (ARM) so no need for the trampoline/veneer.
For the distance limit of bl lemme see. Wow, that was easy, and gnu called it a veneer as well:
.globl more_fun
.type more_fun,%function
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+1);
}
MEMORY
{
bob : ORIGIN = 0x00000000, LENGTH = 0x1000
ted : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.some : { so.o(.text*) } > bob
.more : { more.o(.text*) } > ted
}
Disassembly of section .some:
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: eb000003 bl 18 <__more_fun_veneer>
8: e8bd4010 pop {r4, lr}
c: e2800001 add r0, r0, #1
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <__more_fun_veneer>:
18: e51ff004 ldr pc, [pc, #-4] ; 1c <__more_fun_veneer+0x4>
1c: 20000000 .word 0x20000000
Disassembly of section .more:
20000000 <more_fun>:
20000000: e12fff1e bx lr
Staying in the same mode it did not need the bx.
The alternative is that you replace every bl instruction at compile time with a more complicated solution just in case you need to do a far call. Or since the bl offset/immediate is computed at link time you can, at link time, put the trampoline/veneer in to change modes or cover the distance.
You should be able to repeat this yourself with Kiel tools, all you needed to do was either switch modes on an external function call or exceed the reach of the bl instruction.
Edit
Understand that toolchains vary and even within a toolchain, gcc 3.x.x was the first to support thumb and I do not know that I saw this back then. Note the linker is part of binutils which is as separate development from gcc. You mention "arm linker", well arm has its own toolchain, then they bought Kiel and perhaps replaced Kiel's with their own or not. Then there is gnu and clang/llvm and others. So it is not a case of "arm linker" doing this or that, it is a case of the toolchains linker doing this or that and each toolchain is first free to use whatever calling convention they want there is no mandate that they have to use ARM's recommendations, second they can choose to implement this or not or simply give you a warning and you have to deal with it (likely in assembly language or through function pointers).
ARM does not need to explain it, or let us say, it is clearly explained in the Architectural Reference Manual (look at the bl instruction, the bx instruction look for the words interworking, etc. All quite clearly explained) for a particular architecture. So there is no reason to explain it again. Especially for a generic statement where the reach of bl varies and each architecture has different interworking features, it would be a long set of paragraphs or a short chapter to explain something that is already clearly documented.
Anyone implementing a compiler and linker would be well versed in the instruction set before hand and understand the bl and conditional branch and other limitations of the instruction set. Some instruction sets offer near and far jumps and some of those the assembly language for the near and far may be the same mnemonic so the assembler will often decide if it does not see the label in the same file to implement a far jump/call rather than a near one so that the objects can be linked.
In any case before linking you have to compile and assembly and the toolchain folks will have fully understood the rules of the architecture. ARM is not special here.
This is Raymond Chen's comment :
The veneer has to be close to A because B is too far away. A does a bl
to the veneer, and the veneer sets r12 to the final destination(B) and
does a bx r12. bx can reach the entire address space.
This answers to my question enough, but he doesn't want to write a full answer (maybe for lack of time..) I put it here as an answer and select it. If someone posts a better, more detailed answer, I'll switch to it.
I was looking at a arm assembly code generated by gcc, and I noticed that the GCC compiled a function with the following code:
0x00010504 <+0>: push {r7, lr}
0x00010506 <+2>: sub sp, #24
0x00010508 <+4>: add r7, sp, #0
0x0001050a <+6>: str r0, [r7, #4]
=> 0x0001050c <+8>: mov r3, lr
0x0001050e <+10>: mov r1, r3
0x00010510 <+12>: movw r0, #1664 ; 0x680
0x00010514 <+16>: movt r0, #1
0x00010518 <+20>: blx 0x10378 <printf#plt>
0x0001051c <+24>: add.w r3, r7, #12
0x00010520 <+28>: mov r0, r3
0x00010522 <+30>: blx 0x10384 <gets#plt>
0x00010526 <+34>: mov r3, lr
0x00010528 <+36>: mov r1, r3
0x0001052a <+38>: movw r0, #1728 ; 0x6c0
0x0001052e <+42>: movt r0, #1
0x00010532 <+46>: blx 0x10378 <printf#plt>
0x00010536 <+50>: adds r7, #24
0x00010538 <+52>: mov sp, r7
0x0001053a <+54>: pop {r7, pc}
The thing which was interesting for me was that, I see the GCC uses R7 to pop the values to PC instead of LR. I saw similar thing with R11. The compiler push the r11 and LR to the stack and then pop the R11 to the PC. should not LR act as return address instead of R7 or R11. Why does the R7 (which is a frame pointer in Thumb Mode) being used here?
If you look at apple ios calling convention it is even different. It uses other registers (e.g. r4 to r7) to PC to return the control. Should not it use LR?
Or I am missing something here?
Another question is that, it looks like that the LR, R11 or R7 values are never an immediate value to the return address. But a pointer to the stack which contain the return address. Is that right?
Another weird thing is that compiler does not do the same thing for function epoilogue. For example it might instead of using pop to PC use bx LR, but Why?
Well first off they likely want to keep the stack aligned on a 64 bit boundary.
R7 is better than anything greater for a frame pointer as registers r8 to r15 are not supported in most instructions. I would have to look I would assume there are special pc and sp offset load/store instructions so why would r7 be burned at all?
Not sure all you are asking, in thumb you can push lr but pop pc and I think that is equivalent to bx lr, but you have to look it up for each architecture as for some you cannot switch modes with pop. In this case it appears to assume that and not burn the extra instruction with a pop r3 bx r3 kind of thing. And actually to have done that would have likely needed to be two extra instructions pop r7, pop r3, bx r3.
So it may be a case that one compiler is told what architecture is being used and can assume pop pc is safe where another is not so sure. Again have to read the arm architecture docs for various architectures to know the variations on what instructions can be used to change modes and what cant. Perhaps if you walk through various architecture types with gnu it may change the way it returns.
EDIT
unsigned int morefun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int x, unsigned int y )
{
x+=1;
return(morefun(x,y+2)+3);
}
arm-none-eabi-gcc -O2 -mthumb -c so.c -o so.o
arm-none-eabi-objdump -D so.o
00000000 <fun>:
0: b510 push {r4, lr}
2: 3102 adds r1, #2
4: 3001 adds r0, #1
6: f7ff fffe bl 0 <morefun>
a: 3003 adds r0, #3
c: bc10 pop {r4}
e: bc02 pop {r1}
10: 4708 bx r1
12: 46c0 nop ; (mov r8, r8)
arm-none-eabi-gcc -O2 -mthumb -mcpu=cortex-m3 -march=armv7-m -c so.c -o so.o
arm-none-eabi-objdump -D so.o
00000000 <fun>:
0: b508 push {r3, lr}
2: 3102 adds r1, #2
4: 3001 adds r0, #1
6: f7ff fffe bl 0 <morefun>
a: 3003 adds r0, #3
c: bd08 pop {r3, pc}
e: bf00 nop
just using that march without the mcpu gives the same result (doesnt pop the lr to r1 to bx).
march=armv5t changes it up slightly
00000000 <fun>:
0: b510 push {r4, lr}
2: 3102 adds r1, #2
4: 3001 adds r0, #1
6: f7ff fffe bl 0 <morefun>
a: 3003 adds r0, #3
c: bd10 pop {r4, pc}
e: 46c0 nop ; (mov r8, r8)
armv4t as expected does the pop and bx thing.
armv6-m gives what armv5t gave.
gcc version 6.1.0 built using --target=arm-none-eabi without any other arm specifier.
So likely as the OP is asking if I understand right they are probably seeing the three instruction pop pop bx rather than a single pop {rx,pc}. Or at least one compiler varies compared to another. Apple IOS was mentioned so it likely defaults to a heavier duty core than a works everywhere type of thing. And their gcc like mine defaults to the work everywhere (including the original ARMv4T) rather than work everywhere but the original. I assume if you add some command line options you will see the gcc compiler behave differently as I have demonstrated.
Note in these examples r3 and r4 are not used, why are they preserving them then? It is likely the first thing I mentioned keeping a 64 bit alignment on the stack. If for the all thumb variants solution if you get an interrupt between the pops then the interrupt handler is dealing with an unaligned stack. Since r4 was throwaway anyway they could have popped r1 and r2 or r2 and r3 and then bx r2 or bx r3 respectively and not had that moment where it was unaligned and saved an instruction. Oh well...
This question already has answers here:
ARM: Why do I need to push/pop two registers at function calls?
(3 answers)
Closed 12 months ago.
I tried to write a simple test code like this(main.c):
main.c
void test(){
}
void main(){
test();
}
Then I used arm-non-eabi-gcc to compile and objdump to get the assembly code:
arm-none-eabi-gcc -g -fno-defer-pop -fomit-frame-pointer -c main.c
arm-none-eabi-objdump -S main.o > output
The assembly code will push r3 and lr registers, even the function did nothing.
main.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <test>:
void test(){
}
0: e12fff1e bx lr
00000004 <main>:
void main(){
4: e92d4008 push {r3, lr}
test();
8: ebfffffe bl 0 <test>
}
c: e8bd4008 pop {r3, lr}
10: e12fff1e bx lr
My question is why arm gcc choose to push r3 into stack, even test() function never use it? Does gcc just random choose 1 register to push?
If it's for the stack aligned(8 bytes for ARM) requirement, why not just subtract the sp? Thanks.
==================Update==========================
#KemyLand For your answer, I have another example:
The source code is:
void test1(){
}
void test(int i){
test1();
}
void main(){
test(1);
}
I use the same compile command above, then get the following assembly:
main.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <test1>:
void test1(){
}
0: e12fff1e bx lr
00000004 <test>:
void test(int i){
4: e52de004 push {lr} ; (str lr, [sp, #-4]!)
8: e24dd00c sub sp, sp, #12
c: e58d0004 str r0, [sp, #4]
test1();
10: ebfffffe bl 0 <test1>
}
14: e28dd00c add sp, sp, #12
18: e49de004 pop {lr} ; (ldr lr, [sp], #4)
1c: e12fff1e bx lr
00000020 <main>:
void main(){
20: e92d4008 push {r3, lr}
test(1);
24: e3a00001 mov r0, #1
28: ebfffffe bl 4 <test>
}
2c: e8bd4008 pop {r3, lr}
30: e12fff1e bx lr
If push {r3, lr} in first example is for use less instructions, why in this function test(), the compiler didn't just using one instruction?
push {r0, lr}
It use 3 instructions instead of 1.
push {lr}
sub sp, sp #12
str r0, [sp, #4]
By the way, why it sub sp with 12, the stack is 8-bytes aligned, it can just sub it with 4 right?
According to the Standard ARM Embedded ABI, r0 through r3 are used to pass the arguments to a function, and the return value thereof, meanwhile lr (a.k.a: r14) is the link register, whose purpose is to hold the return address for a function.
It's obvious that lr must be saved, as otherwise main() would have no way to return to its caller.
It's now notorious to mention that every single ARM instruction takes 32 bits, and as you mentioned, ARM has a call stack alignment requirement of 8 bytes. And, as a bonus, we're using the Embedded ARM ABI, so code size shall be optimized. Thus, it's more efficient to have a single 32-bit instruction both saving lr and aligning the stack by pushing an unused register (r3 is not needed, because test() does not take arguments nor it returns anything), and then pop in a single 32-bit instruction, rather than adding more instructions (and thus, wasting precious memory!) to manipulate the stack pointer.
After all, it's pretty logical to conclude this is just an optimization from GCC.
I was writing a loadable kernel module and when trying to compile it, the linker fails with the following message:
*** Warning: "__floatsidf" [/testing/Something.ko] undefined!
I am NOT using floating point variables, so that's not it. What is the cause of such errors?
Note: I'm using Linux ubuntu kernel v. 3.5.0-23-generic
__floatsidf is a runtime routine to convert a 32-bit signed integer into a double precision floating point number. Somewhere in your project is a line that looks like:
double foo = bar;
Or something similar, where bar is a 32-bit integer. It could also be that you're calling one of the libm functions (or any other function, really) that expects a double with an integer parameter:
foo = pow(bar, baz);
where either bar or baz (or both) is an integer.
Without showing some code, there's not much more we can do to help.
To narrow it down, check the object files your compiler generates (before you link them) and look in the disassembly for a reference to that symbol - that should tell you what function it's happening in.
Here's an example of what I mean. First up - source code:
#include <math.h>
int function(int x, int y)
{
return pow(x, y);
}
Pretty straightforward, right? Now, I'm going to compile it for ARM and disassemble:
$ clang -arch arm -O2 -c -o example.o example.c
$ otool -tV example.o
example.o:
(__TEXT,__text) section
_function:
00000000 e92d40f0 push {r4, r5, r6, r7, lr}
00000004 e28d700c add r7, sp, #12
00000008 e1a04001 mov r4, r1
0000000c ebfffffb bl ___floatsidf
00000010 e1a05000 mov r5, r0
00000014 e1a00004 mov r0, r4
00000018 e1a06001 mov r6, r1
0000001c ebfffff7 bl ___floatsidf
00000020 e1a02000 mov r2, r0
00000024 e1a03001 mov r3, r1
00000028 e1a00005 mov r0, r5
0000002c e1a01006 mov r1, r6
00000030 ebfffff2 bl _pow
00000034 ebfffff1 bl ___fixdfsi
00000038 e8bd40f0 pop {r4, r5, r6, r7, lr}
0000003c e12fff1e bx lr
Look at that - two calls to __floatsidf and one to __fixdfsi, matching the two conversions of x and y to double and then the conversion of the return type back to int.
You are using floating point somewhere - your module includes a conversion from int to double. It might be as simple as calling a function that takes a double parameter and passing an int.
You could try searching your code for "double".
You could try compiling your module to assembly code and looking at that to find which function uses __floatsidf.
Remember that the use of double might be in a header file, possibly one written by someone else.