ARM Link register(LR) usage between multiple function calls [duplicate] - arm

This question already has answers here:
ARM: link register and frame pointer
(2 answers)
Closed last month.
I understand that the Link register is used to store the return address after subroutine completes.
This avoids the need to store the return address on stack and return address can be directly copied from LR to PC .This can save some time due to memory access.
But how this works in case of multiple function calls, say F1() calls F2(),F2() calls F3() and F3() calls F4(). Still in this scenario we need to store the previous LR value on the stack memory and will be reading after that.
So is the LR mainly significant for leaf functions.

In ARM(64) x30 = LR, x29 = FP, X31 = SP.
When a function needs to call other (non-inlined) functions, the caller will typically save both the previous frame pointer and previous link register to stack:
stp x29, x30, [sp, #-16]! // store FP and LR to stack
mov x29, sp; // memorize the old FP (for debugging)
...
bl foo
ldp x29,x30, [sp], #16. // restore old FP, LR
ret
As speculated, one does save the trouble of storing the frame pointer and LR on leaf functions.
(Clang can be also compiled with -fomit-frame-pointer, which saves 1-2 instructions, but still requires to store and restore the Link Register)
str x30, [sp, #-16]! // 8-byte Folded Spill
bl foo
ldr x30, [sp], #16 // 8-byte Folded Reload
ret
The leaf function can simply return:
ret

Related

ARM Link register - non-leaf subroutine [duplicate]

This question already has answers here:
ARM: link register and frame pointer
(2 answers)
Why does ARM say that "A link register supports fast leaf function calls"
(2 answers)
Closed last year.
I am wondering about, where the Link register is used in ARM CPU. As I understand it is storing return address of functions. But does every return address go to this register after function call or it is only related to leaf subroutine implementation? How it is performed in functions, that have to use stack (for storing data or additional return addresses) - is LR still used here in any way?
BL instruction
Operation
if ConditionPassed(cond) then
LR = address of the instruction after the branch instruction
PC = PC + (SignExtend(signed_immed_24) << 2)
Usage
The BL instruction is used to perform a subroutine call. The return
from subroutine is achieved by copying the LR to the PC. Typically,
this is done by one of the following methods:
- Executing a BX R14 instruction.
- Executing a MOV PC,R14 instruction.
And newer ARMs go on to allow for pop {lr} and other...
Seems quite clear to me what the usage of LR is.
You can easily try it yourself as well:
unsigned int more_fun ( unsigned int );
unsigned int fun0 ( unsigned int x )
{
return(x+1);
}
unsigned int fun1 ( unsigned int x )
{
return(more_fun(x)+1);
}
unsigned int fun2 ( unsigned int x )
{
return(more_fun(x));
}
unsigned int fun3 ( unsigned int x )
{
return(3);
}
00000000 <fun0>:
0: e2800001 add r0, r0, #1
4: e12fff1e bx lr
00000008 <fun1>:
8: e92d4010 push {r4, lr}
c: ebfffffe bl 0 <more_fun>
10: e8bd4010 pop {r4, lr}
14: e2800001 add r0, r0, #1
18: e12fff1e bx lr
0000001c <fun2>:
1c: e92d4010 push {r4, lr}
20: ebfffffe bl 0 <more_fun>
24: e8bd4010 pop {r4, lr}
28: e12fff1e bx lr
0000002c <fun3>:
2c: e3a00003 mov r0, #3
30: e12fff1e bx lr
Because, as documented, bl modifies the link register. In order to return from a non-leaf function you need to preserve the link register for that call, the return address. So you push it on the stack. The convention for this compiler wants the stack 64 bit aligned, so the addition of the r4 register is simply to facilitate that alignment and r4 is otherwise not involved here.
You can see in the leaf function it does not use the stack because it has no reason to do so, the link register does not get modified during the function and in this case the function is too simple to need the stack for other reasons. If you were to need the stack and be a leaf function the optimizer will not need to put lr on the stack, but if for alignment reasons it needs another register, who knows they are free to use r14 as well as one of many of the other registers.
Now if we force something on the stack (non-leaf)
unsigned int new_fun ( unsigned int, unsigned int );
unsigned int fun4 ( unsigned int x, unsigned int y)
{
return(new_fun(x,y)+y);
}
00000034 <fun4>:
34: e92d4010 push {r4, lr}
38: e1a04001 mov r4, r1
3c: ebfffffe bl 0 <new_fun>
40: e0800004 add r0, r0, r4
44: e8bd4010 pop {r4, lr}
48: e12fff1e bx lr
lr has to be on the stack because a bl is used to call the next function. In this case per the convention they chose to use r4 to save the y variable (in r1 coming in) so that it can be used after the return of the nested call. Since only two registers need to be preserved, and that fits with the stack alignment rule then r4 and lr are saved and in this case both are used (r4 is not just to align the stack).
Not sure what you mean by additional return addresses. Perhaps you are thinking as each function makes a call there a return address on the stack to preserve that address, and that is true but you really only need to look at it one function at a time, that is the beauty of calling conventions. And in that case for this architecture using ideally bl to make function calls (as pointed out in another answer they don't have to, but it would be silly not to) that means lr is modified for every call to a subroutine and as a result the calling function then loses its return address to its caller, so it needs to preserve it locally some how. As we saw with fun 4, technically they could for example:
fun2:
push {r4, r5}
mov r5,lr
bl 0 <more_fun>
mov r1,r5
pop {r4, r5}
bx r1
and not actually save lr on the stack. Newer ARMs than the one I am building for you will see this
00000008 <fun1>:
8: e92d4010 push {r4, lr}
c: ebfffffe bl 0 <more_fun>
10: e2800001 add r0, r0, #1
14: e8bd8010 pop {r4, pc}
00000018 <fun2>:
18: eafffffe b 0 <more_fun>
The contents of lr is on the stack (lr itself of course is a register it can't be "on the stack", but after armv4t you can pop to the pc and change modes between arm and thumb (where before only bx could be used for thumb interwork).
Also note the tail optimization for fun2. This means that fun2 did not even push the return address on the stack.
Seems pretty obvious if you look at the arm docs how lr is used. And then think about how a compiler would implement a standard function, and then what optimizations they might do. And of course you can then just try it and see what certain compilers actually generate.
The return address is a parameter, which is hidden from C and other high level languages, but visible in assembly & machine code.
On ARM, called functions (callees) rely on the return address being passed in the lr register; this by specification of the calling convention — so callers using the standard ARM calling convention must put the return address there, in lr, to satisfy this parameter passing requirement and expectation.
Standard calling conventions are designed so that a caller can properly invoke a callee — knowing nothing more about the callee than its function signature.  Thus, a caller is abstracted from knowing the implementation details of the callee, beyond the parameters and return value(s).  This means that a function can evolve (for bug fixes, or other) without having to revisit (or recompile) callers, as long as the signature is unmodified.
Whether a function is a leaf function (or not) is an aspect of its internal implementation and not visible in function signatures.  So, a caller does not know (and should not have to know) if the callee is leaf or not, or if a function changes from leaf to non-leaf during some bug fix versioning.
A function's use of the stack is also an internal implementation detail not captured in function signature, so will not affect how functions are called, and where return address value is passed and expected.
So, there really is only one way to pass the return address.
Callee's who need to use the lr (because they are calling other functions or maybe just want to use that register) will need to preserve the return address value provided to them as a parameter by their callers, for use later to return to them (assuming they want to return).
Function implementations that use the lr (and so preserve the value therein for their later use) don't have to restore that preserved return address value back into the lr register (the calling convention does not require the return address to be passed back to callers) so sometimes the lr register is restored then used, but other times on ARM, the return address is popped directly off the stack into the program counter, bypassing lr, i.e. without restoring the lr register.
You could create your own calling convention that passed the return address in a different location, i.e. in a different register, or, by pushing it onto the stack!
Some languages do diverge from the standard calling conventions (in minor ways) and then still support the standard calling conventions for their interoperability with C-style functions.
The hardware is designed to support collecting the return address into the lr register while making the call that transfers control to the callee, all in one instruction, so it would be silly to avoid that.  The hardware also offers no other particularly efficient way to capture the return address, and there is no real reason for it to.

Understanding asm instructions in basic C program in GDB

In my attempt to understand memory layout in a process and learn assembly I've written a basic C program on Pi3 (ARM) and disassembled it with GDB but as I'm new to this I need help understanding it.
Essentially I'm trying to understand and spot in assembly where variables are stored (BSS, DATA, TEXT memory segments) and also understand and follow the stack frames.
I've only displayed the main function - there were other segments on the debug screen so let me know if they would help too!
I understand what the individual instructions are doing for the most part, but what I'd like to know is:
The first 3 lines are concerned with the stack pointer is this setting up the stack frame for the main function?
At x0x10414 it's using the value for age, is this where the local variable is popped onto the stack as part of the frame for the main function?
At x0x1041c is that the return value as I assumed that was pushed onto the stack too as part of the frame?
Where is the stack flushed at the end of the function?
int main () {
int age = 30;
int salary;
return 0;
}
0x10408 <main> push {r11} ; (str r11, [sp, #-4]!)
x0x1040c <main+4> add r11, sp, #0
x0x10410 <main+8> sub sp, sp, #12
x0x10414 <main+12> mov r3, #30
x0x10418 <main+16> str r3, [r11, #-8]
x0x1041c <main+20> mov r3, #0
x0x10420 <main+24> mov r0, r3
x0x10424 <main+28> add sp, r11, #0
x0x10428 <main+32> pop {r11} ; (ldr r11, [sp], #4)
x0x1042c <main+36> bx lr
Yes, you are right. Register r11 is used as frame pointer. This frame pointer serves as a reference to where your local variables are stored on the stack. Note that the original frame pointer from the caller must be preserved (so it is saved and restored later).
Almost. It happens one line later, it stores it on the stack at [r11 - 8].
Remember that r11 is the frame pointer, everything is relative w/respect to that.
It is not pushed on the stack. It is simply returned in register r0.
It's common on a lot of platforms that a general purpose register is used. Then the stack need not to be used for simple and plain return values (like your integer). I guess this is for performance reasons as registers are faster than memory accesses.
I don't know what you mean with flushed. What happens here is that the function sets things up the way it likes, and afterwards reverts those changes. The content of the stack might still contain values that the function used. It's just that the pointers are reset to their original locations. First at the beginning of the function the original frame pointer (r11) is saved/pushed on the stack.
Then the value of the stack pointer becomes the new frame pointer.
At the end of the function the stack pointer is returned to where it was (by overwriting it with r11) and finally r11 itself is restored too by popping it off the stack.

Process sections: does a declaration add also something to .text? If yes, what does it add?

I have a C code like this one, that will be possibly compiled in an ELF file for ARM:
int a;
int b=1;
int foo(int x) {
int c=2;
static float d=1.5;
// .......
}
I know that all the executable code goes into the .text section, while .data , .bss and .rodata will contain the various variables/constants.
My question is: does a line like int b=1; here add also something to the .text section, or does it only tell the compiler to place a new variable initialized to 1 in .data (then probably mapped in RAM memory when deployed on the final hardware)?
Moreover, trying to decompile a similar code, I noticed that a line such as int c=2;, inside the function foo(), was adding something to the stack, but also some lines of .text where the value '2' was actually memorized there.
So, in general, does a declaration always imply also something added to .text at an assembly level? If yes, does it depends on the context (i.e. if the variable is inside a function, if it is a local global variable, ...) and what is actually added?
Thanks a lot in advance.
does a line like int b=1; here add also something to the .text section, or does it only tell the compiler to place a new variable initialized to 1 in .data (then probably mapped in RAM memory when deployed on the final hardware)?
You understand that this is likely to be implementation specific, but the likelihood is that that you will just get initialised data in the data section. Were it a constant, it might, instead go into the text section.
Moreover, trying to decompile a similar code, I noticed that a line such as int c=2;, inside the function foo(), was adding something to the stack, but also some lines of .text where the value '2' was actually memorized there.
Automatic variables that are initialised, have to be initialised each time the function's scope is entered. The space for c is reserved on the stack (or in a register, depending on the ABI) but the program has to remember the constant from which it is initialised and this is best placed somewhere in the text segment, either as a constant value or as a "move immediate" instruction.
So, in general, does a declaration always imply also something added to .text at an assembly level?
No. If a static variable is initialised to zero or null or not initialised at all, it is often just enough to reserve space in bss. If a static non constant variable is initialised to a non zero value, it will just be put in the data segment.
As #goodvibration correctly stated, only global or static variables go to the segments. This is because their lifetime is the whole execution time of the program.
Local variables have a different lifetime. They exist only during the execution of the block (e.g. function) they are defined within. If a function is called, all parameters that does not fit into registers a pushed to the stack and the return address is written to the link register.* The function saves possibly the link register and other registers at the stack and adds some space at the stack for local variables (this is the code you have observed). At the end of the function, the saved registers are poped and the the stackpointer is readjusted. In this way, you get an automatic garbage collection for local variables.
*: Please note, that this is true for (some calling conventions of) ARM only. It's different e.g. for Intel processors.
this is one of those just try it things.
int a;
int b=1;
int foo(int x) {
int c=2;
static float d=1.5;
int e;
e=x+2;
return(e);
}
first thing without optimization.
arm-none-eabi-gcc -c so.c -o so.o
arm-none-eabi-objdump -D so.o
arm-none-eabi-ld -Ttext=0x1000 -Tdata=0x2000 so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
arm-none-eabi-objdump -D so.elf > so.list
do worry about the warning, needed to link to see that everything found a home
Disassembly of section .text:
00001000 <foo>:
1000: e52db004 push {r11} ; (str r11, [sp, #-4]!)
1004: e28db000 add r11, sp, #0
1008: e24dd014 sub sp, sp, #20
100c: e50b0010 str r0, [r11, #-16]
1010: e3a03002 mov r3, #2
1014: e50b3008 str r3, [r11, #-8]
1018: e51b3010 ldr r3, [r11, #-16]
101c: e2833002 add r3, r3, #2
1020: e50b300c str r3, [r11, #-12]
1024: e51b300c ldr r3, [r11, #-12]
1028: e1a00003 mov r0, r3
102c: e28bd000 add sp, r11, #0
1030: e49db004 pop {r11} ; (ldr r11, [sp], #4)
1034: e12fff1e bx lr
Disassembly of section .data:
00002000 <b>:
2000: 00000001 andeq r0, r0, r1
00002004 <d.4102>:
2004: 3fc00000 svccc 0x00c00000
Disassembly of section .bss:
00002008 <a>:
2008: 00000000 andeq r0, r0, r0
as a disassembly it tries to disassemble data so ignore that (the andeq next to 0x2008 for example).
The a variable is global and uninitialized so it lands in .bss (typically...a compiler can choose to do whatever it wants so long as it implements the language correctly, doesnt have to have something called .bss for example, but gnu and many others do).
b is global and initialized so it lands in .data, had it been declared as const it might land in .rodata depending on the compiler and what it offers.
c is a local non-static variable that is initialized, because C offers recursion this needs to be on the stack (or managed with registers or other volatile resources), and initialized each run. We needed to compile without optimization to see this
1010: e3a03002 mov r3, #2
1014: e50b3008 str r3, [r11, #-8]
d is what I call a local global, it is a static local so it lives outside the function, not on the stack, alongside the globals but with local access only.
I added e to your example, this is a local not initialized, but then used. Had I not used it and not optimized there probably would have been space allocated for it but no initialization.
save x on the stack (per this calling convention x enters in r0)
100c: e50b0010 str r0, [r11, #-16]
then load x from the stack, add two, save as e on the stack. read e from
the stack and place in the return location for this calling convention which is r0.
1018: e51b3010 ldr r3, [r11, #-16]
101c: e2833002 add r3, r3, #2
1020: e50b300c str r3, [r11, #-12]
1024: e51b300c ldr r3, [r11, #-12]
1028: e1a00003 mov r0, r3
For all architectures, unoptimized this is somewhat typical, always read variables from the stack and put them back quickly. Other architectures have different calling conventions with respect to where the incoming parameters and outgoing return value live.
If I optmize (-O2 on the gcc line)
Disassembly of section .text:
00001000 <foo>:
1000: e2800002 add r0, r0, #2
1004: e12fff1e bx lr
Disassembly of section .data:
00002000 <b>:
2000: 00000001 andeq r0, r0, r1
Disassembly of section .bss:
00002004 <a>:
2004: 00000000 andeq r0, r0, r0
b is a global, so at the object level a global space has to be reserved for it, it is .data, optimization doesnt change that.
a is also global and still .bss, because at the object level it was declared such so allocated in case another object needs it. The linker doesnt remove these.
Now c and d are dead code they dont do anything they need no storage so
c is no longer allocated space on the stack nor is d allocated any .data
space.
We have plenty of registers for this architecture for this calling convention for this code, so e does not need any memory allocated on the
stack, it comes in in r0 the math can be done with r0 and then it is returned in r0.
I know I didnt tell the linker where to put .bss by telling it .data it put .bss in the same space without complaint. I could have put -Tbss=0x3000 for example to give it its own space or just done a linker script. Linker scripts can play havoc with the typical results, so beware.
Typical, but there might be a compiler with exceptions:
non-constant globals go in .data or .bss depending on whether they are initialized during the declaration or not.
If const then perhaps .rodata or .text depending (or .data or .bss would technically work)
non-static locals go in general purpose registers or on the stack as needed (if not completely optimized away).
static locals (if not optimized away) live with globals but are not globally accessible they just get allocated space in .data or .bss like the globals do.
parameters are governed completely by the calling convention used by that compiler for that target. Just because arm or mips or other may have written down a convention doesnt mean a compiler has to use it, only if they claim to support some convention or standard should they then attempt to comply. For a compiler to be useful it needs a convention and stick to it whatever it is, so that both caller and callee of a function know where to get parameters and to return a value. Architectures with enough registers will often have a convention where some few number of registers are used for the first so many parameters (not necessarily one to one) and then the stack is used for all other parameters. likewise a register may be used if possible for a return value. Some architectures due to lack of gprs or other, use the stack in both directions. or the stack in one and a register in the other. You are welcome to seek out the conventions and try to read them, but at the end of the day the compiler you are using, if not broken follows a convention and by setting up experiments like the one above you can see the convention in action.
Plus in this case optimizations.
void more_fun ( unsigned long long );
unsigned fun ( unsigned int x, unsigned long long y )
{
more_fun(y);
return(x+1);
}
If I told you that arm conventions typically use r0-r3 for the first few parameters you might assume that x is in r0 and r1 and r2 are used for y and we could have another small parameter before needing the stack, well
perhaps older arm, but now it wants the 64 bit variable to use an even then an odd.
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: e1a01003 mov r1, r3
c: e1a00002 mov r0, r2
10: ebfffffe bl 0 <more_fun>
14: e2840001 add r0, r4, #1
18: e8bd4010 pop {r4, lr}
1c: e12fff1e bx lr
so r0 contains x, r2/r3 contain y and r1 was passed over.
the test was crafted to not have y as dead code and to pass it to another function we can see where y was stored on the way into fun and way out to more_fun. r2/r3 on the way in, needs to be in r0/r1 to call more fun.
we need to preserve x for the return from fun. one might expect that x would land on the stack, which unoptimized it would, but instead save a register that the convention has stated will be preserved by functions (r4) and use r4 throughout the function or at least in this function to store x. A performance optimization, if x needed to be touched more than once memory cycles going to the stack cost more than register accesses.
then it computes the return and cleans up the stack, registers.
IMO it is important to see this, the calling convention comes into play for some variables and others can vary based on optimization, no optimization they are what most folks are going to state off hand, .bss, .data (.text/.rodata), with optimization then it depends if if the variable survives at all.

Stack Pointer reading incorrect value from register

Why is Stack-pointer register not reading correct value from another register? When I move a value from register (r0) to stack pointer (r13), the SP reads incorrect value.
This is what is mean:
MOV R0, 10
MOV R13, R0
In this case, "A" should move to R13 but instead it gets 8.
Similarly,
MOV R0, 9
MOV R13, R0
In this case R13 stores 8 instead of 9.
Here's a simple program program that demonstrates the problem,
void Init()
{
__asm(
"LDR R0, =0x3FFFFDA7\n"
"MOV R13, R0\n"
);
}
int main(void)
{
Init();
return (1);
}
void SystemInit(void)
{
}
Nothing much is going on here. Just a simple function call. Inside the function I moved the address to r0. Then I moved the address to R13(SP), but instead of actual address i.e. 0x3FFFFDA7, SP received 0x3FFFFDA4.
The images shows the disassembly,
So what is going on here? Why is Stack pointer Register reading incorrect values?
I am using ARM inline Assembly with C. The IDE is KEIL.
Thanks in Advance.
For those who might find this helpful.
Stack-Pointer for armv7 must be 4 bytes aligned. You can write there 0,4,8,12,16 etc but not 9,10,F etc.
So if you want to move any value to Stack-Pointer, make sure it is 4 bytes aligned.

Passing arguments from asm to C in on ARM

I read a lot of topics on this forum and found a lot of answers on this subject. I achieved to pass 5 arguments to a C function from my assembly code. For doing this, i used the instructions below :
mov r0, #0
mov r1, #1
mov r2, #2
mov r3, #3
mov r4, #4
STR r4, [sp, #-4]!
BL displayRegistersValue
But today i'm trying to pass the whole registers to a C function to save them in a C structure. I tried with this instruction :
STMDB sp!, {registers that i want to save}
My C function :
displayRegistersValue(int registers[number_of_registers])
char printable = registers[0] + (int)'0'; // Convert in a printable character
print_uart0(&printable);
But my display is not good. So, how I can access to the registers in C code?
Pretty sure the ARM standard only allows R0-R3 to be passed by value so 4 max. If you need more values, then push them onto the stack and access them that way - like the compiler does. Or make a struct and pass its address.
Ok, doubled cheked and I was right here is a link to the ARM calling conventions - down the page a bit.
To do what you want, pass the address of some memory location (an array) into your assembly routine. Once you have that address, probably within r0, you can stmdb! into that location all your register values and that memory will be viewable at the C level.
Beware, this probably isn't going to do what you think it will. Those values are allowed to change quite a bit as per the calling convention link above. If this is for debugging, you are better off using a debugger and watching the registers that way.
Ok, you are still not understanding here:
{
int registerValues[14];
myAsmRoutine(registerValues);
print_uart0(& registerValues);
}
myAsmRoutine:
stmia r0!, {r1-r14}
blx lr
I skipped R0 and PC, but you get the idea. Also, you will need to do something a bit mroe complex to change the values into a printable format - sprintf or itoa os something like that.
displayRegistersValue(int registers[number_of_registers])
this is an array not a structure and is passed as a pointer to something not as a long list of items. same goes for structures btw.
It is usually easiest to construct a C function that does what you want in asm then see what the compiler produces, then go from there (use the ABI document to confirm, etc).
#define NUMREGS 13
void displayRegistersValue(unsigned int registers[NUMREGS]);
void outer ( void )
{
unsigned int regs[NUMREGS];
displayRegistersValue(regs);
}
> arm-none-linux-gnueabi-gcc -O2 -c fun.c -o fun.o
> arm-none-linux-gnueabi-objdump -D fun.o
fun.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <outer>:
0: e52de004 push {lr} ; (str lr, [sp, #-4]!)
4: e24dd03c sub sp, sp, #60 ; 0x3c
8: e28d0004 add r0, sp, #4
c: ebfffffe bl 0 <displayRegistersValue>
10: e28dd03c add sp, sp, #60 ; 0x3c
14: e49df004 pop {pc} ; (ldr pc, [sp], #4)
You will need to do something similar, make room on the stack by adding to the stack pointer, save the lr so you dont trash it with the branch link, copy your registers to that memory (the stack) point r0 to the beginning of the memory/array you want to pass, then call the function (r0 being the first and only parameter you are passing to the function).
push {lr}
mov lr,sp
stmdb sp!,{r0-r12}
mov r0,lr
bl displayRegistersValue
add sp,sp,#52
pop {lr}
An array is passed as a pointer in a single register. If you want 5 registers then you need to have 5 parameters (int i1, int i2 etc.).
To quote from the ARM APCS document:
"The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls)."
So if you want to pass more than 4 values to a C function, you need to pass the rest of the values on the stack. A better idea would be to put the register values in a memory region that has been statically allocated and pass the address of the memory (pointer) to the C function. The pointer can be de-referenced by the function to get to the register values.

Resources