ARM: link register and frame pointer - c

I'm trying to understand how the link register and the frame pointer work in ARM. I've been to a couple of sites, and I wanted to confirm my understanding.
Suppose I had the following code:
int foo(void)
{
// ..
bar();
// (A)
// ..
}
int bar(void)
{
// (B)
int b1;
// ..
// (C)
baz();
// (D)
}
int baz(void)
{
// (E)
int a;
int b;
// (F)
}
and I call foo(). Would the link register contain the address for the code at point (A) and the frame pointer contain the address at the code at point (B)? And the stack pointer would could be any where inside bar(), after all the locals have been declared?

Some register calling conventions are dependent on the ABI (Application Binary Interface). The FP is required in the APCS standard and not in the newer AAPCS (2003). For the AAPCS (GCC 5.0+) the FP does not have to be used but certainly can be; debug info is annotated with stack and frame pointer use for stack tracing and unwinding code with the AAPCS. If a function is static, a compiler really doesn't have to adhere to any conventions.
Generally all ARM registers are general purpose. The lr (link register, also R14) and pc (program counter also R15) are special and enshrine in the instruction set. You are correct that the lr would point to A. The pc and lr are related. One is "where you are" and the other is "where you were". They are the code aspect of a function.
Typically, we have the sp (stack pointer, R13) and the fp (frame pointer, R11). These two are also related. This
Microsoft layout does a good job describing things. The stack is used to store temporary data or locals in your function. Any variables in foo() and bar(), are stored here, on the stack or in available registers. The fp keeps track of the variables from function to function. It is a frame or picture window on the stack for that function. The ABI defines a layout of this frame. Typically the lr and other registers are saved here behind the scenes by the compiler as well as the previous value of fp. This makes a linked list of stack frames and if you want you can trace it all the way back to main(). The root is fp, which points to one stack frame (like a struct) with one variable in the struct being the previous fp. You can go along the list until the final fp which is normally NULL.
So the sp is where the stack is and the fp is where the stack was, a lot like the pc and lr. Each old lr (link register) is stored in the old fp (frame pointer). The sp and fp are a data aspect of functions.
Your point B is the active pc and sp. Point A is actually the fp and lr; unless you call yet another function and then the compiler might get ready to setup the fp to point to the data in B.
Following is some ARM assembler that might demonstrate how this all works. This will be different depending on how the compiler optimizes, but it should give an idea,
; Prologue - setup
mov ip, sp ; get a copy of sp.
stmdb sp!, {fp, ip, lr, pc} ; Save the frame on the stack. See Addendum
sub fp, ip, #4 ; Set the new frame pointer.
...
; Maybe other functions called here.
; Older caller return lr stored in stack frame.
bl baz
...
; Epilogue - return
ldm sp, {fp, sp, lr} ; restore stack, frame pointer and old link.
... ; maybe more stuff here.
bx lr ; return.
This is what foo() would look like. If you don't call bar(), then the compiler does a leaf optimization and doesn't need to save the frame; only the bx lr is needed. Most likely this maybe why you are confused by web examples. It is not always the same.
The take-away should be,
pc and lr are related code registers. One is "Where you are", the other is "Where you were".
sp and fp are related local data registers.One is "Where local data is", the other is "Where the last local data is".
The work together along with parameter passing to create function machinery.
It is hard to describe a general case because we want compilers to be as fast as possible, so they use every trick they can.
These concepts are generic to all CPUs and compiled languages, although the details can vary. The use of the link register, frame pointer are part of the function prologue and epilogue, and if you understood everything, you know how a stack overflow works on an ARM.
See also: ARM calling convention.
MSDN ARM stack article
University of Cambridge APCS overview
ARM stack trace blog
Apple ABI link
The basic frame layout is,
fp[-0] saved pc, where we stored this frame.
fp[-1] saved lr, the return address for this function.
fp[-2] previous sp, before this function eats stack.
fp[-3] previous fp, the last stack frame.
many optional registers...
An ABI may use other values, but the above are typical for most setups. The indexes above are for 32 bit values as all ARM registers are 32 bits. If you are byte-centric, multiply by four. The frame is also aligned to at least four bytes.
Addendum: This is not an error in the assembler; it is normal. An explanation is in the ARM generated prologs question.

Disclaimer: I think this is roughly right; please correct as needed.
As indicated elsewhere in this Q&A, be aware that the compiler may not be required to generate (ABI) code that uses frame pointers. Frames on the call stack can often require useless information to be put there.
If the compiler options call for 'no frames' (a pseudo option flag), then the compiler can generate smaller code that keeps call stack data smaller. The calling function is compiled to only store the needed calling info on the stack, and the called function is compiled to only pop the needed calling information from the stack.
This saves execution time and stack space - but it makes tracing backwards in the calling code extremely hard (I gave up trying to...)
Info about the size and shape of the calling information on the stack is only known by the compiler and that info was thrown away after compile time.

Related

Determining return address of function on ARM Cortex-M series

I want to determine the return address of a function in Keil. I opened diassembly section at debugging mode in Keil uvision. What is shown is some assembly code like this:
My intention is to inject a simple binary code to microcontroller via using buffer overflow at microcontroller.see: Buffer overflow
I want to determine the return address of "test" function . Is it a must to know how to read assembly code or are there any trick to find the return address?
I am newbie to assembly.
R14 or in other name LR hold the return address. On the left you can see it in the picture. It is 0x08000287.
When a function is called, R14 will be overwritten with the address following the call ("BL" or "BLX") instruction. If that function doesn't call any other functions, R14 will often be left holding the return address for its duration. Further, if the function tail-calls another function, the tail call may be replaced with a branch ("B" or "BX"), with R14 holding the return address of the original caller. If a function makes a non-tail call to another function, it will be necessary to save R14 "somewhere" (typically the stack, but possibly to another previously-used caller-saved register) at some time before that, and retrieve that value from the stack at some later time, but if optimizations are enabled the location where R14 is saved will generally be unpredictable.
Some compilers may have a mode that would stack things consistently enough to be usable, but code will be very compiler-dependent. The technique most likely to be successful may be to do something like:
extern int getStackAddress(uint8_t **addr); // Always returns zero
void myFunction(...whavever...)
{
uint8_t *returnAddress;
if (getStackAddress(&returnAddress)) return; // Put this first.
}
where the getStackAddress would be a machine-code function that stores R14 to the address in R0, loads R0 with zero, and then branches to R14. There are relatively few code sequences that would be likely to follow that, and if a code examines instructions at the address stored in returnAddress and recognizes one of these code sequences, it would know that the return address for myFunction is stored in a spot appropriate for the sequence in question. For example, if it sees:
test r0,r0
be ...
pop {r0,pc}
It would know that the caller's address is second on the stack. Likewise if it sees:
cmp r0,#0
bne somewhere:
somewhere: ; Compute address based on lower byte of bne
pop {r0,r1,r2,r4,r5,pc}
then it would know that the caller's address is sixth.
There are a few instructons compilers could use to test a register against zero, and some compilers might use be while others use bne, but for the code above compilers would be likely to use the above pattern, and so counting how many bits are set in the pop instruction would reveal the whereabouts of the return address on the stack. One wouldn't know until runtime whether this test would actually work, but in cases where it claims to identify the return address it should actually be correct.
You can find all the answers in the Cortex-M documentation
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337h/Chdedegj.html

Stacktrace on ARM cortex-M4

When I run into a fault handler on my ARM cortex-M4 (Thumb) I get a snapshot of the CPU register just before the fault occured. With this information I can find the stack pointer where it was. Now, what I want is to backtrace through all functions it passed. The only problem I see here is that I don't have a frame pointer, so I cannot really see where a certain subroutine has saved the LR, ad infinitum.
How would one tackle this problem if the frame pointer is not available in r7?
This blog post discusses this issue with reference to the MIPS architecture - the principles can be readily adapted to ARM architectures.
In short, it describes three possibilities for locating the stack frame for a given SP and PC:
Using compiler-generated debug information (not included in the executable image) to calculate it.
Using compiler-generated stack-unwinding (exception handling) information (included in the executable image) to calculate it.
Scanning the call site to locate the prologue or epilogue code that adjusts the stack pointer, and deducing the stack frame address from that.
Obviously it's very compiler- and compiler-option dependent, and not guaranteed to work in all cases.
R7 is not the frame pointer on the M4, it's R11. R7 is the FP for Cortex-M0+/M1 where only the lower registers are generally available. In anycase, when Cortex-M makes a call to a function using BL and variants, it saves the return address into LR (link register). At function entry, the LR is saved onto the stack. So in theory, to get a call trace, you would "chase" the chain of the LRs.
Unfortunately, the saved location of LR on the stack is not defined by the calling convention, and its location must be deduced from the debug info for that function entry in the DWARF records (in the .elf file). I do not know if there is an utility that would extract the LR locations from an ELF file, but it should not be too difficult.
Richard at ImageCraft is right.
More information can be found here
This works fine with C code. I had a harder applying it to C++ but it's not impossible.

ARM GCC generated functions prolog

I mentioned that ARM toolchains could generate different function prologs. Actually, i saw two obj files (vmlinux) with completely different function prologs:
The first case looks like:
push {some registers maybe, fp, lr} (lr ommited in leaf function)
The second case looks like:
push {some registers maybe, fp, sp, lr, pc} (i can confuse the order)
So as i see the second one pushes additionally pc and sp. Also i saw some comments in crash utility (kdump project) where was stated, that kernel stackframe should have format {..., fp, sp, lr, pc} what confuse me more, because i see that in some cases it is not true.
1.) Am i right about that some gcc extra flags are needed for pushing additionally pc and sp in function prolog? If yes what are they?.
2.) What is this used for? Basically, as i understand i can unwind stack with FP and LR only, why do i need this additional values?
3.) If this things dealth nothing with compilation flags - how can i force generation of this extended function prolog and again what is the purpose?
Thank you.
1.) Am i right about that some gcc extra flags are needed for pushing additionally pc and sp in function prolog? If yes what are they?.
There are many gcc options that will affect stack frames (-march, -mtune, etc may affect the instructions used for instance). In your case, it was -mapcs-frame. Also, -fomit-frame-pointer will remove frames from leaf functions. Several static functions maybe merged together into a single generated function further reducing the number of frames. The APCS can cause slightly slower code but is needed for stack traces.
2.) What is this used for? Basically, as i understand i can unwind stack with FP and LR only, why do i need this additional values?
All registers that are not parameters (r0-r3) need to be saved as they need to be restored when returning to the caller. The compiler will allocate additional locals on the stack so sp will almost always change when fp changes. For why the pc is stored, see below.
3.) If this things dealth nothing with compilation flags - how can i force generation of this extended function prolog and again what is the purpose?
It is compiler flags as you had guessed.
; Prologue - setup
mov ip, sp ; get a copy of sp.
stm sp!, {fp, ip, lr, pc} ; Save the frame on the stack. See Addendum
sub fp, ip, #4 ; Set the new frame pointer.
...
; Epilogue - return
ldm sp, {fp, sp, lr} ; restore stack, frame pointer and old link.
... ; maybe more stuff here.
bx lr ; return.
A typical save is stm sp!, {fp, ip, lr, pc} and a restore of ldm sp, {fp, sp, lr}. This is correct if you examine the ABI/APCS documents. Note, there is no '!' to try and fix the stack. It is loaded explicitly from the stored ip value.
Also, the saved pc is not used in the epilogue. It is just discarded data on the stack. So why do this? Exception handlers (interrupts, signals or C++ exceptions) and other stack trace mechanisms want to know who saved a frame. The ARM always only have one function prologue (one point of entry). However, there are multiple exits. In some cases, a return like return function(); may actually turn into a b function in the maybe more stuff here. This is known as a tail call. Also when a leaf function is called in the middle of a routine and an exception occurs, it will see a PC range of leaf, but the leaf may have no call frame. By saving the pc, the call frame can be examined when an exception occurs in leaf to know who really saved the stack. Tables of pc versus destructor, etc. maybe stored to allow objects to be freed or to figure out how to call a signal handler. The extra pc is just plain nice when tracing a stack and the operation is almost free due to pipe lining.
See also: ARM Link and frame register question for how the compiler uses these registers.

GCC C and ARM Assembly Stack Cleanup

If I call an ARM assembly function from C, sometimes I need to pass in many arguments. If they do not fit in registers r0, r1, r2, r3 it is generally expected that 5-th, 6-th ... x-th arguments are pushed onto stack so that ARM assembly can read them from it.
So in the ARM function I receive some arguments that are on the stack. After finishing the assembly function I can either remove these arguments from stack or leave them there and expect that the C program will deal with them later.
If we are talking about GCC C and ARM assembly who is usually responsible for cleaning up the stack?
The function that made the call (A)
Or the function that was called (B)
I understand that when developing we could agree on either convention. But what is generally used as the default in this particular case (ARM assembly and GCC C)?
And how would generally a low level piece of code describe which behavior it implements? It seems that there should be some kind of standard description for this. If there isn't one it seems that you pretty much just have to try them both and look at which one does not crash.
If someone is interested in how the code could look like:
arm_function:
stmfd sp, {r4-r12, lr} # Save registers that are not the first three registers, SP->PASSED ARGUMENTS
ldmfd sp, {r4-r6} # Load 3 arguments that were passed through the stack, SP->PASSED ARGUMENTS
sub sp, sp, #40 # Adjust the stack pointer so it points to saved registers, STACK POINTER->SAVED REGISTERS->PASSED ARGUMENTS
#The main function body.
ldmfd sp!, {r4-r12, lr}, # Load saved registers STACK POINTER->PASSED ARGUMENTS
add sp, sp, #12 # Increment stack pointer to remove passed arguments, SP->NOTHING
# If the last code line would not be there, the caller would need to remove the arguments from stack.
UPDATE:
It seems that for C/C++ choice A. is pretty standard. Compilers usually use calling conventions like cdecl that work pretty similar to code in the answers below. More information can be found in this link about calling conventions. Changing C/C++ calling convention for a function does not seem to be so common/easy. With older C standard I could not manage to change it, so it looks like using A should be a decent default choice.
The current ARM procedure call standard is AAPCS.
The language-specific ABI can be found here. Relevant will be the document about C, but others should be similar (why reinvent the wheel?).
A good start for reading might be page 14 in the AAPCS.
It basically requires the caller to clean up the stack, as this is the most simple way: push additional arguments onto the stack, call the function and after return simply adjust the stack pointer by adding an offset (the number of bytes pushed on the stack; this is always a multiple of 4 (the "natural 32bit ARM word size).
But if you use gcc, you can just avoid handling the stack yourself by using inline assembler. This provides features to pass C variables (etc.) to the assembler code. This will also automatically load a parameter into a register if required. Just have a look at the gcc documentation. It is a bit hard to figure out in detail, but I prefer this to having raw assember stubs somewhere.
Ok, i added this as there might be problems understanding the principle:
caller:
...
push r5 // argument which does not fit into r0..r3 anymore
bl callee
add sp,4 // adjust SP
callee:
push r5-r7,lr // temp, variables, return address
sub sp,8 // local variables
// processing
add sp, 8 // restore previous stack frame
pop r5-r7,pc // restore temp. variables and return (replaces bx)
You can verify this by just disassmbling some sample C functions. Note that the pre- and postamble may vary if no temp registers are used or the function does not call another function (no need to stack lr for this).
Also, the caller might have to stack r0..r3 before the call. But that is a matter of compiler optimizations.
Disassembly can be done with gdb and objdump for example.
I use -mabi=aapcs for gcc invocation; not sure if gcc would otherwise use a different standard. Note that all object files have to use the same standard.
Edit:
Just had a peek in the AAPCS and that states that the SP need only 4 byte alignment. I might have confused this with the Cortex-M interrupt handling system which (for whatever reason, possibly for M7 which has 64 bit busses) aligns the SP to 8 bytes by default (software-config option).
However, SP must be 8 byte aligned at a public interface. Ok, the standard actually is more complicated than I remembered. That's why I prefer gcc caring about this stuff.
If some spaces allocated on the stack by caller function (argument passing), stack clearance done within the caller function. And how it happens you may ask. In ARM #Olaf has completely cleared, and in x86 it is usually like this:
sub esp, 8 ; make some room
... ; move arguments on stack
call func
add esp, 8 ; clean the stack
or
push eax ; push the arguments
push ebx ; or pusha, then after call, popa
call func
add esp, 8 ; assuming registers are 4 bytes each
Also how the interaction between caller and callee in a system takes places is explained in ABI (Application Binary Interface) You may find it useful.

How to get address of base stack pointer

I am in the process of porting an application from x86 to x64. I am using Visual Studio 2009; most of the code is C++ and some portions are plain C. The __asm keyword is not supported when compiling towards x64 and our application contains a few portions of inline assembler. I did not write this code so I don't know exactly what et is supposed to do:
int CallStackSize() {
DWORD Frame;
PDWORD pFrame;
__asm
{
mov EAX, EBP
mov Frame, EAX
}
pFrame = (PDWORD)Frame;
/*... do stuff with pFrame here*/
}
EBP is the base pointer to the stack of the current function. Is there some way to obtain the stack pointer without using inline asm? I have been looking at the intrinsics that Microsoft offers as a substitute for inline asm but I could not find anything that gave me something usefull. Any ideas?
Andreas asked what stuff is done with pFrame. Here is the complete function:
int CallStackSize(DWORD frameEBP = 0)
{
DWORD pc;
int tmpint = 0;
DWORD Frame;
PDWORD pFrame, pPrevFrame;
if(!frameEBP) // No frame supplied. Use current.
{
__asm
{
mov EAX, EBP
mov Frame, EAX
}
}
else Frame = frameEBP;
pFrame = (PDWORD)Frame;
do
{
pc = pFrame[1];
pPrevFrame = pFrame;
pFrame = (PDWORD)pFrame[0]; // precede to next higher frame on stack
if ((DWORD)pFrame & 3) // Frame pointer must be aligned on a DWORD boundary. Bail if not so.
break;
if (pFrame <= pPrevFrame)
break;
// Can two DWORDs be read from the supposed frame address?
if(IsBadWritePtr(pFrame, sizeof(PVOID)*2))
break;
tmpint++;
} while (true);
return tmpint;
}
The variable pc is not used. It looks like this function walks down the stack until it fails. It assumes that it can't read outside the applications stack so when it fails it has measured the depth of the call stack. This code does not need to compile on _EVERY_SINGLE compiler out there. Just VS2009. The application does not need to run on EVERY_SINGLE computer out there. We have complete control of deployment since we install/configure it ourselves and deliver the whole thing to our customers.
The really right thing to do would be to rewrite whatever this function does so that it does not require access to the actual frame pointer. That is definitely bad behavior.
But, to do what you are looking for you should be able to do:
int CallStackSize() {
__int64 Frame = 0; /* MUST be the very first thing in the function */
PDWORD pFrame;
Frame++; /* make sure that Frame doesn't get optimized out */
pFrame = (PDWORD)(&Frame);
/*... do stuff with pFrame here*/
}
The reason this works is that in C usually the first thing a function does is save off the location of the base pointer (ebp) before allocating local variables. By creating a local variable (Frame) and then getting the address of if, we're really getting the address of the start of this function's stack frame.
Note: Some optimizations could cause the "Frame" variable to be removed. Probably not, but be careful.
Second Note: Your original code and also this code manipulates the data pointed to by "pFrame" when "pFrame" itself is on the stack. It is possible to overwrite pFrame here by accident and then you would have a bad pointer, and could get some weird behavior. Be especially mindful of this when moving from x86 to x64, because pFrame is now 8 bytes instead of 4, so if your old "do stuff with pFrame" code was accounting for the size of Frame and pFrame before messing with memory, you'll need to account for the new, larger size.
You can use the _AddressOfReturnAddress() intrinsic to determine a location in the current frame pointer, assuming it hasn't been completely optimized away. I'm assuming that the compiler will prevent that function from optimizing away the frame pointer if you explicitly refer to it. Or, if you only use a single thread, you can use the IMAGE_NT_HEADER.OptionalHeader.SizeOfStackReserve and IMAGE_NT_HEADER.OptionalHeader.SizeOfStackCommit to determine the main thread's stack size. See this for how to access the IMAGE_NT_HEADER for the current image.
I would also recommend against using IsBadWritePtr to determine the end of the stack. At the very least you will probably cause the stack to grow until you hit the reserve, as you'll trip a guard page. If you really want to find the current size of the stack, use VirtualQuery with the address you are checking.
And if the original use is to walk the stack, you can use StackWalk64 for that.
There is no guarantee that RBP (the x64's equivalent of EBP) is actually a pointer to the current frame in the callstack. I guess Microsoft decided that despite several new general purpose registers, that they needed another one freed up, so RBP is only used as framepointer in functions that call alloca(), and in certain other cases. So even if inline assembly were supported, it would not be the way to go.
If you just want to backtrace, you need to use StackWalk64 in dbghelp.dll. It's in the dbghelp.dll that's shipped with XP, and pre-XP there was no 64-bit support, so you shouldn't need to ship the dll with your application.
For your 32-bit version, just use your current method. Your own methods will likely be smaller than the import library for dbghelp, much less the actual dll in memory, so it is a definite optimization (personal experience: I've implemented a Glibc-style backtrace and backtrace_symbols for x86 in less than one-tenth the size of the dbghelp import library).
Also, if you're using this for in-process debugging or post-release crash report generation, I would highly recommend just working with the CONTEXT structure supplied to the exception handler.
Maybe some day I'll decide to target the x64 seriously, and figure out a cheap way around using StackWalk64 that I can share, but since I'm still targeting x86 for all my projects I haven't bothered.
Microsoft provides a library (DbgHelp) which takes care of the stack walking, and you should use instead of relying on assembly tricks. For example, if the PDB files are present, it can walk optimized stack frames too (those that don't use EBP).
CodeProject has an article which explains how to use it:
http://www.codeproject.com/KB/threads/StackWalker.aspx
If you need the precise "base pointer" then inline assembly is the only way to go.
It is, surprisingly, possible to write code that munges the stack with relatively little platform-specific code, but it's hard to avoid assembly altogether (depending on what you're doing).
If all you're trying to do is avoid overflowing the stack, you can just take the address of any local variable.
.code
PUBLIC getStackFrameADDR _getStackFrameADDR
getStackFrameADDR:
mov RAX, RBP
ret 0
END
Something like that could work for you.
Compile it with ml64 or jwasm and call it using this in your code
extern "C" void getstackFrameADDR(void);

Resources