Are ARM Cortex-M0 Stacking Registers Saved On $psp or $msp During Hardfault?

Are ARM Cortex-M0 Stacking Registers Saved On $psp or $msp During Hardfault? - arm

I have an issue where my Cortex-M0 is hard faulting, so I am trying to debug it. I am trying to print the contents of the ARM core registers that were pushed to the stack when the hard fault occurred.
Here is my basic assembly code:
__attribute__((naked)) void HardFaultVector(void) {
asm volatile(
// check LR to see whether the process stack or the main stack was being used at time of exception.
"mov r2, lr\n"
"mov r3, #0x4\n"
"tst r2, r3\n"
"beq _MSP\n"
//process stack was being used.
"_PSP:\n"
"mrs r0, psp\n"
"b _END\n"
//main stack was being used.
"_MSP:\n"
"mrs r0, msp\n"
"b _END\n"
"_END:\n"
"b fault_handler\n"
);
}
The function fault_handler will print the contents of the stack frame that was pushed to either the process stack or the main stack. Here's my question though:
When I print the contents of the stack frame that supposedly has the saved registers, here is what I see:
Stack frame at 0x20000120:
pc = 0xfffffffd; saved pc 0x55555554
called by frame at 0x20000120, caller of frame at 0x20000100
Arglist at unknown address.
Locals at unknown address, Previous frame's sp is 0x20000120
Saved registers:
r0 at 0x20000100, r1 at 0x20000104, r2 at 0x20000108, r3 at 0x2000010c, r12 at 0x20000110, lr at 0x20000114, pc at 0x20000118, xPSR at 0x2000011c
You can see the saved registers, these are the registers that are pushed by the ARM core when a hard fault occurs. You can also see the line pc = 0xfffffffd; which indicates that this is the LR's EXC_RETURN value. The value 0xfffffffd indicates to me that the process stack was being used at the time of the hard fault.
If I print the $psp value, I get the following:
gdb $ p/x $psp
$91 = 0x20000328
If I print the $msp value, I get the following:
gdb $ p/x $msp
$92 = 0x20000100
You can clearly see that the $msp is pointing to the top of the stack where supposedly the saved registers are located. Doesn't this mean that the main stack has the saved registers that the ARM core pushed to the stack?
If I print the memory contents, starting at the $msp address, I get the following:
gdb $ x/8xw 0x20000100
0x20000100 <__process_stack_base__>: 0x55555555 0x55555555 0x55555555 0x55555555
0x20000110 <__process_stack_base__+16>: 0x55555555 0x55555555 0x55555555 0x55555555
It's empty...
Now, if I print the memory contents, starting at the $psp address, I get the following:
gdb $ x/8xw 0x20000328
0x20000328 <__process_stack_base__+552>: 0x20000860 0x00000054 0x00000054 0x20000408
0x20000338 <__process_stack_base__+568>: 0x20000828 0x08001615 0x1ad10800 0x20000000
This looks more accurate. But I thought the saved registers are supposed to indicate where in flash memory they are located? So how does this make sense?

The comments by old_timer under your question are all correct. The registers will be pushed to the active stack on exception entry, whether this is PSP or MSP at the time. By default, all code uses the main stack (MSP), but if you're using anything other than complete bare metal it's likely that whatever kernel you're using has switched Thread mode to using the process stack (PSP).
Most of your investigations suggest that the PSP was in use, with your memory peek around the PSP and MSP being pretty much indisputable. The only bit of evidence you have for it having been the MSP is the results of the fault_handler function, for which you have not posted the source; so my first guess would be that this function is broken in some way.
Do also remember that one common reason for entering the HardFault handler is that another exception handler has caused an exception. This can easily happen in cases of memory corruption. In these cases (assuming Thread mode uses the PSP) the CPU will first enter Handler mode in response to the original exception, pushing r0-r3,r12,lr,pc,psr to the process stack. It will start executing the original exception handler, then fault again, pushing r0-r3,r12,lr,pc,psr to the main stack while entering the HardFault handler. There's often some unravelling to do.
old_timer also mentions using real assembly language, and I agree here too. Even though the ((naked)) attribute should be removing the prologue and epilogue (between them most of the possible 'compilerisms'), your code would simply be far more readable if it was written in bare assembly language. Inline assembly language has its uses, for example if you want to do something very low-level that you can't do from C but you want to avoid a call-return overhead. But when your entire function is written in assembly language, there's no reason to use it.

Related

Where do the values of uninitialized variables come from, in practice on real CPUs?

I want to know the way variables are initialized :
#include <stdio.h>
int main( void )
{
int ghosts[3];
for(int i =0 ; i < 3 ; i++)
printf("%d\n",ghosts[i]);
return 0;
}
this gets me random values like -12 2631 131 .. where did they come from?
For example with GCC on x86-64 Linux: https://godbolt.org/z/MooEE3ncc
I have a guess to answer my question, it could be wrong anyways:
The registers of the memory after they are 'emptied' get random voltages between 0 and 1, these values get 'rounded' to 0 or 1, and these random values depend on something?! Maybe the way registers are made? Maybe the capacity of the memory comes into play somehow? And maybe even the temperature?!!

Your computer doesn't reboot or power cycle every time you run a new program. Every bit of storage in memory or registers your program can use has a value left there by some previous instruction, either in this program or in the OS before it started this program.
If that was the case, e.g. for a microcontroller, yes, each bit of storage might settle into a 0 or 1 state during the voltage fluctuations of powering on, except in storage engineered to power up in a certain state. (DRAM is more likely to be 0 on power-up, because its capacitors will have discharged). But you'd also expect there to be internal CPU logic that does some zeroing or setting of things to guaranteed state before fetching and executing the first instruction of code from the reset vector (a memory address); system designers normally arrange for there to be ROM at that physical address, not RAM, so they can put non-random bytes of machine-code there. Code that executes at that address should probably assume random values for all registers.
But you're writing a simple user-space program that runs under an OS, not the firmware for a microcontroller, embedded system, or mainstream motherboard, so power-up randomness is long in the past by the time anything loads your program.
Modern OSes zero registers on process startup, and zero memory pages allocated to user-space (including your stack space), to avoid information leaks of kernel data and data from other processes. So the values must come from something that happened earlier inside your process, probably from dynamic linker code that ran before main and used some stack space.
Reading the value of a local variable that's never been initialized or assigned is not actually undefined behaviour (in this case because it couldn't have been declared register int ghosts[3], that's an error (Godbolt) because ghosts[i] effectively uses the address) See (Why) is using an uninitialized variable undefined behavior? In this case, all the C standard has to say is that the value is indeterminate. So it does come down to implementation details, as you expected.
When you compile without optimization, compilers don't even notice the UB because they don't track usage across C statements. (This means everything is treated somewhat like volatile, only loading values into registers as needed for a statement, then storing again.)
In the example Godbolt link I added to your question, notice that -Wall doesn't produce any warnings at -O0, and just reads from the stack memory it chose for the array without ever writing it. So your code is observing whatever stale value was in memory when the function started. (But as I said, that must have been written earlier inside this program, by C startup code or dynamic linking.)
With gcc -O2 -Wall, we get the warning we'd expect: warning: 'ghosts' is used uninitialized [-Wuninitialized], but it does still read from stack space without writing it.
Sometimes GCC will invent a 0 instead of reading uninitialized stack space, but it happens not in this case. There's zero guarantee about how it compiles the compiler sees the use-uninitialized "bug" and can invent any value it wants, e.g. reading some register it never wrote instead of that memory. e.g. since you're calling printf, GCC could have just left ESI uninitialized between printf calls, since that's where ghost[i] is passed as the 2nd arg in the x86-64 System V calling convention.
Most modern CPUs including x86 don't have any "trap representations" that would make an add instruction fault, and even if it did the C standard doesn't guarantee that the indeterminate value isn't a trap representation. But IA-64 did have a Not A Thing register result from bad speculative loads, which would trap if you tried to read it. See comments on the trap representation Q&A - Raymond Chen's article: Uninitialized garbage on ia64 can be deadly.
The ISO C rule about it being UB to read uninitialized variables that were candidates for register might be aimed at this, but with optimization enabled you could plausibly still run into this anyway if the taking of the address happens later, unless the compiler takes steps to avoid it. But ISO C defect report N1208 proposes saying that an indeterminate value can be "a value that behaves as if it were a trap representation" even for types that have no trap representations. So it seems that part of the standard doesn't fully cover ISAs like IA-64, the way real compilers can work.
Another case that's not exactly a "trap representation": note that only some object-representations (bit patterns) are valid for _Bool in mainstream ABIs, and violating that can crash your program: Does the C++ standard allow for an uninitialized bool to crash a program?
That's a C++ question, but I verified that GCC will return garbage without booleanizing it to 0/1 if you write _Bool b[2] ; return b[0]; https://godbolt.org/z/jMr98547o. I think ISO C only requires that an uninitialized object has some object-representation (bit-pattern), not that it's a valid one for this object (otherwise that would be a compiler bug). For most integer types, every bit-pattern is valid and represents an integer value. Besides reading uninitialized memory, you can cause the same problem using (unsigned char*) or memcpy to write a bad byte into a _Bool.
An uninitialized local doesn't have "a value"
As shown in the following Q&As, when compiling with optimization, multiple reads of the same uninitialized variable can produce different results:
Is uninitialized local variable the fastest random number generator?
What happens to a declared, uninitialized variable in C? Does it have a value?
The other parts of this answer are primarily about where a value comes from in un-optimized code, when the compiler doesn't really "notice" the UB.

The registers of the memory after they are 'emptied' get random voltages between 0 and 1,
Nothing so mysterious. You are just seeing what was written to those memory locations last time they were used.
When memory is released it is not cleared or emptied. The system just knows that its free and the next time somebody needs memory it just gets handed over, the old contents are still there. Its like buying an old car and looking in the glove compartment, the contents are not mysterious, its just a surprise to find a cigarette lighter and one sock.
Sometimes in a debugging environment freed memory is cleared to some identifiable value so that its easy to recognize that you are dealing with uninitialized memory. For examples 0xccccccccccc or maybe 0xdeadbeefDeadBeef
Maybe a better analogy. You are eating in a self serve restaurant that never cleans its plates, when a customer has finished they put the plates back on the 'free' pile. When you go to serve yourself you pick up the top plate from the free pile. You should clean the plate otherwise you get what was left there by previous customer

I am going to use a platform that is easy to see what is going on. The compilers and platforms work the same way independent of architecture, operating system, etc. There are exceptions of course...
In main am going to call this function:
test();
Which is:
extern void hexstring ( unsigned int );
void test ( void )
{
unsigned int x[3];
hexstring(x[0]);
hexstring(x[1]);
hexstring(x[2]);
}
hexstring is just a printf("%008X\n",x).
Build it (not using x86, using something that is overall easier to read for this demonstration)
test.c: In function ‘test’:
test.c:7:2: warning: ‘x[0]’ is used uninitialized in this function [-Wuninitialized]
7 | hexstring(x[0]);
| ^~~~~~~~~~~~~~~
test.c:8:2: warning: ‘x[1]’ is used uninitialized in this function [-Wuninitialized]
8 | hexstring(x[1]);
| ^~~~~~~~~~~~~~~
test.c:9:2: warning: ‘x[2]’ is used uninitialized in this function [-Wuninitialized]
9 | hexstring(x[2]);
| ^~~~~~~~~~~~~~~
The disassembly of the compiler output shows
00010134 <test>:
10134: e52de004 push {lr} ; (str lr, [sp, #-4]!)
10138: e24dd014 sub sp, sp, #20
1013c: e59d0004 ldr r0, [sp, #4]
10140: ebffffdc bl 100b8 <hexstring>
10144: e59d0008 ldr r0, [sp, #8]
10148: ebffffda bl 100b8 <hexstring>
1014c: e59d000c ldr r0, [sp, #12]
10150: e28dd014 add sp, sp, #20
10154: e49de004 pop {lr} ; (ldr lr, [sp], #4)
10158: eaffffd6 b 100b8 <hexstring>
We can see that the stack area is allocated:
10138: e24dd014 sub sp, sp, #20
But then we go right into reading and printing:
1013c: e59d0004 ldr r0, [sp, #4]
10140: ebffffdc bl 100b8 <hexstring>
So whatever was on the stack. Stack is just memory with a special hardware pointer.
And we can see the other two items in the array are also read (load) and printed.
So whatever was in that memory at this time is what gets printed. Now the environment I am in likely zeroed the memory (including stack) before we got there:
00000000
00000000
00000000
Now I am optimizing this code to make it easier to read, which adds a few challenges.
So what if we did this:
test2();
test();
In main and:
void test2 ( void )
{
unsigned int y[3];
y[0]=1;
y[1]=2;
y[2]=3;
}
test2.c: In function ‘test2’:
test2.c:5:15: warning: variable ‘y’ set but not used [-Wunused-but-set-variable]
5 | unsigned int y[3];
|
and we get:
00000000
00000000
00000000
but we can see why:
00010124 <test>:
10124: e52de004 push {lr} ; (str lr, [sp, #-4]!)
10128: e24dd014 sub sp, sp, #20
1012c: e59d0004 ldr r0, [sp, #4]
10130: ebffffe0 bl 100b8 <hexstring>
10134: e59d0008 ldr r0, [sp, #8]
10138: ebffffde bl 100b8 <hexstring>
1013c: e59d000c ldr r0, [sp, #12]
10140: e28dd014 add sp, sp, #20
10144: e49de004 pop {lr} ; (ldr lr, [sp], #4)
10148: eaffffda b 100b8 <hexstring>
0001014c <test2>:
1014c: e12fff1e bx lr
test didn't change but test2 is dead code as one would expect when optimized, so it did not actually touch the stack. But what if we:
test2.c
void test3 ( unsigned int * );
void test2 ( void )
{
unsigned int y[3];
y[0]=1;
y[1]=2;
y[2]=3;
test3(y);
}
test3.c
void test3 ( unsigned int *x )
{
}
Now
0001014c <test2>:
1014c: e3a01001 mov r1, #1
10150: e3a02002 mov r2, #2
10154: e3a03003 mov r3, #3
10158: e52de004 push {lr} ; (str lr, [sp, #-4]!)
1015c: e24dd014 sub sp, sp, #20
10160: e28d0004 add r0, sp, #4
10164: e98d000e stmib sp, {r1, r2, r3}
10168: eb000001 bl 10174 <test3>
1016c: e28dd014 add sp, sp, #20
10170: e49df004 pop {pc} ; (ldr pc, [sp], #4)
00010174 <test3>:
10174: e12fff1e bx lr
test2 is actually putting stuff on the stack. Now the calling conventions generally require that the stack pointer is back where it started when you were called, which means function a might move the pointer and read/write some data in that space, call function b move the pointer, read/write some data in that space, and so on. Then when each function returns it does not make sense usually to clean up, you just move the pointer back and return whatever data you wrote to that memory remains.
So if test 2 writes a few things to the stack memory space and then returns then another function is called at the same level as test2. Then the stack pointer is at the same address when test() is called as when test2() was called, in this example. So what happens?
00000001
00000002
00000003
We have managed to control what test() is printing out. Not magic.
Now rewind back to the 1960s and then work forward to the present, particularly 1980s and later.
Memory was not always cleaned up before your program ran. As some folks here are implying if you were doing banking on a spreadsheet then you closed that program and opened this program...back in the day...you would almost expect to see some data from that spreadsheet program, maybe the binary maybe the data, maybe something else, due to the nature of the operating systems use of memory it may be a fragment of the last program you ran, and a fragment of the one before that, and a fragment of a program still running that just did a free(), and so on.
Naturally, once we started to get connected to each other and hackers wanted to take over and send themselves your info or do other bad things, you can see how trivial it would be to write a program to look for passwords or bank accounts or whatever.
So not only do we have protections today to prevent one program sniffing around in another programs space, we generally assume that, today, before our program gets some memory that was used by some other program, it is wiped.
But if you disassemble even a simple hello world printf program you will see that there is a fair amount of bootstrap code that happens before main() is called. As far as the operating system is concerned, all of that code is part of our one program so even if (let's assume) memory were zeroed or cleaned before the OS loads and launches our program. Before main, within our program, we are using the stack memory to do stuff, leaving behind values, that a function like test() will see.
You may find that each time you run the same binary, one compile many runs, that the "random" data is the same. Now you may find that if you add some other shared library call or something to the overall program, then maybe, maybe, that shared library stuff causes extra code pre-main to happen to try to be able to call the shared code, or maybe as the program runs it takes different paths now because of a side effect of a change to the overall binary and now the random values are different but consistent.
There are explanations why the values could be different each time from the same binary as well.
There is no ghost in the machine though. Stack is just memory, not uncommon when a computer boots to wipe that memory once if for no other reason than to set the ecc bits. After that that memory gets reused and reused and reused and reused. And depending on the overall architecture of the operating system. How the compiler builds your application and shared libraries. And other factors. What happens to be in memory where the stack pointer is pointing when your program runs and you read before you write (as a rule never read before you write, and good that compilers are now throwing warnings) is not necessarily random and the specific list of events that happened to get to that point, were not just random but controlled, are not values that you as the programmer may have predicted. Particularly if you do this at the main() level as you have. But be it main or seventeen levels of nested function calls, it is still just some memory that may or may not contain some stuff from before you got there. Even if the bootloader zeros memory, that is still a written zero that was left behind from some other program that came before you.
There are no doubt compilers that have features that relate to the stack that may do more work like zero at the end of the call or zero up front or whatever for security or some other reason someone thought of.
I would assume today that when an operating system like Windows or Linux or macOS runs your program it is not giving you access to some stale memory values from some other program that came before (spreadsheet with my banking information, email, passwords, etc). But you can trivially write a program to try (just malloc() and print or do the same thing you did but bigger to look at the stack). I also assume that program A does not have a way to get into program B's memory that is running concurrently. At least not at the application level. Without hacking (malloc() and print is not hacking in my use of the term).

The array ghosts is uninitialized, and because it was declared inside of a function and is not static (formally, it has automatic storage duration), its values are indeterminate.
This means that you could read any value, and there's no guarantee of any particular value.

Stacktrace on ARM cortex-M4

When I run into a fault handler on my ARM cortex-M4 (Thumb) I get a snapshot of the CPU register just before the fault occured. With this information I can find the stack pointer where it was. Now, what I want is to backtrace through all functions it passed. The only problem I see here is that I don't have a frame pointer, so I cannot really see where a certain subroutine has saved the LR, ad infinitum.
How would one tackle this problem if the frame pointer is not available in r7?

This blog post discusses this issue with reference to the MIPS architecture - the principles can be readily adapted to ARM architectures.
In short, it describes three possibilities for locating the stack frame for a given SP and PC:
Using compiler-generated debug information (not included in the executable image) to calculate it.
Using compiler-generated stack-unwinding (exception handling) information (included in the executable image) to calculate it.
Scanning the call site to locate the prologue or epilogue code that adjusts the stack pointer, and deducing the stack frame address from that.
Obviously it's very compiler- and compiler-option dependent, and not guaranteed to work in all cases.

R7 is not the frame pointer on the M4, it's R11. R7 is the FP for Cortex-M0+/M1 where only the lower registers are generally available. In anycase, when Cortex-M makes a call to a function using BL and variants, it saves the return address into LR (link register). At function entry, the LR is saved onto the stack. So in theory, to get a call trace, you would "chase" the chain of the LRs.
Unfortunately, the saved location of LR on the stack is not defined by the calling convention, and its location must be deduced from the debug info for that function entry in the DWARF records (in the .elf file). I do not know if there is an utility that would extract the LR locations from an ELF file, but it should not be too difficult.

Richard at ImageCraft is right.
More information can be found here
This works fine with C code. I had a harder applying it to C++ but it's not impossible.

Find which instruction caused a trap on Cortex M3

I am currently debugging a hard fault trap which turned out to be a precise data bus error on a STM32F205 Cortex-M3 processor, using Keil uVision. Due to a lengthy debugging and googling process I found the assembly instruction that caused the trap. Now I am looking for a way to avoid this lengthy process next time a trap occurs.
In the application note 209 by Keil it says:
PRECISEERR: Precise data bus error:
0 = no precise data bus error
1 = a data bus error has occurred, and the PC value stacked for the exception return points to the instruction that caused the fault. When the processor sets this bit it writes the faulting address to SCB->BFAR
and also this:
An exception saves the state of registers R0-R3, R12, PC & LR either the Main Stack or the Process Stack (depends on the stack in use when the exception occurred).
The last quote I am interpreting as such that there should be 7 registers plus the respective stack. When I look up my SP address in the memory I see the address that caused the error at an address 10 words higher than the stack pointer address.
My questions are:
Is the address of the instruction that caused the trap always saved 10 words higher than the current stack pointer? And could you please point out a document where I can read up on how and why this is?
Is there another register that would contain this address as well?

As you said, exceptions (or interrupts) on ARM Cortex-M3 will automatically stack some registers, namely :
Program Counter (PC)
Processor Status Register (xPSR)
r0-r3
r12
Link Register (LR).
For a total of 8 registers (reference : Cortex™-M3 Technical Reference Manual, Chapter 5.5.1).
Now, if you write your exception handler in a language other than assembly, the compiler may stack additional registers, and you can't be sure how many.
One simple solution is to add a small code before the real handler, to pass the address of the auto-stacked registers :
void myRealHandler( void * stack );
void handler(void) {
asm volatile("mov r0, sp");
asm volatile("b myRealHandler");
}
The register BFAR is specific to bus faults. It will contain the faulty address, not the address of the faulty instruction. For example, if there was an error reading an integer at address 0x30005632, BFAR will be set to 0x30005632.

The precise stack location of the return address depends on how much stack the interrupt handler requires. If you look at the disassembly of your HardFault_Handler, you should be able to see how much data is stored on the stack / how many registers are pushed in addition to the registers pushed by the hardware interrupt machinery (R0-R3, R12, PC, LR & PSR)
I found this to be a pretty good idea on how to debug Hard Faults, though it requires a bit of inline assembly.

GCC C and ARM Assembly Stack Cleanup

If I call an ARM assembly function from C, sometimes I need to pass in many arguments. If they do not fit in registers r0, r1, r2, r3 it is generally expected that 5-th, 6-th ... x-th arguments are pushed onto stack so that ARM assembly can read them from it.
So in the ARM function I receive some arguments that are on the stack. After finishing the assembly function I can either remove these arguments from stack or leave them there and expect that the C program will deal with them later.
If we are talking about GCC C and ARM assembly who is usually responsible for cleaning up the stack?
The function that made the call (A)
Or the function that was called (B)
I understand that when developing we could agree on either convention. But what is generally used as the default in this particular case (ARM assembly and GCC C)?
And how would generally a low level piece of code describe which behavior it implements? It seems that there should be some kind of standard description for this. If there isn't one it seems that you pretty much just have to try them both and look at which one does not crash.
If someone is interested in how the code could look like:
arm_function:
stmfd sp, {r4-r12, lr} # Save registers that are not the first three registers, SP->PASSED ARGUMENTS
ldmfd sp, {r4-r6} # Load 3 arguments that were passed through the stack, SP->PASSED ARGUMENTS
sub sp, sp, #40 # Adjust the stack pointer so it points to saved registers, STACK POINTER->SAVED REGISTERS->PASSED ARGUMENTS
#The main function body.
ldmfd sp!, {r4-r12, lr}, # Load saved registers STACK POINTER->PASSED ARGUMENTS
add sp, sp, #12 # Increment stack pointer to remove passed arguments, SP->NOTHING
# If the last code line would not be there, the caller would need to remove the arguments from stack.
UPDATE:
It seems that for C/C++ choice A. is pretty standard. Compilers usually use calling conventions like cdecl that work pretty similar to code in the answers below. More information can be found in this link about calling conventions. Changing C/C++ calling convention for a function does not seem to be so common/easy. With older C standard I could not manage to change it, so it looks like using A should be a decent default choice.

The current ARM procedure call standard is AAPCS.
The language-specific ABI can be found here. Relevant will be the document about C, but others should be similar (why reinvent the wheel?).
A good start for reading might be page 14 in the AAPCS.
It basically requires the caller to clean up the stack, as this is the most simple way: push additional arguments onto the stack, call the function and after return simply adjust the stack pointer by adding an offset (the number of bytes pushed on the stack; this is always a multiple of 4 (the "natural 32bit ARM word size).
But if you use gcc, you can just avoid handling the stack yourself by using inline assembler. This provides features to pass C variables (etc.) to the assembler code. This will also automatically load a parameter into a register if required. Just have a look at the gcc documentation. It is a bit hard to figure out in detail, but I prefer this to having raw assember stubs somewhere.
Ok, i added this as there might be problems understanding the principle:
caller:
...
push r5 // argument which does not fit into r0..r3 anymore
bl callee
add sp,4 // adjust SP
callee:
push r5-r7,lr // temp, variables, return address
sub sp,8 // local variables
// processing
add sp, 8 // restore previous stack frame
pop r5-r7,pc // restore temp. variables and return (replaces bx)
You can verify this by just disassmbling some sample C functions. Note that the pre- and postamble may vary if no temp registers are used or the function does not call another function (no need to stack lr for this).
Also, the caller might have to stack r0..r3 before the call. But that is a matter of compiler optimizations.
Disassembly can be done with gdb and objdump for example.
I use -mabi=aapcs for gcc invocation; not sure if gcc would otherwise use a different standard. Note that all object files have to use the same standard.
Edit:
Just had a peek in the AAPCS and that states that the SP need only 4 byte alignment. I might have confused this with the Cortex-M interrupt handling system which (for whatever reason, possibly for M7 which has 64 bit busses) aligns the SP to 8 bytes by default (software-config option).
However, SP must be 8 byte aligned at a public interface. Ok, the standard actually is more complicated than I remembered. That's why I prefer gcc caring about this stuff.

If some spaces allocated on the stack by caller function (argument passing), stack clearance done within the caller function. And how it happens you may ask. In ARM #Olaf has completely cleared, and in x86 it is usually like this:
sub esp, 8 ; make some room
... ; move arguments on stack
call func
add esp, 8 ; clean the stack
or
push eax ; push the arguments
push ebx ; or pusha, then after call, popa
call func
add esp, 8 ; assuming registers are 4 bytes each
Also how the interaction between caller and callee in a system takes places is explained in ABI (Application Binary Interface) You may find it useful.

ARM: link register and frame pointer

I'm trying to understand how the link register and the frame pointer work in ARM. I've been to a couple of sites, and I wanted to confirm my understanding.
Suppose I had the following code:
int foo(void)
{
// ..
bar();
// (A)
// ..
}
int bar(void)
{
// (B)
int b1;
// ..
// (C)
baz();
// (D)
}
int baz(void)
{
// (E)
int a;
int b;
// (F)
}
and I call foo(). Would the link register contain the address for the code at point (A) and the frame pointer contain the address at the code at point (B)? And the stack pointer would could be any where inside bar(), after all the locals have been declared?

Some register calling conventions are dependent on the ABI (Application Binary Interface). The FP is required in the APCS standard and not in the newer AAPCS (2003). For the AAPCS (GCC 5.0+) the FP does not have to be used but certainly can be; debug info is annotated with stack and frame pointer use for stack tracing and unwinding code with the AAPCS. If a function is static, a compiler really doesn't have to adhere to any conventions.
Generally all ARM registers are general purpose. The lr (link register, also R14) and pc (program counter also R15) are special and enshrine in the instruction set. You are correct that the lr would point to A. The pc and lr are related. One is "where you are" and the other is "where you were". They are the code aspect of a function.
Typically, we have the sp (stack pointer, R13) and the fp (frame pointer, R11). These two are also related. This
Microsoft layout does a good job describing things. The stack is used to store temporary data or locals in your function. Any variables in foo() and bar(), are stored here, on the stack or in available registers. The fp keeps track of the variables from function to function. It is a frame or picture window on the stack for that function. The ABI defines a layout of this frame. Typically the lr and other registers are saved here behind the scenes by the compiler as well as the previous value of fp. This makes a linked list of stack frames and if you want you can trace it all the way back to main(). The root is fp, which points to one stack frame (like a struct) with one variable in the struct being the previous fp. You can go along the list until the final fp which is normally NULL.
So the sp is where the stack is and the fp is where the stack was, a lot like the pc and lr. Each old lr (link register) is stored in the old fp (frame pointer). The sp and fp are a data aspect of functions.
Your point B is the active pc and sp. Point A is actually the fp and lr; unless you call yet another function and then the compiler might get ready to setup the fp to point to the data in B.
Following is some ARM assembler that might demonstrate how this all works. This will be different depending on how the compiler optimizes, but it should give an idea,
; Prologue - setup
mov ip, sp ; get a copy of sp.
stmdb sp!, {fp, ip, lr, pc} ; Save the frame on the stack. See Addendum
sub fp, ip, #4 ; Set the new frame pointer.
...
; Maybe other functions called here.
; Older caller return lr stored in stack frame.
bl baz
...
; Epilogue - return
ldm sp, {fp, sp, lr} ; restore stack, frame pointer and old link.
... ; maybe more stuff here.
bx lr ; return.
This is what foo() would look like. If you don't call bar(), then the compiler does a leaf optimization and doesn't need to save the frame; only the bx lr is needed. Most likely this maybe why you are confused by web examples. It is not always the same.
The take-away should be,
pc and lr are related code registers. One is "Where you are", the other is "Where you were".
sp and fp are related local data registers.One is "Where local data is", the other is "Where the last local data is".
The work together along with parameter passing to create function machinery.
It is hard to describe a general case because we want compilers to be as fast as possible, so they use every trick they can.
These concepts are generic to all CPUs and compiled languages, although the details can vary. The use of the link register, frame pointer are part of the function prologue and epilogue, and if you understood everything, you know how a stack overflow works on an ARM.
See also: ARM calling convention.
MSDN ARM stack article
University of Cambridge APCS overview
ARM stack trace blog
Apple ABI link
The basic frame layout is,
fp[-0] saved pc, where we stored this frame.
fp[-1] saved lr, the return address for this function.
fp[-2] previous sp, before this function eats stack.
fp[-3] previous fp, the last stack frame.
many optional registers...
An ABI may use other values, but the above are typical for most setups. The indexes above are for 32 bit values as all ARM registers are 32 bits. If you are byte-centric, multiply by four. The frame is also aligned to at least four bytes.
Addendum: This is not an error in the assembler; it is normal. An explanation is in the ARM generated prologs question.

Disclaimer: I think this is roughly right; please correct as needed.
As indicated elsewhere in this Q&A, be aware that the compiler may not be required to generate (ABI) code that uses frame pointers. Frames on the call stack can often require useless information to be put there.
If the compiler options call for 'no frames' (a pseudo option flag), then the compiler can generate smaller code that keeps call stack data smaller. The calling function is compiled to only store the needed calling info on the stack, and the called function is compiled to only pop the needed calling information from the stack.
This saves execution time and stack space - but it makes tracing backwards in the calling code extremely hard (I gave up trying to...)
Info about the size and shape of the calling information on the stack is only known by the compiler and that info was thrown away after compile time.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight