Very Baisc Arm Assembly Questions(add, compare) - arm

TLDR: What exactly does bx lr do?
I have trouble understanding these two following examples:
*Add Example: *
I understand that the code "add r0, r0, r1" add r1 to r1 and stores it to register 0. What I do not understand is that how the code "bx lr" knows how
to return r0 without explicitly stating r0.
Compare Example:
Same here I understand that the code "BGT r0_Gt" compares if r0 > r1, and if this is true, the code will skip to r0_gt: However, how does bx lr know how to return the correct value?

It is defined by the used ABI; for ARM, this is EABI which states in "5.4 Result Return"
A Fundamental Data Type that is smaller than 4 bytes is zero- or sign-extended to a word and returned in r0.
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf

bx lr doesn't return any register at all, it just passes control over back to the caller (in the address in the lr register), without modifying any other registers than pc.
The caller then knows, based on the calling convention, that on return, the return value will be in the r0 register (depending on the exact type of the return value and the platform's calling convention).

BX simply means branch exchange, it does a branch and can switch modes between arm/thumb if supported for that architecture. LR is a shortcut for register 14 its that simple. branch to the address in r14.
if you look at the bl instruction you see that r14 will be set with the address after the bl instruction, the return address from a function call.
The pair bl something then later bx lr (or mov pc,lr also works if you dont need to change modes and are in arm mode) is how you make function calls in arm.

The processor has very little concept of context (in an abstract sense). It does not know where it came from, what the registers are for, or if it is in a function call/subroutine. The higher level languages and compiler do know this, and use some common standards to make things easier.
A very small number of operations do have a special, well defined purpose. A BL instruction updates both the 'next instruction to execute' (otherwise known as PC or R15), but also magically updates R14 (the link register).
Exceptions (in V7-A) change a few of the banked core registers around, including the register which is usually used to access the stack, and the link register. This means that exceptions can happen without loosing track of everything else that was going on. Cortex M does things differently, and actually uses the stack to help with the banking (setting R14 to a 'magic value' to indicate if the most recent call was an exception or not).
Unless an instruction interacts with specific registers, CPSR specifically, it probably doesn't care about the context. Some operations (related to security) will be restricted so they can only happen in privileged states - this is ultimately used to prevent an operating system from the user applications, but usually these will relate to accessing very specific control registers.

Related

LDMFD affects R13 oddly

We are using arm9 with ucos. The OS_CPU_ARM_ExceptHndlr_BrkTask common porting function's last instrument has strange behavior in our system.
Instrument: LDMFD SP!,{R0-R12,LR,PC}^
Let's suppose the SP is 0x10002000, and the following 15 DWORDs (which will be copied to R0-R12, LR, PC) have values from 1 to 15. We find the PC (R15) is changed and jumps to 15, but the SP (R13) is changed to a strange value (an address far outside the stack memory space). I expected it would become 0x1000203C (0x10002000+4*15).
Why is R13 changed this way?
This instruction loads r14, like the other registers, from the stack. Write to PC causes the jump. This is not a branch and link that would set the return address to the link register.
Additionally, this instruction is actually an exception return (Because of the ^). So depending on the mode you are returning from, r14 might be banked. So after the exception return, you might see a different r14 than the one that was loaded from memory.

Jump between Thumb and ARM

I am interested in the ARM and Thumb2 commands: LDR and LDR.W, PC, =ADDR for absolute jumping to a certain address.
For example, when I jump from ARM code to ARM, the command LDR PC, =ADDR is performed.
But what happens in the other scenarios?
from ARM to Thumb2
from Thumb2 to Thumb2
from Thumb2 to ARM
when is +1 needed to be added to the address? and why?
The rule is actually quite simple:
If bit 0 of the address is 0, the CPU will execute the code as ARM code after the next branch
If bit 0 of the address is 1, the CPU will execute the code as Thumb after the next branch
Of course if there is a mismatch, the CPU will certainly get a fault (After executing random code) because it has no way to check if the code is ARM or Thumb.
This is what explains the +1.
Note that depending on the compiler, and depending on the label used, bit 0 of the address may be automatically set by the compiler.
You need to just read the documentation.
The following instructions write a value to the PC, treating that value as an interworking address to branch
to, with low-order bits that determine the new instruction set state:
— BLX (register), BX , and BXJ
— LDR instructions with <Rt> equal to the PC
— POP and all forms of LDM except LDM (exception return), when the register list includes the PC
— in ARM state only, ADC , ADD , ADR , AND , ASR (immediate), BIC , EOR , LSL (immediate), LSR (immediate), MOV ,
MVN , ORR , ROR (immediate), RRX , RSB , RSC , SBC , and SUB instructions with <Rd> equal to the PC and without
flag-setting specified.
Since you mentioned thumb2 that means armv6 or newer. (did you say thumb2 and generically mean thumb?) and I believe the docs are telling us the above applies for armv6 and armv7.
Note that bit is consumed by the instruction, the pc doesnt carry around a set lsbit in thumb mode, it is just used by the instruction to indicate a mode change.
Also note you should think in terms of OR 1 not PLUS 1. If you write your code correctly the toolchain will supply you with the correct address with the correct lsbit, if you add a one to that address you will break the code, if you are paranoid or have not done it right you can OR a one to the address and if it has it there already no harm, if it doesnt then it fixes the problem that prevented it from being there. I would never use a plus one though with respect to switching to thumb mode.

Why is SP (apparently) stored on exception entry on Cortex-M3?

I am using a TI LM3S811 (a older Cortex-M3) with the SysTick interrupt to trigger at 10Hz. This is the body of the ISR:
void SysTick_Handler(void)
{
__asm__ volatile("sub r4, r4, #32\r\n");
}
This produces the following assembly with -O0 and -fomit-frame-pointer with gcc-4.9.3. The STKALIGN bit is 0, so stacks are 4-byte aligned.
00000138 <SysTick_Handler>:
138: 4668 mov r0, sp
13a: f020 0107 bic.w r1, r0, #7
13e: 468d mov sp, r1
140: b401 push {r0}
142: f1ad 0420 sub.w r4, r4, #32
146: f85d 0b04 ldr.w r0, [sp], #4
14a: 4685 mov sp, r0
14c: 4770 bx lr
14e: bf00 nop
I don't understand what's going on with r0 in the listing above. Specifically:
1) It seems like we're clearing the lower 3 bits of SP and storing it on the stack. Is that to maintain 8-byte alignment? Or is it something else?
2) Is the exception exit procedure is equally confusing. From my limited understanding of the ARM assembly, it does something like this:
SP = SP + 4; R0 = SP;
Followed by storing it back to SP. Which seems to undo the manipulations until this stage.
3) Why is there a nop instruction after the unconditional branch (at 0x14E)?
The ARM Procedure Calling Standard and C ABI expect an 8 byte (64 bit) alignment of the stack. As an interrupt might occur after pushing/poping a single word, it is not guaranteed the stack is correctly aligned on interrupt entry.
The STKALIGN bit, if set (the default) enforces the hardware to align the stack automatically by conditionally pushing an extra (dummy) word onto the stack.
The interrupt attribute on a function tells gcc, OTOH the stack might be missaligned, so it adds this pre-/postamble which enforces the alignment.
So, both actually do the same; one in hardware, one in software. If you can live with a word-aligned stack only, you should remove the interrupt attribute from the function declarations and clear the STKALIGN bit.
Make sure such a "missaligned" stack is no problem (I would not expect any, as this is a pure 32 bit CPU). OTOH, you should leave it as-is, unless you really need to safe that extra conditional(!) clock and word (very unlikely).
Warning: According to the ARM Architecture Reference Manual, setting STKALIGN == 0 is deprecated. Briefly: do not set this bit to 0!
Since you're using -O0, you should expect lots of redundant and useless code. The general way in which a compiler works is to generate code with the full generality of everything that might be used anywhere in the program, and then rely on the optimizer to get rid of things that are unneeded.
Yes this is doing 8byte alignment. Its also allocating a stack frame to hold local variables even though you have none.
The exit is the reverse, deallocating the stack frame.
The nop at the end is to maintain 4-byte alignment in the code, as you might want to link with non-thumb code at some point.
If you enable optimization, it will eliminate the stack frame (as its unneeded) and the code will become much simpler.

ARM GCC generated functions prolog

I mentioned that ARM toolchains could generate different function prologs. Actually, i saw two obj files (vmlinux) with completely different function prologs:
The first case looks like:
push {some registers maybe, fp, lr} (lr ommited in leaf function)
The second case looks like:
push {some registers maybe, fp, sp, lr, pc} (i can confuse the order)
So as i see the second one pushes additionally pc and sp. Also i saw some comments in crash utility (kdump project) where was stated, that kernel stackframe should have format {..., fp, sp, lr, pc} what confuse me more, because i see that in some cases it is not true.
1.) Am i right about that some gcc extra flags are needed for pushing additionally pc and sp in function prolog? If yes what are they?.
2.) What is this used for? Basically, as i understand i can unwind stack with FP and LR only, why do i need this additional values?
3.) If this things dealth nothing with compilation flags - how can i force generation of this extended function prolog and again what is the purpose?
Thank you.
1.) Am i right about that some gcc extra flags are needed for pushing additionally pc and sp in function prolog? If yes what are they?.
There are many gcc options that will affect stack frames (-march, -mtune, etc may affect the instructions used for instance). In your case, it was -mapcs-frame. Also, -fomit-frame-pointer will remove frames from leaf functions. Several static functions maybe merged together into a single generated function further reducing the number of frames. The APCS can cause slightly slower code but is needed for stack traces.
2.) What is this used for? Basically, as i understand i can unwind stack with FP and LR only, why do i need this additional values?
All registers that are not parameters (r0-r3) need to be saved as they need to be restored when returning to the caller. The compiler will allocate additional locals on the stack so sp will almost always change when fp changes. For why the pc is stored, see below.
3.) If this things dealth nothing with compilation flags - how can i force generation of this extended function prolog and again what is the purpose?
It is compiler flags as you had guessed.
; Prologue - setup
mov ip, sp ; get a copy of sp.
stm sp!, {fp, ip, lr, pc} ; Save the frame on the stack. See Addendum
sub fp, ip, #4 ; Set the new frame pointer.
...
; Epilogue - return
ldm sp, {fp, sp, lr} ; restore stack, frame pointer and old link.
... ; maybe more stuff here.
bx lr ; return.
A typical save is stm sp!, {fp, ip, lr, pc} and a restore of ldm sp, {fp, sp, lr}. This is correct if you examine the ABI/APCS documents. Note, there is no '!' to try and fix the stack. It is loaded explicitly from the stored ip value.
Also, the saved pc is not used in the epilogue. It is just discarded data on the stack. So why do this? Exception handlers (interrupts, signals or C++ exceptions) and other stack trace mechanisms want to know who saved a frame. The ARM always only have one function prologue (one point of entry). However, there are multiple exits. In some cases, a return like return function(); may actually turn into a b function in the maybe more stuff here. This is known as a tail call. Also when a leaf function is called in the middle of a routine and an exception occurs, it will see a PC range of leaf, but the leaf may have no call frame. By saving the pc, the call frame can be examined when an exception occurs in leaf to know who really saved the stack. Tables of pc versus destructor, etc. maybe stored to allow objects to be freed or to figure out how to call a signal handler. The extra pc is just plain nice when tracing a stack and the operation is almost free due to pipe lining.
See also: ARM Link and frame register question for how the compiler uses these registers.

ARM Program Counter distinguishing feature

How does the R15 of ARM differ from the general PC of a CPU?
Both of them are program counters only. What is the difference?
ARM's PC is more similar to a regular register with some restrictions than x86's IP is similar to a regular register.
Considering general PC is an Intel x86 based CPU, in x86's case you can't manipulate PC (Instruction pointer) directly but it is updated implicitly by provided control flow instructions.
In ARM's case historically Program Counter (PC), mapped as register at index 15 (16th register) can be manipulated directly via arithmetic instructions. For example you can add 16 to PC which would alter flow of instruction stream similar to a 16-byte forward jump instruction.
The ARM PC maybe more of a general register than most CPUs, but it is still very special. The traditional simple arithmetic instructions can use the PC as an input argument in many cases. Here it functions as a pointer or array base. It can also be used as the output for control transfer with these instructions. As a read-only value, it is useful for calculating return values in a PC-independent way. It is also useful to use as a constant table look-up in near-by code. For these cases, the PC is very much like a regular register. This is probably more common on many RISC CPUs as opposed to a CISC ISA.
However, when the PC is used as a destination (lvalue or updated and written), the behavior is often non-standard. Some examples of special cases (for some ARM architechure versions) for R15/PC are,
adcs - copies SPSR to CPSR
adds - copies SPSR to CPSR
ands - copies SPSR to CPSR
bics - copies SPSR to CPSR
bx r15 - highly discourage or not supported.
clz r15 - not supported.
mcr pXX, xx, r15,... - unpredictable
etc.
In most cases, using the PC as a destination of an instruction will have some special case. Especially, the use of the S (normally to set conditions codes) can be used to return from an exception. This might be used as some sort of veneer when returning from an exception or just a direct return. In some cases, the meaning of the instruction might change completely. For instance, ldm sp, {r0-r15}^ and ldm sp, {r0-r14}^ use different register banks; the first will load the registers according to the mode in the SPSR; whereas the 2nd will load the register to user mode.
For load/store, atomics, mode manipulation, co-processor and complex arithmetic (64 bit multiplies, etc) instructions, the PC is often unsupported or has a different meaning; the different meaning is often a mechanism for handling exceptions for system level code.

Resources