It is not possible to switch to another execution level in the ARMV8 core

It is not possible to switch to another execution level in the ARMV8 core - arm

I'm trying to write an application on bare metal on ARMV8
When the kernel is turned on, it is in EL3 Secure world mode, I want to switch it to EL2 normal world
I studied some examples and tried to repeat them, but it doesn't work for me. It's like I'm doing pretty simple things, but the kernel doesn't want the conditions to be met. I don't understand what I'm missing
mrs x0, CurrentEL /* CurrentEL Register. bit 2, 3. Others reserved */
and x0, x0, #12 /* Clear reserved bits */
/* Running at EL3? */
cmp x0, #12 /* EL3 value is 0b1100 */
bne cpu_not_in_el3
/* S-EL3 -> NS-EL2 */
// init stack
LDR x1, =__stack_top
MOV sp, x1
MSR SP_EL2, x1
MSR SP_EL1, x1
bl el3_init
bl bare_boot_code
bl el3_bare_boot_code
// Disable trapping of CPTR_EL3 accesses or use of Adv.SIMD/FPU
// -------------------------------------------------------------
MSR CPTR_EL3, xzr
//MSR TCR_EL3, xzr
MSR TCR_EL2, xzr
MSR TCR_EL1, xzr
// Configure SCR_EL3
// ------------------
MOV x0, #1 // NS=1
ORR x0, x0, #(1 << 1) // IRQ=1 IRQs routed to EL3
ORR x0, x0, #(1 << 2) // FIQ=1 FIQs routed to EL3
ORR x0, x0, #(1 << 3) // EA=1 SError routed to EL3
ORR x0, x0, #(1 << 8) // HCE=1 HVC instructions are enabled
ORR x0, x0, #(1 << 10) // RW=1 Next EL down uses AArch64
ORR x0, x0, #(1 << 11) // ST=1 Secure EL1 can access CNTPS_TVAL_EL1, CNTPS_CTL_EL1 & CNTPS_CVAL_EL1
// SIF=0 Secure state instruction fetches from Non-secure memory are permitted
// SMD=0 SMC instructions are enabled
// TWI=0 EL2, EL1 and EL0 execution of WFI instructions is not trapped to EL3
// TWE=0 EL2, EL1 and EL0 execution of WFE instructions is not trapped to EL3
MSR SCR_EL3, x0
// Install dummy vector table
// ---------------------------
LDR x0, =vector_table
MSR VBAR_EL3, x0
MSR VBAR_EL2, x0
MSR VBAR_EL1, x0
isb
// Set SCTLRs for EL1/2 to safe values
// ------------------------------------
MSR HCR_EL2, xzr
LDR x1, =0x30C50838
MSR SCTLR_EL2, x1
MSR SCTLR_EL1, x1
// Enter EL2
// ----------
ADR x0, cpu_in_el2
MSR ELR_EL3, x0
LDR x0, =0b1001 /* 0b1001 = EL2h*/
MSR spsr_el3, x0
//bl before_eret
isb
ERET
after calling the reset function, I expect the kernel to switch to the normal world of EL2, but in fact it goes into interrupt mode. I output several registers and see the following
SPSR_EL3 9
ESR_EL3 82000010
EN_El3 a8
SPSR_EL2 72013885
ESR_EL2 5006008c
ELR_EL2 0
SPSR_EL1 e0091202
ESR_EL1 29e4a2f0
ELR_EL1 0
errors in ESR_EL3 Says that
Used for MMU faults generated by instruction accesses and synchronous External
aborts, including synchronous parity or ECC errors. Not used for debug-related
exceptions.
But I don't turn on the MMU and I don't understand why the error occurs and how to fix it

Related

Why is the IRQ latency in my ARM interrupt handler always the same, regardless of the instruction that is being interrupted?

I am trying to apply a type of side channel attack I read about in this paper that tries to infer execution state from differences in IRQ latencies on a MCU with a cortex M4 processor. The attack carefully interrupts instructions that occur right after a branch and measures the interrupt latency. When different branches have instructions of different lengths, you can look at the interrupt latency to determine in which of these branches the interrupt occurred and leak some of the program state.
I wrote a simple function that I want to attack in the way described above. I am using the SysTick timer to generate the interrupt at the correct point in time. To get an initial good value for the interrupt timer I used GDB to stop the program at the target line to see the SysTick value at that time.
I implemented a very simple interrupt handler that
loads the SysTick timer value from memory
subtracts this value from the reload value to get the elapsed time since interrupt (i.e. the IRQ latency)
clears the interrupt and
void __attribute__((interrupt("IRQ"))) SysTick_Handler(void)
{
/* USER CODE BEGIN SysTick_IRQn 0 */
SysTick->CTRL &= 0xfffffffe; // disable SysTick (~SysTick_CTRL_ENABLE_Msk)
*timer_value = SysTick->VAL; // capture counter value (as quickly as possible)
*timer_value = SysTick->LOAD - *timer_value; // subtract it from reload value to get IRQ latency
SysTick->VAL = 0; // reset initial value
}
However I find that I always get the same IRQ latency, regardless of the instruction that was interrupted. I expect the interrupt latency to be longer when a longer instruction is interrupted.
This is the function I wrote to test the attack
extern uint32_t *timer_value;
int sample_function(int *a, int *b){
/*
* function description -- store the smallest of the two value in a, if MEASURE_CYCLESS defined return the number
* of clock cycles that have been elapsed since the timer has been started
* r0 contains pointer to a
* r1 contains pointer to b
*/
__asm volatile(
/* push working registers */
"PUSH {r4-r8} \n"
/* move counter into r8 */
"MOV r8, #10 \n"
/* begin loop */
"begin_loop: \n"
/* decrement counter variable*/
"SUB r8, r8, #1 \n"
/* if counter variable not equal to 0, jump back to start of loop */
"CMP r8, #0 \n"
/* if r8 not equal to 0, jump back to begin of loop*/
"BNE begin_loop \n"
/* load a into r2 */
"LDR r2, [r0] \n"
/* load b into r3 */
"LDR r3, [r1] \n"
/* store a-b in r4, setting status flags -- if result is 0 Z flag is set */
"SUBS r4, r2, r3 \n"
/* if a-b positive, a is larger otherwise, b is larger (assuming a not equal to b) */
"BPL a_larger \n"
#ifdef SPY
/* load address of (*timer_value) into r4 -- use of LDR pseudo-instruction places constant in a literal pool*/
"LDR r4, =timer_value \n"
/* Load (*timer_value) into r4 */
"LDR r4, [r4] \n"
/* load address of Systick VAL into r5 */
"LDR r5, =0xe000e018 \n"
/* Load value at address stored in R5 (= Systick Val) */
"LDR r5, [r5] \n"
/* Move Systick Val into adress stored at r4 (= *timer_value = address of timer_value)*/
"STR r5, [r4] \n"
#endif
"NOP \n"
/*instruction that gets interrupted -- swap value*/
"STR r2, [r1] \n"
/* load value at this address into r0 (return value) */
"STR r3, [r0] \n"
"B end \n"
"a_larger: \n"
"MOV r0, #0 \n" // instruction that gets interrupted
"end: POP {r4-r8}"
); // pop working registers
}
Note, the section of code in the #define block is used to automatically determine a good timer reload value (instead of using GDB), but I'm currently not using the value I obtained this way.
I also have an empty loop in there to delay the instruction that is meant to be interrupted a bit.
The instruction that gets interrupted is the instruction right after the #define block. When I remove the NOP instruction I still get the same interrupt latency. When I increase or decrease the timer value (to interrupt some cycles earlier or later) I also still get the same IRQ latency.
Am I missing something here? Is there some behavior I do not know about?
Also, is it important to use the attribute __attribute__((interrupt("IRQ")) for an interrupt handler?

This is what I was thinking and commenting on.
bootstrap
.thumb_func
reset:
bl notmain
ldr r4,=0xE000E018
ldr r0,=0xE000E010
mov r1,#7
str r1,[r0]
b hang
.thumb_func
hang:
nop
nop
nop
nop
nop
nop
nop
b hang
setup uart and systick
void notmain ( void )
{
uart_init();
hexstring(0x12345678);
PUT32(STK_CSR,4);
PUT32(STK_RVR,0xF40000);
PUT32(STK_CVR,0x00000000);
//PUT32(STK_CSR,7);
}
event handler
.thumb_func
.globl systick_handler
systick_handler:
ldr r0,[r4]
ldr r5,[sp,#0x18]
push {r0,lr}
bl hexstrings
mov r0,r5
bl hexstring
pop {r0,pc}
grab the timer and address of interrupted instruction and print them out.
00F3FFF4 08000054
00F3FFF4 08000056
00F3FFF4 08000058
00F3FFF4 0800005A
00F3FFF4 0800005C
00F3FFF4 0800005E
00F3FFF4 08000054
00F3FFF4 08000056
00F3FFF4 08000058
00F3FFF4 0800005A
00F3FFF4 08000050
08000050 <hang>:
8000050: bf00 nop
8000052: bf00 nop
8000054: bf00 nop
8000056: bf00 nop
8000058: bf00 nop
800005a: bf00 nop
800005c: bf00 nop
800005e: e7f7 b.n 8000050 <hang>
From ARM's documentation.
Interrupt Latency
There is a maximum of a twelve cycle latency from asserting the interrupt to execution of the first instruction of the ISR when the memory being accessed has no wait states being applied. When the FPU option is implemented and a floating point context is active and the lazy stacking is not enabled, this maximum latency is increased to twenty nine cycles. The first instructions to be executed are fetched in parallel to the stack push.
And that last line we can perhaps see happening here. You can try various instructions, but this architecture has the ability to restart the long duration instructions (reads and push/pop, multiply, and such). I think to see much of a latency difference you may need to create bus or shared resource contention (vs instructions)
Also systick is an exception not an interrupt, so there may be some differences with respect to latency.

ARM Cortex A9 Startup Code and Interrupt Setup

I try to program Cortex-A9 in a bare metal fashion. I use the 'hello world' code from:
https://github.com/tukl-msd/gem5.bare-metal which works. However, I'm not able to get interrupts working. When I create an Interrupt with Interrupt e.g. #47 my software doesn't jump in the ISR function. What I am missing? Do I have to do some more initialization?
Startup Code:
.section INTERRUPT_VECTOR, "x"
.global _Reset
_Reset:
B Reset_Handler /* Reset */
B . /* Undefined */
B . /* SWI */
B . /* Prefetch Abort */
B . /* Data Abort */
B . /* reserved */
B irq_handler /* IRQ */
B irq_handler /* FIQ */
// Some Definitions for GIC:
.equ GIC_DIST, 0x10041000
.equ GIC_CPU , 0x10040000
// GIC Definitions for CPU interface
.equ ICCICR , 0x00
.equ ICCPMR , 0x04
.equ ICCEOIR , 0x10
.equ ICCIAR , 0x0C
// GIC Definitions for Distributor interface
.equ ICDDCR , 0x00
.equ ICDISER , 0x100
.equ ICDIPTR , 0x800
// Other Definitions
.equ USR_MODE , 0x10
GIC_dist_base : .word 0 // address of GIC distributor
GIC_cpu_base : .word 0 // address of GIC CPU interface
Reset_Handler:
LDR sp, =stack_top
// Enable Interrupts on CPU Side:
MRS r1, cpsr // get the cpsr.
BIC r1, r1, #0x80 // enable IRQ (ORR to disable).
MSR cpsr_c, r1 // copy it back, control field bit update.
// Configure GIC:
BL IC_init
// Branch to C code
BL main
B .
// Initialize GIC
.global GIC_init
IC_init:
stmfd sp!,{lr}
// Read GIC base from Configuration Base Address Register
// And use it to initialize GIC_dist_base and GIC_cpu_base
//mrc p15, 4, r0, c15, c0, 0
//add r2, r0, #GIC_DIST // Calculate address
ldr r2, =GIC_DIST
ldr r1, =GIC_dist_base
str r2,[r1] // Store address of GIC distributor
//add r2, r0, #GIC_CPU // Calculate address
ldr r2, =GIC_CPU
ldr r1, =GIC_cpu_base
str r2,[r1] // Store address of GIC CPU interface
// Register (ICCPMR) to enable interrutps of all priorities
ldr r1,=0xFFFF
ldr r2,=GIC_dist_base
str r1,[r2,#ICCPMR]
// Set the enable bit in the CPU interface control register
// ICCICR, allowing CPU(s) to receive interrupts
mov r1,#1
str r1,[r2,#ICCICR]
// Set the enable bit in the distributor control register
// ICDDCR, allowing interrpupts to be generated
ldr r2,=GIC_dist_base
ldr r2,[r2] // Nase address of distributor
mov r1, #1
str r1,[r2,#ICDDCR]
ldmfd sp!,{pc}
//config_interrupt (int ID , int CPU);
.global config_interrupt
config_interrupt:
stmfd sp!,{r4-r5, lr}
// Cinfigure the distributor interrupt set-enable registers (ICDISERn)
// enable the intterupt
// reg_offset = (M/32)*4 (shift and clear some bits)
// value = 1 << (N mod 32);
ldr r2,=GIC_dist_base
ldr r2,[r2] // Read GIC distributor base address
add r2,r2,#ICDISER // r2 <- base address of ICDSER regs
lsr r4,r0,#3 // clculate reg_offset
bic r4,r4,#3 // r4 <- reg_offset
add r4,r2,r4 // r4 <- address of ICDISERn
// Create a bit mask
and r2,r0,#0x1F // r2 <- N mod 32
mov r5,#1 // need to set one bit
lsl r2,r5,r2 // r2 <- value
// Using address in r4 and value in r2 to set the correct bit in the GIC register
ldr r3,[r4] // read ICDISERn
orr r3, r3, r2 // set the enable bit
str r3,[r4] // store the new register value
// Configure the distributor interrupt processor targets register (ICDIPTRn)
// select target CPU(s)
// reg_offset = (N/4)*4 (clear 2 bottom bits)
// index = N mod 4;
ldr r2,=GIC_dist_base
ldr r2,[r2] // Read GIC distributor base address
add r2,r2, #ICDIPTR // base address of ICDIPTR regs
bic r4,r0,#3 // r4 <- reg_offset
add r4,r2,r4 // r4 <- address of ICDIPTRn
// Get the address of th ebyte wihtih ICDIPTRn
and r2,r0,#0x3 // r2 <- index
add r4,r2,r4 // r4 <- byte address to be set
strb r1,[r4]
ldmfd sp!, {r4-r5, lr}
// int get_inLerrupt_number();
// Get the interrupt ID for the current interrupt. This should be called al the
// beginning of ISR. It also changes the state of the interrupt from pending to
// active, which helps to prevent other CPUs from trying to handle it.
.global get_interrupt_number
get_intterrupt_number:
// Read the JCCIAR from the CPU Interface
ldr r0,=GIC_cpu_base
ldr r0,[r0]
ldr r0,[r0,#ICCIAR]
mov pc,lr
// void end_of_interrupt (int ID);
// Notify the GIC that the interrupt has been processed. The state goes from
// active to inactive, or it goes from active and pending to pending.
.global end_of_interrupt
end_of_interrupt:
ldr r1,=GIC_cpu_base
ldr r1,[r1]
str r0,[r1,#ICCEOIR]
mov pc, lr
// IRQ Handler that calls the ISR function in C
.global irq_handler
irq_handler:
stmfd sp!,{r0-r7, lr}
// Call Interrupt Service Routine in C:
bl ISR
ldmfd sp!, {r0-r7, lr}
// Must substract 4 from lr
subs pc, lr, #4
Linker Script:
ENTRY(_Reset)
SECTIONS
{
. = 0x0;
.text : {
boot.o (INTERRUPT_VECTOR)
*(.text)
}
.data : { *(.data) }
.bss : { *(.bss COMMON) }
. = ALIGN(8);
. = . + 0x1000; /* 4kB of stack memory */
stack_top = .;
PROVIDE (end = .) ;
}
Main C Program:
#include <stdio.h>
extern "C" void config_interrupt(int, int);
volatile unsigned int * const SHADOW = (unsigned int *)0x1000a000;
void sendShadow(unsigned int s)
{
*SHADOW = s;
}
int main(void)
{
config_interrupt(47,0);
unsigned int r = 1337;
while (1)
{
printf("Hello World! %d\n", r);
sendShadow(1);
}
}
void ISR(void)
{
printf("ISR");
}

PLL register configuration generates an interrupt (ARM)

I am working with an ARM device produced by Infineon. There seems to be a problem which I can't seem to find a solution to when configuring PLL. When configuring the register holding N, P and K value for a normal PLL mode, the code produces an interrupt and doesn't pause afterwards. Here is the code as shown in the Disassembler (Eclipse):
1333 SCU_PLL->PLLCON1 = (uint32_t)((SCU_PLL->PLLCON1 & ~(SCU_PLL_PLLCON1_NDIV_Msk | SCU_PLL_PLLCON1_K2DIV_Msk |
08000cc8: ldr r1, [pc, #252] ; (0x8000dc8 <XMC_SCU_CLOCK_StartSystemPll+400>)
08000cca: ldr r3, [pc, #252] ; (0x8000dc8 <XMC_SCU_CLOCK_StartSystemPll+400>)
08000ccc: ldr r2, [r3, #8]
08000cce: ldr r3, [pc, #252] ; (0x8000dcc <XMC_SCU_CLOCK_StartSystemPll+404>)
08000cd0: ands r3, r2
1334 SCU_PLL_PLLCON1_PDIV_Msk)) | ((ndiv - 1UL) << SCU_PLL_PLLCON1_NDIV_Pos) |
08000cd2: ldr r2, [r7, #4]
08000cd4: subs r2, #1
08000cd6: lsls r2, r2, #8
08000cd8: orrs r2, r3
1335 ((kdiv_temp - 1UL) << SCU_PLL_PLLCON1_K2DIV_Pos) |
08000cda: ldr r3, [r7, #16]
08000cdc: subs r3, #1
08000cde: lsls r3, r3, #16
1334 SCU_PLL_PLLCON1_PDIV_Msk)) | ((ndiv - 1UL) << SCU_PLL_PLLCON1_NDIV_Pos) |
08000ce0: orrs r2, r3
1336 ((pdiv - 1UL)<< SCU_PLL_PLLCON1_PDIV_Pos));
It seems like the code "breaks" on the following instruction:
08000cce: ldr r3, [pc, #252] ; (0x8000dcc <XMC_SCU_CLOCK_StartSystemPll+404>)
In other words, if I use the 'step into' function, it jumps to the following interrupt right before moving onto the 'ldr' instruction shown above. The following are the configurations of N, P and K values that I have used.
.syspll_config.n_div = 80U,
.syspll_config.p_div = 2U,
.syspll_config.k_div = 4U,
I've been told that the name of the handler doesn't mean much, but here is what Disassembler settles on after the program fails to execute line 08000cce.
08000298: b.n 0x8000298 <VADC0_G3_3_IRQHandler>
Also, here is what is shown in the console.
Starting target CPU...
Debugger requested to halt target...
...Target halted (PC = 0x08000298)
/.../
WARNING: Failed to read memory # address 0xFFFFFFE8
WARNING: Failed to read memory # address 0xFFFFFFE8
EDIT: Perhaps for the sake of completeness I would include a code snippet from system.c file that initializes PLL module with its default values, which works fine. It is very similar to the code shown in the first code pane of this question, perhaps with the exception of resetting the affected register values before writing new P, N and K values. I have divided the initialization code into two parts - resetting and setting the values; it appears that the code "breaks" during the reset phase.
SCU_PLL->PLLCON1 = ((PLL_NDIV << SCU_PLL_PLLCON1_NDIV_Pos) |
(PLL_K2DIV_24MHZ << SCU_PLL_PLLCON1_K2DIV_Pos) |
(PLL_PDIV << SCU_PLL_PLLCON1_PDIV_Pos));

The problem ended up being caused by a trap request (promoted to NMI) that was generated upon disconnecting the VCO (voltage-controlled oscillator) from the external oscillator OSC. Disconnecting the two hardware components is important in configuring the PLL registers, however, if the trap request upon loss-of-lock is not cleared and disabled, the following command will generate an interrupt:
/* disconnect Oscillator from PLL */
SCU_PLL->PLLCON0 |= (uint32_t)SCU_PLL_PLLCON0_FINDIS_Msk;
The command precedes the following line, which is what I thought was originally causing the problem, thus posting it in the question:
SCU_PLL->PLLCON1 = (uint32_t)((SCU_PLL->PLLCON1 & ~(SCU_PLL_PLLCON1_NDIV_Msk | SCU_PLL_PLLCON1_K2DIV_Msk | SCU_PLL_PLLCON1_PDIV_Msk))
Note that trap request can help troubleshoot problems with the PLL module, therefore they need to be enabled again. However, a trap request is still generated regardless of whether the uC will act on it or not (as decided by enable/disable bit). So, in order to restore trap functionality again, one needs to, again, clear and then enable the module as follows:
SCU_TRAP->TRAPCLR |= SCU_TRAP_TRAPCLR_SVCOLCKT_Msk;
SCU_TRAP->TRAPDIS &= ~SCU_TRAP_TRAPDIS_SVCOLCKT_Msk;
Along the way I have discovered this interesting article that may help anyone working on ARM uCs and facing unexpected interrupts: Debugging and Diagnosing Hard Fault & Other Exceptions.

Why does my SWI instruction hang? (BeagleBone Black, ARM Cortex-A8 cpu)

I'm starting to write a toy OS for the BeagleBone Black, which uses an ARM Cortex-A8-based TI Sitara AM3359 SoC and the U-Boot bootloader. I've got a simple standalone hello world app writing to UART0 that I can load through U-Boot so far, and now I'm trying to move on to interrupt handlers, but I can't get SWI to do anything but hang the device.
According to the AM335x TRM (starting on page 4099, if you're interested), the interrupt vector table is mapped in ROM at 0x20000. The ROM SWI handler branches to RAM at 0x4030ce08, which branches to the address stored at 0x4030ce28. (Initially, this is a unique dead loop at 0x20084.)
My code sets up all the ARM processor modes' SP to their own areas at the top of RAM, and enables interrupts in the CPSR, then executes an SWI instruction, which always hangs. (Perhaps jumping to some dead-loop instruction?) I've looked at a bunch of samples, and read whatever documentation I could find, and I don't see what I'm missing.
Currently my only interaction with the board is over serial connection on UART0 with my linux box. U-Boot initializes UART0, and allows loading of the binary over the serial connection.
Here's the relevant assembly:
.arm
.section ".text.boot"
.equ usr_mode, 0x10
.equ fiq_mode, 0x11
.equ irq_mode, 0x12
.equ svc_mode, 0x13
.equ abt_mode, 0x17
.equ und_mode, 0x1b
.equ sys_mode, 0x1f
.equ swi_vector, 0x4030ce28
.equ rom_swi_b_addr, 0x20008
.equ rom_swi_addr, 0x20028
.equ ram_swi_b_addr, 0x4030CE08
.equ ram_swi_addr, 0x4030CE28
.macro setup_mode mode, stackpointer
mrs r0, cpsr
mov r1, r0
and r1, r1, #0x1f
bic r0, r0, #0x1f
orr r0, r0, #\mode
msr cpsr_csfx, r0
ldr sp, =\stackpointer
bic r0, r0, #0x1f
orr r0, r0, r1
msr cpsr_csfx, r0
.endm
.macro disable_interrupts
mrs r0, cpsr
orr r0, r0, #0x80
msr cpsr_c, r0
.endm
.macro enable_interrupts
mrs r0, cpsr
bic r0, r0, #0x80
msr cpsr_c, r0
.endm
.global _start
_start:
// Initial SP
ldr r3, =_C_STACK_TOP
mov sp, r3
// Set up all the modes' stacks
setup_mode fiq_mode, _FIQ_STACK_TOP
setup_mode irq_mode, _IRQ_STACK_TOP
setup_mode svc_mode, _SVC_STACK_TOP
setup_mode abt_mode, _ABT_STACK_TOP
setup_mode und_mode, _UND_STACK_TOP
setup_mode sys_mode, _C_STACK_TOP
// Clear out BSS
ldr r0, =_bss_start
ldr r1, =_bss_end
mov r5, #0
mov r6, #0
mov r7, #0
mov r8, #0
b _clear_bss_check$
_clear_bss$:
stmia r0!, {r5-r8}
_clear_bss_check$:
cmp r0, r1
blo _clear_bss$
// Load our SWI handler's address into
// the vector table
ldr r0, =_swi_handler
ldr r1, =swi_vector
str r0, [r1]
// Debug-print out these SWI addresses
ldr r0, =rom_swi_b_addr
bl print_mem
ldr r0, =rom_swi_addr
bl print_mem
ldr r0, =ram_swi_b_addr
bl print_mem
ldr r0, =ram_swi_addr
bl print_mem
enable_interrupts
swi_call$:
swi #0xCC
bl kernel_main
b _reset
.global _swi_handler
_swi_handler:
// Get the SWI parameter into r0
ldr r0, [lr, #-4]
bic r0, r0, #0xff000000
// Save lr onto the stack
stmfd sp!, {lr}
bl print_uint32
ldmfd sp!, {pc}
Those debugging prints produce the expected values:
00020008: e59ff018
00020028: 4030ce08
4030ce08: e59ff018
4030ce28: 80200164
(According to objdump, 0x80200164 is indeed _swi_handler. 0xe59ff018 is the instruction "ldr pc, [pc, #0x20]".)
What am I missing? It seems like this should work.

The firmware on the board changes the ARM execution mode and the locations of
the vector tables associated with the various modes. In my own case (a bare-metal
snippet code executed at Privilege Level 1 and launched by BBB's uBoot) the active vector table is at address 0x9f74b000.
In general, you might use something like the following function to locate the
active vector table:
static inline unsigned int *get_vectors_address(void)
{
unsigned int v;
/* read SCTLR */
__asm__ __volatile__("mrc p15, 0, %0, c1, c0, 0\n"
: "=r" (v) : : );
if (v & (1<<13))
return (unsigned int *) 0xffff0000;
/* read VBAR */
__asm__ __volatile__("mrc p15, 0, %0, c12, c0, 0\n"
: "=r" (v) : : );
return (unsigned int *) v;
}

change
ldr r0, [lr, #-4]
bic r0, r0, #0xff000000
stmfd sp!, {lr}
bl print_uint32
ldmfd sp!, {pc}
to
stmfd sp!, {r0-r3, r12, lr}
ldr r0, [lr, #-4]
bic r0, r0, #0xff000000
bl print_uint32
ldmfd sp!, {r0-r3, r12, pc}^
PS: You don't restore SPSR into CPSR of interrupted task AND you also scratch registers which are not banked by the cpu mode switch.

ARM Assembly for development board

I'm currently messing around with an LPC 2378 which has an application board attached. On the fan there are several components, 2 of which are a fan and a heater.
If bits 6 and 7 of port 4 are connected to the fan (motor controller), the following code will turn on the fan:
FanOn
STMFD r13!,{r0,r5,r14} ; Push r0, r5 and LR
LDR R5, =FIO4PIN ; Address of FIO4PIN
LDR r0, [r5] ; Read current Port4
ORR r0, r0, #0x80
STR r0, [r5] ; Output
LDMFD r13!,{r0,r5,r14} ; Pop r0, r5 and LR
mov pc, r14 ; Put link register back into PC
How can I rewrite this block of code to turn on a heater connected to bit 5 of port 4, (Setting the bit to 1 will turn it on, setting it to 0 will turn it off).
Edit after answered question:
;==============================================================================
; Turn Heater On
;==============================================================================
heaterOn
STMFD r13!,{r0,r5,r14} ; Push r0, r5 and LR
LDR R5, =FIO4PIN ; Address of FIO4PIN
LDR r0, [r5] ; Read current Port4
ORR r0, r0, #0x20
STR r0, [r5] ; Output
LDMFD r13!,{r0,r5,r14} ; Pop r0, r5 and LR
mov pc, r14 ; Put link register back into PC
;==============================================================================
; Turn The Heater Off
;==============================================================================
heaterOff
STMFD r13!,{r0,r5,r14} ; Push r0, r5 and LR
LDR R5, =FIO4PIN ; Address of FIO4PIN
LDR r0, [r5] ; Read current Port4
AND r0, r0, #0xDF
STR r0, [r5] ; Output
LDMFD r13!,{r0,r5,r14} ; Pop r0, r5 and LR
mov pc, r14 ; Put link register back into PC

As best as I understand the code, the fan is connected only to bit 7 (if bits are numerated from 0).
The following line is responsible for turning the fan-bit on:
ORR r0, r0, #0x80
You are setting all the bits that are 1 in the "mask" to 1. Since the mask is 0x80, that is 1000 0000 in binary, it only turns on bit 7.
Now, if you need to turn on the heater instead of the fan, and you have to set bit 5 instead of 7 (on the same port), you only need to change the mask in that line. New mask should be 0010 0000 binary, that is 0x20 in hexa, so the new code should be:
ORR r0, r0, #0x20
Also, if you want to turn the heater off at some point later, you do it by unsetting only bit 5, by anding with a mask that has 1s everywhere except on bit 5. If the mnemonic for bitwise and is BIC, the line would be:
BIC r0, r0, 0xDF
Now, I have not done any assembly in months but if I am not very mistaken, the code snippet you gave is actually a subroutine. You would call it from you main functionality with something like call to the FanOn address. And, to me, it seems that the subroutine is nice in a way that it preserves all the registers it uses, e.g. it pushes them to a stack in the first line and recovers them in the end.
So, to re-use the code, you could just write a new subroutine for turning the heater on, and one for turning each thing off if you want, and only change the label/subroutine name for each one, e.g. FanOff, HeaterOn...
Since all of them preserve all the registers, you can use them sequentially without worries.

The ORR instruction turns ON a bit, the #0x80 constant determines the bit(s) (in this case, only bit 7 is turned on). To turn OFF the bit, you will need an AND instruction and compute the appropriate mask (e.g., to turn OFF bit 7, you would AND with constant #0x7F). The appropriate constants for bit 5 are #0x20 and #0xDF.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight