calculating the address of global offset table in arm literal pool

calculating the address of global offset table in arm literal pool - c

I am trying to understand the arm assembly code for writing the Literal Pool and Global OFFSET table
Compiling the C code with GNU ARM GCC
extern int i;
int foo(int j)
{
int t = i;
i = j;
return t;
}
GCC generates following code:
foo:
ldr r3, .L2
ldr r2, .L2+4
.LPIC0:
add r3, pc
ldr r3, [r3, r2]
# sp needed for prologue
ldr r2, [r3]
str r0, [r3]
mov r0, r2
bx lr
.L3:
.align 2
.L2:
.word _GLOBAL_OFFSET_TABLE_-(.LPIC0+4)
.word i(GOT)
I want to manually handle the global offset table in arm assembly.
Now I am facing difficulty to understand the above code.
Can any one please describe the literal pool calculation following lines of code?
.L2:
.word _GLOBAL_OFFSET_TABLE_-(.LPIC0+4)
.word i(GOT)

When compiled to a PIC(position independet code) file, global variable need to be relocated.
foo:
ldr r3, .L2
ldr r2, .L2+4
.LPIC0:
add r3, pc
ldr r3, [r3, r2]
Notice add r3, pc, in this instruction, pc is .LPIC0+4, so the result of add is _GLOBAL_OFFSET_TABLE_, which is the entry of the GOT.
.L2+4 is i(GOT), it is the offset of varaibel i in GOT.
Look at the result of objdump is more intuitive.
00000450 <foo>:
450: 4b03 ldr r3, [pc, #12] ; (460 <foo+0x10>)
452: 4a04 ldr r2, [pc, #16] ; (464 <foo+0x14>)
454: 447b add r3, pc
456: 589b ldr r3, [r3, r2]
458: 681a ldr r2, [r3, #0]
45a: 6018 str r0, [r3, #0]
45c: 4610 mov r0, r2
45e: 4770 bx lr
460: 00008ba8 andeq r8, r0, r8, lsr #23
464: 0000001c andeq r0, r0, ip, lsl r0
468: f3af 8000 nop.w
46c: f3af 8000 nop.w
In the disassembly, .L2 and .L2+4 is replaced with specific offset. the result of add r3, pc is 0x8ba8 + 0x458 = 0x9000. Then ldr r3, [r3, r2] would load from address 0x901c. Look up these address in the section header:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
...
[17] .got PROGBITS 00009000 001000 000024 04 WA 0 0 4
...
the address 0x9000 is the entry of global offset table, and 0x901c is also in this section. the symbol info of 0x901c could be find in the .rel.dyn section:
Relocation section '.rel.dyn' at offset 0x348 contains 7 entries:
Offset Info Type Sym.Value Sym. Name
...
00009018 00000415 R_ARM_GLOB_DAT 00000000 _Jv_RegisterClasses
0000901c 00000515 R_ARM_GLOB_DAT 00000000 i
00009020 00000615 R_ARM_GLOB_DAT 00000000 __cxa_finalize

Related

LDR pseudoinstruction

when I create ARM assembly code from C code with gcc -S, I get a variant of the LDR instruction that I don't know. Specifically, I get the "ldr r3, .L5" instruction where ".L5" is a lable defined by the compiler. It is not clear to me why I don't get the pseudoinstruction "ldr r3, =.L5", which should be the only way to load an arbitrary number in a register.
More in details:
I start from this C code (file name: sum_squares_C.c):
int sum;
int main(){
sum = 0;
for(int i=1; i<=n; i++){
sum = sum + i*i;
}
}
Then on a Raspeberry PI, I compile with "gcc -O0 -S sum_squares_C.c", with compiler version gcc (Raspbian 8.3.0-6+rpi1) 8.3.0.
The output is this ARM code (the instruction "ldr r3, .L5" is in the 7th line after label "main"):
.arch armv6
.eabi_attribute 28, 1
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 6
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file "sum_squares_C.c"
.text
.global n
.data
.align 2
.type n, %object
.size n, 4
n:
.word 1
.comm sum,4,4
.text
.align 2
.global main
.arch armv6
.syntax unified
.arm
.fpu vfp
.type main, %function
main:
# args = 0, pretend = 0, frame = 8
# frame_needed = 1, uses_anonymous_args = 0
# link register save eliminated.
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #12
ldr r3, .L5
mov r2, #0
str r2, [r3]
mov r3, #1
str r3, [fp, #-8]
b .L2
.L3:
ldr r3, [fp, #-8]
ldr r2, [fp, #-8]
mul r2, r2, r3
ldr r3, .L5
ldr r3, [r3]
add r3, r2, r3
ldr r2, .L5
str r3, [r2]
ldr r3, [fp, #-8]
add r3, r3, #1
str r3, [fp, #-8]
.L2:
ldr r3, .L5+4
ldr r3, [r3]
ldr r2, [fp, #-8]
cmp r2, r3
ble .L3
mov r3, #0
mov r0, r3
add sp, fp, #0
# sp needed
ldr fp, [sp], #4
bx lr
.L6:
.align 2
.L5:
.word sum
.word n
.size main, .-main
.ident "GCC: (Raspbian 8.3.0-6+rpi1) 8.3.0"
.section .note.GNU-stack,"",%progbits
It seems to me that gcc uses the instruction "ldr r3, .L5" as equivalent to "ldr r3, =.L5". Is it correct? Where can I find the definition of this instruction syntax? Is it possible to force gcc to not use this instruction, but use "ldr r3, =.L5" (I need this for teaching reasons)?
Thanks!
Francesco

ldr r3, .L5 loads a word from the address .L5 into r3. At the label .L5 there is the address of the variable sum. So this loads the address of sum into r3.
ldr r3, =.L5 loads the address of .L5 into r3. Then the program would need to dereference it again in order to get the address of sum. There is no reason to do this.
When you use ldr r3, =.L5 the assembler stores the address of .L5 somewhere, and then loads from that address. So this:
ldr r3, =.L5
...
.L5:
.word sum
is the same as this:
ldr r3, .address_of_L5
...
.L5:
.word sum
...
.address_of_L5:
.word .L5
As you can see, the compiler has already done this for sum. Instead of writing this assembly:
ldr r3, =sum
the compiler has written:
ldr r3, .L5
...
.L5:
.word sum
which is exactly what the assembler would have done anyway. I don't know why the compiler wants to do this instead of the assembler.
It is not clear to me why I don't get the pseudoinstruction "ldr r3, =.L5", which should be the only way to load an arbitrary number in a register.
Notice this is not the only way to load an arbitrary number into a register. It's not even a real way to load an arbitrary number into a register. It's a pseudoinstruction (as you know): it's not something the CPU can actually do, it's something that the assembler can "compile" for your convenience.

To save typing and assume a risk a person might use:
ldr r3,=sum
ldr r3,[r3]
As pointed out in the other example the assembler will create in machine code the equivalent of what the human could have typed without the =address trick:
ldr r3,address_of_sum (without the =)
ldr r3,[r3]
...
address_of_sum: .word sum
And that first ldr (not pseudo as it translates directly into a known instruction, one to one) is a pc-relative load (assuming it can reach).
Both of these though are assembler specific as assembly language is defined by the assembler not the target.
The =address shortcut is not supported by all arm assemblers and should be used with care, for certain values it does not turn into a word in the pool with a pc relative load.
For questions like this first examine the disassembly, most of the time that will answer your question, even better examine the dissasembly first then in question the assembly. Compiler generated assembly is not as easy to read and follow as a disassembly, especially when linked. It is also easier to learn from optimized code than unoptimized as so much of the code is this stack (or in this case global) variable stuff.
ldr r3,=0x1000
ldr r3,=0x1234
b .
00000000 <.text>:
0: e3a03a01 mov r3, #4096 ; 0x1000
4: e51f3000 ldr r3, [pc, #-0] ; c <.text+0xc>
8: eafffffe b 8 <.text+0x8>
c: 00001234 andeq r1, r0, r4, lsr r2
In one case where it can it generates a mov, where it cant then it allocates from the pool and places the value there then does a pc relative load. Now yes when reading the output this way you need to see/understand/ignore the andeq disassembly that line we are looking at the value 0x00001234 and seeing the instruction generated.
You should not always assume the =address trick will work if you choose to try various tools, it works for gnu now if it can find a pool if it can't then you either need to just do the typing yourself or add a .pool or whatever the other pseudocode that does the same thing is to help the assembler find a place for this value as needed.
I would expect an assembler to always place this (=address) in the pool for an external reference, but it is technically possible for a toolchain to put a placeholder there and let the linker fill it in either with a mov or add a nearby item and place the value there like binutils does with a bl to an external reference.
gas:
ldr r3,=sum
b .
00000000 <.text>:
0: e51f3000 ldr r3, [pc, #-0] ; 8 <.text+0x8>
4: eafffffe b 4 <.text+0x4>
8: 00000000 andeq r0, r0, r0
The linker will fill in the address later as with your compiler output. Now the -0 disassembly is very interesting, almost amusing.

Does arm-none-eabi-gcc produce slower code than Keil uVision

I have a simple blinking led program running on STM32f103C8 (without initialization boilerplate):
void soft_delay(void) {
for (volatile uint32_t i=0; i<2000000; ++i) { }
}
uint32_t iters = 0;
while (1)
{
LL_GPIO_TogglePin(LED_GPIO_Port, LED_Pin);
soft_delay();
++iters;
}
It was compiled with both Keil uVision v.5 (default compiler) and CLion using arm-none-eabi-gcc compiler.
The surprise is that arm-none-eabi-gcc program runs 50% slower in Release mode (-O2 -flto) and 100% slower in Debug mode.
I suspect 3 reasons:
Keil over-optimization (unlikely, because the code is very simple)
arm-none-eabi-gcc under-optimization due to wrong compiler flags (I use CLion Embedded plugins` CMakeLists.txt)
A bug in the initialization so that chip has lower clock frequency with arm-none-eabi-gcc (to be investigated)
I have not yet dived into the jungles of optimization and disassembling,
I hope that there are many experienced embedded developers who already encountered this issue and have the answer.
UPDATE 1
Playing around with different optimization levels of Keil ArmCC, I see
how it affects the generated code. And it affects drastically, especially execution time. Here are the benchmarks and disassembly of soft_delay() function for each optimization level (RAM and Flash amounts include initialization code).
-O0: RAM: 1032, Flash: 1444, Execution Time (20 iterations): 18.7 sec
soft_delay PROC
PUSH {r3,lr}
MOVS r0,#0
STR r0,[sp,#0]
B |L6.14|
|L6.8|
LDR r0,[sp,#0]
ADDS r0,r0,#1
STR r0,[sp,#0]
|L6.14|
LDR r1,|L6.24|
LDR r0,[sp,#0]
CMP r0,r1
BCC |L6.8|
POP {r3,pc}
ENDP
-O1: RAM: 1032, Flash: 1216, Execution Time (20 iterations): 13.3 sec
soft_delay PROC
PUSH {r3,lr}
MOVS r0,#0
STR r0,[sp,#0]
LDR r0,|L6.24|
B |L6.16|
|L6.10|
LDR r1,[sp,#0]
ADDS r1,r1,#1
STR r1,[sp,#0]
|L6.16|
LDR r1,[sp,#0]
CMP r1,r0
BCC |L6.10|
POP {r3,pc}
ENDP
-O2 -Otime: RAM: 1032, Flash: 1136, Execution Time (20 iterations): 9.8 sec
soft_delay PROC
SUB sp,sp,#4
MOVS r0,#0
STR r0,[sp,#0]
LDR r0,|L4.24|
|L4.8|
LDR r1,[sp,#0]
ADDS r1,r1,#1
STR r1,[sp,#0]
CMP r1,r0
BCC |L4.8|
ADD sp,sp,#4
BX lr
ENDP
-O3: RAM: 1032, Flash: 1176, Execution Time (20 iterations): 9.9 sec
soft_delay PROC
PUSH {r3,lr}
MOVS r0,#0
STR r0,[sp,#0]
LDR r0,|L5.20|
|L5.8|
LDR r1,[sp,#0]
ADDS r1,r1,#1
STR r1,[sp,#0]
CMP r1,r0
BCC |L5.8|
POP {r3,pc}
ENDP
TODO: benchmarking and disassembly for arm-none-eabi-gcc.

This second answer is a demonstration of the kinds of things that would affect the performance results the OP may be seeing and examples of to possibly test for those STM32F103C8 blue pill.
Complete source code:
flash.ld
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
flash.s
.cpu cortex-m0
.thumb
.thumb_func
.global _start
_start:
stacktop: .word 0x20001000
.word reset
.word hang
.word hang
.thumb_func
reset:
bl notmain
b hang
.thumb_func
hang: b .
.align
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
.thumb_func
.globl GET32
GET32:
ldr r0,[r0]
bx lr
.thumb_func
.globl dummy
dummy:
bx lr
test.s
.cpu cortex-m0
.thumb
.word 0,0,0
.word 0,0,0,0
.thumb_func
.globl TEST
TEST:
bx lr
notmain.c
//PA9 TX
//PA10 RX
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
void dummy ( unsigned int );
#define USART1_BASE 0x40013800
#define USART1_SR (USART1_BASE+0x00)
#define USART1_DR (USART1_BASE+0x04)
#define USART1_BRR (USART1_BASE+0x08)
#define USART1_CR1 (USART1_BASE+0x0C)
#define USART1_CR2 (USART1_BASE+0x10)
#define USART1_CR3 (USART1_BASE+0x14)
//#define USART1_GTPR (USART1_BASE+0x18)
#define GPIOA_BASE 0x40010800
#define GPIOA_CRH (GPIOA_BASE+0x04)
#define RCC_BASE 0x40021000
#define RCC_APB2ENR (RCC_BASE+0x18)
#define STK_CSR 0xE000E010
#define STK_RVR 0xE000E014
#define STK_CVR 0xE000E018
#define STK_MASK 0x00FFFFFF
static void uart_init ( void )
{
//assuming 8MHz clock, 115200 8N1
unsigned int ra;
ra=GET32(RCC_APB2ENR);
ra|=1<<2; //GPIOA
ra|=1<<14; //USART1
PUT32(RCC_APB2ENR,ra);
//pa9 TX alternate function output push-pull
//pa10 RX configure as input floating
ra=GET32(GPIOA_CRH);
ra&=~(0xFF0);
ra|=0x490;
PUT32(GPIOA_CRH,ra);
PUT32(USART1_CR1,0x2000);
PUT32(USART1_CR2,0x0000);
PUT32(USART1_CR3,0x0000);
//8000000/16 = 500000
//500000/115200 = 4.34
//4 and 5/16 = 4.3125
//4.3125 * 16 * 115200 = 7948800
PUT32(USART1_BRR,0x0045);
PUT32(USART1_CR1,0x200C);
}
static void uart_putc ( unsigned int c )
{
while(1)
{
if(GET32(USART1_SR)&0x80) break;
}
PUT32(USART1_DR,c);
}
static void hexstrings ( unsigned int d )
{
//unsigned int ra;
unsigned int rb;
unsigned int rc;
rb=32;
while(1)
{
rb-=4;
rc=(d>>rb)&0xF;
if(rc>9) rc+=0x37; else rc+=0x30;
uart_putc(rc);
if(rb==0) break;
}
uart_putc(0x20);
}
static void hexstring ( unsigned int d )
{
hexstrings(d);
uart_putc(0x0D);
uart_putc(0x0A);
}
void soft_delay(void) {
for (volatile unsigned int i=0; i<2000000; ++i) { }
}
int notmain ( void )
{
PUT32(STK_CSR,4);
PUT32(STK_RVR,0x00FFFFFF);
PUT32(STK_CVR,0x00000000);
PUT32(STK_CSR,5);
uart_init();
hexstring(0x12345678);
hexstring(GET32(0xE000E018));
hexstring(GET32(0xE000E018));
return(0);
}
build
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m3 flash.s -o flash.o
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m3 test.s -o test.o
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-m0 -march=armv6-m -c notmain.c -o notmain.thumb.o
arm-none-eabi-ld -o notmain.thumb.elf -T flash.ld flash.o test.o notmain.thumb.o
arm-none-eabi-objdump -D notmain.thumb.elf > notmain.thumb.list
arm-none-eabi-objcopy notmain.thumb.elf notmain.thumb.bin -O binary
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-m3 -march=armv7-m -c notmain.c -o notmain.thumb2.o
arm-none-eabi-ld -o notmain.thumb2.elf -T flash.ld flash.o test.o notmain.thumb2.o
arm-none-eabi-objdump -D notmain.thumb2.elf > notmain.thumb2.list
arm-none-eabi-objcopy notmain.thumb2.elf notmain.thumb2.bin -O binary
uart output as shown
12345678
00FFE445
00FFC698
If I take your code, make it shorter, don't have all day.
void soft_delay(void) {
for (volatile unsigned int i=0; i<0x2000; ++i) { }
}
arm-none-eabi-gcc -c -O0 -mthumb -mcpu=cortex-m0 hello.c -o hello.o
yes I know this is an m3
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 5.4.0
gives
00000000 <soft_delay>:
0: b580 push {r7, lr}
2: b082 sub sp, #8
4: af00 add r7, sp, #0
6: 2300 movs r3, #0
8: 607b str r3, [r7, #4]
a: e002 b.n 12 <soft_delay+0x12>
c: 687b ldr r3, [r7, #4]
e: 3301 adds r3, #1
10: 607b str r3, [r7, #4]
12: 687b ldr r3, [r7, #4]
14: 4a03 ldr r2, [pc, #12] ; (24 <soft_delay+0x24>)
16: 4293 cmp r3, r2
18: d9f8 bls.n c <soft_delay+0xc>
1a: 46c0 nop ; (mov r8, r8)
1c: 46bd mov sp, r7
1e: b002 add sp, #8
20: bd80 pop {r7, pc}
22: 46c0 nop ; (mov r8, r8)
24: 00001fff
first check the test infrastructure
.cpu cortex-m0
.thumb
.align 8
.word 0,0
.thumb_func
.globl TEST
TEST:
push {r4,r5,r6,lr}
mov r4,r0
mov r5,r1
ldr r6,[r4]
inner:
bl soft_delay
sub r5,#1
bne inner
ldr r3,[r4]
sub r0,r6,r3
pop {r4,r5,r6,pc}
.align 8
soft_delay:
bx lr
in the openocd telnet window
reset halt
flash write_image erase notmain.thumb.elf
reset
gives
12345678
00001B59
7001 clocks, assuming the systick matches the cpu, thats 7001 arm clocks, 4 instructions per loop.
Step back note I aligned some things
08000108 <TEST>:
8000108: b570 push {r4, r5, r6, lr}
800010a: 1c04 adds r4, r0, #0
800010c: 1c0d adds r5, r1, #0
800010e: 6826 ldr r6, [r4, #0]
08000110 <inner>:
8000110: f000 f876 bl 8000200 <soft_delay>
8000114: 3d01 subs r5, #1
8000116: d1fb bne.n 8000110 <inner>
8000118: 6823 ldr r3, [r4, #0]
800011a: 1af0 subs r0, r6, r3
800011c: bd70 pop {r4, r5, r6, pc}
08000200 <soft_delay>:
8000200: 4770 bx lr
both loops are nicely aligned.
Now if I do this:
0800010a <TEST>:
800010a: b570 push {r4, r5, r6, lr}
800010c: 1c04 adds r4, r0, #0
800010e: 1c0d adds r5, r1, #0
8000110: 6826 ldr r6, [r4, #0]
08000112 <inner>:
8000112: f000 f875 bl 8000200 <soft_delay>
8000116: 3d01 subs r5, #1
8000118: d1fb bne.n 8000112 <inner>
800011a: 6823 ldr r3, [r4, #0]
800011c: 1af0 subs r0, r6, r3
800011e: bd70 pop {r4, r5, r6, pc}
Simply changing the alignment of the code that is supposed to be testing the code under test I now get:
00001F40
8000 ticks to do that loop 1000 times with that call with the code function under test still being aligned
08000200 <soft_delay>:
8000200: 4770 bx lr
The .align 8, in general don't use .align with a number on gnu its behavior does not translate across targets. .balign is better. Anyway I used it. The two words are because the align made TEST aligned, but inner is what I wanted aligned so I added two words to make it aligned.
.align 8
.word 0,0
nop
.thumb_func
.globl TEST
TEST:
push {r4,r5,r6,lr}
mov r4,r0
mov r5,r1
ldr r6,[r4]
inner:
bl soft_delay
sub r5,#1
bne inner
ldr r3,[r4]
sub r0,r6,r3
pop {r4,r5,r6,pc}
A little code review to make sure I didn't make a mistake here.
r0 is the systick current value register
r1 is the number of loops I want to run the code under test
The calling convention allows for r0-r3 to be clobbered so I need to move r0 and r1 to non-volatile registers (per the calling convention).
I want to sample the time the instruction before the loop and the instruction after.
so I need two registers for r0 and r1 and a register to store the begin time so r4,r5,r6 and that fits in nicely to have an even number of registers pushed on the stack. Have to preserve lr so we can return.
we can now safely call soft_delay in the loop, subtract the count, branch if not equal to inner, once the count is done read the timer in r3. from output above this is a down counter so subtract end from beginning, technically since this is a 24 bit counter I should and with 0x00FFFFFF to correctly do that subtraction, but because this isn't going to roll over I can assume out that operation. result/return value goes in r0, pop everything which includes popping the pc to do the return to the C calling function which prints out r0's value.
I think the test code is good.
reading the CPUID register
411FC231
So that means r1p1, while the TRM I am using is written for r2p1 you have to be very careful to use the right document but also sometimes use the current document or all the ones in between if available to see what changed.
ICode memory interface
Instruction fetches from Code memory space 0x00000000 to 0x1FFFFFFF
are performed over the 32-bit AHB-Lite bus. The Debugger cannot access
this interface. All fetches are word-wide. The number of instructions
fetched per word depends on the code running and the alignment of the
code in memory.
Sometimes in ARM TRMs you see the fetch info up top near the processor features, this tells me what I wanted to know.
08000112 <inner>:
8000112: f000 f875 bl 8000200 <soft_delay>
8000116: 3d01 subs r5, #1
8000118: d1fb bne.n 8000112 <inner>
this requires a fetch at 110, 114 and 118.
08000110 <inner>:
8000110: f000 f876 bl 8000200 <soft_delay>
8000114: 3d01 subs r5, #1
8000116: d1fb bne.n 8000110 <inner>
This a fetch at 110 and 114, but not one at 118, so that extra fetch could be our added clock. the m3 was the first publicly available one and it has a lot of features in the core that went away and similar ones came back. Some of the smaller cores fetch differently and you don't see this alignment issue. with bigger cores like full sized ones they fetch sometimes 4 or 8 instructions at a time and you have to change your alignment even more to hit the boundary but you can hit the boundary and since it is 2 or 4 clocks plus bus overhead for the extra fetch you can see those.
If I put two nops
nop
nop
.thumb_func
.globl TEST
TEST:
gives
08000114 <inner>:
8000114: f000 f874 bl 8000200 <soft_delay>
8000118: 3d01 subs r5, #1
800011a: d1fb bne.n 8000114 <inner>
800011c: 6823 ldr r3, [r4, #0]
800011e: 1af0 subs r0, r6, r3
8000120: bd70 pop {r4, r5, r6, pc}
gives
00001B59
So that's good we are back to that number, could try a few more to confirm but it appears that alignment is sensitive to our outer test loop, which is bad, but we can manage that, don't change it it won't affect the test. If I didn't care about alignment and had something like this:
void soft_delay(void) {
for (volatile unsigned int i=0; i<0x2000; ++i) { }
}
int notmain ( void )
{
unsigned int ra;
unsigned int beg;
unsigned int end;
PUT32(STK_CSR,4);
PUT32(STK_RVR,0x00FFFFFF);
PUT32(STK_CVR,0x00000000);
PUT32(STK_CSR,5);
uart_init();
hexstring(0x12345678);
beg=GET32(STK_CVR);
for(ra=0;ra<1000;ra++)
{
soft_delay();
}
end=GET32(STK_CVR);
hexstring((beg-end)&0x00FFFFFF);
return(0);
}
Then as I played with optimization options and I also played with using different compilers any change in the program/binary in front of the test loop would/could move the test loop changing its performance, in my simple example it was a 14% performance difference, that's massive if you are doing performance tests. letting the compiler take care of all this without us being in control the everything in front of the function under test could mess with the function under test, as written above the compiler might opt to inline the function rather than call it making an even more interesting situation as the test loop while probably not as clean as mine, certainly not if not optimized, but now the code under test is dynamic as options or alignments change.
I'm very happy you happened to be using this core/chip...
If I re-align inner and now mess with this
.align 8
nop
soft_delay:
bx lr
08000202 <soft_delay>:
8000202: 4770 bx lr
it's a single instruction which is fetched at 0x200 from what we have read and seem to be able to tell. wouldn't expect this to change anything and it didn't
00001B59
but now that we know what we know, we can use our experience to mess with this trivial Not interesting at all example.
.align 8
nop
soft_delay:
nop
bx lr
gives
00001F41
as expected. and we can have even more fun:
.align 8
.word 0,0
nop
.thumb_func
.globl TEST
TEST:
combined gives
08000112 <inner>:
8000112: f000 f876 bl 8000202 <soft_delay>
8000116: 3d01 subs r5, #1
8000118: d1fb bne.n 8000112 <inner>
08000202 <soft_delay>:
8000202: 46c0 nop ; (mov r8, r8)
8000204: 4770 bx lr
no surprise if you know what you are doing:
00002328
9000 clocks, 29% performance difference. we are literally talking about 5 (technically 6) instructions, same exact machine code and by simply changing alignment the performance can be 29% different, compiler and options have nothing to do with it, yet, have not even gotten there.
How can we expect to do any kind of performance evaluation of a program using the time the code a bunch of times in a loop method? We cant unless we know what we are doing, have an understanding of the architecture, etc.
Now as it should be obvious and reading the documentation I am using the internal 8Mhz clock, everything is derived from that so the systick times are not going to sometimes vary as you might see with dram for example. The LATENCY bits in the FLASH_ACR register should have defaulted to zero wait states for 0 < SYSCLK <- 24Mhz. If I were to bump up the clock above 24Mhz, the processor is running faster but the flash is now slower relative to the processor.
Without messing with the clocks and simply adding a wait state by changing the FLASH_ACR register to 0x31.
000032C6
12998 up from 9000, I didn't expect it to double necessarily and it didn't.
Hmm for fun make a PUT16 using strh, and
.thumb_func
.globl HOP
HOP:
bx r2
and
PUT16(0x2000010a,0xb570); // 800010a: b570 push {r4, r5, r6, lr}
PUT16(0x2000010c,0x1c04); // 800010c: 1c04 adds r4, r0, #0
PUT16(0x2000010e,0x1c0d); // 800010e: 1c0d adds r5, r1, #0
PUT16(0x20000110,0x6826); // 8000110: 6826 ldr r6, [r4, #0]
PUT16(0x20000112,0xf000); // 8000112: f000 f876 bl 8000202 <soft_delay>
PUT16(0x20000114,0xf876); // 8000112: f000 f876 bl 8000202 <soft_delay>
PUT16(0x20000116,0x3d01); // 8000116: 3d01 subs r5, #1
PUT16(0x20000118,0xd1fb); // 8000118: d1fb bne.n 8000112 <inner>
PUT16(0x2000011a,0x6823); // 800011a: 6823 ldr r3, [r4, #0]
PUT16(0x2000011c,0x1af0); // 800011c: 1af0 subs r0, r6, r3
PUT16(0x2000011e,0xbd70); // 800011e: bd70 pop {r4, r5, r6, pc}
PUT16(0x20000202,0x46c0); // 8000202: 46c0 nop ; (mov r8, r8)
PUT16(0x20000204,0x4770); // 8000204: 4770 bx lr
hexstring(HOP(STK_CVR,1000,0x2000010B));
gives
0000464B
and that was not at all expected. but is 18,000 basically
Putting ram to bed after this
PUT16(0x20000108,0xb570); // 800010a: b570 push {r4, r5, r6, lr}
PUT16(0x2000010a,0x1c04); // 800010c: 1c04 adds r4, r0, #0
PUT16(0x2000010c,0x1c0d); // 800010e: 1c0d adds r5, r1, #0
PUT16(0x2000010e,0x6826); // 8000110: 6826 ldr r6, [r4, #0]
PUT16(0x20000110,0xf000); // 8000112: f000 f876 bl 8000202 <soft_delay>
PUT16(0x20000112,0xf876); // 8000112: f000 f876 bl 8000202 <soft_delay>
PUT16(0x20000114,0x3d01); // 8000116: 3d01 subs r5, #1
PUT16(0x20000116,0xd1fb); // 8000118: d1fb bne.n 8000112 <inner>
PUT16(0x20000118,0x6823); // 800011a: 6823 ldr r3, [r4, #0]
PUT16(0x2000011a,0x1af0); // 800011c: 1af0 subs r0, r6, r3
PUT16(0x2000011c,0xbd70); // 800011e: bd70 pop {r4, r5, r6, pc}
PUT16(0x20000200,0x46c0); // 8000202: 46c0 nop ; (mov r8, r8)
PUT16(0x20000200,0x4770); // 8000204: 4770 bx lr
hexstring(HOP(STK_CVR,1000,0x20000109));
00002EDE
The machine code did not change because I moved both back by 2 so the relative address between them was the same. Note that bl is two separate instructions not one 32 bit one. You cant see this in the newer docs you need to go back to the original/early ARM ARM where it is explained. And it is easy to do experiments where you split the two instructions and put other stuff in between and they work just fine, because they are two separate instructions.
At this point the reader should be able to make a 2 instruction test loop, time it and dramatically change the performance of the execution of those two instructions on this platform using the same exact machine code.
So let's try the volatile loop that you wrote.
.align 8
soft_delay:
push {r7, lr}
sub sp, #8
add r7, sp, #0
mov r3, #0
str r3, [r7, #4]
b L12
Lc:
ldr r3, [r7, #4]
add r3, #1
str r3, [r7, #4]
L12:
ldr r3, [r7, #4]
ldr r2, L24
cmp r3, r2
bls Lc
nop
mov sp, r7
add sp, #8
pop {r7, pc}
nop
.align
L24: .word 0x1FFF
this is I believe the unoptimized -O0 version. starting off with one test loop
hexstring(TEST(STK_CVR,1));
experience, the times we are seeing will overflow our 24 bit counter and the results will be very strange or lead to false conclusions.
0001801F
98,000, quick check for safety:
.align
L24: .word 0x1F
0000019F
not bad that is on par with 256 times faster.
so we have some wiggle room in our test loop but not much try 10
hexstring(TEST(STK_CVR,10));
000F012D
98334 ticks per loop.
changing the alignment
08000202 <soft_delay>:
8000202: b580 push {r7, lr}
8000204: b082 sub sp, #8
gave the same result
000F012D
not unheard of, you can examine the differences if you want count through each instruction check fetch cycles, etc.
had I made the test:
soft_delay:
nop
nop
bx lr
its two fetch cycles no matter what the alignment or if I had left it bx lr with no nops as we saw so by simply having an odd number of instructions in the test then alignment won't affect the results on fetches along, but note that from what we know now had some other code in the program moved the outer timing/test loop that may have changed performance and the results may show a difference between two tests that were purely the timing code and not the code under test (read Michael Abrash).
The cortex-m3 is based on the armv7-m architecture. If I change the compiler from -mcpu=cortex-m0 (all cortex-m compatible so far) to -mcpu=cortex-m3 (not all cortex-m compatible will break on half of them) it produces a little bit less code.
.align 8
soft_delay:
push {r7}
sub sp, #12
add r7, sp, #0
movs r3, #0
str r3, [r7, #4]
b L12
Lc:
ldr r3, [r7, #4]
add r3, #1
str r3, [r7, #4]
L12:
ldr r3, [r7, #4]
/*14: f5b3 5f00 cmp.w r3, #8192 ; 0x2000*/
//cmp.w r3, #8192
.word 0x5f00f5b3
bcc Lc
nop
add r7, #12
mov sp, r7
pop {r7}
bx lr
000C80FB 81945 ticks for the code under test.
I hate unified syntax, that was a massive mistake, so I fumble along in legacy mode. thus the .word thing there in the middle.
As part of writing this I kinda messed up my system in order to demonstrate something. I was building a gcc 5.4.0 but overwrote my 9.2.0 so had to re-build both.
2.95 was the version I started using with arm and didn't support thumb gcc 3.x.x was the first to. And either gcc 4.x.x or gcc 5.x.x produced "slower" code for some of my projects, at work we are currently moving from ubuntu 16.04 to 18.04 for our build systems which if you use the apt-got cross compiler for arm that moves you from 5.x.x to 7.x.x and it is making larger binaries for the same source code and where we are tight on memory it is pushing us beyond what's available so we have to either remove some code (easiest to make the printed messages shorter, cut text out) or stick to the older compiler by building our own or apt-getting the older one. 19.10 does no longer offers the 5.x.x version.
So both are now built.
18: d3f8 bcc.n c <soft_delay+0xc>
1a: bf00 nop
1c: bf00 nop
1e: 370c adds r7, #12
these nops after bcc are baffling to me...
18: d3f8 bcc.n c <soft_delay+0xc>
1a: bf00 nop
1c: 370c adds r7, #12
gcc 5.4.0 is putting one, gcc 9.2.0 is putting two nops, ARM doesn't have the branch shadow thing of MIPS (MIPS doesn't currently either).
000C80FB gcc 5.4.0
000C8105 gcc 9.2.0
I call the function 10 times, the nop is outside the code under tests loop so has a lesser effect.
Optimized all cortex-m variants (to date) using gcc 9.2.0
soft_delay:
mov r3, #0
mov r2, #128
sub sp, #8
str r3, [sp, #4]
ldr r3, [sp, #4]
lsl r2, r2, #6
cmp r3, r2
bcs L1c
L10:
ldr r3, [sp, #4]
add r3, #1
str r3, [sp, #4]
ldr r3, [sp, #4]
cmp r3, r2
bcc L10
L1c:
add sp, #8
bx lr
(also understand that not all say gcc 9.2.0 builds produce the same code when you build the compiler you have options and those options can affect the output making different builds of 9.2.0 possibly producing different results)
000C80B5
gcc 9.2.0 built for cortex-m3:
soft_delay:
mov r3, #0
sub sp, #8
str r3, [sp, #4]
ldr r3, [sp, #4]
/*8: f5b3 5f00 cmp.w r3, #8192 ; 0x2000*/
.word 0x5F00F5B3
bcs L1c
Le:
ldr r3, [sp, #4]
add r3, #1
str r3, [sp, #4]
ldr r3, [sp, #4]
/*16: f5b3 5f00 cmp.w r3, #8192 ; 0x2000*/
.word 0x5F00F5B3
bcc Le
L1c:
add sp, #8
bx lr
000C80A1
That's in the noise. despite the code built has differences. they simply didn't gain in comparing the 0x2000 in fewer instructions. and note if you change that 0x2000 to some other number then that does not simply make the loop take that much longer it can change the generated code for architectures like this.
How I like to make these counted delay loops is to use a function outside the compile domain
extern void dummy ( unsigned int );
void soft_delay(void) {
for (unsigned int i=0; i<0x2000; ++i) { dummy(i); }
}
soft_delay:
push {r4, r5, r6, lr}
mov r5, #128
mov r4, #0
lsl r5, r5, #6
L8:
mov r0, r4
add r4, #1
bl dummy
cmp r4, r5
bne L8
pop {r4, r5, r6, pc}
the feature there is you don't need the overhead of what volatile does you do have a call and clearly there is overhead as well due to the call but not as much
000B40C9
or even better:
soft_delay:
sub r0,#1
bne soft_delay
bx lr
I would have to change the code wrapped around the code under test to make that function work.
Another note specific to these targets but also something you deal with
unsigned int more_fun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int a, unsigned int b )
{
return(more_fun(a,b)+a+(b<<2));
}
00000000 <fun>:
0: b570 push {r4, r5, r6, lr}
2: 000c movs r4, r1
4: 0005 movs r5, r0
6: f7ff fffe bl 0 <more_fun>
a: 00a4 lsls r4, r4, #2
c: 1964 adds r4, r4, r5
e: 1820 adds r0, r4, r0
10: bd70 pop {r4, r5, r6, pc}
12: 46c0 nop ; (mov r8, r8)
a question repeated here at SO on a period basis. why is it pushing r6 it isn't using r6.
The compiler operates using what I call and used to be called a calling convention, now they use terms ABI, EABI, whatever either case it is the same thing it is a set of rules the compiler follows for a particular target. Arm added a rule to keep the stack aligned on a 64 bit address boundary instead of 32, this caused the extra item to keep the stack aligned, what register is used there can vary. If you use an older gcc vs a newer this can/will affect the performance of your code all by itself.

There are many factors at play here. Certainly if you have an optimizing compiler and you compare optimized vs not DEPENDING ON THE CODE you can see a large difference in execution speed. Using the volatile in the tiny loop here actually masks some of that, in both cases it should be read/written to memory every loop.
But the calling code the loop variable unoptimized would touch ram two or three times in that loop, optimized ideally would be in a register the whole time, making for a dramatic difference in execution performance even with zero wait state ram.
The toggle pin code is relatively large (talking to the peripheral directly would be less code), depending on whether that library was compiled separately with different options or at the same time with the same options makes a big difference with respect to performance.
Add that this is an mcu and running off of a flash which with the age of this part the flash might at best be half the clock rate of the cpu and worst a number of wait states and I don't remember off hand if ST had the caching in front of it at that time. so every instruction you add can add a clock, so just the loop variable alone can dramatically change the timing.
Being a high performance pipelined core I have demonstrated here and elsewhere that alignment can (not always) play a role, so if in one case the exact same machine code links to address 0x100 in one case and 0x102 in another it is possible that exact same machine code takes extra or fewer clocks to execute based on the nature of the pre-fetcher in the design, or the flash implementation, cache if any implementation, branch predictor, etc.
And then the biggest problem is how did you time this, it is not uncommon for there to be error in not using a clock correctly such that the clock/timing code itself varies and is creating some of the difference. Plus are there background things going on, interrupts/multitasking.
Michal Abrash wrote a wonderful book called The Zen of Assembly Language, you can get it for free in ePub or perhaps pdf form on GitHub. the 8088 was obsolete when the book was released but if you focus on that then you have missed the point, I bought it when it came out and have used what I learned on nearly a daily basis.
gcc is not a high performance compiler it is more of a general purpose compiler built Unix style where you can have different language front ends and different target backends. When I was in the position you are now trying to first understand these things I sampled many compilers for the same arm target and same C code and the results were vast. Later I wrote an instruction set simulator so I could count instructions and memory accesses to compare gnu vs llvm, as the latter has more optimization opportunities than gnu but for execution tests of code gcc was faster sometimes but not slower. That ended up being more of a long weekend having fun than something I used to analyze the differences.
It is easier to start with small-ish code like this and disassemble the two. Understand that fewer instructions doesn't mean faster, one distant memory access on a dram based system can take hundreds of clock cycles, that might be replaced with another solution that takes a handful/dozen of linearly fetched instructions to end up with the same result (do some math vs look up something in a rarely sampled table) and depending on the situation the dozen instructions execute much faster. at the same time the table solution can be much faster. it depends.
Examination of the disassembly often leads to incorrect conclusions (read abrash, not just that book, everything) as first off folks think less instructions means faster code. rearranging instructions in a pipelined processor can improve performance if you move an instruction into a time period that would have otherwise been wasted clocks. incrementing a register not related to a memory access in front of the memory access instead of after in a non-superscaler processor.
Ahh, back to a comment. This was years ago and competing compilers were more of a thing most folks just wrap their gui around gnu and the ide/gui is the product not the compiler. But there was the arm compiler itself, before the rvct tools, ads and I forget the other, those were "better" than gcc. I forget the names of the others but there was one that produced significantly faster code, granted this was Dhrystone so you will also find that they may tune optimizers for Dhrystone just to play benchmark games. Now that I can see how easy it is to manipulate benchmarks I consider them to in general be bu33zzit, can't be trusted. Kiel used to be a multi-target tool for mcus and similar, but then was purchased by arm and I thought at the time they were dropping all other targets, but have not checked in a while. I might have tried them once to get access to a free/demo version of rvct as when I was working at one job we had a budget to buy multi-thousand dollar tools, but that didn't include rvct (although I was on phone calls with the formerly Allant folks who were part of a purchase that became the rvct tools) which I was eager to try once they had finished the development/integration of that product, by then didn't have a budget for that and later couldn't afford or wasn't interested in buying even kiels tools, much less arms. Their early demos of rvct created an encrypted/obfuscated binary that was not arm machine code it only ran on their simulator so you couldn't use it to eval performance or to compare it to others, don't think they were willing to give us an un-obfuscated version and we weren't willing to reverse engineer it. now it is easier just to use gcc or clang and hand optimize where NEEDED. Likewise with experience can write C code that optimizes better based on experience examining compiler output.
You have to know the hardware, particularly in this case where you take processor IP and most of the chip is not related to the processor IP and most of the performance is not related to the processor IP (pretty much true for a lot of platforms today in particular your server/desktop/laptop). The Gameboy Advance for example used a lot of 16 bit buses instead of 32, thumb tools were barely being integrated, but thumb mode while counting instructions/or bytes was like 10% more code at the time, executed significantly faster on that chip. On other implementations both arm architecture and chip design thumb may have performance penalties, or not.
ST in general with the cortex-m products tends to put a cache in front of the flash, sometimes they document it and provide enable/disable control sometimes not so it can be difficult at best to get a real performance value as the typical thing is to run the code under test many times in a loop so you can get a better time measurement. other vendors don't necessarily do this and it is much easier to see the flash wait states and get a real, worst case, timing value that you can use to validate your design. caches in general as well as pipelines make it difficult at best to get good, repeatable, reliable numbers to validate your design. So for example you sometimes cannot do the alignment trick to mess with performance of the same machine code on an st but on say a ti with the same core you can. st might not give you the icache in a cortex-m7 where another vendor might since st has already covered that. Even within one brand name though don't expect the results of one chip/family to translate to another chip/family even if they use the same core. Also look at subtle comments in the arm documentation as to whether some cores offer a fetch size as an advertised option, single or multi-cycle multiply, etc. and I'll tell you that there are other compile time options for the core that are not shown in the technical reference manual that can affect performance so don't assume that all cortex-m3s are the same even if they are the same revision from arm. The chip vendor has the source so they can go even further and modify it, or for example a register bank to be implemented by the consumer they might change it from no protection to parity to ecc which might affect performance while retaining all of arms original code as is. When you look at an avr or pic though or even an msp430, while I cant prove it those designs appear more static not tiny vs Xmega vs regular old avr as there are definite differences there, but one tiny to another.
Your assumptions are a good start, there really isn't such a thing as over-optimization, more of a thing of missed optimizations, but there may be other factors you are not seeing in your assumptions that may or may not be see in a disassembly. there are obvious things that we would expect like one of the loop variables to be register based vs memory based. alignment, I wouldn't expect the clock settings to change if you used the same code, but a different set of experiments using timers or a scope you can measure the clock settings to see if they were configured the same. background tasks, interrupts and dumb luck as to how/when they hit the test. But bottom line, sometimes it is as simple as a missed optimization or subtle differences in how one compiler generates code to another, it is as often not those things and more of a system issue, memory speed, peripheral speed, caches or their architecture, how the various busses in the design operate, etc. For some of these cortex-ms (and many other processors) you can exploit their bus behavior to show a performance difference in something that the average person wouldn't expect to see.

Keil over-optimization (unlikely, because the code is very simple)
You cant over-optimize you can under/miss so if anything gcc missed something that Kiel didn't. Not the other way around
arm-none-eabi-gcc under-optimization due to wrong compiler flags (I use CLion Embedded plugins` CMakeLists.txt)
will see below but this is highly likely esp debug vs release, I never build for debug (never use a debugger) you have to test everything twice, and if you don't test as you go it makes it much harder to debug so the release version if it has issues takes a lot more work to figure out the issues.
A bug in the initialization so that chip has lower clock frequency with arm-none-eabi-gcc (to be investigated)
My guess is it isn't this, this would imply you made a really big mistake and didn't compile the same code on each tool so it wasn't a fair comparison.
Let's run it.
Using the systick timer, 24 bit (current value register address passed in r0)
.align 8
.thumb_func
.globl TEST
TEST:
push {r4,r5,r6,lr}
mov r4,r0
ldr r5,[r4]
bl soft_delay
ldr r3,[r4]
sub r0,r5,r3
pop {r4,r5,r6,pc}
to avoid overflowing the 24 bit timer the loops count to limited to 200000 times not 2000000 times. I assume the code you left out is 2000000 - 1. If not this still shows the relevant differences.
-O0 code
.align 8
soft_delay:
PUSH {r3,lr}
MOV r0,#0
STR r0,[sp,#0]
B L6.14
L6.8:
LDR r0,[sp,#0]
ADD r0,r0,#1
STR r0,[sp,#0]
L6.14:
LDR r1,L6.24
LDR r0,[sp,#0]
CMP r0,r1
BCC L6.8
POP {r3,pc}
.align
L6.24: .word 100000 - 1
08000200 <soft_delay>:
8000200: b508 push {r3, lr}
8000202: 2000 movs r0, #0
8000204: 9000 str r0, [sp, #0]
8000206: e002 b.n 800020e <L6.14>
08000208 <L6.8>:
8000208: 9800 ldr r0, [sp, #0]
800020a: 3001 adds r0, #1
800020c: 9000 str r0, [sp, #0]
0800020e <L6.14>:
800020e: 4902 ldr r1, [pc, #8] ; (8000218 <L6.24>)
8000210: 9800 ldr r0, [sp, #0]
8000212: 4288 cmp r0, r1
8000214: d3f8 bcc.n 8000208 <L6.8>
8000216: bd08 pop {r3, pc}
08000218 <L6.24>:
8000218: 0001869f
00124F8B systick timer ticks
-O1 code
soft_delay:
PUSH {r3,lr}
MOV r0,#0
STR r0,[sp,#0]
LDR r0,L6.24
B L6.16
L6.10:
LDR r1,[sp,#0]
ADD r1,r1,#1
STR r1,[sp,#0]
L6.16:
LDR r1,[sp,#0]
CMP r1,r0
BCC L6.10
POP {r3,pc}
.align
L6.24: .word 100000 - 1
08000200 <soft_delay>:
8000200: b508 push {r3, lr}
8000202: 2000 movs r0, #0
8000204: 9000 str r0, [sp, #0]
8000206: 4804 ldr r0, [pc, #16] ; (8000218 <L6.24>)
8000208: e002 b.n 8000210 <L6.16>
0800020a <L6.10>:
800020a: 9900 ldr r1, [sp, #0]
800020c: 3101 adds r1, #1
800020e: 9100 str r1, [sp, #0]
08000210 <L6.16>:
8000210: 9900 ldr r1, [sp, #0]
8000212: 4281 cmp r1, r0
8000214: d3f9 bcc.n 800020a <L6.10>
8000216: bd08 pop {r3, pc}
08000218 <L6.24>:
8000218: 0001869f
000F424E systicks
-O2 code
soft_delay:
SUB sp,sp,#4
MOVS r0,#0
STR r0,[sp,#0]
LDR r0,L4.24
L4.8:
LDR r1,[sp,#0]
ADDS r1,r1,#1
STR r1,[sp,#0]
CMP r1,r0
BCC L4.8
ADD sp,sp,#4
BX lr
.align
L4.24: .word 100000 - 1
08000200 <soft_delay>:
8000200: b081 sub sp, #4
8000202: 2000 movs r0, #0
8000204: 9000 str r0, [sp, #0]
8000206: 4804 ldr r0, [pc, #16] ; (8000218 <L4.24>)
08000208 <L4.8>:
8000208: 9900 ldr r1, [sp, #0]
800020a: 3101 adds r1, #1
800020c: 9100 str r1, [sp, #0]
800020e: 4281 cmp r1, r0
8000210: d3fa bcc.n 8000208 <L4.8>
8000212: b001 add sp, #4
8000214: 4770 bx lr
8000216: 46c0 nop ; (mov r8, r8)
08000218 <L4.24>:
8000218: 0001869f
000AAE65 systicks
-O3
soft_delay:
PUSH {r3,lr}
MOV r0,#0
STR r0,[sp,#0]
LDR r0,L5.20
L5.8:
LDR r1,[sp,#0]
ADD r1,r1,#1
STR r1,[sp,#0]
CMP r1,r0
BCC L5.8
POP {r3,pc}
.align
L5.20: .word 100000 - 1
08000200 <soft_delay>:
8000200: b508 push {r3, lr}
8000202: 2000 movs r0, #0
8000204: 9000 str r0, [sp, #0]
8000206: 4803 ldr r0, [pc, #12] ; (8000214 <L5.20>)
08000208 <L5.8>:
8000208: 9900 ldr r1, [sp, #0]
800020a: 3101 adds r1, #1
800020c: 9100 str r1, [sp, #0]
800020e: 4281 cmp r1, r0
8000210: d3fa bcc.n 8000208 <L5.8>
8000212: bd08 pop {r3, pc}
08000214 <L5.20>:
8000214: 0001869f
000AAE6A systicks
Interestingly alignment doesn't affect any of these results.
Comparing your results relative to each other and the above in a spreadsheet
18.7 1.000 00124F8B 1200011 1.000
13.3 0.711 000F424E 1000014 0.833
9.8 0.524 000AAE65 700005 0.583
9.9 0.529 000AAE6A 700010 0.583
It shows that the various stages as I have measured also show improvements and that -O3 is slightly slower.
Analyze what happened.
void soft_delay(void) {
for (volatile uint32_t i=0; i<2000000; ++i) { }
}
because this counts up AND is volatile the compiler cannot do the usual count down and save an instruction (subs then bne rather than add, cmp, bcc)
-O0 code
soft_delay:
PUSH {r3,lr} allocate space for i
MOV r0,#0 i = 0
STR r0,[sp,#0] i = 0
B L6.14
L6.8:
LDR r0,[sp,#0] read i from memory
ADD r0,r0,#1 increment i
STR r0,[sp,#0] save i to memory
L6.14:
LDR r1,L6.24 read max value
LDR r0,[sp,#0] read i from memory
CMP r0,r1 compare i and max value
BCC L6.8 branch if unsigned lower
POP {r3,pc} return
I should have examined the code first L6.24 should have been 2000000 not 2000000 - 1. You left this out of your question.
No optimization generally means just bang out the code in order as in the high level language.
r3 doesn't need to be preserved neither does LR but the variable is volatile so it needs space on the stack the compiler chose to do it this way for this optimization level pushing lr allows for it to pop pc at the end.
push is a pseudo instruction for stm (stmdb) so 8 is subtracted from the stack pointer then the registers are saved in order so if the sp was at 0x1008 then it changes to 0x1000 and writes r3 to 0x1000 and lr to 0x1004 so for the rest of this function it uses sp+0 which is 0x1000 in this example. The r3 and the push used in this way is to allocate a location for the variable i in the code.
-O1 version
soft_delay:
PUSH {r3,lr} allocate space
MOV r0,#0 i = 0
STR r0,[sp,#0] i = 0
LDR r0,L6.24 read max/test value
B L6.16
L6.10:
LDR r1,[sp,#0] load i from memory
ADD r1,r1,#1 increment i
STR r1,[sp,#0] save i to memory
L6.16:
LDR r1,[sp,#0] read i from memory
CMP r1,r0 compare i with test value
BCC L6.10 branch if unsigned lower
POP {r3,pc}
The primary difference between -O0 and -O1 in this case is the -O0 version reads the max value every time through the loop. The -O1 version reads it outside the loop one time.
-O0
08000208 <L6.8>:
8000208: 9800 ldr r0, [sp, #0]
800020a: 3001 adds r0, #1
800020c: 9000 str r0, [sp, #0]
800020e: 4902 ldr r1, [pc, #8] ; (8000218 <L6.24>)
8000210: 9800 ldr r0, [sp, #0]
8000212: 4288 cmp r0, r1
8000214: d3f8 bcc.n 8000208 <L6.8>
1200011 / 100000 = 12
The bulk of the time is in the above loop. 7 instructions three loads two stores. That is 12 things so perhaps its one clock per.
-O1 code
0800020a <L6.10>:
800020a: 9900 ldr r1, [sp, #0]
800020c: 3101 adds r1, #1
800020e: 9100 str r1, [sp, #0]
08000210 <L6.16>:
8000210: 9900 ldr r1, [sp, #0]
8000212: 4281 cmp r1, r0
8000214: d3f9 bcc.n 800020a <L6.10>
1000014 / 100000 = 10
0800020a <L6.10>:
800020a: 9900 ldr r1, [sp, #0]
800020c: 3101 adds r1, #1
800020e: 9100 str r1, [sp, #0]
8000210: 9900 ldr r1, [sp, #0]
8000212: 4281 cmp r1, r0
8000214: d3f9 bcc.n 800020a <L6.10>
6 instructions, two loads one store. 8 things 10 clocks. The difference here from -O0 is that the compare value is read before/outside the loop so that saves that instruction and that memory cycle.
-O2 code
08000208 <L4.8>:
8000208: 9900 ldr r1, [sp, #0]
800020a: 3101 adds r1, #1
800020c: 9100 str r1, [sp, #0]
800020e: 4281 cmp r1, r0
8000210: d3fa bcc.n 8000208 <L4.8>
700005 / 100000 = 7 ticks per loop
So by some folks definition, this isn't honoring the volatile, or is it? The compare value is outside the loop and the way this is written it should be 2000000 + 1, yes? It reads i from memory one time per loop rather than twice but does store it every time through the loop with the new value. Basically it removed the second load and that saved some time waiting on that read to finish.
-O3 code
08000208 <L5.8>:
8000208: 9900 ldr r1, [sp, #0]
800020a: 3101 adds r1, #1
800020c: 9100 str r1, [sp, #0]
800020e: 4281 cmp r1, r0
8000210: d3fa bcc.n 8000208 <L5.8>
The inner loop is the same as -O2.
-O2 does this
08000200 <soft_delay>:
8000200: b081 sub sp, #4
8000202: 2000 movs r0, #0
8000204: 9000 str r0, [sp, #0]
8000206: 4804 ldr r0, [pc, #16] ; (8000218 <L4.24>)
...
8000212: b001 add sp, #4
8000214: 4770 bx lr
-O3 does this
08000200 <soft_delay>:
8000200: b508 push {r3, lr}
8000202: 2000 movs r0, #0
8000204: 9000 str r0, [sp, #0]
8000206: 4803 ldr r0, [pc, #12] ; (8000214 <L5.20>)
8000212: bd08 pop {r3, pc}
Now that is fewer instructions yes, but the push and pop take longer they have memory cycle overhead, the subtract and add of the stack pointer instructions are faster than those memory cycles even with the fewer instructions. So the subtle difference in time is the push/pop outside the loop.
Now for GCC (9.2.0)
For starters I don't know if Kiel was targetted at thumb in general (all variants) the cortex-ms or the cortex-m3 specifically.
First -O0 code:
-O0
soft_delay:
push {r7, lr}
sub sp, sp, #8
add r7, sp, #0
movs r3, #0
str r3, [r7, #4]
b .L2
.L3:
ldr r3, [r7, #4]
adds r3, r3, #1
str r3, [r7, #4]
.L2:
ldr r3, [r7, #4]
ldr r2, .L4
cmp r3, r2
bls .L3
nop
nop
mov sp, r7
add sp, sp, #8
# sp needed
pop {r7}
pop {r0}
bx r0
.L5:
.align 2
.L4:
.word 199999
08000200 <soft_delay>:
8000200: b580 push {r7, lr}
8000202: b082 sub sp, #8
8000204: af00 add r7, sp, #0
8000206: 2300 movs r3, #0
8000208: 607b str r3, [r7, #4]
800020a: e002 b.n 8000212 <soft_delay+0x12>
800020c: 687b ldr r3, [r7, #4]
800020e: 3301 adds r3, #1
8000210: 607b str r3, [r7, #4]
8000212: 687b ldr r3, [r7, #4]
8000214: 4a04 ldr r2, [pc, #16] ; (8000228 <soft_delay+0x28>)
8000216: 4293 cmp r3, r2
8000218: d9f8 bls.n 800020c <soft_delay+0xc>
800021a: 46c0 nop ; (mov r8, r8)
800021c: 46c0 nop ; (mov r8, r8)
800021e: 46bd mov sp, r7
8000220: b002 add sp, #8
8000222: bc80 pop {r7}
8000224: bc01 pop {r0}
8000226: 4700 bx r0
8000228: 00030d3f andeq r0, r3, pc, lsr sp
00124F9F
Immediately we see two things, first the stack frame which Kiel was not building and second these mystery nops after the compare, gotta be some chip errata or something, need to look that up. From my other answer that may be deleted by now gcc 5.4.0 put one nop, tcc 9.2.0 put two. so this loop has
1200031 / 100000 = 12 ticks per loop
800020c: 687b ldr r3, [r7, #4]
800020e: 3301 adds r3, #1
8000210: 607b str r3, [r7, #4]
8000212: 687b ldr r3, [r7, #4]
8000214: 4a04 ldr r2, [pc, #16] ; (8000228 <soft_delay+0x28>)
8000216: 4293 cmp r3, r2
8000218: d9f8 bls.n 800020c <soft_delay+0xc>
The main loop where this code spends its time is also 12 ticks like Kiel its the same just different registers which don't matter. The subtle overall time difference is that the stack frame and the extra nops make the gcc version slightly longer.
arm-none-eabi-gcc -O0 -fomit-frame-pointer -c -mthumb -mcpu=cortex-m0 hello.c -o hello.o
arm-none-eabi-objdump -D hello.o > hello.list
arm-none-eabi-gcc -O0 -fomit-frame-pointer -S -mthumb -mcpu=cortex-m0 hello.c
If I build without a frame pointer then gcc -O0 becomes
soft_delay:
sub sp, sp, #8
movs r3, #0
str r3, [sp, #4]
b .L2
.L3:
ldr r3, [sp, #4]
adds r3, r3, #1
str r3, [sp, #4]
.L2:
ldr r3, [sp, #4]
ldr r2, .L4
cmp r3, r2
bls .L3
nop
nop
add sp, sp, #8
bx lr
.L5:
.align 2
.L4:
.word 99999
08000200 <soft_delay>:
8000200: b082 sub sp, #8
8000202: 2300 movs r3, #0
8000204: 9301 str r3, [sp, #4]
8000206: e002 b.n 800020e <soft_delay+0xe>
8000208: 9b01 ldr r3, [sp, #4]
800020a: 3301 adds r3, #1
800020c: 9301 str r3, [sp, #4]
800020e: 9b01 ldr r3, [sp, #4]
8000210: 4a03 ldr r2, [pc, #12] ; (8000220 <soft_delay+0x20>)
8000212: 4293 cmp r3, r2
8000214: d9f8 bls.n 8000208 <soft_delay+0x8>
8000216: 46c0 nop ; (mov r8, r8)
8000218: 46c0 nop ; (mov r8, r8)
800021a: b002 add sp, #8
800021c: 4770 bx lr
800021e: 46c0 nop ; (mov r8, r8)
8000220: 0001869f
00124F94
and saves 11 clocks over the other gcc version unlike Kiel gcc is not doing the push pop thing so saving some clocks over Kiel but the nops don't help.
Update: I had the wrong number of loops for Kiel because it used unsigned lower instead of unsigned lower or same as with gcc. Even the playing field, remove the nops fix the loops gcc is 00124F92 and Kiel 00124F97 5 clocks slower due to the push/pop vs sp math. gcc 5.4.0 also does the sp math thing, with the nop 00124F93. Being outside the loop stuff these differences while measurable are also in the noise when comparing these two (three) compilers.
gcc -O1
soft_delay:
sub sp, sp, #8
mov r3, #0
str r3, [sp, #4]
ldr r2, [sp, #4]
ldr r3, .L5
cmp r2, r3
bhi .L1
mov r2, r3
.L3:
ldr r3, [sp, #4]
add r3, r3, #1
str r3, [sp, #4]
ldr r3, [sp, #4]
cmp r3, r2
bls .L3
.L1:
add sp, sp, #8
bx lr
.L6:
.align 2
.L5:
.word 99999
08000200 <soft_delay>:
8000200: b082 sub sp, #8
8000202: 2300 movs r3, #0
8000204: 9301 str r3, [sp, #4]
8000206: 9a01 ldr r2, [sp, #4]
8000208: 4b05 ldr r3, [pc, #20] ; (8000220 <soft_delay+0x20>)
800020a: 429a cmp r2, r3
800020c: d806 bhi.n 800021c <soft_delay+0x1c>
800020e: 1c1a adds r2, r3, #0
8000210: 9b01 ldr r3, [sp, #4]
8000212: 3301 adds r3, #1
8000214: 9301 str r3, [sp, #4]
8000216: 9b01 ldr r3, [sp, #4]
8000218: 4293 cmp r3, r2
800021a: d9f9 bls.n 8000210 <soft_delay+0x10>
800021c: b002 add sp, #8
800021e: 4770 bx lr
8000220: 0001869f muleq r1, pc, r6 ; <UNPREDICTABLE>
000F4251
10 ticks per loop
8000210: 9b01 ldr r3, [sp, #4]
8000212: 3301 adds r3, #1
8000214: 9301 str r3, [sp, #4]
8000216: 9b01 ldr r3, [sp, #4]
8000218: 4293 cmp r3, r2
800021a: d9f9 bls.n 8000210 <soft_delay+0x10>
Same as Kiel the load of the compare value is outside the loop now saving a little per loop. It was architected a little different. And I believe the nops after the bls are something else. I just saw someone asking about why gcc did something that another didn't what seemed to be an extra instruction. I would use the term missed optimization vs bug, but either way this one doesn't have the nops...
gcc -O2 code
soft_delay:
mov r3, #0
sub sp, sp, #8
str r3, [sp, #4]
ldr r3, [sp, #4]
ldr r2, .L7
cmp r3, r2
bhi .L1
.L3:
ldr r3, [sp, #4]
add r3, r3, #1
str r3, [sp, #4]
ldr r3, [sp, #4]
cmp r3, r2
bls .L3
.L1:
add sp, sp, #8
bx lr
.L8:
.align 2
.L7:
.word 99999
08000200 <soft_delay>:
8000200: 2300 movs r3, #0
8000202: b082 sub sp, #8
8000204: 9301 str r3, [sp, #4]
8000206: 9b01 ldr r3, [sp, #4]
8000208: 4a05 ldr r2, [pc, #20] ; (8000220 <soft_delay+0x20>)
800020a: 4293 cmp r3, r2
800020c: d805 bhi.n 800021a <soft_delay+0x1a>
800020e: 9b01 ldr r3, [sp, #4]
8000210: 3301 adds r3, #1
8000212: 9301 str r3, [sp, #4]
8000214: 9b01 ldr r3, [sp, #4]
8000216: 4293 cmp r3, r2
8000218: d9f9 bls.n 800020e <soft_delay+0xe>
800021a: b002 add sp, #8
800021c: 4770 bx lr
800021e: 46c0 nop ; (mov r8, r8)
8000220: 0001869f
000F4251
No difference from -O1
800020e: 9b01 ldr r3, [sp, #4]
8000210: 3301 adds r3, #1
8000212: 9301 str r3, [sp, #4]
8000214: 9b01 ldr r3, [sp, #4]
8000216: 4293 cmp r3, r2
8000218: d9f9 bls.n 800020e <soft_delay+0xe>
gcc is not willing to take that second load out of the loop.
at the -O2 level Kiel is 70005 ticks and gcc 1000017. 42 percent more/slower.
gcc -O3 produced the same code as -O2.
So the key difference here is perhaps an interpretation of what volatile does, and there are some folks at SO that get upset about its use anyway, but let's just assume that it means everything you do with the variable needs to go to/from memory.
From what I normally see that means this
.L3:
ldr r3, [sp, #4]
add r3, r3, #1
str r3, [sp, #4]
ldr r3, [sp, #4]
cmp r3, r2
bls .L3
not this
.L3:
ldr r3, [sp, #4]
add r3, r3, #1
str r3, [sp, #4]
cmp r3, r2
bls .L3
Is that a Kiel bug? Do you want to use your over-optimization term here?
There are two operations an increment
ldr r3, [sp, #4]
add r3, r3, #1
str r3, [sp, #4]
and a compare
ldr r3, [sp, #4]
cmp r3, r2
bls .L3
arguably each should access the variable from memory not from a register. (in a pure debug version sense you should see code like this too btw, although the tool defines what it means by debug version)
When you figure out which gcc you have and how it was used it may account for even more code on the gcc side being the 100% slower not 40%.
I don't know that you could make this any tighter, I don't think re-arranging instructions will improve performance either.
Also, this was a missed optimization in gcc:
cmp r3, r2
bhi .L1
gcc knew that it was starting from zero and knew it was going to a bigger number so r3 would never be larger than r2 here.
We wish for the tool to make this:
soft_delay:
mov r3, #0
ldr r2, .L7
.L3:
add r3, r3, #1
cmp r3, r2
bls .L3
.L1:
bx lr
.L8:
.align 2
.L7:
.word 99999
00061A88
at 4 instructions per loop on average
but without the volatile it is dead code so the optimizer would simply remove it rather than make this code. A down count loop would be slightly smaller
soft_delay:
ldr r2, .L7
.L3:
sub r2, r2, #1
bne .L3
.L1:
bx lr
.L8:
.align 2
.L7:
.word 100000
000493E7
3 ticks per loop, removing the extra instruction helped.
Keil over-optimization (unlikely, because the code is very simple)
You might actually be right here, not because it is simple, but what does volatile really mean, and is it subject to interpretation by the compilers (I would have to find a spec). Is this a Kiel bug, did it over optimize?
There still isn't such a thing as over-optimization, there is a name for that, a compiler bug. So did Kiel interpret this wrong or Kiel and gcc disagree on the interpretation of volatile.
arm-none-eabi-gcc under-optimization due to wrong compiler flags (I use CLion Embedded plugins` CMakeLists.txt)
This could be it as well, for the same reason. Is this simply an "implementation defined" difference between compilers and both are right based on their definition?
Now gcc did miss an optimization here (or two), but it accounts for a small amount as it is outside the loop.

GCC ||| KEIL
|||
soft_delay: |||
mov r3, #0 |||
sub sp, sp, #8 |||
str r3, [sp, #4] |||
ldr r3, [sp, #4] |||
ldr r2, .L7 |||
cmp r3, r2 |||
bhi .L1 ||| soft_delay PROC
.L3: ||| PUSH {r3,lr}
ldr r3, [sp, #4] ||| MOVS r0,#0
add r3, r3, #1 ||| STR r0,[sp,#0]
str r3, [sp, #4] ||| LDR r0,|L5.20|
ldr r3, [sp, #4] ||| |L5.8|
cmp r3, r2 ||| LDR r1,[sp,#0]
bls .L3 ||| ADDS r1,r1,#1
.L1: ||| STR r1,[sp,#0]
add sp, sp, #8 ||| CMP r1,r0
bx lr ||| BCC |L5.8|
.L7: ||| POP {r3,pc}
.word 1999999 ||| ENDP
There is obvious bug in KEIL. volatile means that its value has to be loaded before every use and saved when changed. ? Keil is missing one load.
The variable is used 2 times: 1: when increased, 2: when compared. Two loads needed.

What am I doing wrong when compiling C code to bare metal (raspberry pi)?

I have spent multiple days trying to figure this out and I just can't. I have some C code. I have made the assembly code for this C program, copy pasted the assembly to someone else's project (that only contains a single assembly file) and assembled that. In these case things work. But if I try to compile from C directly to generate the binaries, it doesn't work. Even though everything else should be identical. This is my C code:
#include <stdint.h>
#define REGISTERS_BASE 0x3F000000
#define MAIL_BASE 0xB880 // Base address for the mailbox registers
// This bit is set in the status register if there is no space to write into the mailbox
#define MAIL_FULL 0x80000000
// This bit is set in the status register if there is nothing to read from the mailbox
#define MAIL_EMPTY 0x40000000
struct Message
{
uint32_t messageSize;
uint32_t requestCode;
uint32_t tagID;
uint32_t bufferSize;
uint32_t requestSize;
uint32_t pinNum;
uint32_t on_off_switch;
uint32_t end;
};
struct Message m =
{
.messageSize = sizeof(struct Message),
.requestCode =0,
.tagID = 0x00038041,
.bufferSize = 8,
.requestSize =0,
.pinNum = 130,
.on_off_switch = 1,
.end = 0,
};
/** Main function - we'll never return from here */
int _start(void)
{
uint32_t mailbox = MAIL_BASE + REGISTERS_BASE + 0x18;
volatile uint32_t status;
do
{
status = *(volatile uint32_t *)(mailbox);
}
while((status & 0x80000000));
*(volatile uint32_t *)(MAIL_BASE + REGISTERS_BASE + 0x20) = ((uint32_t)(&m) & 0xfffffff0) | (uint32_t)(8);
while(1);
}
This is a linker file I copied from the successful method:
/*
* Very simple linker script, combing the text and data sections
* and putting them starting at address 0x800.
*/
SECTIONS {
/* Put the code at 0x80000, leaving room for ARM and
* the stack. It also conforms to the standard expecations.
*/
.init 0x8000 : {
*(.init)
}
.text : {
*(.text)
}
/* Put the data after the code */
.data : {
*(.data)
}
}
And these is how I am compiling and linking everything:
arm-none-eabi-gcc -O0 -march=armv8-a PiTest.c -nostartfiles -o kernel.o
arm-none-eabi-ld kernel.o -o kernel.elf -T kernel.ld
arm-none-eabi-objcopy kernel.elf -O binary kernel.img
My target architecture is armv8 since that's what the pi model 3 uses.
I have no idea how the generated assembly works, but the C code directly does not. Please help I am on the verge of madness.
EDIT: The expected behaviour is for the pi's light to turn on. which it does with the first method I described. With the second method the light remains off.
EDIT4: Made some changes to files, deleted previous edits with outdated info to reduce post size
kernel.elf: file format elf32-littlearm
Disassembly of section .init:
00008000 <_start>:
8000: e3a0dd7d mov sp, #8000 ; 0x1f40
8004: eaffffff b 8008 <kernel_main>
Disassembly of section .text:
00008008 <kernel_main>:
8008: e52db004 push {fp} ; (str fp, [sp, #-4]!)
800c: e28db000 add fp, sp, #0
8010: e24dd00c sub sp, sp, #12
8014: e30b3898 movw r3, #47256 ; 0xb898
8018: e3433f00 movt r3, #16128 ; 0x3f00
801c: e50b3008 str r3, [fp, #-8]
8020: e51b3008 ldr r3, [fp, #-8]
8024: e5933000 ldr r3, [r3]
8028: e50b300c str r3, [fp, #-12]
802c: e51b300c ldr r3, [fp, #-12]
8030: e3530000 cmp r3, #0
8034: bafffff9 blt 8020 <kernel_main+0x18>
8038: e30b38a0 movw r3, #47264 ; 0xb8a0
803c: e3433f00 movt r3, #16128 ; 0x3f00
8040: e3082050 movw r2, #32848 ; 0x8050
8044: e3402001 movt r2, #1
8048: e3c2200f bic r2, r2, #15
804c: e3822008 orr r2, r2, #8
8050: e5832000 str r2, [r3]
8054: eafffffe b 8054 <kernel_main+0x4c>
Disassembly of section .data:
00008058 <__data_start>:
8058: 00000020 andeq r0, r0, r0, lsr #32
805c: 00000000 andeq r0, r0, r0
8060: 00038041 andeq r8, r3, r1, asr #32
8064: 00000008 andeq r0, r0, r8
8068: 00000000 andeq r0, r0, r0
806c: 00000082 andeq r0, r0, r2, lsl #1
8070: 00000001 andeq r0, r0, r1
8074: 00000000 andeq r0, r0, r0
Disassembly of section .ARM.attributes:
00000000 <_stack-0x80021>:
0: 00002e41 andeq r2, r0, r1, asr #28
4: 61656100 cmnvs r5, r0, lsl #2
8: 01006962 tsteq r0, r2, ror #18
c: 00000024 andeq r0, r0, r4, lsr #32
10: 412d3805 ; <UNDEFINED> instruction: 0x412d3805
14: 070e0600 streq r0, [lr, -r0, lsl #12]
18: 09010841 stmdbeq r1, {r0, r6, fp}
1c: 14041202 strne r1, [r4], #-514 ; 0xfffffdfe
20: 17011501 strne r1, [r1, -r1, lsl #10]
24: 1a011803 bne 46038 <__bss_end__+0x3dfc0>
28: 2a012201 bcs 48834 <__bss_end__+0x407bc>
2c: Address 0x000000000000002c is out of bounds.
Disassembly of section .comment:
00000000 <.comment>:
0: 3a434347 bcc 10d0d24 <_stack+0x1050d03>
4: 35312820 ldrcc r2, [r1, #-2080]! ; 0xfffff7e0
8: 392e343a stmdbcc lr!, {r1, r3, r4, r5, sl, ip, sp}
c: 732b332e ; <UNDEFINED> instruction: 0x732b332e
10: 33326e76 teqcc r2, #1888 ; 0x760
14: 37373131 ; <UNDEFINED> instruction: 0x37373131
18: 2029312d eorcs r3, r9, sp, lsr #2
1c: 2e392e34 mrccs 14, 1, r2, cr9, cr4, {1}
20: 30322033 eorscc r2, r2, r3, lsr r0
24: 35303531 ldrcc r3, [r0, #-1329]! ; 0xfffffacf
28: 28203932 stmdacs r0!, {r1, r4, r5, r8, fp, ip, sp}
2c: 72657270 rsbvc r7, r5, #112, 4
30: 61656c65 cmnvs r5, r5, ror #24
34: 00296573 eoreq r6, r9, r3, ror r5

kernel8.img
12345678
00000800
00080264
00000000
12345678
kernel8-32.img
12345678
00008320
00008224
200001DA
12345678
kernel7.img
12345678
00000700
00008224
200001DA
12345678
kernel.img
12345678
00000000
00008224
200001DA
12345678
when I wrote and posted this code this is what I got so if you name your file kernel.img then 0x8000 is your entry point the answer I gave in your other SO question is a complete raspberry pi starting point. You can simply add your mailbox stuff, although if you are struggling with this I thing the mailbox and video are not where you should start IMO.
if you name the file kernel8.img then the entry point is 0x80000 change the linker script to match.
I have a serial port based bootloader you can use to save on the sd card dance, can get a long way with that then simply use the binary version of what you are creating to write to the flash once your application is working.
EDIT
Okay this is incredibly disgusting and by posting it here maybe that means you cant use it in your classwork...you should really do this right and not use inline assembly for your bootstrap...
so.c
asm(
".globl _start\n"
"_start:\n"
"mov sp,#0x8000\n"
"bl centry\n"
"b .\n"
);
unsigned int centry ( void )
{
return(5);
}
build
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-ld -Ttext=0x8000 so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy so.elf -O binary kernel.img
examine
Disassembly of section .text:
00008000 <_start>:
8000: e3a0d902 mov sp, #32768 ; 0x8000
8004: eb000000 bl 800c <centry>
8008: eafffffe b 8008 <_start+0x8>
0000800c <centry>:
800c: e3a00005 mov r0, #5
8010: e12fff1e bx lr
A complete raspberry pi C with bootstrap example that will work on any of the flavors of pi (so far as I know they might have changed the GPU bootloader in the last few months but assume the didnt).

There are a couple of things I see wrong here. The most obvious ones are:
You aren't leaving anything at address 0, so the CPU is left executing blank memory at startup. You need to put something (like a branch instruction!) at 0x0.
On ARM Cortex-A, the stack pointer is not initialized at startup. You have to initialize it yourself in _start -- which means you will need to write that function in assembly.

First, cudos to old timer for his patience helping me.
The mistakes were:
Wrong entry point for the program, fixed by creating an assembly file with the label _start to set the stack pointer and using the linker to put the init section at address 0x8000
The compilation line itself was also wrong, it was missing a -c argument

How to make bare metal ARM programs and run them on QEMU?

I am trying to get this tutorial to work as intended without success (Something fails after the bl main instruction).
According to the tutorial the command
(qemu) xp /1dw 0xa0000018
should result in the print 33 (But i get 0x00 instead)
a0000018: 33
This is the content of the registers after the main call (see startup.s)
(qemu) info registers
R00=a000001c R01=a000001c R02=00000006 R03=00000000
R04=00000000 R05=00000005 R06=00000006 R07=00000007
R08=00000008 R09=00000009 R10=00000000 R11=a3fffffc
R12=00000000 R13=00000000 R14=0000003c R15=00000004
PSR=800001db N--- A und32
FPSCR: 00000000
I have the following files
main.c
startup.s
lscript.ld
Makefile
And I am using the following toolchain
arm-2013.11-24-arm-none-eabi-i686-pc-linux-gnu
Makefile:
SRCS := main.c startup.s
LINKER_NAME := lscript.ld
ELF_NAME := program.elf
BIN_NAME := program.bin
FLASH_NAME := flash.bin
CC := arm-none-eabi
CFLAGS := -nostdlib
OBJFLAGS ?= -DS
QEMUFLAGS := -M connex -pflash $(FLASH_NAME) -nographic -serial /dev/null
# Allocate 16MB to use as a virtual flash for th qemu
# bs = blocksize -> 4KB
# count = number of block -> 4096
# totalsize = 16MB
setup:
dd if=/dev/zero of=$(FLASH_NAME) bs=4096 count=4096
# Compile srcs and write to virtual flash
all: clean setup
$(CC)-gcc $(CFLAGS) -o $(ELF_NAME) -T $(LINKER_NAME) $(SRCS)
$(CC)-objcopy -O binary $(ELF_NAME) $(BIN_NAME)
dd if=$(BIN_NAME) of=$(FLASH_NAME) bs=4096 conv=notrunc
objdump:
$(CC)-objdump $(OBJFLAGS) $(ELF_NAME)
mem-placement:
$(CC)-nm -n $(ELF_NAME)
qemu:
qemu-system-arm $(QEMUFLAGS)
clean:
rm -rf *.bin
rm -rf *.elf
main.c:
static int arr[] = { 1, 10, 4, 5, 6, 7 };
static int sum;
static const int n = sizeof(arr) / sizeof(arr[0]);
int main()
{
int i;
for (i = 0; i < n; i++){
sum += arr[i];
}
return 0;
}
startup.s:
.section "vectors"
reset: b _start
undef: b undef
swi: b swi
pabt: b pabt
dabt: b dabt
nop
irq: b irq
fiq: b fiq
.text
_start:
init:
## Copy data to RAM.
ldr r0, =flash_sdata
ldr r1, =ram_sdata
ldr r2, =data_size
## Handle data_size == 0
cmp r2, #0
beq init_bss
copy:
ldrb r4, [r0], #1
strb r4, [r1], #1
subs r2, r2, #1
bne copy
init_bss:
## Initialize .bss
ldr r0, =sbss
ldr r1, =ebss
ldr r2, =bss_size
## Handle bss_size == 0
cmp r2, #0
beq init_stack
mov r4, #0
zero:
strb r4, [r0], #1
subs r2, r2, #1
bne zero
init_stack:
## Initialize the stack pointer
ldr sp, =0xA4000000
## **this call dosent work as expected.. (r13/sp contains 0xA4000000)**
bl main
## Dosent return from main
## r0 should now contain 33
stop:
b stop
lscript.ld:
/*
* Linker for testing purposes
* (using 16 MB virtual flash = 0x0100_0000)
*/
MEMORY {
rom (rx) : ORIGIN = 0x00000000, LENGTH = 0x01000000
ram (rwx) : ORIGIN = 0xA0000000, LENGTH = 0x04000000
}
SECTIONS {
.text : {
* (vectors);
* (.text);
} > rom
.rodata : {
* (.rodata);
} > rom
flash_sdata = .;
ram_sdata = ORIGIN(ram);
.data : AT (flash_sdata) {
* (.data);
} > ram
ram_edata = .;
data_size = ram_edata - ram_sdata;
sbss = .;
.bss : {
* (.bss);
} > ram
ebss = .;
bss_size = ebss - sbss;
/DISCARD/ : {
*(.note*)
*(.comment)
*(.ARM*)
/*
*(.debug*)
*/
}
}
Disassembly of the executable (objdump):
program.elf: file format elf32-littlearm
Disassembly of section .text:
00000000 <reset>:
0: ea000023 b 94 <_start>
00000004 <undef>:
4: eafffffe b 4 <undef>
00000008 <swi>:
8: eafffffe b 8 <swi>
0000000c <pabt>:
c: eafffffe b c <pabt>
00000010 <dabt>:
10: eafffffe b 10 <dabt>
14: e320f000 nop {0}
00000018 <irq>:
18: eafffffe b 18 <irq>
0000001c <fiq>:
1c: eafffffe b 1c <fiq>
00000020 <main>:
20: e52db004 push {fp} ; (str fp, [sp, #-4]!)
24: e28db000 add fp, sp, #0
28: e24dd00c sub sp, sp, #12
2c: e3a03000 mov r3, #0
30: e50b3008 str r3, [fp, #-8]
34: ea00000d b 70 <main+0x50>
38: e3003000 movw r3, #0
3c: e34a3000 movt r3, #40960 ; 0xa000
40: e51b2008 ldr r2, [fp, #-8]
44: e7932102 ldr r2, [r3, r2, lsl #2]
48: e3003018 movw r3, #24
4c: e34a3000 movt r3, #40960 ; 0xa000
50: e5933000 ldr r3, [r3]
54: e0822003 add r2, r2, r3
58: e3003018 movw r3, #24
5c: e34a3000 movt r3, #40960 ; 0xa000
60: e5832000 str r2, [r3]
64: e51b3008 ldr r3, [fp, #-8]
68: e2833001 add r3, r3, #1
6c: e50b3008 str r3, [fp, #-8]
70: e3a02006 mov r2, #6
74: e51b3008 ldr r3, [fp, #-8]
78: e1530002 cmp r3, r2
7c: baffffed blt 38 <main+0x18>
80: e3a03000 mov r3, #0
84: e1a00003 mov r0, r3
88: e24bd000 sub sp, fp, #0
8c: e49db004 pop {fp} ; (ldr fp, [sp], #4)
90: e12fff1e bx lr
00000094 <_start>:
94: e59f004c ldr r0, [pc, #76] ; e8 <stop+0x4>
98: e59f104c ldr r1, [pc, #76] ; ec <stop+0x8>
9c: e59f204c ldr r2, [pc, #76] ; f0 <stop+0xc>
a0: e3520000 cmp r2, #0
a4: 0a000003 beq b8 <init_bss>
000000a8 <copy>:
a8: e4d04001 ldrb r4, [r0], #1
ac: e4c14001 strb r4, [r1], #1
b0: e2522001 subs r2, r2, #1
b4: 1afffffb bne a8 <copy>
000000b8 <init_bss>:
b8: e59f0034 ldr r0, [pc, #52] ; f4 <stop+0x10>
bc: e59f1034 ldr r1, [pc, #52] ; f8 <stop+0x14>
c0: e59f2034 ldr r2, [pc, #52] ; fc <stop+0x18>
c4: e3520000 cmp r2, #0
c8: 0a000003 beq dc <init_stack>
cc: e3a04000 mov r4, #0
000000d0 <zero>:
d0: e4c04001 strb r4, [r0], #1
d4: e2522001 subs r2, r2, #1
d8: 1afffffc bne d0 <zero>
000000dc <init_stack>:
dc: e3a0d329 mov sp, #-1543503872 ; 0xa4000000
e0: ebffffce bl 20 <main>
000000e4 <stop>:
e4: eafffffe b e4 <stop>
e8: 00000104 andeq r0, r0, r4, lsl #2
ec: a0000000 andge r0, r0, r0
f0: 00000018 andeq r0, r0, r8, lsl r0
f4: a0000018 andge r0, r0, r8, lsl r0
f8: a000001c andge r0, r0, ip, lsl r0
fc: 00000004 andeq r0, r0, r4
Disassembly of section .rodata:
00000100 <n>:
100: 00000006 andeq r0, r0, r6
Disassembly of section .data:
a0000000 <arr>:
a0000000: 00000001 andeq r0, r0, r1
a0000004: 0000000a andeq r0, r0, sl
a0000008: 00000004 andeq r0, r0, r4
a000000c: 00000005 andeq r0, r0, r5
a0000010: 00000006 andeq r0, r0, r6
a0000014: 00000007 andeq r0, r0, r7
Disassembly of section .bss:
a0000018 <sum>:
a0000018: 00000000 andeq r0, r0, r0
Can someone point me in the right direction to why this isn't working according to my expectations?
Thanks Henrik

Minimal examples that just work
https://github.com/cirosantilli/linux-kernel-module-cheat/tree/54e15e04338c0fecc0be139a0da2d0d972c21419#baremetal-setup-getting-started
The prompt.c example takes input from your host terminal and gives back output all through the simulated UART:
enter a character
got: a
new alloc of 1 bytes at address 0x0x4000a1c0
enter a character
got: b
new alloc of 2 bytes at address 0x0x4000a1c0
enter a character
It uses Newlib to expose a subset of the C standard library. This allows you to run existing programs written in C if the only use that restricted subset of the C standard library.
More details about Newlib at: https://electronics.stackexchange.com/questions/223929/c-standard-libraries-on-bare-metal/400077#400077
https://github.com/freedomtan/aarch64-bare-metal-qemu/tree/2ae937a2b106b43bfca49eec49359b3e30eac1b1 for -M virt, just the hello world on the repo. Compile with:
sudo apt-get install gcc-aarch64-linux-gnu
make CROSS_PREFIX=aarch64-linux-gnu-
Here is the example minimized to printing a single character from assembly: How to run a bare metal ELF file on QEMU?
https://github.com/bztsrc/raspi3-tutorial for -M raspi3. Quick getting started at: https://raspberrypi.stackexchange.com/questions/34733/how-to-do-qemu-emulation-for-bare-metal-raspberry-pi-images/85135#85135 Several other examples on the repo going to more advanced subjects.
Also does display output on 09_framebuffer.
Both write a hello world to the UART.
Tested in Ubuntu 18.04, gcc-aarch64-linux-gnu version 4:7.3.0-3ubuntu2.

Debugging!
First, look at the PC and PSR: You're in Undef mode, in the undefined instruction handler.
OK, in an exception mode, the LR tells you where you took the exception. There are some slightly complicated rules between the PC offset and the preferred return address determining exactly what it points at, but just eyeballing it it's clearly in the vicinity of the movw/movt pair.
The movw instruction effectively only exists in the ARMv7 ISA onwards. A brief investigation tells me the machine you're emulating is some old PXA255 thing, whose CPU only implements the ARMv5 ISA. Thus it's not surprising it faults on an instruction that it predates by many years.
Your compiler is apparently configured to target ARMv7 by default (which is not uncommon), so you need to add at least -march=armv5te to your CFLAGS to target the appropriate architecture version. The 'advanced' challenge would be to switch to a different, newer, machine, but that's going to involve adapting the linker script to a new memory map and rewriting any hardware-touching code for new peripherals, so I'd save that idea for the longer term, once you're comfortable with the basics of bare-metal code and slogging through hardware reference manuals.

for the same code on my ubuntu i got
arm-none-eabi-gcc -nostdlib -o sum.elf sum.lds startup.s -w
/usr/lib/gcc/arm-none-eabi/4.9.3/../../../arm-none-eabi/bin/ld: warning: cannot find entry symbol _start; defaulting to 00000000
/tmp/ccBthV7t.o: In function init_stack':
(.text+0x4c): undefined reference tomain'
collect2: error: ld returned 1 exit status

Issues with ARMv7-A bare metal call stack [duplicate]

This question already has an answer here:
Rustc/LLVM generates faulty code for aarch64 with opt-level=0
(1 answer)
Closed 7 years ago.
I'm trying to get a small ARM kernel up and running on QEMU (Versatile Express for Cortex-A15). Currently it simply sets sp to the top of a small stack and sends a single character to UART0.
_start.arm:
.set stack_size, 0x10000
.comm stack, stack_size
.global _start
_start:
ldr sp, =stack+stack_size
bl start
1:
b 1b
.size _start, . - _start
start.c:
/* UART_0 is a struct overlaid on 0x1c090000 */
void printChar(char c)
{
while (UART_0->flags & TRANSMIT_FULL);
UART_0->data = c;
}
void start()
{
while (UART_0->flags & TRANSMIT_FULL);
UART_0->data = 'A';
printChar('a');
}
From GDB, I know that execution progresses through _start into start and successfully sends 'A' to UART_0. printChar gets called and completes, but doesn't seem to print anything to the serial port . When running without GDB, the kernel repeatedly prints 'A', though I'm not sure if this is the processor resetting or jumping incorrectly.
From objdump:
Disassembly of section .stub:
00010000 <_start>:
10000: e59fd004 ldr sp, [pc, #4] ; 1000c <__STACK_SIZE+0xc>
10004: eb000016 bl 10064 <start>
10008: eafffffe b 10008 <_start+0x8>
1000c: 000200d0 .word 0x000200d0
Disassembly of section .text:
00010010 <printChar>:
10010: e52db004 push {fp} ; (str fp, [sp, #-4]!)
10014: e28db000 add fp, sp, #0
10018: e24dd00c sub sp, sp, #12
1001c: e1a03000 mov r3, r0
10020: e54b3005 strb r3, [fp, #-5]
10024: e1a00000 nop ; (mov r0, r0)
10028: e3a03000 mov r3, #0
1002c: e3413c09 movt r3, #7177 ; 0x1c09
10030: e1d331ba ldrh r3, [r3, #26]
10034: e6ff3073 uxth r3, r3
10038: e2033020 and r3, r3, #32
1003c: e3530000 cmp r3, #0
10040: 1afffff8 bne 10028 <printChar+0x18>
10044: e3a03000 mov r3, #0
10048: e3413c09 movt r3, #7177 ; 0x1c09
1004c: e55b2005 ldrb r2, [fp, #-5]
10050: e6ff2072 uxth r2, r2
10054: e1c320b2 strh r2, [r3, #2]
10058: e24bd000 sub sp, fp, #0
1005c: e49db004 pop {fp} ; (ldr fp, [sp], #4)
10060: e12fff1e bx lr
00010064 <start>:
10064: e52db008 str fp, [sp, #-8]!
10068: e58de004 str lr, [sp, #4]
1006c: e28db004 add fp, sp, #4
10070: e1a00000 nop ; (mov r0, r0)
10074: e3a03000 mov r3, #0
10078: e3413c09 movt r3, #7177 ; 0x1c09
1007c: e1d331ba ldrh r3, [r3, #26]
10080: e6ff3073 uxth r3, r3
10084: e2033020 and r3, r3, #32
10088: e3530000 cmp r3, #0
1008c: 1afffff8 bne 10074 <start+0x10>
10090: e3a03000 mov r3, #0
10094: e3413c09 movt r3, #7177 ; 0x1c09
10098: e5d32002 ldrb r2, [r3, #2]
1009c: e3a02000 mov r2, #0
100a0: e3822041 orr r2, r2, #65 ; 0x41
100a4: e5c32002 strb r2, [r3, #2]
100a8: e5d32003 ldrb r2, [r3, #3]
100ac: e3a02000 mov r2, #0
100b0: e5c32003 strb r2, [r3, #3]
100b4: e3a00061 mov r0, #97 ; 0x61
100b8: ebffffd4 bl 10010 <printChar>
100bc: e24bd004 sub sp, fp, #4
100c0: e59db000 ldr fp, [sp]
100c4: e28dd004 add sp, sp, #4
100c8: e49df004 pop {pc} ; (ldr pc, [sp], #4)
000100cc <UART_0>:
100cc: 1c090000 ....

I may have missed something, but I am not seeing where you have enabled interrupts, or poll to see if you can send the next character. If you have enabled the interrupts and set up the UART hardware correctly, your driver my have a bug. If you have not setup the UART hardware correctly, it may not be generating interrupts, or it may not be doing the FIFO correctly, or any number of other problems.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight