Related
I am new to MCU and trying to figure out how arm (Cortex M3-M4) based MCU boots. Because booting is specific to any SOC, I took an example hardware board of STM for case study.
Board: STMicroelectronics – STM32L476 32-bit.
In this board when booting mode is (x0)"Boot from User Flash", board maps 0x0000000 address to flash memory address. On flash memory I have pasted my binary with first 4 bytes pointing to vector table first entry, which is esp. Now if I press reset button ARM documentation says PC value will be set to 0x00000000.
CPU generally executes stream of instructions based on PC -> PC + 1 loop. In this case if I see PC value points to esp, which is not instruction. How does Arm CPU does the logic of not use this instruction address, but do a jump to value store at address 0x00000004?
Or this is the case:
Reset produces a special hardware interrupt and cause PC value to be value at 0x00000004, if this is the case why Arm documentation says it sets PC value to 0x00000000?
Ref: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka3761.html
What values are in ARM registers after a power-on reset? Applies to:
ARM1020/22E, ARM1026EJ-S, ARM1136, ARM720T, ARM7EJ-S, ARM7TDMI,
ARM7TDMI-S, ARM920/922T, ARM926EJ-S, ARM940T, ARM946E-S, ARM966E-S,
ARM9TDMI
Answer Registers R0 - R14 (including banked registers) and SPSR (in
all modes) are undefined after reset.
The Program Counter (PC/R15) will be set to 0x000000, or 0xFFFF0000 if
the core has a VINITHI or CFGHIVECS input which is set high as the
core leaves reset. This input should be set to reflect where the base
of the vector table in your system is located.
The Current Program Status Register (CPSR) will indicate that the ARM
core has started in ARM state, Supervisor mode with both FIQ and IRQ
mask bits set. The condition code flags will be undefined. Please see
the ARM Architecture Manual for a detailed description of the CPSR.
The cortex-m's do not boot the same way the traditional and full sized cores boot. Those at least for the reset as you pointed out fetch from address 0x00000000 (or the alternate if asserted) the first instructions, not really fair to call it the PC value as at this point the PC is somewhat bugus, there are multiple program counters being produced a fake one in r15, one leading the fetching, one doing prefetch, none are really the program counter. anyway, doesnt matter.
The cortex-m as documented in the armv7-m documentation (for the m3 and m4, for the m0 and m0+ see the armv6-m although they so far all boot the same way). These use a vector TABLE not instructions. The CORE reads address 0x00000000 (or an alternate if a strap is asserted) and that 32 bit value gets loaded into the stack pointer register. it reads address 0x00000004 it checks the lsbit (maybe not all cores do) if set then this is a valid thumb address, strips the lsbit off (makes it a zero) and begins to fetch the first instructions for the reset handler at that address so if your flash starts with
0x00000000 : 0x20001000
0x00000004 : 0x00000101
the cortex-m will put 0x20001000 in the stack pointer and fetch the first instructions from address 0x100. Being thumb instructions are 16 bits with thumb2 extensions being two 16 bit portions, its not an x86 the program counter is aligned for the full sized processors with 32 bit instructions it fetches on aligned addresses 0x0000, 0x0004, 0x0008 it doesnt increment pc <= pc + 1; For thumb mode or thumb processors it is pc = pc + 2. But also the fetches are not necessarily single instruction transactions, for the full sized they may fetch 4 or 8 words per transaction, the cortex-ms as documented in the technical reference manuals some are able to be compiled or strapped to 16 bits at a time or 32 bits at a time. So no need to talk about or think about execution loops fetching pc = pc + 1, that doesnt make sense even in an x86 these days.
to be fair arms documentation is generally good, on the better side compared to a number of others, not the best. Unlike the full sized arm exception table, the vector table in the cortex-m documentation was not done as well as it could have been, could have/should have just done something like the full sized but shown they were vectors not instructions. It is in there though in the architectural reference manual for the armv6-m and armv7-m (and I would assume armv8-m as well but have not looked, got some parts last week but boards are not here yet, will know very soon). Cant look for words like reset have to look for interrupt or undefined or hardfault, etc in that manual.
EDIT
unwrap your mind on this notion of how the processor starts fetching, it can be any arbitrary address they add into the design, and then the execution of the instructions determines the next address and next address, etc.
Also understand unlike say x86 or microchip pic or the avrs, etc, the core and the chips are two different companies. Even in those same company designs, but certainly where there is a clear division between the IP with a known bus, the ARM CORE will read address 0x00000004 on the AMBA/AXI/AHB bus, the chip vendor can mirror that address in as many different places as they want, in this case with the stm32 there probably isnt actually anything at 0x00000000 as their documentation implies based on the boot pins they map it either to an internal bootloader, or they map it to the user application at 0x08000000 (or in most stm32's if there is an exception thats fine I have not yet seen it) so when strapped that way and the logic has those addresses mirrored you will see the same 32 bit values at 0x00000000 and 0x08000000, 0x00000004 and 0x08000004 and so on for some limited amount of address space. This is why even though linking for 0x00000000 will work to some extent (till you hit that limit which is probably smaller than the application flash size), you will see most folks link for 0x08000000 and the hardware takes care of the rest, so your table really wants to look like
0x08000000 : 0x20001000
0x08000004 : 0x08000101
for an stm32, at least the dozens I have seen so far.
The processor reads 0x00000000 which is mirrored to the first item in the application flash, finds 0x20001000, it then reads 0x00000004 which is mirroed to the second word in the application flash and gets 0x08000101 which causes a fetch from 0x08000100 and now we are executing from the proper fully mapped application flash address space. so long as you dont change the mirroring, which I dont know if you can on an stm32 (nxp chips you can and I dont know about ti or other brands off hand). Some of the cortex-m cores the VTOR register is there and changable (others it is fixed at 0x00000000 and you cant change it), you do not need to change it to 0x08000000 for an stm32, at least all the ones I know about. its only if you are actively changing the mirroring of the zero address space yourself if possible or if you say have your own bootloader and maybe YOUR application space is 0x08004000 and that application wants a vector table of its own. then you either use VTOR or you build the bootloaders vector table such that it runs code that reads the vectors at 0x08004000 and branches to those. The NXP and others in the past certainly with the ARMV7TDMI cores, would let you change the mirroring of address zero because those older cores didnt have a programmable vector table offset register, helping you solve that problem in their chip designs. Newer ARM cores with a VTOR eliminate that need and over time the chip vendors might not bother anymore if they do at all...
EDIT
I dont know if you have the discovery board or the nucleo, I assume the latter as the former is not available (wish I knew about that one would like to have one. And/or I already have one and its buried in a drawer and I never got to it).
so here is a somewhat minimal program you can try on your stm32
.cpu cortex-m0
.thumb
.globl _start
_start:
.word 0x20000400
.word reset
.word loop
.word loop
.thumb_func
loop: b loop
.thumb_func
reset:
ldr r0,=0x20000000
mov r2,sp
str r2,[r0]
add r0,r0,#4
mov r2,pc
str r2,[r0]
add r0,r0,#4
mov r1,#0
top:
str r1,[r0]
add r1,r1,#1
b top
build
arm-none-eabi-as so.s -o so.o
arm-none-eabi-ld -Ttext=0x08000000 so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy so.elf -O binary so.bin
this should build with arm-linux-whatever- or other arm-whatever-whatever tools from a binutils from the last 10 years.
The disassembly is important to examine before using the binary, dont want to brick your chip (with an stm32 there is a way to get unbricked)
08000000 <_start>:
8000000: 20000400 andcs r0, r0, r0, lsl #8
8000004: 08000013 stmdaeq r0, {r0, r1, r4}
8000008: 08000011 stmdaeq r0, {r0, r4}
800000c: 08000011 stmdaeq r0, {r0, r4}
08000010 <loop>:
8000010: e7fe b.n 8000010 <loop>
08000012 <reset>:
8000012: 4805 ldr r0, [pc, #20] ; (8000028 <top+0x6>)
8000014: 466a mov r2, sp
8000016: 6002 str r2, [r0, #0]
8000018: 3004 adds r0, #4
800001a: 467a mov r2, pc
800001c: 6002 str r2, [r0, #0]
800001e: 3004 adds r0, #4
8000020: 2100 movs r1, #0
08000022 <top>:
8000022: 6001 str r1, [r0, #0]
8000024: 3101 adds r1, #1
8000026: e7fc b.n 8000022 <top>
8000028: 20000000 andcs r0, r0, r0
the disassembler doesnt know that the vector table is not instructions so you can ignore those.
08000000 <_start>:
8000000: 20000400
8000004: 08000013
8000008: 08000011
800000c: 08000011
08000010 <loop>:
8000010: e7fe b.n 8000010 <loop>
08000012 <reset>:
Does it start the vector table at 0x08000000, check. Our stack pointer init value is at 0x00000000, yes, the reset vector we had the tools place for us. thumb_func tells them the following label is an address for some code/function/procedure/whatever_not_data so they orr the one on there for us. our reset handler is at address 0x08000012 so we want to see 0x08000013 in the vector table, check. I tossed in a couple more for demonstration purposes, sent them to an infinite loop at address 0x08000010 so the vector table should have 0x08000011, check.
So assuming you have a nucleo board not the discovery then you can copy the so.bin file to the thumb drive that shows up when you plug it in.
If you use openocd to connect through the stlink interface into the board now you can see that it was running (details left to the reader to figure out)
Open On-Chip Debugger
> halt
stm32f0x.cpu: target state: halted
target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0x08000022 msp: 0x20000400
> mdw 0x20000000 20
0x20000000: 20000400 0800001e 0048cd01 200002e7 200002e9 200002eb 200002ed 00000000
0x20000020: 00000000 00000000 00000000 200002f1 200002ef 00000000 200002f3 200002f5
0x20000040: 200002f7 200002f9 200002fb 200002fd
> resume
> halt
stm32f0x.cpu: target state: halted
target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0x08000022 msp: 0x20000400
> mdw 0x20000000 20
0x20000000: 20000400 0800001e 005e168c 200002e7 200002e9 200002eb 200002ed 00000000
0x20000020: 00000000 00000000 00000000 200002f1 200002ef 00000000 200002f3 200002f5
0x20000040: 200002f7 200002f9 200002fb 200002fd
so we can see that the stack pointer had 0x20000400 as expected
0x20000000: 20000400 0800001e 0048cd01
the program counter which is not some magical thing, they have to somewhat fake it to make the instruction set work.
800001a: 467a mov r2, pc
as defined in the instruction set the pc value used in this instruction is two instructions ahead of the address of this instruction, so 0x0800001A + 4 = 0x0800001E which is what we see in the memory dump.
And the third item is a counter showing we are running, the resume and halt shows that that count kept going
0x20000000: 20000400 0800001e 005e168
So this demonstrates, the vector table, initializing the stack pointer, the reset vector, where code execution starts, what the value of the pc is at some point in the program, and seeing the program run.
the .cpu cortex-m0 makes it build the most compatible program for the cortex-m family and the mov r0,=0x20000000 was cheating, you posted the same feature in your comment it says I want to load the address of blah into the register a label is just an address and they let you put just an address =_estack is the address of a label =0x20000000 is just a number treated as an address (addresses are just numbers as well, nothing magical about them). I could have done a smaller immediate with a shift or explicitly have done the pc relative load. force of habit in this case.
EDIT2
In attempt for a programmer to understand that the chip is logic, only some percentage of it is software/instruction driven, even within that it is just logic that does more things than the software instruction itself indicates. You want to read from memory your instruction asks the processor to do it but in a real chip there are a number of steps involved to actually perform that, microcoded or not (ARMs are not microcoded) there are state machines that walk through the various steps to perform each of these tasks. grab the values from registers, compute the address, do the memory transaction which is a handful of separate steps, take the return value and place it in the register file.
.thumb
.globl _start
_start:
.word 0x20001000
.word reset
.word loop
.word loop
.thumb_func
loop: b loop
.thumb_func
reset:
ldr r0,loop_counts
loop_top:
sub r0,r0,#1
bne loop_top
b reset
.align
loop_counts: .word 0x1234
00000000 <_start>:
0: 20001000 andcs r1, r0, r0
4: 00000013 andeq r0, r0, r3, lsl r0
8: 00000011 andeq r0, r0, r1, lsl r0
c: 00000011 andeq r0, r0, r1, lsl r0
00000010 <loop>:
10: e7fe b.n 10 <loop>
00000012 <reset>:
12: 4802 ldr r0, [pc, #8] ; (1c <loop_counts>)
00000014 <loop_top>:
14: 3801 subs r0, #1
16: d1fd bne.n 14 <loop_top>
18: e7fb b.n 12 <reset>
1a: 46c0 nop ; (mov r8, r8)
0000001c <loop_counts>:
1c: 00001234 andeq r1, r0, r4, lsr r2
Just barely enough of an instruction set simulator to run that program.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ROMMASK 0xFFFF
#define RAMMASK 0xFFF
unsigned short rom[ROMMASK+1];
unsigned short ram[RAMMASK+1];
unsigned int reg[16];
unsigned int pc;
unsigned int cpsr;
unsigned int inst;
int main ( void )
{
unsigned int ra;
unsigned int rb;
unsigned int rc;
unsigned int rx;
//just putting something there, a real chip might have an MBIST, might not.
memset(reg,0xBA,sizeof(reg));
memset(ram,0xCA,sizeof(ram));
memset(rom,0xFF,sizeof(rom));
//in a real chip the rom/flash would contain the program and not
//need to do anything to it, this sim needs to have the program
//various ways to have done this...
//00000000 <_start>:
rom[0x00>>1]=0x1000; // 0: 20001000 andcs r1, r0, r0
rom[0x02>>1]=0x2000;
rom[0x04>>1]=0x0013; // 4: 00000013 andeq r0, r0, r3, lsl r0
rom[0x06>>1]=0x0000;
rom[0x08>>1]=0x0011; // 8: 00000011 andeq r0, r0, r1, lsl r0
rom[0x0A>>1]=0x0000;
rom[0x0C>>1]=0x0011; // c: 00000011 andeq r0, r0, r1, lsl r0
rom[0x0E>>1]=0x0000;
//
//00000010 <loop>:
rom[0x10>>1]=0xe7fe; // 10: e7fe b.n 10 <loop>
//
//00000012 <reset>:
rom[0x12>>1]=0x4802; // 12: 4802 ldr r0, [pc, #8] ; (1c <loop_counts>)
//
//00000014 <loop_top>:
rom[0x14>>1]=0x3801; // 14: 3801 subs r0, #1
rom[0x16>>1]=0xd1fd; // 16: d1fd bne.n 14 <loop_top>
rom[0x18>>1]=0xe7fb; // 18: e7fb b.n 12 <reset>
rom[0x1A>>1]=0x46c0; // 1a: 46c0 nop ; (mov r8, r8)
//
//0000001c <loop_counts>:
rom[0x1C>>1]=0x0004; // 1c: 00001234 andeq r1, r0, r4, lsr r2
rom[0x1E>>1]=0x0000;
//reset
//THIS IS NOT SOFTWARE DRIVEN LOGIC, IT IS JUST LOGIC
ra=rom[0x00>>1];
rb=rom[0x02>>1];
reg[14]=(rb<<16)|ra;
ra=rom[0x04>>1];
rb=rom[0x06>>1];
rc=(rb<<16)|ra;
if((rc&1)==0) return(1); //normally run a fault handler here
pc=rc&0xFFFFFFFE;
reg[15]=pc+2;
cpsr=0x000000E0;
//run
//THIS PART BELOW IS SOFTWARE DRIVEN LOGIC
//still you can see that each instruction requires some amount of
//non-software driven logic.
//while(1)
for(rx=0;rx<20;rx++)
{
inst=rom[(pc>>1)&ROMMASK];
printf("0x%08X : 0x%04X\n",pc,inst);
reg[15]=pc+4;
pc+=2;
if((inst&0xF800)==0x4800)
{
//LDR
printf("LDR r%02u,[PC+0x%08X]",(inst>>8)&0x7,(inst&0xFF)<<2);
ra=(inst>>0)&0xFF;
rb=reg[15]&0xFFFFFFFC;
ra=rb+(ra<<2);
printf(" {0x%08X}",ra);
rb=rom[((ra>>1)+0)&ROMMASK];
rc=rom[((ra>>1)+1)&ROMMASK];
ra=(inst>>8)&0x07;
reg[ra]=(rc<<16)|rb;
printf(" {0x%08X}\n",reg[ra]);
continue;
}
if((inst&0xF800)==0x3800)
{
//SUB
ra=(inst>>8)&0x07;
rb=(inst>>0)&0xFF;
printf("SUBS r%u,%u ",ra,rb);
rc=reg[ra];
rc-=rb;
reg[ra]=rc;
printf("{0x%08X}\n",rc);
//do flags
if(rc==0) cpsr|=0x80000000; else cpsr&=(~0x80000000); //N flag
//dont need other flags for this example
continue;
}
if((inst&0xF000)==0xD000) //B conditional
{
if(((inst>>8)&0xF)==0x1) //NE
{
ra=(inst>>0)&0xFF;
if(ra&0x80) ra|=0xFFFFFF00;
rb=reg[15]+(ra<<1);
printf("BNE 0x%08X\n",rb);
if((cpsr&0x80000000)==0)
{
pc=rb;
}
continue;
}
}
if((inst&0xF000)==0xE000) //B
{
ra=(inst>>0)&0x7FF;
if(ra&0x400) ra|=0xFFFFF800;
rb=reg[15]+(ra<<1);
printf("B 0x%08X\n",rb);
pc=rb;
continue;
}
printf("UNDEFINED INSTRUCTION 0x%08X: 0x%04X\n",pc-2,inst);
break;
}
return(0);
}
You are welcome to hate my coding style, this is a brute force thrown together for this question thing. No I dont work for ARM, this can all be pulled from public documents/information. I shortened the loop to 4 counts to see it hit the outer loop
0x00000012 : 0x4802
LDR r00,[PC+0x00000008] {0x0000001C} {0x00000004}
0x00000014 : 0x3801
SUBS r0,1 {0x00000003}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000002}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000001}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000000}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000018 : 0xE7FB
B 0x00000012
0x00000012 : 0x4802
LDR r00,[PC+0x00000008] {0x0000001C} {0x00000004}
0x00000014 : 0x3801
SUBS r0,1 {0x00000003}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000002}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000001}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000014 : 0x3801
SUBS r0,1 {0x00000000}
0x00000016 : 0xD1FD
BNE 0x00000014
0x00000018 : 0xE7FB
B 0x00000012
Perhaps this helps perhaps this makes it worse. Most of the logic is not driven by instructions, each instruction, requires some amount of logic not counting the common logic like instruction fetching and things like that.
If you add more code this simulator will break it ONLY supports these handful of instructions and this loop.
The most important thing to check when you're confused about some behaviour of an Arm processor is probably to check the version of the architecture which applies. You will find a huge amount of very old legacy documentation which relates to ARM7 and ARM9 designs. Whilst not all of this is wrong today, it can be very misleading.
ARM v4, ARM v5, ARM v6: These are legacy designs, rarely even used in derivative products now.
ARM v7A: These are the first of the Cortex series. Cortex-A5 is the entry-level for a linux class device in 2018.
ARM v7M, ARM v6M: These are the common microcontroller devices like your STM32, and already these have over 10 years of history
ARM v8A: These introduce the 64 bit instruction set (T32/A32/A64 in one device), already entry level in the R-pi 3 for example.
ARM v8M: The latest iteration of an microcontroller architecture with more advanced security features, just starting to become available 2018Q2
Specifically, ARMv6M/ARMv7M/ARMv8M provide a very different exception model compared with all of the other ARM architectures (remaining similar within the family), whilst many of the other differences are more incremental or focused on specialised area.
I have a C code like this one, that will be possibly compiled in an ELF file for ARM:
int a;
int b=1;
int foo(int x) {
int c=2;
static float d=1.5;
// .......
}
I know that all the executable code goes into the .text section, while .data , .bss and .rodata will contain the various variables/constants.
My question is: does a line like int b=1; here add also something to the .text section, or does it only tell the compiler to place a new variable initialized to 1 in .data (then probably mapped in RAM memory when deployed on the final hardware)?
Moreover, trying to decompile a similar code, I noticed that a line such as int c=2;, inside the function foo(), was adding something to the stack, but also some lines of .text where the value '2' was actually memorized there.
So, in general, does a declaration always imply also something added to .text at an assembly level? If yes, does it depends on the context (i.e. if the variable is inside a function, if it is a local global variable, ...) and what is actually added?
Thanks a lot in advance.
does a line like int b=1; here add also something to the .text section, or does it only tell the compiler to place a new variable initialized to 1 in .data (then probably mapped in RAM memory when deployed on the final hardware)?
You understand that this is likely to be implementation specific, but the likelihood is that that you will just get initialised data in the data section. Were it a constant, it might, instead go into the text section.
Moreover, trying to decompile a similar code, I noticed that a line such as int c=2;, inside the function foo(), was adding something to the stack, but also some lines of .text where the value '2' was actually memorized there.
Automatic variables that are initialised, have to be initialised each time the function's scope is entered. The space for c is reserved on the stack (or in a register, depending on the ABI) but the program has to remember the constant from which it is initialised and this is best placed somewhere in the text segment, either as a constant value or as a "move immediate" instruction.
So, in general, does a declaration always imply also something added to .text at an assembly level?
No. If a static variable is initialised to zero or null or not initialised at all, it is often just enough to reserve space in bss. If a static non constant variable is initialised to a non zero value, it will just be put in the data segment.
As #goodvibration correctly stated, only global or static variables go to the segments. This is because their lifetime is the whole execution time of the program.
Local variables have a different lifetime. They exist only during the execution of the block (e.g. function) they are defined within. If a function is called, all parameters that does not fit into registers a pushed to the stack and the return address is written to the link register.* The function saves possibly the link register and other registers at the stack and adds some space at the stack for local variables (this is the code you have observed). At the end of the function, the saved registers are poped and the the stackpointer is readjusted. In this way, you get an automatic garbage collection for local variables.
*: Please note, that this is true for (some calling conventions of) ARM only. It's different e.g. for Intel processors.
this is one of those just try it things.
int a;
int b=1;
int foo(int x) {
int c=2;
static float d=1.5;
int e;
e=x+2;
return(e);
}
first thing without optimization.
arm-none-eabi-gcc -c so.c -o so.o
arm-none-eabi-objdump -D so.o
arm-none-eabi-ld -Ttext=0x1000 -Tdata=0x2000 so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
arm-none-eabi-objdump -D so.elf > so.list
do worry about the warning, needed to link to see that everything found a home
Disassembly of section .text:
00001000 <foo>:
1000: e52db004 push {r11} ; (str r11, [sp, #-4]!)
1004: e28db000 add r11, sp, #0
1008: e24dd014 sub sp, sp, #20
100c: e50b0010 str r0, [r11, #-16]
1010: e3a03002 mov r3, #2
1014: e50b3008 str r3, [r11, #-8]
1018: e51b3010 ldr r3, [r11, #-16]
101c: e2833002 add r3, r3, #2
1020: e50b300c str r3, [r11, #-12]
1024: e51b300c ldr r3, [r11, #-12]
1028: e1a00003 mov r0, r3
102c: e28bd000 add sp, r11, #0
1030: e49db004 pop {r11} ; (ldr r11, [sp], #4)
1034: e12fff1e bx lr
Disassembly of section .data:
00002000 <b>:
2000: 00000001 andeq r0, r0, r1
00002004 <d.4102>:
2004: 3fc00000 svccc 0x00c00000
Disassembly of section .bss:
00002008 <a>:
2008: 00000000 andeq r0, r0, r0
as a disassembly it tries to disassemble data so ignore that (the andeq next to 0x2008 for example).
The a variable is global and uninitialized so it lands in .bss (typically...a compiler can choose to do whatever it wants so long as it implements the language correctly, doesnt have to have something called .bss for example, but gnu and many others do).
b is global and initialized so it lands in .data, had it been declared as const it might land in .rodata depending on the compiler and what it offers.
c is a local non-static variable that is initialized, because C offers recursion this needs to be on the stack (or managed with registers or other volatile resources), and initialized each run. We needed to compile without optimization to see this
1010: e3a03002 mov r3, #2
1014: e50b3008 str r3, [r11, #-8]
d is what I call a local global, it is a static local so it lives outside the function, not on the stack, alongside the globals but with local access only.
I added e to your example, this is a local not initialized, but then used. Had I not used it and not optimized there probably would have been space allocated for it but no initialization.
save x on the stack (per this calling convention x enters in r0)
100c: e50b0010 str r0, [r11, #-16]
then load x from the stack, add two, save as e on the stack. read e from
the stack and place in the return location for this calling convention which is r0.
1018: e51b3010 ldr r3, [r11, #-16]
101c: e2833002 add r3, r3, #2
1020: e50b300c str r3, [r11, #-12]
1024: e51b300c ldr r3, [r11, #-12]
1028: e1a00003 mov r0, r3
For all architectures, unoptimized this is somewhat typical, always read variables from the stack and put them back quickly. Other architectures have different calling conventions with respect to where the incoming parameters and outgoing return value live.
If I optmize (-O2 on the gcc line)
Disassembly of section .text:
00001000 <foo>:
1000: e2800002 add r0, r0, #2
1004: e12fff1e bx lr
Disassembly of section .data:
00002000 <b>:
2000: 00000001 andeq r0, r0, r1
Disassembly of section .bss:
00002004 <a>:
2004: 00000000 andeq r0, r0, r0
b is a global, so at the object level a global space has to be reserved for it, it is .data, optimization doesnt change that.
a is also global and still .bss, because at the object level it was declared such so allocated in case another object needs it. The linker doesnt remove these.
Now c and d are dead code they dont do anything they need no storage so
c is no longer allocated space on the stack nor is d allocated any .data
space.
We have plenty of registers for this architecture for this calling convention for this code, so e does not need any memory allocated on the
stack, it comes in in r0 the math can be done with r0 and then it is returned in r0.
I know I didnt tell the linker where to put .bss by telling it .data it put .bss in the same space without complaint. I could have put -Tbss=0x3000 for example to give it its own space or just done a linker script. Linker scripts can play havoc with the typical results, so beware.
Typical, but there might be a compiler with exceptions:
non-constant globals go in .data or .bss depending on whether they are initialized during the declaration or not.
If const then perhaps .rodata or .text depending (or .data or .bss would technically work)
non-static locals go in general purpose registers or on the stack as needed (if not completely optimized away).
static locals (if not optimized away) live with globals but are not globally accessible they just get allocated space in .data or .bss like the globals do.
parameters are governed completely by the calling convention used by that compiler for that target. Just because arm or mips or other may have written down a convention doesnt mean a compiler has to use it, only if they claim to support some convention or standard should they then attempt to comply. For a compiler to be useful it needs a convention and stick to it whatever it is, so that both caller and callee of a function know where to get parameters and to return a value. Architectures with enough registers will often have a convention where some few number of registers are used for the first so many parameters (not necessarily one to one) and then the stack is used for all other parameters. likewise a register may be used if possible for a return value. Some architectures due to lack of gprs or other, use the stack in both directions. or the stack in one and a register in the other. You are welcome to seek out the conventions and try to read them, but at the end of the day the compiler you are using, if not broken follows a convention and by setting up experiments like the one above you can see the convention in action.
Plus in this case optimizations.
void more_fun ( unsigned long long );
unsigned fun ( unsigned int x, unsigned long long y )
{
more_fun(y);
return(x+1);
}
If I told you that arm conventions typically use r0-r3 for the first few parameters you might assume that x is in r0 and r1 and r2 are used for y and we could have another small parameter before needing the stack, well
perhaps older arm, but now it wants the 64 bit variable to use an even then an odd.
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: e1a01003 mov r1, r3
c: e1a00002 mov r0, r2
10: ebfffffe bl 0 <more_fun>
14: e2840001 add r0, r4, #1
18: e8bd4010 pop {r4, lr}
1c: e12fff1e bx lr
so r0 contains x, r2/r3 contain y and r1 was passed over.
the test was crafted to not have y as dead code and to pass it to another function we can see where y was stored on the way into fun and way out to more_fun. r2/r3 on the way in, needs to be in r0/r1 to call more fun.
we need to preserve x for the return from fun. one might expect that x would land on the stack, which unoptimized it would, but instead save a register that the convention has stated will be preserved by functions (r4) and use r4 throughout the function or at least in this function to store x. A performance optimization, if x needed to be touched more than once memory cycles going to the stack cost more than register accesses.
then it computes the return and cleans up the stack, registers.
IMO it is important to see this, the calling convention comes into play for some variables and others can vary based on optimization, no optimization they are what most folks are going to state off hand, .bss, .data (.text/.rodata), with optimization then it depends if if the variable survives at all.
I have a question about dynamic linking on Linux. Consider the following disassembly of an ARM binary.
8300 <printf#plt-0x40>:
....
8320: e28fc600 add ip, pc, #0, 12
8324: e28cca08 add ip, ip, #8, 20 ; 0x8000
8328: e5bcf344 ldr pc, [ip, #836]! ; 0x344
....
83fc <main>:
...
8424:ebffffbd bl 8320 <_init+0x2c>
Main function calls printf at 8424: bl 8320. 8320 is an address in the .plt shown above. Now the code in .plt makes call to dynamic linker to invoke printf routine. My question is how the dynamic linker will be able to say that it is a call to printf?
TLDR; The PLT calls the dynamic linker by passing:
the address of the GOT entry in IP (&PLTGOT[n+3]);
&PLTGOT[2] is in LR;
Moreover PLTGOT[1] identifies the shared-object/executable.
The dynamic linker use this to find the relocation entry (plt_relocation_table[n]) and thus the symbol (printf).
Explanation of the PLT entry code
This is explained (somehow) in section A.3 of ELF for ARM:
8320: e28fc600 add ip, pc, #0, 12
8324: e28cca08 add ip, ip, #8, 20 ; 0x8000
8328: e5bcf344 ldr pc, [ip, #836]! ; 0x344
Which are explained by:
ADD ip, pc, #-8:PC_OFFSET_27_20:__PLTGOT(X)
; R_ARM_ALU_PC_G0_NC(__PLTGOT(X))
ADD ip, ip, #-4:PC_OFFSET_19_12: __PLTGOT(X)
;R_ARM_ALU_PC_G1_NC(__PLTGOT(X))
LDR pc, [ip, #0:PC_OFFSET_11_0:__PLTGOT(X)]!
; R_ARM_LDR_PC_G2(__PLTGOT(X))
Those instructions do two things:
they compute the address of the GOT entry as an offset from PC and store it in the IP register;
they jump to this GOT entry.
The spec notes that:
The write-back on the final LDR ensures that ip contains
the address of the PLTGOT entry. This is critical to
incremental dynamic linking.
The "write-back" is the use of "!" in the last instruction: this is used to update IP register with the final offset (#836). This way IP contains the addess of the GOT entry at the end of the PLT entry.
The dynamic linker has the address of the GOT entry in IP:
it can find the shared-object or executable;
it can find the correct relocation entry.
This relocation entry references the symbol of target function (printf in your case):
Offset Info Type Sym. Value Sym. Name
0001066c 00000116 R_ARM_JUMP_SLOT 00000000 printf
The Base Platform ABI for the ARM architecture notes that:
When the platform supports lazy function binding (as ARM Linux does)
this ABI requires ip to address the corresponding
PLTGOT entry at the point where the PLT calls through it.
(The PLT is requir ed to behave as if it ended with LDR pc, [ip]).
Finding the relocation entry from the GOT
Now the way the relocation entry is found from the GOT address is not clear. Binary search could be used but is would not be convenient. The GNU ld.so does it like this (glibc/sysdeps/arm/dl-trampoline.S):
dl_runtime_resolve:
cfi_adjust_cfa_offset (4)
cfi_rel_offset (lr, 0)
# we get called with
# stack[0] contains the return address from this call
# ip contains &GOT[n+3] (pointer to function)
# lr points to &GOT[2]
# Save arguments. We save r4 to realign the stack.
push {r0-r4}
cfi_adjust_cfa_offset (20)
cfi_rel_offset (r0, 0)
cfi_rel_offset (r1, 4)
cfi_rel_offset (r2, 8)
cfi_rel_offset (r3, 12)
# get pointer to linker struct
ldr r0, [lr, #-4]
# prepare to call _dl_fixup()
# change &GOT[n+3] into 8*n NOTE: reloc are 8 bytes each
sub r1, ip, lr
sub r1, r1, #4
add r1, r1, r1
[...]
The address of the second GOT entry is in LR. I guess this is donebyt .PLT0:
00015b84 :
15b84: e52de004 push {lr} ; (str lr, [sp, #-4]!)
15b88: e59fe004 ldr lr, [pc, #4] ; 15b94
15b8c: e08fe00e add lr, pc, lr
15b90: e5bef008 ldr pc, [lr, #8]!
15b94: 0012f46c andseq pc, r2, ip, ror #8
From those two GOT addresses, the dynamic linker can find the GOT offset and the offset in the PLT relocation table.
From &GOT[2], the dynamic linker can find the second entry of the PLTGOT (GOT[1]) which contains the address of the linker struct (a reference used by the dynamic linker to recosgnise this shared-object/executable).
I don't where this is specified: it does not seem to be part of the base ARM ABI spec.
.rela.plt contains the address of printf to inform the dynamic linker from where to locate the printf
check this link for details very soft to digest https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html. This article also clarify about process of variables to be accessed through Shared libraries first and then functions.
The process of dynamic linking is described in great detail here.
TL;DR: at static link time, ld creates a set of tables in special sections such as .rel.dyn, .rel.plt, etc., which tell the runtime loader what to do at runtime.
You can examine these tables with nm -D, readelf -Wr, objdump -R, etc.
I am understanding U-boot(v2014.07).
In the start.S(at arch/arm/cpu/armv7/) file it is loading vector base address using following instructions.
ldr r0, =_start
mcr p15, 0, r0, c12, c0, 0 #Set VBAR
Can you please guide to understand where "_start" is defined. I checked in start.S and lowlevel_init.S, but I couldn't find.
Can you please guide to understand where "_start" is defined
For the ARM architecture, _start is defined as a global in arch/arm/lib/vectors.S
When disassembly the start.o file, the "ldr r0, =_start" instruction is updated as "ldr r0, [pc, #104] ; 9c " .
That should correspond to the first entry in the 32-byte ARM Exception vector, i.e.
ldr pc, _reset
I am building a project which has relocatable code on bare metal. It is a Cortex M3 embedded application. I do not have a dynamic linker and have implemented all the relocations in my startup code.
Mostly it is working but my local static variables appear to be incorrectly located. Their address is offset by the amount that my executable is offset in memory - ie I compile my code as if it is loaded at memory location 0 but I actually load it in memory located at 0x8000. The static local variable have their memory address offset by 0x8000 which is not good.
My global variables are located properly by the GOT but the static local variables are not in the GOT at all (at least they don't appear when I run readelf -r). I am compiling my code with -fpic and the linker has -fpic and -pie specified. I think I must be missing a compile and/or link option to either instruct gcc to use the GOT for the static local variables or to instruct it to use absolute addressing for them.
It seems that currently the code adds the PC to the location of the static local variables.
I think I have repeated what you are seeing:
statloc.c
unsigned int glob;
unsigned int fun ( unsigned int a )
{
static unsigned int loc;
if(a==0) loc=7;
return(a+glob+loc);
}
arm-none-linux-gnueabi-gcc -mcpu=cortex-m3 -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -fpic -pie -S statloc.c
Which gives:
.cpu cortex-m3
.fpu softvfp
.thumb
.text
.align 2
.global fun
.thumb
.thumb_func
fun:
ldr r3, .L6
.LPIC2:
add r3, pc
cbnz r0, .L5
ldr r1, .L6+4
movs r2, #7
.LPIC1:
add r1, pc
ldr ip, .L6+8
str r2, [r1, #0]
ldr r1, [r3, ip]
ldr r3, [r1, #0]
adds r0, r0, r3
adds r0, r0, r2
bx lr
.L5:
ldr ip, .L6+8
ldr r2, .L6+12
ldr r1, [r3, ip]
.LPIC0:
add r2, pc
ldr r2, [r2]
ldr r3, [r1, #0]
adds r0, r0, r3
adds r0, r0, r2
bx lr
.L7:
.align 2
.L6:
.word _GLOBAL_OFFSET_TABLE_-(.LPIC2+4)
.word .LANCHOR0-(.LPIC1+4)
.word glob(GOT)
.word .LANCHOR0-(.LPIC0+4)
.size fun, .-fun
.comm glob,4,4
.bss
.align 2
.LANCHOR0 = . + 0
.type loc.823, %object
.size loc.823, 4
loc.823:
.space 4
I also added startup code and compiled a binary and disassembled to further understand/verify what is going on.
this is the offset from the pc to the .got
ldr r3, .L6
add pc so r3 holds a position independent offset to the .got
add r3, pc
offset in the got for the address of the glob
ldr ip, .L6+8
read the absolute address for the global variable from the got
ldr r1, [r3, ip]
finally read the global variable into r3
ldr r3, [r1, #0]
this is the offset from the pc to the static local in .bss
ldr r2, .L6+12
add pc so that r2 holds a position independent offset to the static local
in .bss
add r2, pc
read the static local in .bss
ldr r2, [r2]
So if you were to change where .text is loaded and were to change where both .got and .bss are loaded relative to .text and that is it then the contents of .got would be wrong and the global variable would be loaded from the wrong place.
if you were to change where .text is loaded, leave .bss where the linker put it and move .got relative to .text. then the global would be pulled from the right place and the local would not
if you were to change where .text is loaded, change where both .got and .bss are loaded relative to .text and modify .got contents to reflect where .text is loaded, then both the local and global variable would be accessed from the right place.
So the loader and gcc/ld need to all be in sync. My immediate recommendation is to not use a static local and just use a global. That or dont worry about position independent code, it is a cortex-m3 after all and somewhat resource limited, just define the memory map up front. I assume the question is how do I make gcc use the .got for the local global, and that one I dont know the answer to, but taking a simple example like the one above you can work through the many command line options until you find one that changes the output.
This problem is also discussed here: https://answers.launchpad.net/gcc-arm-embedded/+question/236744
Starting with gcc 4.8 (ARM), there's a command line switch called -mpic-data-is-text-relative which causes static variables being addressed through the GOT as well.