AT91 Bootstrap + Bare Metal Application - c

I am currently trying to understand how AT91 and a bare metal application can work together. I'll try to describe what I have:
IAR as development environment
A simple application which I can download via debugger to SRAM and which toggles some LEDs (working!)
Using SAM-BA I can write this application to SRAM and it will start correctly (LEDs are toggling)
My hardware platform is the ATSAMA5D3x-EK
Now I would like this application to first run the AT91 bootstrap to initialize all the low level hardware (like DDR-RAM), then jump to my application and run it. I have not been able to do that yet successfully. I am able to start the pre-built uboot binary though so I assume it's not the copying or jump that are failing but my application is setup incorrectly.
As far as I understand, if I jump to an application (I assume this is some sort of "LDR pc, appstart_address") the operation at address appstart_address gets executed.
Now, in ARM the first 7 bytes or so are reserved for abort/interrupt vectors, whereas the first instruction is usually some sort of "LDR pc, =main". Are these required if my application is copied to RAM and executed from there? I somehow have the feeling that after copying my application to RAM, the address pointers do not match anymore (although they should be relative - is that correct at all?)
So my questions basically boil down to:
What happens after AT91 has initialized the hardware and jumps to my application
Do I need to setup my application in some specific way? Do I need to tell the linker or any other component that it will be relocated to some other memory location (at91 bootstrap copies it to 0x2600 0000 whereas 0x2000 0000 is the start address of DDR).
Does anyone know of a good tutorial which explains exactely this step (the jump from at91 bootstrap to my application)?
One more question which I probably can answer myself:
Is it safe to assume that I will not need to execute the instructions in board_startup.s at the beginning of my application which enable The floating point unit, setup the sys stack pointer and so on. I would say that the hardware itself has already been setup by AT91 Bootstrap and therefore there is no need for such setup.
After thinking about a few things it comes down to this:
Does it make sense to tell the linker that it should link main to address 0x0 (because this is where bootstrap will jump to) - how would I do that?

Now, in ARM the first 7 bytes or so are reserved for abort/interrupt
vectors, whereas the first instruction is usually some sort of "LDR
pc, =main". Are these required if my application is copied to RAM and
executed from there? I somehow have the feeling that after copying my
application to RAM, the address pointers do not match anymore
(although they should be relative - is that correct at all?)
The first 8 WORDS are exception entry points yes. Of which one is undefined so 7 real ones...
The reset vector does not want to go straight to main implying C code, you have not setup the stack or anything that you need to do to call C code. Also the reset vector is often close enough to use a branch b instead of a ldr pc, but since you only have one word/instruction to get out of the exception table then it either needs to be a branch or a ldr pc,something.
if your binary is position dependent then you build it for that position, you can then place it in non-volatile storage, copy and run if you like there is no problem with that. if you build it for its non-volatile address but you run it in a different address space and it is not position independent then you are right it simply wont work.
What happens after AT91 has initialized the hardware and jumps to my
application
your application runs
Do I need to setup my application in some specific way? Do I need to
tell the linker or any other component that it will be relocated to
some other memory location (at91 bootstrap copies it to 0x2600 0000
whereas 0x2000 0000 is the start address of DDR).
either build it position independent or link it for the address where it will run.
Does anyone know of a good tutorial which explains exactely this step
(the jump from at91 bootstrap to my application)?
I assume when you say at91 bootstrap (need to use a more correct term) you mean some part specific (at91 is a long lived family of devices) you really mean either some atmel part specific code or IAR part specific code. And the answer to your question is in their examples or documentation. You need to demonstrate what you found, examples, etc before posting a question like that.
Is it safe to assume that I will not need to execute the instructions
in board_startup.s at the beginning of my application which enable The
floating point unit, setup the sys stack pointer and so on. I would
say that the hardware itself has already been setup by AT91 Bootstrap
and therefore there is no need for such setup.
if you are relying on someone elses code to for example setup ddr, then it is probably a safe bet that they setup the stack. fpu, thats another story. But if that file name is specific to their project and is something they call/use then well, they called it or used it. Again this is specific to this magic AT91 Bootstrap thing which you have not demonstrated that you looked at or through or read about. Please, do some more research on the topic, show what you tried, etc. For example it should be quite trivial after this bootstrap code to read the registers that enable the fpu and or just use it and see what you see. that is an easy way to tell if it had been run. alternatively insert an infinite loop in that code and re-build if the code hangs at the infinite loop. they they are running it. (careful not to brick your board with such a move, in theory SAM-BA will let you re-load).
Does it make sense to tell the linker that it should link main to
address 0x0 (because this is where bootstrap will jump to) - how would
I do that?
The exception table for this processor is at a well known location (possibly one of two depending on strapping). the exception handlers need to be in the right place for the processor to boot properly. Generally it is the linker that does the final arranging of code and it is linker specific as to how you tell the linker where to put things so the answer is in the documentation for the linker and also either somewhere in the project it specifies this information (linker script, makefile, etc) or a default is used either global default or some variable or command line option tells one of the tools where to look for this information. so how you do it is read the docs and do what the docs say.

Related

Custom Bootloader for Kinetis MKE06Z microcontrollers on IAR EWARM issue

First I'd like to introduce myself, as I'm new to the site. I'm an Electronic Engineer, specialized in embedded systems design and development. I've been gathering info from the site for a long time, and I think that there's a lot of people with great deal of knowledge. I'm hoping some other of you may have stumbled upon this or a similar issue.
I've been having some trouble in the implementation of a custom bootloader for a Kinetis MKE06Z microcontroller, not in the bootloader itself but in the relocation of the application code and the behavior after jumping to it. The application is completely coded in C.
The bootloader executes everything as expected, determines if it should run or jump to user application. This is the sequence that implements the jump:
__disable_interrupt();
SCB->VTOR = RELOCATION_VECTOR_ADDR & 0x3FFFFE00;
JumpToUserApplication(RELOCATION_VECTOR_ADDR);
where:
void JumpToUserApplication(uint32_t userStartup)
{
/* set up stack pointer */
asm("LDR r1, [r0]");
asm("MOV r13, r1");
/* jump to application reset vector */
asm("ADDS r0,r0,#0x04 ");
asm("LDR r0, [r0]");
asm("BX r0");
}
as implemented in Frescale's AN4767.
So far, so good. Once the jump is executed, I trace the application behavior (on the Disassembly Window) and find out after some instructions, it gets stuck at some specific address with a jump instruction, which ends up being an infinite loop. I then run it step by step to determine which was the instruction that causes this malfunction. It's very strange, as it is running OK and suddenly jumps to a RAM address. A couple of cycles and then jumps to the infinite loop. I took note of the addresses with the instruction causing this strange jump and the one with the infinite loop. I look at the core registers and find out there is an exception, and notice it's the number 0x03 (Hard Fault). Then switch to debugging the user application.
Once in the user application, I start debugging. The user application works fine running like this (no jump from the bootloader). Then I look for the relevant addresses and discover that the routine causing the hard fault when jumping from bootloader is from IAR: __iar_data_init3. The thing is, it's part of a precompiled library and I'm not sure if it's safe to remove it (by removing the __iar_program_start and replacing it directly with the call to main on the startup file.
The real question is: why does the application behave like that after the jump from the booloader but not if there is no such jump? Why does this routine jumps to a RAM address (when it shouldn't)?
Of course, it may be a little to specific, but hopefully there's someone that can help me.
It seems that something IAR does with the linker configuration is not very clear to me, but has something to do with this problem. The thing is I relocated .text segment:
define symbol __ICFEDIT_intvec_start__ = 0x00001800;
define symbol __ICFEDIT_region_ROM_start__ = 0x00002000;
define symbol __ICFEDIT_region_ROM_end__ = 0x0000FFFF;
define region APP_ROM = mem:[from (__ICFEDIT_region_ROM_start__) to (__ICFEDIT_region_ROM_end__)];
place at address mem:__ICFEDIT_intvec_start__ { readonly section .intvec };
place at start of APP_ROM { readonly section .text };
It seems that the linker doesn't appreciate this and something make the app misbehave when jumping from other app. Instead of this, keeping the original .icf file and editing within the GUI only the .intvec_start solved the problem, but code starts right next to the vector table. Not an issue, but I wanted to relocate code a little farther.
Thanks.

How does OS execute binary files in virtual memory?

For example in my program I called a function foo(). The compiler and assembler would eventually write jmp someaddr in the binary. I know the concept of virtual memory. The program would think that it has the whole memory at disposal, and the start position is 0x000. In this way the assembler can calculate the position of foo().
But in fact this is not decided until runtime right? I have to run the program to know where I loaded the program into, hence the address of the jmp. But when the program actually runs, how does the OS come in and change the address of the jmp? These are direct CPU instructions right?
This question can't be answered in general because it's totally hardware and OS dependent. However a typical answer is that the initially loaded program can be compiled as you say: Because the VM hardware gives each program its own address space, all addresses can be fixed when the program is linked. No recalculation of addresses at load time is needed.
Things get much more interesting with dynamically loaded libraries because two used by the same initially loaded program might be compiled with the same base address, so their address spaces overlap.
One approach to this problem is to require Position Independent Code in DLLs. In such code all addresses are relative to the code itself. Jumps are usually relative to the PC (though a code segment register can also be used). Data are also relative to some data segment or base register. To choose the runtime location, the PIC code itself needs no change. Only the segment or base register(s) need(s) be set whenever in the prelude of every DLL routine.
PIC tends to be a bit slower than position dependent code because there's additional address arithmetic and the PC and/or base registers can bottleneck the processor's instruction pipeline.
So the other approach is for the loader to rebase the DLL code when necessary to eliminate address space overlaps. For this the DLL must include a table of all the absolute addresses in the code. The loader computes an offset between the assumed code and data base addresses and actual, then traverses the table, adding the offset to each absolute address as the program is copied into VM.
DLLs also have a table of entry points so that the calling program knows where the library procedures start. These must be adjusted as well.
Rebasing is not great for performance either. It slows down loading. Moreover, it defeats sharing of DLL code. You need at least one copy per rebase offset.
For these reasons, DLLs that are part of Windows are deliberately compiled with non-overlapping VM address spaces. This speeds loading and allows sharing. If you ever notice that a 3rd party DLL crunches the disk and loads slowly, while MS DLLs like the C runtime library load quickly, you are seeing the effects of rebasing in Windows.
You can infer more about this topic by reading about object file formats. Here is one example.
Position-independent code is code that you can run from any address. If you have a jmp instruction in position-independent code, it will often be a relative jump, which jumps to an offset from the current location. When you copy the code, it won't change the offsets between parts of the code so it will still work.
Relocatable code is code that you can run from any address, but you might have to modify the code first (maybe you can't just copy it). The code will contain a relocation table which tells how it needs to be modified.
Non-relocatable code is code that must be loaded at a certain address or it will not work.
Each program is different, it depends on how the program was written, or the compiler settings, or other various factors.
Shared libraries are usually compiled as position-independent code, which allows the same library to be loaded at different locations in different processes, without having to load multiple copies into memory. The same copy can be shared between processes, even though it is at a different address in each process.
Executables are often non-relocatable, but they can be position-independent. Virtual memory allows each program to have the entire address space (minus some overhead) to itself, so each executable can choose the address at which it's loaded without worrying about collisions with other executables. Some executables are position-independent, which can be used to increase security (ASLR).
Object files and static libraries are usually relocatable code. The linker will relocate them when combining them to create a shared library, executable, or other image.
Boot loaders and operating system kernels are almost always non-relocatable.
Yes, it is at runtime. The operating system, the part managing starting and switching tasks is ideally at a different protection level, it has more power. It knows what memory is in use and allocates some for the new task. It configures the mmu so that the new task has a virtual address space starting at zero or whatever the rule is for that operating system and processor. How you get into user mode at that starting address, is very processor specific.
One method for example is the hardware might save some state not just address but mode or virtual id or something when an interrupt occurs, lets say on the stack. And the return from interrupt instruction as defined by that processor takes the address, and state/mode, off of the stack and switches there (causing lets assume the mmu to react to its next fetch based on the new mode not the old). For a processor that works like that then you may be able to fake an interrupt return by placing the right items on the stack such that when you kick the interrupt return instruction it basically does a jump with additional features of mode switching, etc.
The ARM family for example (not cortex-m) has a processor state register for what you are running now (in the case of an interrupt or service call) and a second state register for where you came from, the state that was interrupted, when you do the proper return you give it the address and it switches back to that mode using the other register. You can directly access that register from the non-users modes so you can manipulate the state of the return. There is no return instruction in arm, just flavors of jump (modifications to the program counter), so it is a special kind of jump.
The short answer is that it is very specific to the processor as to what your choices are for jumping to the first time or returning to after a task switch to a running task in an application mode in a virtual address space. Either directly or indirectly the processor documentation will describe these modes and how you change them. If not explicitly described then you have to figure out on your own from the instructions and the mmu protections and such how to switch tasks.

C code that checksums itself *in ram*

I'm trying to get a ram-resident image to checksum itself, which is proving easier said than done.
The code is first compiled on a cross development platform, generating an .elf output. A utility is used to strip out the binary image, and that image is burned to flash on the target platform, along with the image size. When the target is started, it copies the binary to the correct region of ram, and jumps to it. The utility also computes a checksum of all the words in the elf that are destined for ram, and that too is burned into the flash. So my image theoretically could checksum its own ram resident image using the a-priori start address and the size saved in flash, and compare to the sum saved in flash.
That's the theory anyway. The problem is that once the image begins executing, there is change in the .data section as variables are modified. By the time the sum is done, the image that has been summed is no longer the image for which the utility calculated the sum.
I've eliminated change due to variables defined by my application, by moving the checksum routine ahead of all other initializations in the app (which makes sense b/c why run any of it if an integrity check fails, right?), but the killer is the C run time itself. It appears that there are some items relating to malloc and pointer casting and other things that are altered before main() is even entered.
Is the entire idea of self-checksumming C code lame? If there was a way to force app and CRT .data into different sections, I could avoid the CRT thrash, but one might argue that if the goal is to integrity check the image before executing (most of) it, that initialized CRT data should be part of that. Is there a way to make code checksum itself in RAM like this at all?
FWIW, I seem stuck with a requirement for this. Personally I'd have thought that the way to go is to checksum the binary in the flash, before transfer to ram, and trust the loader and the ram. Paranoia has to end somewhere right?
Misc details: tool chain is GNU, image contains .text, .rodata and .data as one contiguously loaded chunk. There is no OS, this is bare metal embedded. Primary loader essentially memcpy's my binary into ram, at a predetermined address. No relocations occur. VM is not used. Checksum only needs testing once at init only.
updated
Found that by doing this..
__attribute__((constructor)) void sumItUp(void) {
// sum it up
// leave result where it can be found
}
.. that I can get the sum done before almost everything except the initialization of the malloc/sbrk vars by the CRT init, and some vars owned by "impure.o" and "locale.o". Now, the malloc/sbrk value is something I know from the project linker script. If impure.o and locale.o could be mitigated, might be in business.
update
Since I can control the entry point (by what's stated in flash for the primary loader), it seems the best angle of attack now is to use a piece of custom assembler code to set up stack and sdata pointers, call the checksum routine, and then branch into the "normal" _start code.
If the checksum is done EARLY enough, you could use ONLY stack variables, and not write to any data-section variables - that is, make EVERYTHING you need to perform the checksumming [and all preceding steps to get to that point] ONLY use local variables for storing things in [you can read global data of course].
I'm fairly convinced that the right way is to trust the flash & loader to load what is in the flash. If you want to checksum the code, sure, go and do that [assuming it's not being modified by the loader of course - for example runtime loading of shared libraries or relocation of the executable itself, such as random virtual address spaces and such]. But the data loaded from flash can't be relied upon once execution starts properly.
If there is a requirement from someone else that you should do this, then please explain to them that this is not feasible to implement, and that "the requirement, as it stands" is "broken".
I'd suggest approaching this like an executable packer, like upx.
There are several things in the other answers and in your question that, for lack of a better term, make me nervous.
I would not trust the loader or anything in flash that wasn't forced on me.
There is source code floating around on the net that was used to secure one of, I think, HTCs recent phones. Look around on forum.xda-developers.com and see if you can find it and use it for an example.
I would push back on this requirement. Cellphone manufacturers spend a lot of time on keeping their images locked down and, eventually, all of them are beaten. This seems like a vicious circle.
Can you use the linker script to place impure.o and locale.o before or after everything else, allowing you to checksum everything but those and the malloc/sbrk stuff? I'm guessing malloc and sbrk are called in the bootloader that loads your application, so the thrash caused by those cannot be eliminated?
It's not an answer to just tell you to fight this requirement, but I agree that this seems to be over-thought. I'm sure you can't go into any amount of detail, but I'm assuming the spec authors are concerned about malicious users/hackers, rather than regular memory corruption due to cosmic rays, etc. In this case, if a malicious user/hacker can change what's loaded into RAM, they can just change your checksumming routine (which is itself running from RAM, correct?) to always return a happy status, no matter how well the checksum routine they aren't running anymore is designed.
Even if they are concerned about regular memory corruption, this checksum routine would only catch that if the error occurred during the original copy to memory, which is actually the least likely time such an error would occur, simply because the system hasn't been running long enough to have a high probability of a corruption event.
In general, what you want to do is impossible, since on many (most?) platforms the program loader may "relocate" some program address constants.
Can you update the loader to perform the checksum test on the flash resident binary image, before it is copied to ram?

What is the structure of a MPW tool's main symbol?

This question is about Mac OS Classic, which has been obsolete for several years now. I hope someone still knows something about it!
I've been building a PEF executable parser for the past few weeks and I've plugged a PowerPC interpreter to it. With a good dose of wizardry, I would expect to be able to run (to some extent) some Mac OS 9 programs under Mac OS X. In fact, I'm now ready to begin testing with small applications.
To help me with that, I have installed an old version of Mac OS inside SheepShaver and downloaded the (now free) MPW Tools1, and I built a "hello world" MPW tool (just your classic puts("Hello World!") C program, except compiled for Mac OS 9).
When built, this generates a program with a code section and a data section. I expected that I would be able to just jump to the main symbol of the executable (as specified in the header of the loader section), but I hit a big surprise: the compiler placed the main symbol inside the data section.
Obviously, there's no executable code in the data section.
Going back to the Mac OS Runtime Architectures document (published in 1997, surprisingly still up on Apple's website), I found out that this is totally legal:
Using the Main Symbol as a Data Structure
As mentioned before, the
main symbol does not have to point to a routine, but can point to a
block of data instead. You can use this fact to good effect with
plug-ins, where the block of data referenced by the main symbol can
contain essential information about the plug-in. Using the main symbol
in this fashion has several advantages:
The Code Fragment Manager
returns the address of the main symbol when you programmatically
prepare a fragment, so you do not need to call FindSymbol.
You do
not have to reserve and document the specific name of an export for
your plug-in.
However, not having a specific symbol name means that
the plug-in’s purpose is not quite as obvious. A plug-in can store its
name, icon, or information about its symbols in the main symbol data
structure. Storing symbolic information in this fashion eliminates the
need for multiple FindSymbol calls.
My conclusion, therefore, is that MPW tools run as plugins inside the MPW shell, and that the executable's main symbol points to some data structure that should tell it how to start.
But that still doesn't help me figure out what's in that data structure, and just looking at its hex dump has not been very instructive (I have an idea where the compiler put the __start address for this particular program, but that's definitely not enough to make a generic MPW shell "replacement"). And obviously, most valuable information sources on this topic seem to have disappeared with Mac OS 9 in 2004.
So, what is the format of the data structure pointed by the main symbol of a MPW tool?
1. Apparently, Apple very recently pulled the plug of the FTP server that I got the MPW Tools from, so it probably is not available anymore; though a google search for "MPW_GM.img.bin" does find some alternatives).
As it turns out, it's not too complicated. That "data structure" is simply a transition vector.
I didn't realize it right away because of bugs in my implementation of the relocation virtual machine that made these two pointers look like garbage.
Transition vectors are structures that contain (in this order) an entry point (4 bytes) and a "table of contents" offset (4 bytes). This offset should be loaded into register r2 before executing the code pointed to by the entry point.
(The Mac OS Classic runtime only uses the first 8 bytes of a transition vector, but they can technically be of any size. The address of the transition vector is always passed in r12 so the callee may access any additional information it would need.)

What is the deal with position-independent code (PIC)?

Could somebody explain why I should be interested in compiling position-independent code, and also why should I avoid it?
Making code position-independent adds a layer of abstraction, which requires an additional lookup step at runtime for certain operations (usually pertaining to accessing variables with static storage).
So if you don't need it, don't use it!
There are specific situations where you must produce PIC (namely when creating run-time loadable code, such as a plug-in module or library), but the added flexibility comes at a price.
The gory details depend on how your loader works on on whether you are building a executable or a library, but there is a sense in which this is all a problem for the build system and the compiler, not for you.
If you really want to understand you need to consider where the code gets put in the address space before execution starts and what set of branching instructions your chip provides. Are branches relative or absolute? Is access to the data segment relative or absolute?
If branches are absolute, then the code must be loaded to a reliable address or it won't work. That's position dependent code. Many simple (or older) operating systems accommodate this by always loading a program to the same place.
Relative branches mean that the can be placed at any location in memory. That is position independent code.
Again, all you need to know is the recipe for invoking your compiler and linker on your platform.
PIC code usually has to be slightly larger because the compiler can't use instructions that encode relative address offsets. Without PIC, many addresses can be encoded with 16 or 8 bits relative to current PC. Sometimes in embedded systems, PIC is useful. For example if you want to have patch code that can run at various physical addresses.

Resources