How does the ARM linker know where the exception table stops? - linker

I have come across something very similar to the following when looking for basic, bare-metal Cortex-M3 programming information (referred to in a great answer right here, that I'll attempt to locate later).
/* vectors.s */
.cpu cortex-m3
.thumb
.word 0x20002000 /* stack top address */
.word _start /* 1 Reset */
.word hang /* 2 NMI */
.word hang /* 3 HardFault */
.word hang /* 4 MemManage */
.word hang /* 5 BusFault */
.word hang /* 6 UsageFault */
.word hang /* 7 RESERVED */
.word hang /* 8 RESERVED */
.word hang /* 9 RESERVED*/
.word hang /* 10 RESERVED */
.word hang /* 11 SVCall */
.word hang /* 12 Debug Monitor */
.word hang /* 13 RESERVED */
.word hang /* 14 PendSV */
.word hang /* 15 SysTick */
.word hang /* 16 External Interrupt(0) */
.word hang /* 17 External Interrupt(1) */
.word hang /* 18 External Interrupt(2) */
.word hang /* 19 ... */
.thumb_func
.global _start
_start:
bl notmain
b hang
.thumb_func
hang: b .
This makes sense to me, I understand what it does, but what does not make sense to me is how the linker (or CPU, I'm not sure which...) knows where the exception table ends and actual code begins. It seems to me from the Cortex-M3 documentation that there can be an arbitrary number of external interrupts in the table.
How does this work, and what should I have read to learn?

That is my code, and it is the programmer not the linker that needs to know what is going on. The cortex-m's have different vector tables and they can be quite lengthy, many many individual interrupt vectors. The above was just a quicky from one flavor one day. If you never will use anything more than the reset vector do you need to burn more memory locations creating an entire table? probably not. You might want to make one deep enough to cover undefined exceptions and aborts and such, so you can trap them with a controlled hang. If you only need reset and one interrupt but that interrupt is pretty deep in the table, you can make a huge table to cover both or if brave enough put some code between them to recover the lost space.
To learn/read more you need to go to the TRM, technical reference manual, for the specific cortex-m cores cortex-m0, cortex-m3 and cortex-m4 as well as possibly the ARM ARM for the armv7-m (the cortex-m's are all of the family armv7-m). ARM ARM means ARM Architectural Reference Manual, which talks generically about the whole family but doesnt cover core specific details to any great length, you need an ARM ARM and the right TRM usually when programming for an ARM (if you need to get into anything low level like asm or specific registers). All can be found at the arm website infocenter.arm.com along the left look for Architecture or look for Cortex-M series processors.
These cores are new enough they probably dont have more than the original rev number. For some older cores it is best to get the TRM for the specific core. Take the ARM11 Mpcore for example if the vendor used rev 1.0 (r1p0) of the core despite the obsolete markings on the manual you want to at least try to use the rev 1.0 manual over the rev 2.0 manual if there are differences. The vendors dont always tell you which rev they bought/used so this can be a problem when sorting out how to program the thing or what errata applies to you and what doesnt. (Linux is FULL of mistakes related to ARM because of not understanding this, errata applied wrong in general or applied to the wrong core, instructions used on the wrong core or at the wrong time, etc).
I probably wrote that handler for a cortex-m3 when i first got one and then cut and pasted here and there not bothering to fix/change it.
The core (CPU, logic) definitely knows how many interrupts are supported, and knows the complete make up of the vector table. There might be control signals on the edge of the core that might change these kinds of things, and the vendor may have tied those one way or made them programmable, etc. No matter what the logic for that core definitely knows what the vector table looks like it is up to the programmer to make the code match the hardware as needed. This will all be described in the TRM for that core.
EDIT
Hardcoded in the logic will be an address or an offset for each of these events or interrupts. So when the interrupt or event happens it performs a memory read from that address.
This is just a bank of flash memory, you feed it an address flick a read strobe and some data comes out. Those bits dont mean anything until you put them in a context. You can cause a memory cycle at address 0x10 if you create some code that does a load instruction at that address, or you can branch to that address and a fetch cycle reads that location hoping to find an instruction, and if an event has that hardcoded address and a read happens at that address it is hoping to find an address for a handler. but it is just a memory. Just like a file on a file system, its just bytes on the disk. If it is a jpeg then a particular byte might be a pixel, if the file on the disk is a program then a byte at an offset might be an instruction. its just a byte.
These addresses are generated directly from the logic. These days logic is written using programming languages (usually verilog or vhdl or some higher level language that produces verilog or vhdl as an output). No different than if you were to write a program in your favorite language you might choose to literally hardcode some address
x = (unsigned int *)0x1234;
or you might choose to use a structure or an array or some sort of other programming style, but once compiled it still ends up producing some fixed address or offset:
unsigned int vector_table[256];
...
handler_address = vector_table[interrupt_base+interrupt_number];
...
So as a programmer, at this low level, you have to know if and when the hardware is going to read one of these addresses and why. if you never use interrupts because you never enable any interrupts then those memory locations that might normally hold interrupt handler addresses, are now memory locations you can use for anything you want. if as you see in many, almost all, of my examples I only ever need the reset vector, the rest I dont use. I might accidentally hit an undefined instruction handler or data abort if I accidentally perform an unaligned access, but I dont worry about it enough to place a handler there, usually. I am going to crash/hang anyway so I will sort that problem out when I get there, cross that bridge when I get to it. So I am usually content with the bare minimum for a cortex-m the stack address and reset address:
.cpu cortex-m3
.thumb
.word 0x20002000 /* stack top address */
.word _start /* 1 Reset */
.thumb_func
.global _start
_start:
bl notmain
b hang
.thumb_func
hang: b .
And yes, absolutely I have placed a two word instruction bl notmain at the memory location for an NMI. if an NMI were to occur, then the cpu would read that location, assume it was an address for a handler, try to fetch the instruction at that address, which will have who knows what. It might know before even fetching that it is an invalid address and cause a data abort, which in the above case would be yet another address to somewhere, and it might turn into an infinite loop of data aborts. Or if by chance one of those instructions happens to appear like an address in our program then basically it will jump right into the middle of the program, which quite often will end up in some sort of crash. what is a crash? really? the cpu doing its job reading bytes from memory and interpreting them as instructions if those instructions tell it to do unaligned accesses or those bytes are not valid instructions or cause you to read invalid memory address you go back into the vector table. Or you might be so lucky that the messed up addresses being written or read by jumping in the middle of the code space cause a flash bank to be erased, or a byte to spit out a uart, or a gpio input port to be changed into an output port (if really lucky you make it try to drive against a ground or something and you melt down the chip).
If I were to start seeing weird things (and hopefully not hear or smell or see the chip melting down) I might toss in a few dozen entries in the vector table that point to a hang, or point to a handler that spits something out a port or turns on a gpio/led. if my weirdness now becomes this led coming on or the uart output in the handler then I have to sort out what event is happening, or if I think about the last few changes to my application I may realize I had an unaligned access or a branch into the weeds, etc.
It goes back to that saying, "The computer doesnt do what you want it to do, it does what you told it to do". You put the bytes there and it interpreted them for what they were, if you didnt put the right bytes in the right place (vector addresses to the right place, instructions that do the right thing, and the right data in the right place) it can/will crash.

The answer is it doesn't. It's the responsibility of the programmer to put a sufficient number of vectors to trap all of the ones generated by your CPU. If you screw it up, the CPU could try to load something after the vector table (like an instruction from your _start code) , treat it like a vector and try to jump into it (which would probably cause a data abort).

Related

How to determine if an instruction is long or short at the event of an exception? (Variable Length Instructions)

My question is about Chapter 5 in this link.
I have an Error Correction Code which simply increments the program counter (PC) by 2 or 4 bytes according the length of the instruction at the time of exception. The core is e200z4.
As far as I know e200z4 can support Fixed Length Instructions of 4 bytes, too.
The thing I don't understand is that: To determine if Variable Length Instructions (VLE) enabled, we need to check the VLEMI bit in the ESR (Exception Syndrome Register). However, this register always contains 0x00000000. The only interrupt that we end up with is Machine Check Interrupt (IVOR1) (during Power On and Off tests with increasing On and fixed Off intervals).
So, why does the CPU not provide the information about the length of the instruction if VLE is used at the moment of interrupt, for instance via VLEMI bit inside ESR? How could I determine if the instruction at the time of interrupt is 2 bytes or 4 bytes long is fixed length or variable length?
Note1: isOpCode32Bit below is decoding opCode to determine instruction length, but isOpCode32Bit is relevant only if isFixedLength is 0, i.e. when (syndrome & VLEMI_MASK) is equal to 1. So, we need to have VLEMI value in syndrome somehow, but ESR seems to be always 0x00 (why?).
Note2: As mentioned before, we always end up in IVOR1 and the instruction address right before the interrupt is reachable (provided in a register).
// IVOR1 (Machine Check Interrupt Assembly part):
(ASSEMBLY)(mfmcsr r7) // copy MCSR into register 7 (MCSR in Chapter 5 in the link)
(ASSEMBLY)(store r7 &syndrome)
// IVOR2:
(ASSEMBLY)(mfesr r7) // copy ESR into register 7 (ESR in Chapter 5 in the link)
(ASSEMBLY)(store r7 &syndrome)
------------------------------------------------------
#define VLEMI_MASK 0x00000020uL
isFixedLength = ((syndrome & VLEMI_MASK) == 0);
if (isFixedLength || isOpCode32Bit)
{
PC += 4; // instruction is 32-bit, increase PC by 4
}
else
{
PC += 2; // instruction is 16-bit, increase PC by 2
}
When it comes to how these exception handlers work in real systems:
Sometimes handling the exception only requires servicing a page fault (e.g. via copy on write or disc reload).  In such cases, we don't even need to know the length of the instruction, just the effective memory address the instruction is accessing, and the CPUs generally offer that value.  If the page fault can be serviced, then re-running that faulting instruction (without advancing the PC) is appropriate (and if not, then halting the program, also without advancing the PC, is appropriate.)
In other cases, such as software emulation for instructions not present in this hardware, presumably hardware designers consider that such a software handler needs to decode the faulting instruction in order to emulate it, and so will figure out the instruction length anyway.
Thus, hardware turns the job of understanding the faulting instruction over to software.  As such system software needs to have deep knowledge of the instruction set architecture, while also likely requiring customization for each different hardware instantiation of the instruction set.
So, why does the CPU not provide information about the length of the instruction at the moment of interrupt inside ESR?
No CPU that I know tells us of the length of an instruction that caused an exception.  If they did, that would be convenient — but only for toy exception handlers.  For real systems, ultimately, this isn't a true burden.
How to determine if an instruction is long or short at the event of an exception? (Vairable Length Instructions)
Decode the instruction (while considering any instruction modes the CPU was in at the time of exception)!

How can I trace the cause of an invalid PC fault on Cortex M3?

I have an STM32 Cortex M3 that is experiencing an intermittent invalid PC (INVPC) fault. Unfortunately it takes a day or more to manifest and I don't know the cause.
I have the device paused in the debugger after the fault happened. The INVPC flag is set. The stacked registers are as follows:
0x08003555 xPSR
0x08006824 PC
0x08006824 LR
0x00000000 R12
0x08003341 R3
0x08006824 R2
0xFFFFFFFD R2
0x0000FFFF R0
Unfortunately the return address 0x08006824 is just past the end of the firmware image. The decompilation of that region is as follows:
Region$$Table$$Base
0x08006804: 08006824 $h.. DCD 134244388
0x08006808: 20000000 ... DCD 536870912
0x0800680c: 000000bc .... DCD 188
0x08006810: 08005b30 0[.. DCD 134241072
0x08006814: 080068e0 .h.. DCD 134244576
0x08006818: 200000bc ... DCD 536871100
0x0800681c: 00001a34 4... DCD 6708
0x08006820: 08005b40 #[.. DCD 134241088
Region$$Table$$Limit
** Section #2 'RW_IRAM1' (SHT_PROGBITS) [SHF_ALLOC + SHF_WRITE]
Size : 188 bytes (alignment 4)
Address: 0x20000000
I'm not sure this address is valid. The disassembly of that address in the debugger looks like nonsense, maybe data interpreted as code or something.
Is there any way I can trace this back to see where the exception happened? If necessary I can add some additional code to capture more information.
Don't sure how it works on Cortex M3, but on some other ARMs PSR register holds processor mode bits that could help you find out when it happens (in user mode, IRQ, FIQ etc). Each mode generally have it's own stack.
For user mode, if you use some RTOS with multi-tasking, you probably have many stacks for each task, but you could try to find out which task is current one (was running before crash).
When you find crashed task (or IRQ) you could try to look at it's stack for addresses of all routines and find out what was called before accident. Of course if stack was not unrecoverably corrupted.
This is what I'd start investigation from. If you find crashed task or even function but still have no idea what happens, you could make something like small circular history buffer where you write some codes on every step of your program, so you could find what it does last even if stack was destroyed.

How do you startup the additional cores on an Allwinner H5?

I am trying to figure out how to start cores other than core0 for a quad core allwinner h5. the C_RST_CTRL register (a.k.a CPU2 Reset Control Register) has four bits at the bottom that imply they are four reset controls. The lsbit is one the other three zeros implying setting those releases reset on the other cores, but I dont see that happening (nothing is running code I have left at address zero), at the same time zeroing that lsbit does stop core0 implying that it is a reset control. So I assume there are clock gates somewhere but I cannot find them.
The prcm registers which are not documented in the H5 docs but are on a sunxi wiki page for older allwinners do show what seem to be real PLL settings but the cpu enable registers are marked as A31 only and the cpu0 register(s) are not setup so that would imply that is not how you enable any cpu including 0 for this chip.
What am I missing?
For a pure bare metal solution look at sunxi_cpu_ops.c from the plat/sun50iw1p1 directory of https://github.com/apritzel/arm-trusted-firmware.git
You need to deactivate various power clamps as well as clock gates.
Alternatively, include the Arm Trusted Firmware code and enable a core by an SMC call:
ldr x2,=entry_point
mov x1,#corenumber
mov x0,#0x03
movk x0,#0x8400,lsl #16
smc #0
I've now confirmed this works on an H5.
Does C_CPU_STATUS STANDBY_WFI=0x0E suggest that the secondary cores are sitting in WFI?
Not an answer, I don't have enough rep to comment but I'm just starting the same exercise myself.
As an aside, how did you put code at address 0? Isn't that BROM? I was going to play with the RVBARADDR registers.

Why NOP/few extra lines of code/optimization of pointer aliasing helps? [Fujitsu MB90F543 MCU C code]

I am trying to fix an bug found in a mature program for Fujitsu MB90F543. The program works for nearly 10 years so far, but it was discovered, that under some special circumstances it fails to do two things at it's very beginning. One of them is crucial.
After low and high level initialization (ports, pins, peripherials, IRQ handlers) configuration data is read over SPI from EEPROM and status LEDs are turned on for a moment (to turn them a data is send over SPI to a LED driver).
When those special circumstances occur first and only first function invoking just a few EEPROM reads fails and additionally a few of the LEDs that should, don't turn on.
The program is written in C and compiled using Softune v30L32.
Surprisingly it is sufficient to add single __asm(" NOP ") in low level hardware init to make the program work as expected under mentioned circumstances. It is sufficient to turn off 'Control optimization of pointer aliasing' in Optimization settings. Adding just a few lines of code in various places helps too.
I have compared (DIFFed) ASM listings of compiled program for a version with and without __asm(" NOP ") and with both aforementioned optimizer settings and they all look just fine.
The only warning Softune compiler has been printing for years during compilation is as follows:
*** W1372L: The section is placed outside the RAM area or the I/O area (IOXTND)
I do realize it's rather general question, but maybe someone who has a bigger picture will be able to point out possible cause.
Have you got an idea what may cause such a weird behaviour? How to locate the bug and fix it?
During the initialization a few long (about 20ms) delay loops are used. They don't help although they were increased from about 2ms, yet single NOP in any line of the hardware initialization function and even before or after the function helps.
Both the wait loops works. I have checked it using an oscilloscope. (I have added LED turn on before and off after).
I have checked timming hypothesis by slowing down SPI clock from 1MHz to 500kHz. It does not change anything. Slowing down to 250kHz makes watchdog resets, as some parts of the code execute too long (>25ms).
One more thing. I have observed that adding local variables in any source file sometimes makes the problem disappear or reappear. The same concerns initializing uninitialized local variables. Adding a few extra lines of a code in any of the files helps or reveals the problem.
void main(void)
{
watchdog_init();
// waiting for power supply to stabilize
wait; // about 45ms
hardware_init();
clear_watchdog();
application_init();
clear_watchdog();
wait; // about 20ms
test_LED();
{...}
}
void hardware_init (void)
{
__asm("NOP"); // how it comes it helps? - it may be in any line of the function
io_init(); // ports initialization
clk_init();
timer_init();
adc_init();
spi_init();
LED_init();
spi_start();
key_driver_init();
can_init();
irq_init(); // set IRQ priorities and global IRQ enable
}
Could be one of many things but two spring to mind.
Timing.
Maybe the wait is not long enough for power to stabilize and not everything is synced to the clock. The NOP gets everything back in sync.
Alignment.
Perhaps the NOP gets your instructions aligned on a 32 or 64 bit boundary expected by the hardware. (we used to do this a lot on mainframe assemblers as IO operations often expected things to be on double word boundarys).
The problem was solved. It was caused by a trivial bug.
EEPROM's nHOLD and nCS signals were not initialized immediately after MCU's reset, but before the first use of the EEPROM. As a result they were 0's, so active.
This means EEPROM was selected, but waiting on hold. Meantime other transfer using SPI started. After 6 out of 8 CLK pulses EEPROM's nHOLD I/O pin was initialized and brought high. EEPROM was no longer on hold so it clocked in last two bits of a data for an other peripheral. Every subsequent operation on the EEPROM found it being having not synchronized CLK and MOSI.
When I have added NOP or anything other the moment of nHOLD 0->1 edge was shifted to happen after the last CLK pulse. Now CLK-MOSI were in sync.
All I have had to do was to initialize all the EEPROM's SPI lines, in
particular nHOLD and nCS right after the MCU reset.

ARM Bootloader: Disable MMU and Caches

According to some tutorials, we will disable MMU and I/D-Caches at the beginning of bootlaoder. If I understand correctly, it aims to use the physical address directly in the program, so please correct me if I'm wrong. Thank you!
Secondly, we do this to disable MMU and Caches:
mrc P15, 0, R0, C1, C0, 0
bic R0, R0, #0x00002300 # clear bits 13, 9:8
bic R0, R0, #0x00000087 # clear bits 7, 2:0
orr R0, R0, #0x00000002 # set bit 2 (A) Align
orr R0, R0, #0x00001000 # set bit 12 (I) I-Cache
mcr P15, 0, R0, C1, C0, 0
D-Cache, MMU and Data Address Alignment Fault Checking have been disabled by clear bits 2:0, but why we enable bit 2 immediately in the following instrument? To make sure this manipulation is valid?
Last question is why D-cache is disabled but I-caches is able? To speed up instrument process?
Last question is why D-cache is disabled but I-caches is able? To speed up instrument process?
The MMU has settings to determine which memory regions are cacheable or not. If you do not have the mmu on but you have the data cache on (if possible) then you cannot safely talk to peripherals. if you read the uart status register for example that goes through the cache just like any other data operation, whatever that status is stays in the cache for subsequent reads until such time as that cache line is evicted and you get one more shot at the actual register. Lets say for example you have some code that polls the uart status register waiting for a character in the rx buffer. If that first read shows there is no character, that status goes in the cache, you will remain in the loop forever since you will never get to talk to the status register again you will simply get the cached copy of the register. if there was a character in there then that status also gets cached, you read the rx register, and perhaps do something, if when you come back again if the status has not been evicted from the data cache then you get the stale status which shows there is a character, you rx buffer read may or may not also be cached so you may get the stale value in the cache, you may get a stale value or whatever the peripheral does when you read and there is no new value or you might get a new value, but what you dont get in these situations is proper access to the peripheral. When the mmu is on, you use the mmu to mark the address space used by that peripheral as non-(data)-cacheable, and you dont have this problem. With the mmu off you need the data cache off for arm systems.
Leaving the I-cache on is okay because instruction fetches only read instructions...Well for a bare metal application that is okay, it helps for example if you are using a flash that has a potential for read disturb (spi or i2c flashes). The problem is this application is a bootloader, so you must take some extra care. For example your bootloader has some code at address 0x8000 that it runs through at least once, then you choose to use it as a bootloader, the bootloader might be at say address 0x10000000 allowing you to load a new program at 0x8000, this load uses data accesses so it does not go through the instruction cache. So there is a potential that the instruction cache has some or all of the code from the last time you were in the 0x8000 area, and when you branch to the bootloaded code at 0x8000 you will get either the old program from cache or a nasty mixture of old program and new program for the parts that are cached and not cached. So if your bootloader allows for the i-cache to be on, you need to invalidate the cache before branching to bootloaded code.
Lastly, if you or anyone using this bootloader wants to use jtag, then you have that same problem but worse, data cycles that do not go through the i-cache are used to write the new program to ram, when you tell the jtag debugger to then run the new program you will get 1) only the new program, 2) a mixture of the new program and old program fragments from cache 3) the old program from cache.
So d-cache is bad without an mmu because of things that are not in ram, peripherals, etc. The i-cache is a use at your own risk kind of thing which you can mitigate except for the times that jtag is used for debugging.
If you have concerns or have confirmed read-disturb in your (external) flash, then I recommend turn on the i-cache, use a tight loop to copy your application to ram, branch to the ram copy and run there, turn off the i-cache (or use at your own risk) and dont touch the flash again, certainly not heavy read accesses to small areas. A tight uart polling loop like you might have for a command line parser, is a really good place to get hit with read-disturb.
You did not specified on which ARM you are working. Capabilities may vary from one ARM to an other (there is a huge gap between an ARM9 and an ARM Cortex A15).
In the given code, bit 2 is cleared and then set, but it does not matter, as those changes are done in R0. There is no change in the ARM behavior until the write in CP15 register (done by the instruction mcr P15, 0, R0, C1, C0, 0).
Concerning d-cache/i-cache enabling, it is only a matter of choice, there is no requirement. On the products I work on, the bootloader enables L1 I-cache, D-cache, L2 cache, and MMU (and it disables all that stuff before jumping on Linux). Be sure to follow ARM documentations about cache invalidation and memory barriers (according to your actual ARM Core) if you use cache and MMU in your bootloader.

Resources