How to correctly use a startup-ipi to start an application processor? - c

My goal is to let my own kernel start an application cpu. It uses the same mechanism as the linux kernel:
Send asserting and level triggered init-IPI
Wait...
Send deasserting and level triggered init-IPI
Wait...
Send up to two startup-IPIs with vector number (0x40000 >> 12) (the entry code for the application processor lies there)
Currently I'm just interested in making it work with QEMU. Unfortunately, instead of jumping to 0x40000, the application cpu jumps to 0x0 with the cs register set to 0x4000. (I checked with gdb).
The Intel MultiProcessor Specification (B.4.2) explains that the behavior that I noticed is valid if the target processor is halted immediately after RESET or INIT. But shouldn't this also apply to the code of the linux kernel? It sends the startup-IPI after the init-IPI. Or do I misunderstand the specification?
What can I do to have the application processor jump to 0x000VV000 and not to 0x0 with the cs register set to 0xVV00? I really can't see, where linux does something that changes the behavior.

It seems that I really misunderstood the specification: Since the application cpu is started in real mode 0x000VV000 is equivalent to 0xVV00:0x0000. It is not possible to represent the address just in the 16 bit ip register. Therefore a segment offset for the code segment is required.
Additionally, debugging real mode code with gdb is comparable complicated because it does not respect the segment offset. When required to see the disassembled code of the trampoline at the current position, it is necessary to calculate the physical location:
x/20i $eip+0xVV000
This makes gdb print the next 20 instructions at 0xVV00:$eip.

Related

How to determine if an instruction is long or short at the event of an exception? (Variable Length Instructions)

My question is about Chapter 5 in this link.
I have an Error Correction Code which simply increments the program counter (PC) by 2 or 4 bytes according the length of the instruction at the time of exception. The core is e200z4.
As far as I know e200z4 can support Fixed Length Instructions of 4 bytes, too.
The thing I don't understand is that: To determine if Variable Length Instructions (VLE) enabled, we need to check the VLEMI bit in the ESR (Exception Syndrome Register). However, this register always contains 0x00000000. The only interrupt that we end up with is Machine Check Interrupt (IVOR1) (during Power On and Off tests with increasing On and fixed Off intervals).
So, why does the CPU not provide the information about the length of the instruction if VLE is used at the moment of interrupt, for instance via VLEMI bit inside ESR? How could I determine if the instruction at the time of interrupt is 2 bytes or 4 bytes long is fixed length or variable length?
Note1: isOpCode32Bit below is decoding opCode to determine instruction length, but isOpCode32Bit is relevant only if isFixedLength is 0, i.e. when (syndrome & VLEMI_MASK) is equal to 1. So, we need to have VLEMI value in syndrome somehow, but ESR seems to be always 0x00 (why?).
Note2: As mentioned before, we always end up in IVOR1 and the instruction address right before the interrupt is reachable (provided in a register).
// IVOR1 (Machine Check Interrupt Assembly part):
(ASSEMBLY)(mfmcsr r7) // copy MCSR into register 7 (MCSR in Chapter 5 in the link)
(ASSEMBLY)(store r7 &syndrome)
// IVOR2:
(ASSEMBLY)(mfesr r7) // copy ESR into register 7 (ESR in Chapter 5 in the link)
(ASSEMBLY)(store r7 &syndrome)
------------------------------------------------------
#define VLEMI_MASK 0x00000020uL
isFixedLength = ((syndrome & VLEMI_MASK) == 0);
if (isFixedLength || isOpCode32Bit)
{
PC += 4; // instruction is 32-bit, increase PC by 4
}
else
{
PC += 2; // instruction is 16-bit, increase PC by 2
}
When it comes to how these exception handlers work in real systems:
Sometimes handling the exception only requires servicing a page fault (e.g. via copy on write or disc reload).  In such cases, we don't even need to know the length of the instruction, just the effective memory address the instruction is accessing, and the CPUs generally offer that value.  If the page fault can be serviced, then re-running that faulting instruction (without advancing the PC) is appropriate (and if not, then halting the program, also without advancing the PC, is appropriate.)
In other cases, such as software emulation for instructions not present in this hardware, presumably hardware designers consider that such a software handler needs to decode the faulting instruction in order to emulate it, and so will figure out the instruction length anyway.
Thus, hardware turns the job of understanding the faulting instruction over to software.  As such system software needs to have deep knowledge of the instruction set architecture, while also likely requiring customization for each different hardware instantiation of the instruction set.
So, why does the CPU not provide information about the length of the instruction at the moment of interrupt inside ESR?
No CPU that I know tells us of the length of an instruction that caused an exception.  If they did, that would be convenient — but only for toy exception handlers.  For real systems, ultimately, this isn't a true burden.
How to determine if an instruction is long or short at the event of an exception? (Vairable Length Instructions)
Decode the instruction (while considering any instruction modes the CPU was in at the time of exception)!

CodeViser JTag Debugger: how it knows u-boot's relocated offset in RAM?

I am currently using CodeViser Jtag debugger connecting to a FPGA for debugging firmware for an Armv8 processor. From within the CodeViser GUI client (called CVD64 on windows) I can set break points at absolute addresses.
One thing I noticed that, when I set a break point at some functions in u-boot, I use the address as shown in u-boot.map, which reflects the initial loading address of u-boot. In my case, the u-boot text section is initially loaded to 0x20408000.
When the break point I set hits, the actual address the PC stops is not what I specified from GUI, but another address, which is the relocated one (i.e., after u-boot relocation).
For example, I set break point from GUI at 0x20413040, and CVD64 stops at 0x207ae040, the offset is 0x39b000. This offset is exactly the same as u-boot itself printed to serial port:
Relocation Offset is: 0039b000
Relocating to 207a3000, new gd at 2075edf0, sp at 2075ede0
This is nice. I am just wondering how CVD64 knows the relocate offset and automatically place HLT instruction in relocated address?
Thanks!

Cannot write TMC/ETF registers on STM32H753

This question is the follow-up of this one.
I try to configure the ETF in circular mode in order to be able to read the execution trace through the embedded software in case of critical error, on a STM32H753.
I am following the algorithm described in ARM's Trace Memory Controller reference manual (section 2.2.2)
But I cannot write the ETF registers: I unlocked the ETF macrocell by writing the magic number 0xC5ACCE55to register ETF_LAR but when I read thez registers they are all 0 (through debugger or printf) and when I write them they remain at 0.
Any advice on how to write ETF registers ?
There is actually a mistake in the STM32H7 doc:
and few pages later:
My understanding is that the second image is wrong.
Also I had to use the "Component base address (system bus)" and not the "component base address (debugger)". This notion of double address is pretty confusing as for example for ETM there is only one address.
Anyway using 0x5C014000as base address, I can read/write the registers
When accesses are made using the DAP (SWD or JTAG), the debug subsystem should be powered up as part of the handshake with the tools. In a self-hosted debug mode, then device specific control of clocks/power is usually necessary.
For STM32H7, parts:
The ETM is in the same clock/power domain as the core, so it will normally be visible.
CK_DBG_D1 clocks the trace components in the D1 power domain: System
ROM table 2, CoreSight trace funnel, ETF, system CTI and TPIU. It is a
gated version of the D1 domain system clock (CK_HCLK_D1).
Self hosted debug clock and power control uses a dedicated set of registers.
All the debug clocks (except DAPCLK) can be enabled and disabled by
register bits in the DBGMCU.

Using PMU to count ticks on ARM

I'm trying to use the PMU (specifically use PMCCNTR to determine ticks per usec) from userspace in ARM. I have an arm64 kernel running an arm32 bit userspace app.
I created an LKM to force the PMUSERENR.EN bit to be on, which works. I then I ran a test program from userspace:
asm volatile ("mrc p15, 0, %[en], c9, c14, 0"
: [en] "=r" (pmuserenr));
printf(" -- %08x\n", pmuserenr);
The first few times I ran this, the bit was correctly marked as turned on. But every third or fourth time I run it, the bit shows up as off. After puzzling over this, I came across this link, where the user is enabling this bit for each cpu. Does each core have it's own instance of a PMU, or is there only one per SoC? If there are multiple per SoC, are the PMUCCNTR registers synced up? (That is, if I read the PMUCCNTR, then do a context switch to another core, and read PMUCCNTR again, can I safely compare the two to see how many ticks have elapsed?).
Even if I modify my LKM to initialize on every online-cpu (and register a notifier to enable it on all new cpu's coming up), I'm still seeing the same behavior, where every nth read from userspace shows the bit not set.
My other issue is that the PMCCNTR is not incrementing. I set the PMCR.E bit but it's still not incrementing. I found one website that says I need to set the PMCR.C bit -- but this has a side effect of clearing the counter, which I don't want to do in fear of creating a race condition (where someone else is using the counter as I clear it). Any help or insights would be very welcome.
Each cpu has its own PMU instance, they are entirely independant (except for a small number of cluster wide events, where each PMU keeps its own count).
See, for example, the first hit today for "debug memory map pmu", remembering that v7 and v8 debug memory map standards are different. The actual answer is too 'obvious' to expect it to be explicitly documented.
0x010000 - 0x010FFF CPU 0 Debug
0x030000 - 0x030FFF CPU 0 PMU
0x110000 - 0x110FFF CPU 1 Debug
0x130000 - 0x130FFF CPU 1 PMU
0x210000 - 0x210FFF CPU 2 Debug
0x230000 - 0x230FFF CPU 2 PMU
0x310000 - 0x310FFF CPU 3 Debug
0x330000 - 0x330FFF CPU 3 PMU

how to single-step code on-target with no jtag, breakpoints, simulator, emulator

Let's say you have a pointer to function whose source you do not have and which is "untrusted" because it might read/write to disallowed memory region.
Before it executes each assembly instruction, you want to verify that it doesn't access disallowed memory regions.
The OS is (almost) bare-metal i.e. a custom RTOS (so no Linux or QNX).
This is for a functionality that needs to be enabled not only during development but during normal runtime.
Ideally, it'd run something like this:
void (*fptr)(int);
fptr = &someFunction; // untrusted, don't have source
// enable interrupts for each assembly instruction
_EN_INT();
// call the function
fptr();
// everytime the PC increments, some other code runs which verifies that if any load/stores are executed, it doesn't access some disallowed memory range
// disable interrupts for each assembly instruction
_DIS_INT();
QUESTION
Is it possible to call that function and pause execution after every assembly instruction?
The OS is (almost) bare-metal i.e. a custom RTOS (so no Linux or QNX).
My answer assumes that you can modify the "OS" the way you need it...
Cortex MK20DX256VLH7
This seems to be a Cortex M4 CPU.
how to single-step code on-target with no jtag, breakpoints
From the doc, it doesn't say whether you NEED an external debugger to resume execution.
If the CPU is really stopped, you'll definitely need an external signal (e.g. from a debugger).
However most CPUs support software debugging. This means that an interrupt service routine is executed whenever a breakpoint is hit. To continue execution you simply return from the interrupt service routine.
I don't know about the Cortex M4 but for the Cortex M3 you'll have to set some special registers to enable that feature. Whenever a "BKPT" instruction is hit then interrupt #12 (*) is executed.
For code in RAM you simply write an BKPT instruction (0xBExx, e.g. 0xBEBE) to the address where you want to set your breakpoint. (Before writing you read out the value to be able to restore it later on).
For code in Flash the M3 has a "Flash patching unit" which allows you to specify up to three addresses which shall be read out as 0xBExx (0xBEBE ?) even if other data is stored there. This allows you to set up to 3 breakpoints in Flash.
Interesting for you: The register controlling the debug features in the M3 (named "DEMCR") also has a bit named "MON_STEP":
If you set this bit in interrupt handler #12 then exactly one instruction is executed after returning from the interrupt handler and interrupt #12 is triggered again. The use case for this feature is - of course - single-stepping code!
To stop single-stepping you'll have to clear the MON_STEP bit again...
Important 1:
I don't know if the MK20DX256VLH7 really has all these features. However because it is a Cortex M4 chip and the M4 should have nearly all features of the M3 these features should be present...
Important 2:
Implementing single-stepping and debugging is not done quickly. Assembly language knowledge will be very helpful and you'll need a lot of time...
From the doc, ...
You will not only need the documentation for the MK20DX256VLH7 from NXP but you'll also need the Cortex M4 documentation from ARM.
(*) Offset 4*12 in the vector table is meant here (which is named "IRQ(-4)" in some ARM documents); not IRQ12.
yes the ARM emulator/interpreter sounds exactly like what I want. Is there a free one?
qemu is open-source, most of it is GPLv2. https://wiki.qemu.org/License. You'd probably need to modify it a lot, because it's designed for use as a stand-alone wrapper for a whole Unix process (qemu-user) or whole machine (qemu-system).
I googled, and there's also http://www.unicorn-engine.org/ which is designed to be used as part of a larger program (written in C with bindings for calling from various languages). It's also GPLv2 (not LGPL), so you can use it if the rest of your code is also Free software.
It's actually based on the CPU-emulation code from QEMU; they stripped out all the device / BIOS emulation stuff to make a flexible library for just emulating CPUs.
Presumably you could configure some memory protections for it and set up the starting machine state, and let it run your function (with a return address that leads to some code that hands control back to your main code?)

Resources