PIC Assembly Language - decfsz loop - loops

I am working with a PIC 18F microcontroller from Microchip to continuously generate a rectangular signal. The code for the signal itself is at label5.
I need to generate 255*20 pulses of this signal. So basically, I need to repeat the instructions from the first 4 lines at label 5 for 255*20 times. Because I cannot have numbers higher than 2^8, I needed to write the number this way.
label5 BSF portd,5
call timer1
BCF portd,5
call timer2
In the code below I tried to achieve this behavior. I gave variable1 the value of 255 and I decremented from this value until variable1 was zero, in which case I returned at label2 and restarted the program. Everytime I decremented the variable1 I called label4. A similar things happens at label4. Here I have another variable, variable2, that is also decremented until it hits zero (and here comes the main signal generation program, repeated with each decrement operation), in which case, the program returns.
Can someone please tell me if I am on the right track ?
label2 movlw .255
movwf variable1
label3 call label4
decfsz variable1,1
goto label3
goto label2
; """"""""""""""
label4 movlw .20
movwf variable2
label5 BSF portd,5
call timer1
BCF portd,5
call timer2
decfsz variable2,1
goto label5
return
end
```

The general recommendation is to use timers to burn time, some would argue interrupts to have a possibility of putting the chip in a lower powered mode. But with processors like the PIC18 where you can count instructions and very accurately from that determine execution time to use simple loops to burn time.
Two ways to make a loop take longer and I am very rusty on my PIC coding so consider this psuedo-code:
variable2 = 0
label:
decfsz variable2,1
goto label
That essentially is 256 loops yes? and you can count instructions including the extra clock or whatever for the time that it is zero...
variable2 = 0
label:
nop
nop
decfsz variable2,1
goto label
Adding nops can burn more time (yes I may still not understand if it is time you are burning or simply want more loops).
Or if you want to make it more loops and you only have 8 bits to count with then nest the loops
variable1 = 20
variable2 = 0
outer:
inner:
; other stuff goes here?
decfsz variable2,1
goto inner
decfsz variable1,1
goto outer
the inner loop will count 256 times, the outer loop will count 20 so you get 20*256 total loops
I have used this type of approach to make very accurate signals that couldn't be made by using a timer with this processor a much more efficient instruction set and faster processor would need to be used to have done the same thing with a timer if even possible. But you would instead buy a product that has a timer peripheral that does what you are trying to do or a portion of it, for example infrared remote you can get some ST products that take two timer outputs and have the and gate in the chip, so you can have a hardware generated carrier signal and a hardware generated gate, but generate the duration of the gate via software. with the pic I just had some small loops to do the same thing and it was all timed by counting instructions.
I would not use this approach on a cortex-m, maybe an msp430, maybe an avr, but not something pipelined and not something that was purchased IP from someone else (arm doesn't make chips, st and nxp and others make chips and simply purchase IP from arm as well as most of the rest of the chip is not arm IP and each vendor can tweak the ip when the get it so the same core (cortex-m0+ rev x.y for example) in different chips does not necessarily behave the same).

Another way would be to use a 16-bit loop counter that has a value of 255*20.
Something like this:
;
;
;
TIMER1_CODE code
timer1:
return
;
;
;
TIMER2_CODE code
timer2:
return
;
; main application
;
MAIN_CODE code
main:
bcf TRISD,5 ; make RD5 an output
ProcessLoop:
movlw D'255' ; Compute loop count
movwf PRODL
movlw D'20'
mulwf PRODL ; PRODH:PRODL = 255*20 = 5100
OutBitLoop:
movlw 0xFF ; Decrement loop count
addwf PRODL,F
addwfc PRODH,F
bnc Stop ; Stop when done enough loops
bsf LATD,5 ; Set output bit high
call timer1
BCF LATD,5 ; Set output bit low
call timer2
bra OutBitLoop
bra ProcessLoop
Stop:
bra Stop
end
Note that the code you posted uses the PORTD register to set or clear an output bit with an opcode that does a Read-Modify-Write. This is a bad choice.
For the PIC18F always use the output latch register (LATD) when changing the state of output bits.

Related

SPARC LEON error: IU exception (tt = 0x2B, data store error)

Good morning, I need an help because I'm stuck and I cannot find any solution looking at the manuals.
I want to use EDAC on Leon3. I'm programming in C using the BCC compiler. In particular, I have a GR-UT699 board. I'm using GRMON to flash my elf file in the RAM. My program is a short test where I want to use the EDAC. To enable the EDAC I simple bitbang the registers in this way (I can say that I checked the register and they are correctly wroted):
#define MCFG2_RMW_bit_set 0x00000040 //enable read-modify-write cycles on sub-word writes to 16 and 32bit areas with common write strobe
#define MCFG2_DE_bit_set 0x00004000 //SDRAM controller (1 en, 0 dis)
#define MCFG3_R_bit_set 0x00000200 //enable EDAC checking of the SDRAM or SRAM (1 en, 0 dis)
#define MCFG1_IE_bit_set 0x00080000 //enable access to mapped I/O memory.
...
edac->MCFG1 = edac->MCFG1 | MCFG1_IE_bit_set;
edac->MCFG2 = edac->MCFG2 | MCFG2_RMW_bit_set | MCFG2_DE_bit_set;
edac->MCFG3 = edac->MCFG3 | MCFG3_R_bit_set;
...
return 0;
}
these instructions are executed inside a init function which returns 0. I just set the bits which you can see in the previous defines.
When the function returns, I just want to call a printf() to show a message. The latter (the printf) output is never showed. So the program crashes after having set the register and before the printf. I think it crashes during the init function return.
these is the grmon console output:
grmon2> run
IU exception (tt = 0x2B, data store error)
0x40009acc: 81c3e008 retl <memmove+484>
grmon2> inst
TIME ADDRESS INSTRUCTION RESULT SYMBOL
2608062 40009978 andcc %g1, %g3, %g0 [00000000] memmove+0x90
2608065 4000997C be 0x40009AB0 [00000000] memmove+0x94
2608066 40009980 or %g2, %o1, %g1 [40013FA0] memmove+0x98
2608067 40009AB0 mov 0, %g1 [00000000] memmove+0x1c8
2608068 40009AB4 ldub [%o1 + %g1], %g3 [0000002E] memmove+0x1cc
2608070 40009AB8 stb %g3, [%g2 + %g1] [40012EA0 2E2E2E2E] memmove+0x1d0
2608072 40009ABC add %g1, 1, %g1 [00000001] memmove+0x1d4
2608073 40009AC0 cmp %g1, %o2 [00000000] memmove+0x1d8
2608076 40009AC4 bne,a 0x40009AB8 [00000000] memmove+0x1dc
2608078 40009ACC retl [ TRAP ] memmove+0x1e4
I saw that I needed to set the IE bit in the MCFG1 reg, and so I did. But the program still crashes. What is wrong here?
thanks in advance for your patience.
-Lorenzo
I found at least one solution which does not produces a crash of the program.
If you want to use EDAC you have to initialize the memory controller registers (from GRMON using "mcfgx 0xvalue etc" OR using -edac option when starting GRMON).
Then a wash of the RAM shall be performed (use of the wash command from GRMON).
It is important launch the wash command (or generally wash the memory from a firmware) after the EDAC has been enabled. In fact, if you wash the memory after the ENAC has been enabled the checkbits are generated. Otherwise you'll perform a simple memory clean.
Then you can finally load a program into the RAM (from grmon using "load").
It is important to notice that also IU/FPU register shall be cleared at reset, this can be done from MKPROM (if necessary).
This solution works for programs that are loaded in the RAM through GRMON.
If is necessary to flash the programs into the flash ROM similar operation shall be performed by means of MKPROM. I have not done this yet but I hope is something really similar.
Lorenzo.

Best way to add delay/do nothing for n cpu cycles

I need to add a delay into my code of n CPU cycles (~30).
My current solution is the one below, which works but isn't very elegant.
Also, the delay has to be known at compile time. I can work with this, but it would be ideal if I could change the delay at runtime.
(It is OK if there is some overhead, but I need the 1 cycle resolution.)
I do not have any peripheral timers left, that I could use, so it needs to be a software solution.
do_something();
#define NUMBER_OF_NOPS (SOME_DELAY + 3)
#include "nops.h"
#undef NUMBER_OF_NOPS
do_the_next_thing();
nops.h:
#if NUMBER_OF_NOPS > 0
__ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 1
__ASM volatile ("nop");
#endif
#if NUMBER_OF_NOPS > 2
__ASM volatile ("nop");
#endif
...
In the cortex devices NOP is something which literally means nothing. There is no guarantee that the NOP will consume any time.They are used for padding only. I you will have several consecutive NOPs they will just be flushed from the pipeline.
For more information refer to the Cortex-M0 documentation. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CHDJJGFB.html
software delays are quite tricky in the Cortex devices and you should use other instructions + possibly barrier instructions instead.
use ISB instructions 4 clocks + flash access time which depend what speed the core is running. For very precise delays place this part of code in the SRAM
Edit: There is a better answer from another SO Q&A here. However it is in assembly, AFAIK using a counter like SysTick is the only way to guarantee any semblance of cycle accuracy.
Edit 2: To avoid a counter overflow, which would result in a very, very long delay, clear the SysTick counter before use, ie. SysTick->VAL = 0;
Original:
Cortex-Ms have a built in timer called SysTick which can be used for cycle accurate timing purposes.
First enable the timer:
SysTick->CTRL = SysTick_CTRL_CLKSOURCE_Msk |
SysTick_CTRL_ENABLE_Msk;
Then you can read the current count using the VAL register. You can then implement a cycle accurate delay this way:
int count = SysTick->VAL;
while(SysTick->VAL < (count+30));
Note that this will introduce some overhead because of the load, compare and branch in the loop so the final cycle count will be a little off, no more than a few ticks in my estimation.
You can use a free-running up-counter as follows:
uint32_t t = <periph>.count;
while ((<periph>.count - t) < delay);
As long as delay is less than half the period of the counter, this is unaffected by wrapping of the counter value - the unsigned arithmetic produces the correct time delta.
Note that since you don't need to control the counter's value in any way, you can use any such counter in the system - even if it's being used for another purpose (as long, of course, as it really is running continuously and freely, and at a rate that gives you the timing resolution that you require).

MSP430 TI while loop duration

I am programming a simple program on the TI MSP430.
I have a counter set up in C:
while (P1IN & BIT1)
{
counter++;
}
So when the pin is high, it counts up by one. I am wondering how long this takes?
I need to do some calculations with counter and need the duration of one while loop. In other words, say counter = 1234 in the end, how can I get a value of seconds?
How can I get this? Should I export the ASM code and see how long each instruction set takes? This seems tedious.
You can try:
1. Toggle any free port pin at the start and end of the loop and monitor the duration on CRO(If you have necessary equipment).
OR
2.Look into disassembly listing(ASM code),read instruction manual and based on CPU clock calculate the loop time.

Precise delays on Arduino using nop assembly?

I'm looking to make a very short pulse after a rising edge signal input.
The hard part here is that I would like to control (to high resolution) the timing of the delay before my pulse, and the duration of my pulse. I can easily control this by just stringing together nops by myself, hard coding delays, but I'm not sure how to do it for some arbitrary delay, with the same level of accuracy.
After a lot of headaches chasing down timers, and then eventually realizing I am ultimately limited by the interrupt routine entry/exit time, I am now settling at trying to control my delay via nops.
I had assumed this C switch statement would be what I wanted (after compiling, hoping it would become efficient and just change the program counter to the right spot), but it produces some very odd behavior...
switch(delayTime){
case 10:
__asm__ __volatile__("nop");
case 9:
__asm__ __volatile__("nop");
case 8:
__asm__ __volatile__("nop");
case 7:
__asm__ __volatile__("nop");
case 6:
__asm__ __volatile__("nop");
case 5:
__asm__ __volatile__("nop");
case 4:
__asm__ __volatile__("nop");
case 3:
__asm__ __volatile__("nop");
case 2:
__asm__ __volatile__("nop");
case 1:
__asm__ __volatile__("nop");
}
PORTD = 0x10;
...
Ideally, I would like to essentially run through some code that would compile into this: (it's some weird pseudocode of C and assembly, still not sure how to do some of it in assembly)
0x005 Reg1 = 0xFF-val1 %(where somehow 0xFF is known? / found out?)
0x006 Reg2 =0x1FF-val2
0x007 IJMP Reg1
0x008 NOP
0x009 NOP
0x00A NOP
...
0x0FF MOV 0x40, PORTD % assign the value 0x40 to the static variable "PORTD"
0x100 IJMP Reg2
0x101 NOP
0x102 NOP
0x103 NOP
0x104 NOP
...
0x1FF MOV 0x00, PORTD % assign the value 0x00 to the static variable "PORTD"
I'm just overall not sure how to find the memory location for the code after/during run time so that the "0xFF" and "0x1FF" aspects of this program are not really so bad (it seems like it's super dangerous to just, get the assembly of the code, and then hard code that in... I'd rather not do that). Also, while it's easy to just flood it with the 200+ nops, how to get the IJMP cmd to behave the way I want it to? (I honestly don't even know if that's the command I want)..
I guess in general I'm looking for some assembly command (that I can't seem to find) that allows me to "add N to Program Counter" and I can just make sure that that command is run in assembly with at least N+1 commands of assembly ahead of it, hardcoded in.
As a side note, all of this is executing inside of an interrupt routine, so I don't feel so bad about playing around with the PC... Also, I know is kinda bad blocking for up to 500 operations, but for the task at hand, timing is more important than how badly it blocks as a routine.
I'm not familiar with the AVR instruction set, but the general idea is to use the CALL instruction to put the program counter (PC) on the stack. Then use POP to move the PC to the Z register. Then you can ADD some number to the Z register, and use IJMP to jump to the resulting address.
So something along these lines
delay: call delay1 ; push the PC onto the stack
delay1: pop r30 ; pop the PC into the Z registers
pop r31
add r30,r0 ; add some amount to the PC value
addc r31,r1
ijmp ; use IJMP to jump to the resulting address
nop
nop
nop
...
Random thoughts:
On the 8MB machines, you need a third pop to remove the third byte of
the PC from the stack.
Z is only sixteen bits, therefore this code must be in the first
128KB of program memory.
I'm not sure which register (r30 or r31) is supposed to be popped
first.
The value added to Z must be relative to delay1 since call is
going to push the address of delay1 onto the stack. In other words,
the minimum amount that needs to be added is 6, since that's the
number of instructions from delay1 to the first nop.
The minimum delay is determined by the six instructions up to and
including the ijmp. You should increase r1/r0 (reduce the number of
nops) accordingly.
Like I said, I'm no expert on the AVR instruction set, so you should take this as a general suggestion, and be prepared to spend some time working out the particulars. Good luck!

ARM: Start/Wakeup/Bringup the other CPU cores/APs and pass execution start address?

I've been banging my head with this for the last 3-4 days and I can't find a DECENT explanatory documentation (from ARM or unofficial) to help me.
I've got an ODROID-XU board (big.LITTLE 2 x Cortex-A15 + 2 x Cortex-A7) board and I'm trying to understand a bit more about the ARM architecture. In my "experimenting" code I've now arrived at the stage where I want to WAKE UP THE OTHER CORES FROM THEIR WFI (wait-for-interrupt) state.
The missing information I'm still trying to find is:
1. When getting the base address of the memory-mapped GIC I understand that I need to read CBAR; But no piece of documentation explains how the bits in CBAR (the 2 PERIPHBASE values) should be arranged to get to the final GIC base address
2. When sending an SGI through the GICD_SGIR register, what interrupt ID between 0 and 15 should I choose? Does it matter?
3. When sending an SGI through the GICD_SGIR register, how can I tell the other cores WHERE TO START EXECUTION FROM?
4. How does the fact that my code is loaded by the U-BOOT bootloader affect this context?
The Cortex-A Series Programmer's Guide v3.0 (found here: link) states the following in section 22.5.2 (SMP boot in Linux, page 271):
While the primary core is booting, the secondary cores will be held in a standby state, using the
WFI instruction. It (the primary core) will provide a startup address to the secondary cores and wake them using an
Inter-Processor Interrupt(IPI), meaning an SGI signalled through the GIC
How does Linux do that? The documentation-S don't give any other details regarding "It will provide a startup address to the secondary cores".
My frustration is growing and I'd be very grateful for answers.
Thank you very much in advance!
EXTRA DETAILS
Documentation I use:
ARMv7-A&R Architecture Reference Manual
Cortex-A15 TRM (Technical Reference Manual)
Cortex-A15 MPCore TRM
Cortex-A Series Programmer's Guide v3.0
GICv2 Architecture Specification
What I've done by now:
UBOOT loads me at 0x40008000; I've set-up Translation Tables (TTBs), written TTBR0 and TTBCR accordingly and mapped 0x40008000 to 0x8000_0000 (2GB), so I also enabled the MMU
Set-up exception handlers of my own
I've got Printf functionality over the serial (UART2 on ODROID-XU)
All the above seems to work properly.
What I'm trying to do now:
Get the GIC base address => at the moment I read CBAR and I simply AND (&) its value with 0xFFFF8000 and use this as the GIC base address, although I'm almost sure this ain't right
Enable the GIC distributor (at offset 0x1000 from GIC base address?), by writting GICD_CTLR with the value 0x1
Construct an SGI with the following params: Group = 0, ID = 0, TargetListFilter = "All CPUs Except Me" and send it (write it) through the GICD_SGIR GIC register
Since I haven't passed any execution start address for the other cores, nothing happens after all this
....UPDATE....
I've started looking at the Linux kernel and QEMU source codes in search for an answer. Here's what I found out (please correct me if I'm wrong):
When powering up the board ALL THE CORES start executing from the reset vector
A software (firmware) component executes WFI on the secondary cores and some other code that will act as a protocol between these secondary cores and the primary core, when the latter wants to wake them up again
For example, the protocol used on the EnergyCore ECX-1000 (Highbank) board is as follows:
**(1)** the secondary cores enter WFI and when
**(2)** the primary core sends an SGI to wake them up
**(3)** they check if the value at address (0x40 + 0x10 * coreid) is non-null;
**(4)** if it is non-null, they use it as an address to jump to (execute a BX)
**(5)** otherwise, they re-enter standby state, by re-executing WFI
**(6)** So, if I had an EnergyCore ECX-1000 board, I should write (0x40 + 0x10 * coreid) with the address I want each of the cores to jump to and send an SGI
Questions:
1. What is the software component that does this? Is it the BL1 binary I've written on the SD Card, or is it U-BOOT?
2. From what I understand, this software protocol differs from board to board. Is it so, or does it only depend on the underlying processor?
3. Where can I find information about this protocol for a pick-one ARM board? - can I find it on the official ARM website or on the board webpage?
Ok, I'm back baby. Here are the conclusions:
The software component that puts the CPUs to sleep is the bootloader (in my case U-Boot)
Linux somehow knows how the bootloader does this (hardcoded in the Linux kernel for each board) and knows how to wake them up again
For my ODROID-XU board the sources describing this process are UBOOT ODROID-v2012.07 and the linux kernel found here: LINUX ODROIDXU-3.4.y (it would have been better if I looked into kernel version from the branch odroid-3.12.y since the former doesn't start all of the 8 processors, just 4 of them but the latter does).
Anyway, here's the source code I've come up with, I'll post the relevant source files from the above source code trees that helped me writing this code afterwards:
typedef unsigned int DWORD;
typedef unsigned char BOOLEAN;
#define FAILURE (0)
#define SUCCESS (1)
#define NR_EXTRA_CPUS (3) // actually 7, but this kernel version can't wake them up all -> check kernel version 3.12 if you need this
// Hardcoded in the kernel and in U-Boot; here I've put the physical addresses for ease
// In my code (and in the linux kernel) these addresses are actually virtual
// (thus the 'VA' part in S5P_VA_...); note: mapped with memory type DEVICE
#define S5P_VA_CHIPID (0x10000000)
#define S5P_VA_SYSRAM_NS (0x02073000)
#define S5P_VA_PMU (0x10040000)
#define EXYNOS_SWRESET ((DWORD) S5P_VA_PMU + 0x0400)
// Other hardcoded values
#define EXYNOS5410_REV_1_0 (0x10)
#define EXYNOS_CORE_LOCAL_PWR_EN (0x3)
BOOLEAN BootAllSecondaryCPUs(void* CPUExecutionAddress){
// 1. Get bootBase (the address where we need to write the address where the woken CPUs will jump to)
// and powerBase (we also need to power up the cpus before waking them up (?))
DWORD bootBase, powerBase, powerOffset, clusterID;
asm volatile ("mrc p15, 0, %0, c0, c0, 5" : "=r" (clusterID));
clusterID = (clusterID >> 8);
powerOffset = 0;
if( (*(DWORD*)S5P_VA_CHIPID & 0xFF) < EXYNOS5410_REV_1_0 )
{
if( (clusterID & 0x1) == 0 ) powerOffset = 4;
}
else if( (clusterID & 0x1) != 0 ) powerOffset = 4;
bootBase = S5P_VA_SYSRAM_NS + 0x1C;
powerBase = (S5P_VA_PMU + 0x2000) + (powerOffset * 0x80);
// 2. Power up each CPU, write bootBase and send a SEV (they are in WFE [wait-for-event] standby state)
for (i = 1; i <= NR_EXTRA_CPUS; i++)
{
// 2.1 Power up this CPU
powerBase += 0x80;
DWORD powerStatus = *(DWORD*)( (DWORD) powerBase + 0x4);
if ((powerStatus & EXYNOS_CORE_LOCAL_PWR_EN) == 0)
{
*(DWORD*) powerBase = EXYNOS_CORE_LOCAL_PWR_EN;
for (i = 0; i < 10; i++) // 10 millis timeout
{
powerStatus = *(DWORD*)((DWORD) powerBase + 0x4);
if ((powerStatus & EXYNOS_CORE_LOCAL_PWR_EN) == EXYNOS_CORE_LOCAL_PWR_EN)
break;
DelayMilliseconds(1); // not implemented here, if you need this, post a comment request
}
if ((powerStatus & EXYNOS_CORE_LOCAL_PWR_EN) != EXYNOS_CORE_LOCAL_PWR_EN)
return FAILURE;
}
if ( (clusterID & 0x0F) != 0 )
{
if ( *(DWORD*)(S5P_VA_PMU + 0x0908) == 0 )
do { DelayMicroseconds(10); } // not implemented here, if you need this, post a comment request
while (*(DWORD*)(S5P_VA_PMU + 0x0908) == 0);
*(DWORD*) EXYNOS_SWRESET = (DWORD)(((1 << 20) | (1 << 8)) << i);
}
// 2.2 Write bootBase and execute a SEV to finally wake up the CPUs
asm volatile ("dmb" : : : "memory");
*(DWORD*) bootBase = (DWORD) CPUExecutionAddress;
asm volatile ("isb");
asm volatile ("\n dsb\n sev\n nop\n");
}
return SUCCESS;
}
This successfully wakes 3 of 7 of the secondary CPUs.
And now for that short list of relevant source files in u-boot and the linux kernel:
UBOOT: lowlevel_init.S - notice lines 363-369, how the secondary CPUs wait in a WFE for the value at _hotplug_addr to be non-zeroed and to jump to it; _hotplug_addr is actually bootBase in the above code; also lines 282-285 tell us that _hotplug_addr is to be relocated at CONFIG_PHY_IRAM_NS_BASE + _hotplug_addr - nscode_base (_hotplug_addr - nscode_base is 0x1C and CONFIG_PHY_IRAM_NS_BASE is 0x02073000, thus the above hardcodings in the linux kernel)
LINUX KERNEL: generic - smp.c (look at function __cpu_up), platform specific (odroid-xu): platsmp.c (function boot_secondary, called by generic __cpu_up; also look at platform_smp_prepare_cpus [at the bottom] => that's the function that actually sets the boot base and power base values)
For clarity and future reference, there's a subtle piece of information missing here thanks to the lack of proper documentation of the Exynos boot protocol (n.b. this question should really be marked "Exynos 5" rather than "Cortex-A15" - it's a SoC-specific thing and what ARM says is only a general recommendation). From cold boot, the secondary cores aren't in WFI, they're still powered off.
The simpler minimal solution (based on what Linux's hotplug does), which I worked out in the process of writing a boot shim to get a hypervisor running on the XU, takes two steps:
First write the entry point address to the Exynos holding pen (0x02073000 + 0x1c)
Then poke the power controller to switch on the relevant core(s): That way, they drop out of the secure boot path into the holding pen to find the entry point waiting for them, skipping the WFI loop and obviating the need to even touch the GIC at all.
Unless you're planning a full-on CPU hotplug implementation you can skip checking the cluster ID - if we're booting, we're on cluster 0 and nowhere else (the check for pre-production chips with backwards cluster registers should be unnecessary on the Odroid too - certainly was for me).
From my investigation, firing up the A7s is a little more involved. Judging from the Exynos big.LITTLE switcher driver, it seems you need to poke a separate set of power controller registers to enable cluster 1 first (and you may need to mess around with the CCI too, especially to have the MMUs and caches on) - I didn't get further since by that point it was more "having fun" than "doing real work"...
As an aside, Samsung's mainline patch for CPU hotplug on the 5410 makes the core power control stuff rather clearer than the mess in their downstream code, IMO.
QEMU uses PSCI
The ARM Power State Coordination Interface (PSCI) is documented at: https://developer.arm.com/docs/den0022/latest/arm-power-state-coordination-interface-platform-design-document and controls things such as powering on and off of cores.
TL;DR this is the aarch64 snippet to wake up CPU 1 on QEMU v3.0.0 ARMv8 aarch64:
/* PSCI function identifier: CPU_ON. */
ldr w0, =0xc4000003
/* Argument 1: target_cpu */
mov x1, 1
/* Argument 2: entry_point_address */
ldr x2, =cpu1_entry_address
/* Argument 3: context_id */
mov x3, 0
/* Unused hvc args: the Linux kernel zeroes them,
* but I don't think it is required.
*/
hvc 0
and for ARMv7:
ldr r0, =0x84000003
mov r1, #1
ldr r2, =cpu1_entry_address
mov r3, #0
hvc 0
A full runnable example with a spinlock is available on the ARM section of this answer: What does multicore assembly language look like?
The hvc instruction then gets handled by an EL2 handler, see also: the ARM section of: What are Ring 0 and Ring 3 in the context of operating systems?
Linux kernel
In Linux v4.19, that address is informed to the Linux kernel through the device tree, QEMU for example auto-generates an entry of form:
psci {
method = "hvc";
compatible = "arm,psci-0.2", "arm,psci";
cpu_on = <0xc4000003>;
migrate = <0xc4000005>;
cpu_suspend = <0xc4000001>;
cpu_off = <0x84000002>;
};
The hvc instruction is called from: https://github.com/torvalds/linux/blob/v4.19/drivers/firmware/psci.c#L178
static int psci_cpu_on(unsigned long cpuid, unsigned long entry_point)
which ends up going to: https://github.com/torvalds/linux/blob/v4.19/arch/arm64/kernel/smccc-call.S#L51
Go to www.arm.com and download there evaluation copy of DS-5 developement suite. Once installed, under the examples there will be a startup_Cortex-A15MPCore directory. Look at startup.s.

Resources