This is the starting code of bootloader for ARM and configure the CPU into svc mode:
1) mrs r0, cpsr
2) bic r0, r0, #0x1F
3) orr r0, r0, #0xD3
4) msr cpsr, r0
and my question is why must we use the first instruction "msr r0, cpsr"? I mean can't we just use 2) 3) to obtain 0xD3 and write to cpsr directly? 1) serves to what exactly?
CPSR contains more state then just cpu mode.
For example it contains State bit telling whether CPU is executing in ARM or Thumb mode.
Writing to CPSR without preserving other states would most likely put cpu into an undefined state. Because of this you always do a read-modify-write.
In most of the documents from ARM it is also stated the importance of keeping state of reserved bits for future compatibility.
To maintain compatibility with future ARM processors, and as good
practice, you are strongly advised to use a read-modify-write strategy
when you change the CPSR.
Well in fact instruction 2 and 3 manipulates bits 7,6 and 4,3,2,1,0:
I is set (Masking IRQs)
F is set (Masking FIQs)
MODE is set to 0b10011
Left bits are unchanged, thanks to the read-modify-write sequence (which by the way answers your question about usefulness of instruction 1)
Related
I am interested in the ARM and Thumb2 commands: LDR and LDR.W, PC, =ADDR for absolute jumping to a certain address.
For example, when I jump from ARM code to ARM, the command LDR PC, =ADDR is performed.
But what happens in the other scenarios?
from ARM to Thumb2
from Thumb2 to Thumb2
from Thumb2 to ARM
when is +1 needed to be added to the address? and why?
The rule is actually quite simple:
If bit 0 of the address is 0, the CPU will execute the code as ARM code after the next branch
If bit 0 of the address is 1, the CPU will execute the code as Thumb after the next branch
Of course if there is a mismatch, the CPU will certainly get a fault (After executing random code) because it has no way to check if the code is ARM or Thumb.
This is what explains the +1.
Note that depending on the compiler, and depending on the label used, bit 0 of the address may be automatically set by the compiler.
You need to just read the documentation.
The following instructions write a value to the PC, treating that value as an interworking address to branch
to, with low-order bits that determine the new instruction set state:
— BLX (register), BX , and BXJ
— LDR instructions with <Rt> equal to the PC
— POP and all forms of LDM except LDM (exception return), when the register list includes the PC
— in ARM state only, ADC , ADD , ADR , AND , ASR (immediate), BIC , EOR , LSL (immediate), LSR (immediate), MOV ,
MVN , ORR , ROR (immediate), RRX , RSB , RSC , SBC , and SUB instructions with <Rd> equal to the PC and without
flag-setting specified.
Since you mentioned thumb2 that means armv6 or newer. (did you say thumb2 and generically mean thumb?) and I believe the docs are telling us the above applies for armv6 and armv7.
Note that bit is consumed by the instruction, the pc doesnt carry around a set lsbit in thumb mode, it is just used by the instruction to indicate a mode change.
Also note you should think in terms of OR 1 not PLUS 1. If you write your code correctly the toolchain will supply you with the correct address with the correct lsbit, if you add a one to that address you will break the code, if you are paranoid or have not done it right you can OR a one to the address and if it has it there already no harm, if it doesnt then it fixes the problem that prevented it from being there. I would never use a plus one though with respect to switching to thumb mode.
I am using a TI LM3S811 (a older Cortex-M3) with the SysTick interrupt to trigger at 10Hz. This is the body of the ISR:
void SysTick_Handler(void)
{
__asm__ volatile("sub r4, r4, #32\r\n");
}
This produces the following assembly with -O0 and -fomit-frame-pointer with gcc-4.9.3. The STKALIGN bit is 0, so stacks are 4-byte aligned.
00000138 <SysTick_Handler>:
138: 4668 mov r0, sp
13a: f020 0107 bic.w r1, r0, #7
13e: 468d mov sp, r1
140: b401 push {r0}
142: f1ad 0420 sub.w r4, r4, #32
146: f85d 0b04 ldr.w r0, [sp], #4
14a: 4685 mov sp, r0
14c: 4770 bx lr
14e: bf00 nop
I don't understand what's going on with r0 in the listing above. Specifically:
1) It seems like we're clearing the lower 3 bits of SP and storing it on the stack. Is that to maintain 8-byte alignment? Or is it something else?
2) Is the exception exit procedure is equally confusing. From my limited understanding of the ARM assembly, it does something like this:
SP = SP + 4; R0 = SP;
Followed by storing it back to SP. Which seems to undo the manipulations until this stage.
3) Why is there a nop instruction after the unconditional branch (at 0x14E)?
The ARM Procedure Calling Standard and C ABI expect an 8 byte (64 bit) alignment of the stack. As an interrupt might occur after pushing/poping a single word, it is not guaranteed the stack is correctly aligned on interrupt entry.
The STKALIGN bit, if set (the default) enforces the hardware to align the stack automatically by conditionally pushing an extra (dummy) word onto the stack.
The interrupt attribute on a function tells gcc, OTOH the stack might be missaligned, so it adds this pre-/postamble which enforces the alignment.
So, both actually do the same; one in hardware, one in software. If you can live with a word-aligned stack only, you should remove the interrupt attribute from the function declarations and clear the STKALIGN bit.
Make sure such a "missaligned" stack is no problem (I would not expect any, as this is a pure 32 bit CPU). OTOH, you should leave it as-is, unless you really need to safe that extra conditional(!) clock and word (very unlikely).
Warning: According to the ARM Architecture Reference Manual, setting STKALIGN == 0 is deprecated. Briefly: do not set this bit to 0!
Since you're using -O0, you should expect lots of redundant and useless code. The general way in which a compiler works is to generate code with the full generality of everything that might be used anywhere in the program, and then rely on the optimizer to get rid of things that are unneeded.
Yes this is doing 8byte alignment. Its also allocating a stack frame to hold local variables even though you have none.
The exit is the reverse, deallocating the stack frame.
The nop at the end is to maintain 4-byte alignment in the code, as you might want to link with non-thumb code at some point.
If you enable optimization, it will eliminate the stack frame (as its unneeded) and the code will become much simpler.
I'm trying to understand some assembler generated for the stm32f103 chipset by arm-none-eabi-gcc, which seems to be running exactly half the speed I expect. I'm not that familiar with assembler but since everyone always says read the asm if you want to understand what your compiler is doing I am seeing how far I get. Its a simple function:
void delay(volatile uint32_t num) {
volatile uint32_t index = 0;
for(index = (6000 * num); index != 0; index--) {}
}
The clock speed is 72MHz and the above function gives me a 1ms delay, but I expect 0.5ms (since (6000*6)/72000000 = 0.0005).
The assembler is this:
delay:
# args = 0, pretend = 0, frame = 16
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
sub sp, sp, #16 stack pointer = stack pointer - 16
movs r3, #0 move 0 into r3 and update condition flags
str r0, [sp, #4] store r0 at location stack pointer+4
str r3, [sp, #12] store r3 at location stack pointer+12
ldr r3, [sp, #4] load r3 with data at location stack pointer+4
movw r2, #6000 move 6000 into r2 (make r2 6000)
mul r3, r2, r3 r3 = r2 * r3
str r3, [sp, #12] store r3 at stack pointer+12
ldr r3, [sp, #12] load r3 with data at stack pointer+12
cbz r3, .L1 Compare and Branch on Zero
.L4:
ldr r3, [sp, #12] 2 load r3 with data at location stack pointer+12
subs r3, r3, #1 1 subtract 1 from r3 with 'set APSR flag' if any conditions met
str r3, [sp, #12] 2 store r3 at location sp+12
ldr r3, [sp, #12] 2 load r3 with data at location sp+12
cmp r3, #0 1 status = 0 - r3 (if r3 is 0, set status flag)
bne .L4 1 branch to .L4 if not equal
.L1:
add sp, sp, #16 add 16 back to the stack pointer
# sp needed
bx lr
.size delay, .-delay
.align 2
.global blink
.thumb
.thumb_func
.type blink, %function
I've commented what I believe each instruction means from looking it up. So I believe the .L4 section is the loop of the delay function, which is 6 instructions long. I do realise that clock cycles are not always the same as instructions but since theres such a large difference, and since this is a loop which I imagine is predicted and pipelined efficiently, I am wondering if theres a solid reason that I am seeing 2 clock cycles per instruction.
Background:
In the project I am working on I need to use 5 output pins to control a linear ccd, and the timing requirements are said to be fairly tight. Absolute frequency will not be maxed out (I will clock the pins slower than the cpu is capable of) but pin timings relative to each other are important. So rather than use interupts which are at the limit of my ability and might complicate relative timings I am thinking use loops to provide the short delays (around 100 ns) between pin voltage change events, or even code the whole section in unrolled assembler since I have plenty of program storage space. There is a period when the pins are not changing during which I can run the ADC to sample the signal.
Although the odd behaviour I am asking about is not a show stopper I would rather understand it before proceeding.
Edit: From comment, the arm tech ref gives instruction timings. I have added them to the assembly. But its still only a total of 9 cycles rather than the 12 I expect. Is the jump a cycle itself?
TIA, Pete
Think I have to give this one to ElderBug although Dwelch raised some points which might also be very relevant so thanks to all. Going from this I will try using unrolled assembly to toggle the pins which are 20ns apart in their changes and then return back to C for the longer waits, and ADC conversion, then back to assembly to repeat the process, keeping an eye on the assembly output from gcc to get a rough idea of whether my timings look OK. BTW Elder the modified wait_cycles function does work as expected as you said. Thanks again.
First, doing a spin-wait loop in C is a bad idea. Here I can see that you compiled with -O0 (no optimizations), and your wait will be much shorter if you enable optimizations (EDIT: Actually maybe the unoptimized code you posted just results from the volatile, but it doesn't really matter). C wait loops are not reliable. I maintained a program that relied on a function like that, and each time we had to change a compiler flag, the timings were messed (fortunately, there was a buzzer that went out of tune as a result, reminding us to change the wait loop).
About why you don't see 1 instruction per cycle, it is because some instructions don't take 1 cycle. For example, bne can take additional cycles if the branch is taken. The problem is that you can have less deterministic factors, like bus usage. Accessing the RAM means using the bus, that can be busy fetching data from ROM or in use by a DMA. This means instructions like STR and LDR may be delayed. On your example, you have a STR followed by a LDR on the same location (typical of -O0); if the MCU doesn't have store-to-load forwarding, you can have a delay.
What I do for timings is using a hardware timer for delay above 1µs, and a hard-coded assembly loop for the really short delays.
For the hardware timer, you just have to setup a timer at a fixed frequency (with period < 1µs if you want delay accurate at 1µs), and use some simple code like that :
void wait_us( uint32_t us ) {
uint32_t mark = GET_TIMER();
us *= TIMER_FREQ/1000000;
while( us > GET_TIMER() - mark );
}
You can even use mark as a parameter to set it before some task, and use the function to wait for the remaining time after. Example :
uint32_t mark = GET_TIMER();
some_task();
wait_us( mark, 200 );
For the assembly wait, I use this one for ARM Cortex-M4 (close to yours) :
#define CYCLES_PER_LOOP 3
inline void wait_cycles( uint32_t n ) {
uint32_t l = n/CYCLES_PER_LOOP;
asm volatile( "0:" "SUBS %[count], 1;" "BNE 0b;" :[count]"+r"(l) );
}
This is very short, precise, and won't be affected by compiler flags nor bus load.
You may have to tune the CYCLES_PER_LOOP, but I think it will the same value for your MCU (here it is 1+2 for SUBS+BNE).
this is a cortex-m3 so you are likely running out of flash? did you try running from ram and/or adjust the flash speed, or adjust the clocks vs flash speed (slow the main clock) so you can get the flash to as close to a single cycle per access as you can.
you are also doing a memory access for half of those instructions which is a cycle or more for the fetch (one if you are on sram running on the same clock) and another clock for the ram access (due to using volatile). so that could account for some percentage of the difference between one clock per and two clocks per, the branch might cost more than one clock as well, on an m3 not sure if you can turn that on or off (branch prediction) and branch prediction is a bit funny the way it works anyway, if it is too close to the beginning of a fetch block then it wont work, so where the branch is in ram can affect the performance, where any of this is in ram can affect the performance, you can do experiments by adding nops anywhere in front of the code to change the alignment of the loop, affects caches (which you likely dont have here) and can also affect other things based on how big and where the instructions lie in a fetch. (some arms fetch 8 instructions at a time for example).
not only do you need to know assembly to understand what you are trying to do but how to manipulate that assembly and other things like alignment, re-arranging the instruction mix, sometimes more instructions is faster than fewer and so on. pipelines and caches are difficult at best to predict if at all, and can easily throw off assumptions and experiments with hand optimized code.
even if you overcome the slow flash, lack of a cache (although you cannot rely on its performance), and other things, the logic between the core and the I/O and the speed of the I/O for bit banging might be another performance hit, no reason to expect the I/O to be a small number of cycles per access, it might even be double digit number of clocks. very early in this research you need to start gpio read only loops, write only loops, and read/write loops. If you are relying on the gpio logic to only touch one bit in a port rather than the whole port that might have a cycle cost so you need to performance tune that as well.
you might want to look into using a cpld if you are even close to the margin on timing and have to be hard real time, as one extra line of code or a new rev of the compiler can completely throw off the timing of the project.
ARM instructions may utilize barrel shifter in its second source operand (see below assembly listed), which is part of the data process instruction so save one instruction to do shifting. I am wondering could thumb instruction utilize barrel shift in DP instructions? Or should it separate the shift operation into an independent instruction? I am asking this since thumb may not has sufficient space in the instruction to code barrel shifter.
mov r0, r1, LSL #1
That example's not great, since it's an alternate form of the canonical lsl r0, r1, #1, which does have a 16-bit Thumb encoding (albeit with flag-setting restrictions).
An alternative ARM instruction such as add r0, r0, r1, lsl #1 would indeed have to be done as two Thumb instructions because as you say there just isn't room to squeeze both operations into 16 bits (hence also why you're limited to r0-r7 so registers can be encoded in 3 bits rather than 4).
Thumb-2, on the other hand, generally does have 32-bit encodings for immediate shifts of operands, so on ARMv6T2 and later architectures you can encode add r0, r0, r1, lsl #1 as a single instruction.
The register-shifted register form, however, (e.g. add r0, r0, r1, lsl r2) isn't available even in Thumb-2 - you'd have to do that in 2 Thumb instructions like so:
lsl r1, r2
add r0, r1
Note that unlike the ARM instruction this sequence changes the value in r1 - if you wanted to preserve that as well you'd need an intermediate register and an extra mov instruction (or a Thumb-2 3-register lsl) - failing that the last resort would be to bx to ARM code.
According to the ARM manual, it should be possible to access the banked registers for a specific CPU mode as, for instance, "r13_svc". When I try to do this gcc yells at me with the following error:
immediate expression requires a # prefix -- `mov r2,sp_svc'
What's wrong?
Update. The following text from the ARM Architecture Reference Manual for ARMv5 and ARMv6 led me to believe that it is possible, section A2.4.2:
Registers R13 and R14 have six banked
physical registers each. One is used
in User and System modes, and each of
the remaining five is used in one of
the five exception modes. Where it is
necessary to be specific about which
version is being referred to, you use
names of the form: R13_mode
R14_mode where mode is the
appropriate one of usr, svc (for
Supervisor mode), abt, und, irq and
fiq.
The correct syntax for this is mrs r2,sp_svc or mrs r3, sp_usr. This is a new armv7 extension. The code can be seen in the ARM Linux KVM source file interrupt_head.S. The gas binutils patch for this instruction support by Matthew Gretton-Dann. It requires the virtualization extensions are far as I understand.
According to what I understand, the LPAE (large physical address extension) implies the virtualization extensions. So Cortex-A7, Cortex-A12, Cortex-A15, and Cortex-A17 may be able to use this extension. However, the Cortex-A5, Cortex-A8, and Cortex-A9 can not.
Documentation on the instruction can be found in the ARMv7a TRM revC, under section B9.3.9 MRS (Banked register).
For other Cortex-A (and ARMv6) CPU's you can use the cps instruction to switch modes and transfer the banked register to an un-banked register (R0-R7) and then switch back. The obvious difficulty is with user mode. The correct way to handle this is with ldm rN, {sp,lr}^; user mode has no simple way back to the privileged modes.
For all older CPUs, the information given by old_timer will work. Mainly, use mrs/msr to change modes. mrs/msr works over the full class of ARM cpus but requires multiple instructions and hence may have race issues which require interrupt and exception masking depending on context.
This is an important instruction (sequences) for context switching (which VMs do a lot of).
I don't think that's possible with the mov instruction; at least according to the ARM Architecture Reference Manual I'm reading. What document do you have? There are is a variant of ldm that can load user mode registers from a privileged mode (using ^). Your only other option is to switch to SVC mode, do mov r2, sp, and then switch back to whatever other mode you were using.
The error you're getting is because it doesn't understand sp_svc, so it thinks you're trying to do an immediate mov, which would look like:
mov r2, #0x14
So that's why it says "requires a # prefix".
You use mrs and msr to change modes by changing bits in the cpsr then use r13 normally.
From the arm arm
MRS R0,CPSR
BIC R0,R0,#0x1F
ORR R0,R0,#0x13
MSR CPSR_c,R0
then
mov sp,#0x10000000
or if you need more bits in the immediate
ldr sp,=0x12345600
or if you dont want the assembler placing your data, you can place it yourself.
ldr sp,svc_stack
b 1f
svc_stack: .word 0x12345600
1:
You will see typical arm startup code, where the application is going to support interrupts, aborts and other exceptions, to set all of your stack pointers that you are going to need, change mode, set sp, change mode, set sp, change mode ...