armV8 alignment abort - arm

When running strh r1, [r2] in armV8, I am receiving the alignment abort with DFSR of 0x801. This is what I expect as the value of r2 is 0x10074d33 and it's not halfword aligned.
But when I clear the SCTLR.A (alignment checking bit), I still receive the alignment abort! Should I change some other bit somewhere else in other to disable the alignment checking?

Unaligned transfers are not permitted if the memory target is defined as Strongly Ordered or Device.

Related

Behavior of LDR on a uint8_t variable in ARM?

I cannot find a straight answer for this anywhere. The registers for ARM are 32-bit, I know that LDRB loads a byte size value into a register and zeros out the remaining 3 bytes, even if you feed it a value bigger than a byte, it will just take the first byte value.
My program combines C with ARM Assembly. I have an extern variable in C that gets loaded into a register directly.
However if I call just LDR on this byte variable, is there a guarantee that it loads the byte and nothing else or will it load random things in the remaining 3 byte space from nearby things in memory to fill out the entire 32-bit register?
I'm only asking because I did LDR R0, =var and always got the correct value out of probably a hundred million executions (software ran for a long time and was tested thoroughly / recompiled many times before this issue was brought up on another setup).
However someone else with a different setup (Not so different, compiler is the same version I think) compiled the code successfully however the value loaded into R0 was polluted with random bits from the surrounding memory of the variable. They had to do LDRB to fix it.
Is this a compiler thing? Can it detect this and automatically switch it to LDRB? Or am I just that lucky that the surrounding memory of the variable was just zero due to some optimization?
As a side note the compiler is ARM GCC 9.2.1
because I did LDR R0, =var
Are you loading the value or the address of the variable?
Normally, the instruction LDR R0, =var will write the address of the variable var into the register R0 and not the value.
And the address of a variable is always a 32-bit value on a 32-bit ARM CPU - independent of the data type.
However if I call just LDR on this byte variable, ...
If you load the value of a variable (e.g. using LDR R1, [R0]), two things may happen:
The upper 24 bits of the register may contain a random value depending on the bytes that follow your variable in memory. If you are lucky, the bytes are always zero.
Depending on the exact CPU type, you may get problems due to alignment (for example an alignment exception or even completely undefined behavior)
LDR doesn't know anything about how you declared the variable or what's supposed to be in the 4 bytes it loads. That's why ISAs like ARM have byte loads like LDRB (and its sign-extending equivalent) in the first place.
And no, compilers don't waste 3 bytes (of zeros) after every uint8_t just so you can use word loads on it, that would be silly. i.e. sizeof(uint8_t) = 1 = unsigned char, CHAR_BIT = 8, and alignof(uint8_t) = 1
LDR loads an int32_t or uint32_t whole word.
But as Martin points out, LDR r0, =var puts the address of var into a register.
Then you use ldrb r1, [r0]
Fun fact: early ARM CPUs (ARMv4 and earlier) with an unaligned word load will use the low 2 bits of the address as a rotate count (after loading from an aligned word). https://medium.com/#iLevex/the-curious-case-of-unaligned-access-on-arm-5dd0ebe24965

Jump between Thumb and ARM

I am interested in the ARM and Thumb2 commands: LDR and LDR.W, PC, =ADDR for absolute jumping to a certain address.
For example, when I jump from ARM code to ARM, the command LDR PC, =ADDR is performed.
But what happens in the other scenarios?
from ARM to Thumb2
from Thumb2 to Thumb2
from Thumb2 to ARM
when is +1 needed to be added to the address? and why?
The rule is actually quite simple:
If bit 0 of the address is 0, the CPU will execute the code as ARM code after the next branch
If bit 0 of the address is 1, the CPU will execute the code as Thumb after the next branch
Of course if there is a mismatch, the CPU will certainly get a fault (After executing random code) because it has no way to check if the code is ARM or Thumb.
This is what explains the +1.
Note that depending on the compiler, and depending on the label used, bit 0 of the address may be automatically set by the compiler.
You need to just read the documentation.
The following instructions write a value to the PC, treating that value as an interworking address to branch
to, with low-order bits that determine the new instruction set state:
— BLX (register), BX , and BXJ
— LDR instructions with <Rt> equal to the PC
— POP and all forms of LDM except LDM (exception return), when the register list includes the PC
— in ARM state only, ADC , ADD , ADR , AND , ASR (immediate), BIC , EOR , LSL (immediate), LSR (immediate), MOV ,
MVN , ORR , ROR (immediate), RRX , RSB , RSC , SBC , and SUB instructions with <Rd> equal to the PC and without
flag-setting specified.
Since you mentioned thumb2 that means armv6 or newer. (did you say thumb2 and generically mean thumb?) and I believe the docs are telling us the above applies for armv6 and armv7.
Note that bit is consumed by the instruction, the pc doesnt carry around a set lsbit in thumb mode, it is just used by the instruction to indicate a mode change.
Also note you should think in terms of OR 1 not PLUS 1. If you write your code correctly the toolchain will supply you with the correct address with the correct lsbit, if you add a one to that address you will break the code, if you are paranoid or have not done it right you can OR a one to the address and if it has it there already no harm, if it doesnt then it fixes the problem that prevented it from being there. I would never use a plus one though with respect to switching to thumb mode.

Why is SP (apparently) stored on exception entry on Cortex-M3?

I am using a TI LM3S811 (a older Cortex-M3) with the SysTick interrupt to trigger at 10Hz. This is the body of the ISR:
void SysTick_Handler(void)
{
__asm__ volatile("sub r4, r4, #32\r\n");
}
This produces the following assembly with -O0 and -fomit-frame-pointer with gcc-4.9.3. The STKALIGN bit is 0, so stacks are 4-byte aligned.
00000138 <SysTick_Handler>:
138: 4668 mov r0, sp
13a: f020 0107 bic.w r1, r0, #7
13e: 468d mov sp, r1
140: b401 push {r0}
142: f1ad 0420 sub.w r4, r4, #32
146: f85d 0b04 ldr.w r0, [sp], #4
14a: 4685 mov sp, r0
14c: 4770 bx lr
14e: bf00 nop
I don't understand what's going on with r0 in the listing above. Specifically:
1) It seems like we're clearing the lower 3 bits of SP and storing it on the stack. Is that to maintain 8-byte alignment? Or is it something else?
2) Is the exception exit procedure is equally confusing. From my limited understanding of the ARM assembly, it does something like this:
SP = SP + 4; R0 = SP;
Followed by storing it back to SP. Which seems to undo the manipulations until this stage.
3) Why is there a nop instruction after the unconditional branch (at 0x14E)?
The ARM Procedure Calling Standard and C ABI expect an 8 byte (64 bit) alignment of the stack. As an interrupt might occur after pushing/poping a single word, it is not guaranteed the stack is correctly aligned on interrupt entry.
The STKALIGN bit, if set (the default) enforces the hardware to align the stack automatically by conditionally pushing an extra (dummy) word onto the stack.
The interrupt attribute on a function tells gcc, OTOH the stack might be missaligned, so it adds this pre-/postamble which enforces the alignment.
So, both actually do the same; one in hardware, one in software. If you can live with a word-aligned stack only, you should remove the interrupt attribute from the function declarations and clear the STKALIGN bit.
Make sure such a "missaligned" stack is no problem (I would not expect any, as this is a pure 32 bit CPU). OTOH, you should leave it as-is, unless you really need to safe that extra conditional(!) clock and word (very unlikely).
Warning: According to the ARM Architecture Reference Manual, setting STKALIGN == 0 is deprecated. Briefly: do not set this bit to 0!
Since you're using -O0, you should expect lots of redundant and useless code. The general way in which a compiler works is to generate code with the full generality of everything that might be used anywhere in the program, and then rely on the optimizer to get rid of things that are unneeded.
Yes this is doing 8byte alignment. Its also allocating a stack frame to hold local variables even though you have none.
The exit is the reverse, deallocating the stack frame.
The nop at the end is to maintain 4-byte alignment in the code, as you might want to link with non-thumb code at some point.
If you enable optimization, it will eliminate the stack frame (as its unneeded) and the code will become much simpler.

ARM CPU Mode SVC Instruction

This is the starting code of bootloader for ARM and configure the CPU into svc mode:
1) mrs r0, cpsr
2) bic r0, r0, #0x1F
3) orr r0, r0, #0xD3
4) msr cpsr, r0
and my question is why must we use the first instruction "msr r0, cpsr"? I mean can't we just use 2) 3) to obtain 0xD3 and write to cpsr directly? 1) serves to what exactly?
CPSR contains more state then just cpu mode.
For example it contains State bit telling whether CPU is executing in ARM or Thumb mode.
Writing to CPSR without preserving other states would most likely put cpu into an undefined state. Because of this you always do a read-modify-write.
In most of the documents from ARM it is also stated the importance of keeping state of reserved bits for future compatibility.
To maintain compatibility with future ARM processors, and as good
practice, you are strongly advised to use a read-modify-write strategy
when you change the CPSR.
Well in fact instruction 2 and 3 manipulates bits 7,6 and 4,3,2,1,0:
I is set (Masking IRQs)
F is set (Masking FIQs)
MODE is set to 0b10011
Left bits are unchanged, thanks to the read-modify-write sequence (which by the way answers your question about usefulness of instruction 1)

Does stm with non-adjacent registers do 32 bit writes?

I am looking at a piece of ARM code that will write a pair of 32bit registers, like this:
ldm r9!, {r0, r1}
sub r8, r8, #2
stm r10!, {r0, r1}
When the r10 output pointer is word aligned but not always dword aligned, does the above code write one 64bit value? My reading of the docs makes me think that a 64bit value would be written in this case, but I am concerned about the case where the 8 word cache line might already contain 7 words and then this code does a 64bit write and splits half of one of the dwords over the end of the cache line.
I was thinking that if the stm were to do 2 32bit word writes instead, that might avoid the issue. So, my question is would using two non-adjacent registers force the stm to write 2 words as opposed to a dword?
ldm r9!, {r0, r2}
sub r8, r8, #2
stm r10!, {r0, r2}
Would the above code be basically the same as:
ldm r9!, {r0, r1}
sub r8, r8, #2
str r0, [r10], #4
str r1, [r10], #4
The register numbers you are writing from or reading two have nothing to do with the AMBA/AXI bus transaction. The only connection is the quantity of data.
The question is a bit vague and I dont know enough about all the different implementations, but if you have a 64 bit AXI bus and your 64 bits of data are not being written to a 64 bit aligned address (this is perfectly legal, writing 2 registers to address 0x1004 for example) then it takes two bus transactions one for the first item on the unaligned address (0x1004) and one transaction for the other (0x1008). Assuming you are using an aligned address then it will perform a single 64 bit transaction independent of the register numbers so long as there are two of them.
The cache is yet another, completely separate, topic. I believe you will get two separate transactions if the address is not dword aligned, and those transactions will be handled separately by the cache. Understand the L1 cache if you have one is inside the core and not on the AXI bus the L2 cache if present is on the outside of the core between the core and the vendors AXI memory controller. So L1 behavior and L2 behavior can vary, I dont know what the cores interface to the L1 looks like and if and how it breaks up these transactions. I suspect no matter what make or model of processor you are on if something crosses a cache line boundary at some point in the memory system or in the cache logic it has to break that transaction up and handle the two cache lines separately.
From what I have seen the stm/ldm turns the single instruction into separate bus transactions where necessary. For example a 4 register write to 0x1004 turns into 3 separate transactions, a 32 bit at 0x1004, a 64 bit at 0x1008 and a 32 bit at 0x1010. Doing that yourself just wastes instruction fetch cycles, use the stm in this case.

Resources