usage of PLD in arm cortex a9 - arm

I am trying to use PLD instruction. The problem I am facing is as follows:
int32_t addr[10];
asm ("PLD [addr,#5]");
I am getting following error:
Error: ARM register expected -- `pld [addr,#5]'

The address used by the preload instruction needs to be in a register.
addr is a variable (memory location), not a register.

int32_t addr[10];
asm ("PLD [%[ADDR],#5] \n\t"
:
: [ADDR]"r"(addr)
);
Give the thing a name in the registers list, and then specify it like shown. 5 is a strange number to use for prefetch. Most PC will be working with sizes of 32 etc...
When using pld, the size of each line in memory is apparently 64 bytes for the arm chips in ipad and ipad2. So to pld most efficiently it would probably work best to do 1 pld for a 64 byte size range and then unroll the loop to cover just that range if that is the type of code being programmed.
For example you might move through sixteen 32 bit float sized entries for each pld.

Related

Understanding Cortex-M assembly LDR with pc offset

I'm looking at the disassembly code for this piece of C code:
#define GPIO_PORTF_DATA_R (*((volatile unsigned long *)0x400253FC))
int main(void){
// Initialization code
while(1) {
SW1 = GPIO_PORTF_DATA_R&0x10; // Read PF4 into SW1
// Other code
SW2 = GPIO_PORTF_DATA_R&0x01;
}
}
The assembly for that SW1= line is (sorry can't copy code):
https://imgur.com/dnPHZrd
Here are my questions:
At the first line, PC = 0x00000A56, and PC + 92 = 0x00000AB2, which is not equal to 0x00000AB4, the number shown. Why?
I did a bit of research on SO and found out that PC actually points to the Next Next instruction to be executed.
When pc is used for reading there is an 8-byte offset in ARM mode and 4-byte offset in Thumb mode.
However 0x00000AB4 - 0x00000A56 = 0x5E = 94, neither does it match 92+8 or 92+4. Where did I get wrong?
Reference:
Strange behaviour of ldr [pc, #value]
Why does the ARM PC register point to the instruction after the next one to be executed?
LDR Rd,-Label vs LDR Rd,[PC+Offset]
From ARM documentation:
Operation
address = (PC[31:2] << 2) + (immed_8 * 4)
Rd = Memory[address, 4]
The pc is 0xA56+4 because of two instructions ahead and this is thumb so 4 bytes.
(0xA5A>>2)<<2 + (0x17*4)
or
(0x00000A5A&0xFFFFFFFC) + (0x17<<2)
0xA58+92=0xA64
This is an LDR so it is a word-based address ideally. Because the thumb instruction can be on a non-word aligned address, you start off by adding two instructions of course (thumb2 complicates this but add four for thumb). Then zero the lower two bits (LDR) the offset is in words so need to convert that to bytes, times four. This makes the encoding make more sense if you think about each part of it. In arm mode the PC is already word aligned so that step is not required (and in arm mode you have more bits for the immediate so it is byte-based not word-based), making the offset encoding between arm and thumb possibly confusing.
The various documents will show the math in different ways but it is the same math nevertheless. The PC is the only confusing part, especially for thumb. For ARM you add 8, two ahead, for thumb it is basically 4 because the execution cannot tell if there is a thumb2 coming, and it would break a great many things if they had attempted that. So add 4 for the two ahead, for thumb. Since thumb is compressed they do not use a byte offset but instead a word offset giving 4 times the range. Likewise this and/or other instructions can only look forward not back so unsigned offset. This is why you will get alignment errors when assembling things in thumb that in arm would just be unaligned (and you get what you get there depending on architecture and settings). Thumb cannot encode any address for an instruction like this.
For understanding instruction encoding, in particular pc based addressing, it is best to go back to the early ARM ARM (before the armv5 one but if not then just get the armv5 one) as well as the armv6-m and armv7-m and full sized armv7-ar. And look at the pseudo-code for each. The older one generally has the best pseudo-code, but sometimes they leave out the masking of lower bits of the address. No document is perfect, they have bugs just like everything else. Naturally the architecture tied to the core you are using is the official document for the IP the chip vendor used (even down to the specific version of the TRM as these can vary in incompatible ways from one to the next). But if that document is not perfectly clear you can sometimes get an idea from others that, upon inspection, have compatible instructions, architectural features.
You missed a key part of the rules for Thumb mode, quoted in one of the question you linked (Why does the ARM PC register point to the instruction after the next one to be executed?):
For all other instructions that use labels, the value of the PC is the address of the current instruction plus 4 bytes, with bit[1] of the result cleared to 0 to make it word-aligned.
(0xA56 + 4) & -4 = 0xA58 is the location that PC-relative things are relative to during execution of that ldr r0, [PC, #92]
((0xA56 + 4) & -4) + 92 = 0xab4, the location the disassembler calculated.
It's equivalent to do 0xA56 & -4 = 0xa54 then +4 + 92, because +4 doesn't modify bit #1; you can think of clearing it before or after adding that +4. But you can't clear the bit after adding the PC-relative offset; that can be unaligned for other instructions like ldrb. (Thumb-mode ldr encodes an offset in words to make better use of the limited number of bits, so the scaled offset and thus the final load address always have bits[1:0] clear.)
(Thanks to Raymond Chen for spotting this; I had also missed it initially!)
Also note that your debugger shows you a PC value when stopped at a breakpoint, but that's the address of the instruction you're stopped at. (Because that's how ARM exceptions work, I assume, saving the actual instruction to return to, not some offset.) During execution of the instruction, PC-relative stuff follows different rules. And the debugger doesn't "cook" this value to show what PC will be during its execution.
The rule is not "relative to the end of this / start of next instruction". Answers and comments stating that rule happen to get the right answer in this case, but would get the wrong answer in other Thumb cases like in LDR Rd,-Label vs LDR Rd,[PC+Offset] where the PC-relative load instruction happens to start at a 4-byte aligned address so bit #1 of PC is already cleared.
Your LDR is at address 0xA56 where bit #1 is set, so the rounding down has an effect. And your ldr instruction used a 2-byte encoding, not a Thumb2 32-bit instruction like you might need for a larger offset. Both of these things means round-down + 4 happens to be the address of the next instruction, rather than 2 instruction later or the middle of this instruction.
Since the program counter points to the next instruction, when it executes the LDR at address 0x00000A56, the program counter will be holding the address of the next instruction, which is 0x00000A58.
0x0A58 + 0x5C (decimal 92) == 0x00000AB4

Behavior of LDR on a uint8_t variable in ARM?

I cannot find a straight answer for this anywhere. The registers for ARM are 32-bit, I know that LDRB loads a byte size value into a register and zeros out the remaining 3 bytes, even if you feed it a value bigger than a byte, it will just take the first byte value.
My program combines C with ARM Assembly. I have an extern variable in C that gets loaded into a register directly.
However if I call just LDR on this byte variable, is there a guarantee that it loads the byte and nothing else or will it load random things in the remaining 3 byte space from nearby things in memory to fill out the entire 32-bit register?
I'm only asking because I did LDR R0, =var and always got the correct value out of probably a hundred million executions (software ran for a long time and was tested thoroughly / recompiled many times before this issue was brought up on another setup).
However someone else with a different setup (Not so different, compiler is the same version I think) compiled the code successfully however the value loaded into R0 was polluted with random bits from the surrounding memory of the variable. They had to do LDRB to fix it.
Is this a compiler thing? Can it detect this and automatically switch it to LDRB? Or am I just that lucky that the surrounding memory of the variable was just zero due to some optimization?
As a side note the compiler is ARM GCC 9.2.1
because I did LDR R0, =var
Are you loading the value or the address of the variable?
Normally, the instruction LDR R0, =var will write the address of the variable var into the register R0 and not the value.
And the address of a variable is always a 32-bit value on a 32-bit ARM CPU - independent of the data type.
However if I call just LDR on this byte variable, ...
If you load the value of a variable (e.g. using LDR R1, [R0]), two things may happen:
The upper 24 bits of the register may contain a random value depending on the bytes that follow your variable in memory. If you are lucky, the bytes are always zero.
Depending on the exact CPU type, you may get problems due to alignment (for example an alignment exception or even completely undefined behavior)
LDR doesn't know anything about how you declared the variable or what's supposed to be in the 4 bytes it loads. That's why ISAs like ARM have byte loads like LDRB (and its sign-extending equivalent) in the first place.
And no, compilers don't waste 3 bytes (of zeros) after every uint8_t just so you can use word loads on it, that would be silly. i.e. sizeof(uint8_t) = 1 = unsigned char, CHAR_BIT = 8, and alignof(uint8_t) = 1
LDR loads an int32_t or uint32_t whole word.
But as Martin points out, LDR r0, =var puts the address of var into a register.
Then you use ldrb r1, [r0]
Fun fact: early ARM CPUs (ARMv4 and earlier) with an unaligned word load will use the low 2 bits of the address as a rotate count (after loading from an aligned word). https://medium.com/#iLevex/the-curious-case-of-unaligned-access-on-arm-5dd0ebe24965

32-bit fixed instruction length in 64-bit memory space

I am currently reading up on the AArch64 architecture by ARM. They are using a RISC-like instruction set with a fixed instruction length of 32-bit while operating on 64-bit addresses. I am still new to the topic of ISA so my question is: how can you operate with 64-bit long addresses when you only have 32-bit length in your instructions?
32-bit is the length of an instruction in bytes, not the operand-size or the address size.
ARM 32-bit (like all other 32-bit RISCs that use 32-bit fixed-width instruction words) can't fit a 32-bit address as an immediate into a single instruction either: there'd be no room for an opcode to say what instruction it is.
The width of an instruction limits the number of registers you can have. With 3 registers per instruction (dst, src1, src2), AArch64's increase from 16 to 32 registers means that each instruction needs 3 * log2(32) = 3* 5 = 15 bits to encode the registers. Or fewer for instructions with only 2 or 1 registers. (e.g. mov-immediate or add-immediate). The rest of the space goes to number of possible opcodes, and the size of immediates.
To get an address into a register, ARM compilers will typically load it from a nearby pool of constants (with a PC-relative addressing mode).
The other option is what most RISC CPUs do: use a 2-instruction sequence to put a 16-bit immediate in the upper half of a register, then OR a 16-bit immediate into the low half. (Or use the lower half of a static address as the displacement to a load/store instruction that uses an addressing mode like register + 16-bit offset.)
MIPS is a good example of a very simple RISC, see it's ISA with binary encoding. Its lui reg, imm16 puts imm16 <<16 into a register. (Load Upper Immediate). Then lw dst, imm16(base_reg) is a load like I was talking about in the last paragraph.
Even in 64-bit code, most numbers are still small, so there's not much need for wider immediate operands (except for addresses). e.g. x86 still uses a choice of 32-bit or 8-bit immediate operands for add r64, imm. x86 being a variable length ISA saves space when immediates are between -128 and +127 in a lot of cases.
You could have a 4-bit CPU that operates on 4096 bit values, it just wouldn't be terribly fast at doing it. Any programming language with a "bignum" type will work this way, though usually not to the same extreme.
The way a 32-bit CPU can operate on 64-bit values is because the hardware or software allows it, no other reason.
It's not like we couldn't manipulate 64-bit integer values before we had 64-bit CPUs. Even the old Intel 80386 could support 64-bit operations and it was released in 1985.

ARM Cortex A7: avoid memory veneers?

On ARMv7, which is Thumb capable, is it right that we can avoid all the veneers by using the BX instruction?
Since this instruction takes a 32 bit register, are we good?
If yes, when I see veneers in the generated code, I should specialize the output for my machine, right?
Thanks
Yes, since BX takes a 32-bit register, there's no need for veeners because you can cover the whole addressing space.
Of course you'd need to load a 32-bit value into the register, which usually means constant pooling, so if you are looking to squeeze every cycle out of it and your program is not too large you're better off with relative branches. As #Notlikethat notes, if you don't already have the address in a register there's no point in using BX when you can just LDR PC, ... (unless you need to support ARMv4T interworking).
Relative, non-conditional, 32-bit Thumb branches have a 24-bit addressing space, so you can reach +/- 16MB (for others see here). If you're doing ELF, be really careful with 16-bit relative Thumb branches. A 32-bit branch will generate a 24-bit relocation and the linker will insert a veener if the target can't be addressed with 24 bits. A 16-bit branch generates a 11-bit relocation and ELF for ARM specifies that the linker is not required to generate veeners for those, so you'd risk a link-time out-of-range branch.

Intel x86 to ARM assembly conversion

I am currently learning ARM assembly language;
To do so, I am trying to convert some x86 code (AT&T Syntax) to ARM assembly (Intel Syntax) code.
__asm__("movl $0x0804c000, %eax;");
__asm__("mov R0,#0x0804c000");
From this document, I learn that in x86 the Chunk 1 of the heap structure starts from 0x0804c000. But I when I try do the same in arm,
I get the following error:
/tmp/ccfNZp9F.s:174: Error: invalid constant (804c000) after fixup
I am assuming the problem is that ARM can only load 32bit instructions.
Question 1: Any idea what would be the first chunk in case of ARM processors?
Question 2:
From my previous question, I know how memory indirect addressing works.
Are the snippets written below doing the same job?
movl (%eax), %ebx
LDR R0,[R1]
I am using ARMv7 Processor rev 4 (v7l)
Trying to learn arm by looking at x86 is not a good idea one is CISC and quite ugly the other is RISC and much cleaner.. Just learn ARM by looking at the instruction set reference in the architectural reference manual. Look up the mov instruction the add instruction, etc.
ARM doesnt use intel syntax it uses ARM syntax.
Dont learn by using inline assembly, write real assembly. Use an instruction set simulator first not hardware.
ARM, Mips and others aim for fixed word length. So how would you for example fit an instruction that says move some immediate to a register, specify the register, and fit the 32 bit immediate all in 32 bits? not possible. So for fixed length instruction sets you cannot simply load any immediate you want into any register. You must read up on the rules for that instruction set. mips allows for 16 bit immediates, arm for 8 plus or minus depending on the flavor of arm instruction set and the instruction. mips limits where you can put those 16 bits either high or low, arm lets you put those 8 bits anywhere in the 32 bit register depending on the flavor of arm instruction set (arm, thumb, thumb2 extensions).
As with most assembly languages you can solve this problem by doing something like this
ldr r0,my_value
...
my_value: .word 0x12345678
With CISC that immediate is simply tacked onto the instruciton, so whether it 0 bytes a way or 20 bytes away it is still there with either approach.
ARM assemblers also generally allow you this shortcut:
ldr r0,=something
...
something:
which says load r0 with the ADDRESS of something, not the contents at that location but the address (like an lea)
But that lends itself to this immediate shortcut
ldr r0,=0x12345678
which if supported by the assembler will allocate a memory location to hold the value and generate a ldr r0,[pc,offset] instruction to read it. If the immediate is within the rules for a mov then the assembler might optimize it into a mov rd,#immediate.
Answer to Question 1
The MOV instruction on ARM only has 12 bits available for an immediate value, and those bits are used this way: 8 bits for value, and 4 bits to specify the number of rotations to the right (the number of rotations is multiplied by 2, to increase the range).
This means that only a limited number of values can be used with that instruction. They are:
0-255
256, 260, 264,..., 1020
1024, 1040, 1056, ..., 4080
etc
And so on. You are getting that error because your constant can't be created using the 8 bits + rotations. You can load that value onto the register following instruction:
LDR r0, =0x0804c000
Notice that this is a pseudo-instruction though. The assembler will basically put that constant somewhere in your code and load it as a memory location with some offset to the PC (program counter).
Answer to question 2
Yes those instructions are equivalent.

Resources