ARM Cortex A7: avoid memory veneers?

ARM Cortex A7: avoid memory veneers? - arm

On ARMv7, which is Thumb capable, is it right that we can avoid all the veneers by using the BX instruction?
Since this instruction takes a 32 bit register, are we good?
If yes, when I see veneers in the generated code, I should specialize the output for my machine, right?
Thanks

Yes, since BX takes a 32-bit register, there's no need for veeners because you can cover the whole addressing space.
Of course you'd need to load a 32-bit value into the register, which usually means constant pooling, so if you are looking to squeeze every cycle out of it and your program is not too large you're better off with relative branches. As #Notlikethat notes, if you don't already have the address in a register there's no point in using BX when you can just LDR PC, ... (unless you need to support ARMv4T interworking).
Relative, non-conditional, 32-bit Thumb branches have a 24-bit addressing space, so you can reach +/- 16MB (for others see here). If you're doing ELF, be really careful with 16-bit relative Thumb branches. A 32-bit branch will generate a 24-bit relocation and the linker will insert a veener if the target can't be addressed with 24 bits. A 16-bit branch generates a 11-bit relocation and ELF for ARM specifies that the linker is not required to generate veeners for those, so you'd risk a link-time out-of-range branch.

Related

Understanding Cortex-M assembly LDR with pc offset

I'm looking at the disassembly code for this piece of C code:
#define GPIO_PORTF_DATA_R (*((volatile unsigned long *)0x400253FC))
int main(void){
// Initialization code
while(1) {
SW1 = GPIO_PORTF_DATA_R&0x10; // Read PF4 into SW1
// Other code
SW2 = GPIO_PORTF_DATA_R&0x01;
}
}
The assembly for that SW1= line is (sorry can't copy code):
https://imgur.com/dnPHZrd
Here are my questions:
At the first line, PC = 0x00000A56, and PC + 92 = 0x00000AB2, which is not equal to 0x00000AB4, the number shown. Why?
I did a bit of research on SO and found out that PC actually points to the Next Next instruction to be executed.
When pc is used for reading there is an 8-byte offset in ARM mode and 4-byte offset in Thumb mode.
However 0x00000AB4 - 0x00000A56 = 0x5E = 94, neither does it match 92+8 or 92+4. Where did I get wrong?
Reference:
Strange behaviour of ldr [pc, #value]
Why does the ARM PC register point to the instruction after the next one to be executed?
LDR Rd,-Label vs LDR Rd,[PC+Offset]

From ARM documentation:
Operation
address = (PC[31:2] << 2) + (immed_8 * 4)
Rd = Memory[address, 4]
The pc is 0xA56+4 because of two instructions ahead and this is thumb so 4 bytes.
(0xA5A>>2)<<2 + (0x17*4)
or
(0x00000A5A&0xFFFFFFFC) + (0x17<<2)
0xA58+92=0xA64
This is an LDR so it is a word-based address ideally. Because the thumb instruction can be on a non-word aligned address, you start off by adding two instructions of course (thumb2 complicates this but add four for thumb). Then zero the lower two bits (LDR) the offset is in words so need to convert that to bytes, times four. This makes the encoding make more sense if you think about each part of it. In arm mode the PC is already word aligned so that step is not required (and in arm mode you have more bits for the immediate so it is byte-based not word-based), making the offset encoding between arm and thumb possibly confusing.
The various documents will show the math in different ways but it is the same math nevertheless. The PC is the only confusing part, especially for thumb. For ARM you add 8, two ahead, for thumb it is basically 4 because the execution cannot tell if there is a thumb2 coming, and it would break a great many things if they had attempted that. So add 4 for the two ahead, for thumb. Since thumb is compressed they do not use a byte offset but instead a word offset giving 4 times the range. Likewise this and/or other instructions can only look forward not back so unsigned offset. This is why you will get alignment errors when assembling things in thumb that in arm would just be unaligned (and you get what you get there depending on architecture and settings). Thumb cannot encode any address for an instruction like this.
For understanding instruction encoding, in particular pc based addressing, it is best to go back to the early ARM ARM (before the armv5 one but if not then just get the armv5 one) as well as the armv6-m and armv7-m and full sized armv7-ar. And look at the pseudo-code for each. The older one generally has the best pseudo-code, but sometimes they leave out the masking of lower bits of the address. No document is perfect, they have bugs just like everything else. Naturally the architecture tied to the core you are using is the official document for the IP the chip vendor used (even down to the specific version of the TRM as these can vary in incompatible ways from one to the next). But if that document is not perfectly clear you can sometimes get an idea from others that, upon inspection, have compatible instructions, architectural features.

You missed a key part of the rules for Thumb mode, quoted in one of the question you linked (Why does the ARM PC register point to the instruction after the next one to be executed?):
For all other instructions that use labels, the value of the PC is the address of the current instruction plus 4 bytes, with bit[1] of the result cleared to 0 to make it word-aligned.
(0xA56 + 4) & -4 = 0xA58 is the location that PC-relative things are relative to during execution of that ldr r0, [PC, #92]
((0xA56 + 4) & -4) + 92 = 0xab4, the location the disassembler calculated.
It's equivalent to do 0xA56 & -4 = 0xa54 then +4 + 92, because +4 doesn't modify bit #1; you can think of clearing it before or after adding that +4. But you can't clear the bit after adding the PC-relative offset; that can be unaligned for other instructions like ldrb. (Thumb-mode ldr encodes an offset in words to make better use of the limited number of bits, so the scaled offset and thus the final load address always have bits[1:0] clear.)
(Thanks to Raymond Chen for spotting this; I had also missed it initially!)
Also note that your debugger shows you a PC value when stopped at a breakpoint, but that's the address of the instruction you're stopped at. (Because that's how ARM exceptions work, I assume, saving the actual instruction to return to, not some offset.) During execution of the instruction, PC-relative stuff follows different rules. And the debugger doesn't "cook" this value to show what PC will be during its execution.
The rule is not "relative to the end of this / start of next instruction". Answers and comments stating that rule happen to get the right answer in this case, but would get the wrong answer in other Thumb cases like in LDR Rd,-Label vs LDR Rd,[PC+Offset] where the PC-relative load instruction happens to start at a 4-byte aligned address so bit #1 of PC is already cleared.
Your LDR is at address 0xA56 where bit #1 is set, so the rounding down has an effect. And your ldr instruction used a 2-byte encoding, not a Thumb2 32-bit instruction like you might need for a larger offset. Both of these things means round-down + 4 happens to be the address of the next instruction, rather than 2 instruction later or the middle of this instruction.

Since the program counter points to the next instruction, when it executes the LDR at address 0x00000A56, the program counter will be holding the address of the next instruction, which is 0x00000A58.
0x0A58 + 0x5C (decimal 92) == 0x00000AB4

32-bit fixed instruction length in 64-bit memory space

I am currently reading up on the AArch64 architecture by ARM. They are using a RISC-like instruction set with a fixed instruction length of 32-bit while operating on 64-bit addresses. I am still new to the topic of ISA so my question is: how can you operate with 64-bit long addresses when you only have 32-bit length in your instructions?

32-bit is the length of an instruction in bytes, not the operand-size or the address size.
ARM 32-bit (like all other 32-bit RISCs that use 32-bit fixed-width instruction words) can't fit a 32-bit address as an immediate into a single instruction either: there'd be no room for an opcode to say what instruction it is.
The width of an instruction limits the number of registers you can have. With 3 registers per instruction (dst, src1, src2), AArch64's increase from 16 to 32 registers means that each instruction needs 3 * log2(32) = 3* 5 = 15 bits to encode the registers. Or fewer for instructions with only 2 or 1 registers. (e.g. mov-immediate or add-immediate). The rest of the space goes to number of possible opcodes, and the size of immediates.
To get an address into a register, ARM compilers will typically load it from a nearby pool of constants (with a PC-relative addressing mode).
The other option is what most RISC CPUs do: use a 2-instruction sequence to put a 16-bit immediate in the upper half of a register, then OR a 16-bit immediate into the low half. (Or use the lower half of a static address as the displacement to a load/store instruction that uses an addressing mode like register + 16-bit offset.)
MIPS is a good example of a very simple RISC, see it's ISA with binary encoding. Its lui reg, imm16 puts imm16 <<16 into a register. (Load Upper Immediate). Then lw dst, imm16(base_reg) is a load like I was talking about in the last paragraph.
Even in 64-bit code, most numbers are still small, so there's not much need for wider immediate operands (except for addresses). e.g. x86 still uses a choice of 32-bit or 8-bit immediate operands for add r64, imm. x86 being a variable length ISA saves space when immediates are between -128 and +127 in a lot of cases.

You could have a 4-bit CPU that operates on 4096 bit values, it just wouldn't be terribly fast at doing it. Any programming language with a "bignum" type will work this way, though usually not to the same extreme.
The way a 32-bit CPU can operate on 64-bit values is because the hardware or software allows it, no other reason.
It's not like we couldn't manipulate 64-bit integer values before we had 64-bit CPUs. Even the old Intel 80386 could support 64-bit operations and it was released in 1985.

ARM Program Counter distinguishing feature

How does the R15 of ARM differ from the general PC of a CPU?
Both of them are program counters only. What is the difference?

ARM's PC is more similar to a regular register with some restrictions than x86's IP is similar to a regular register.
Considering general PC is an Intel x86 based CPU, in x86's case you can't manipulate PC (Instruction pointer) directly but it is updated implicitly by provided control flow instructions.
In ARM's case historically Program Counter (PC), mapped as register at index 15 (16th register) can be manipulated directly via arithmetic instructions. For example you can add 16 to PC which would alter flow of instruction stream similar to a 16-byte forward jump instruction.

The ARM PC maybe more of a general register than most CPUs, but it is still very special. The traditional simple arithmetic instructions can use the PC as an input argument in many cases. Here it functions as a pointer or array base. It can also be used as the output for control transfer with these instructions. As a read-only value, it is useful for calculating return values in a PC-independent way. It is also useful to use as a constant table look-up in near-by code. For these cases, the PC is very much like a regular register. This is probably more common on many RISC CPUs as opposed to a CISC ISA.
However, when the PC is used as a destination (lvalue or updated and written), the behavior is often non-standard. Some examples of special cases (for some ARM architechure versions) for R15/PC are,
adcs - copies SPSR to CPSR
adds - copies SPSR to CPSR
ands - copies SPSR to CPSR
bics - copies SPSR to CPSR
bx r15 - highly discourage or not supported.
clz r15 - not supported.
mcr pXX, xx, r15,... - unpredictable
etc.
In most cases, using the PC as a destination of an instruction will have some special case. Especially, the use of the S (normally to set conditions codes) can be used to return from an exception. This might be used as some sort of veneer when returning from an exception or just a direct return. In some cases, the meaning of the instruction might change completely. For instance, ldm sp, {r0-r15}^ and ldm sp, {r0-r14}^ use different register banks; the first will load the registers according to the mode in the SPSR; whereas the 2nd will load the register to user mode.
For load/store, atomics, mode manipulation, co-processor and complex arithmetic (64 bit multiplies, etc) instructions, the PC is often unsupported or has a different meaning; the different meaning is often a mechanism for handling exceptions for system level code.

Intel x86 to ARM assembly conversion

I am currently learning ARM assembly language;
To do so, I am trying to convert some x86 code (AT&T Syntax) to ARM assembly (Intel Syntax) code.
__asm__("movl $0x0804c000, %eax;");
__asm__("mov R0,#0x0804c000");
From this document, I learn that in x86 the Chunk 1 of the heap structure starts from 0x0804c000. But I when I try do the same in arm,
I get the following error:
/tmp/ccfNZp9F.s:174: Error: invalid constant (804c000) after fixup
I am assuming the problem is that ARM can only load 32bit instructions.
Question 1: Any idea what would be the first chunk in case of ARM processors?
Question 2:
From my previous question, I know how memory indirect addressing works.
Are the snippets written below doing the same job?
movl (%eax), %ebx
LDR R0,[R1]
I am using ARMv7 Processor rev 4 (v7l)

Trying to learn arm by looking at x86 is not a good idea one is CISC and quite ugly the other is RISC and much cleaner.. Just learn ARM by looking at the instruction set reference in the architectural reference manual. Look up the mov instruction the add instruction, etc.
ARM doesnt use intel syntax it uses ARM syntax.
Dont learn by using inline assembly, write real assembly. Use an instruction set simulator first not hardware.
ARM, Mips and others aim for fixed word length. So how would you for example fit an instruction that says move some immediate to a register, specify the register, and fit the 32 bit immediate all in 32 bits? not possible. So for fixed length instruction sets you cannot simply load any immediate you want into any register. You must read up on the rules for that instruction set. mips allows for 16 bit immediates, arm for 8 plus or minus depending on the flavor of arm instruction set and the instruction. mips limits where you can put those 16 bits either high or low, arm lets you put those 8 bits anywhere in the 32 bit register depending on the flavor of arm instruction set (arm, thumb, thumb2 extensions).
As with most assembly languages you can solve this problem by doing something like this
ldr r0,my_value
...
my_value: .word 0x12345678
With CISC that immediate is simply tacked onto the instruciton, so whether it 0 bytes a way or 20 bytes away it is still there with either approach.
ARM assemblers also generally allow you this shortcut:
ldr r0,=something
...
something:
which says load r0 with the ADDRESS of something, not the contents at that location but the address (like an lea)
But that lends itself to this immediate shortcut
ldr r0,=0x12345678
which if supported by the assembler will allocate a memory location to hold the value and generate a ldr r0,[pc,offset] instruction to read it. If the immediate is within the rules for a mov then the assembler might optimize it into a mov rd,#immediate.

Answer to Question 1
The MOV instruction on ARM only has 12 bits available for an immediate value, and those bits are used this way: 8 bits for value, and 4 bits to specify the number of rotations to the right (the number of rotations is multiplied by 2, to increase the range).
This means that only a limited number of values can be used with that instruction. They are:
0-255
256, 260, 264,..., 1020
1024, 1040, 1056, ..., 4080
etc
And so on. You are getting that error because your constant can't be created using the 8 bits + rotations. You can load that value onto the register following instruction:
LDR r0, =0x0804c000
Notice that this is a pseudo-instruction though. The assembler will basically put that constant somewhere in your code and load it as a memory location with some offset to the PC (program counter).
Answer to question 2
Yes those instructions are equivalent.

Is it atomic to access(load/store) 32 bit integer when using ARM Thumb instruction set?

Using ARM cortex with thumb instruction set and Keil realview compiler, is it safe to access to 32 bit integer? Since the thumb register set is 16 bits, does this mean, fetching a 32 bit int needs 2 machine instructions? If so, accessing 32 bit will not be atomic. If my worry is true, does it mean that int assignment should be protected by a critical region?

Thumb uses the same 32-bit registers as ARM, so there's no issue there. What's halved is the instruction size (and even that is not strictly true for Thumb-2).
Do not worry, you don't need to change your code if you're compiling to Thumb.

The instruction size is 16-Bit in thumb mode, not the register size.
This means that a constant assignment - as in i=1; - can be seen as atomic. Although more than one instruction is generated, only one will modify the memory location of i even if i is int32_t.
But you need a critical section once you to things like i=i+1. That is of course not atomic.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight