I'm doing binary instrumentation on Thumb2 executables and I noticed that in some of my benchmarks the instrumented binaries performed better than the original ones (very strange since my instrumentation adds more code to execute).
Looking into this anomaly, I was able to come to the conclusion that the performance differential was caused by the different way I handle returns from procedures. In the original executables most of the returns are handled with
pop {..., pc}
while after my instrumentation these are translated as
pop {..., lr}
...
bx lr
because I need to do some calculations with the return address.
I was able to further test this behavior with a simple program which invokes a subroutine for 1 million times and pop {lr}; bx lr performed 49% faster than pop {pc} (which is huge).
All tests were done on a Scaleway C1 Baremetal Server running a Marvell Armada XP quad-core ARMv7 SoC.
Does anybody have an explanation for this?
EDIT
The server runs Ubuntu 16.04. Here is the source code of my tests: popcptest, bxlrtest. For compilation I used GCC Linaro version 5.4.0 with default settings.
Related
Given the following tiny program:
#include <stdlib.h>
#define NEXT goto **ip++
#define guard(n) asm("#" #n)
int main() {
static void *prog[] = {&&next1,&&next2,&&next1,&&next3,&&next1,&&next4,&&next1,&&next5,&&next1,&&loop};
void ** ip=prog;
int count = 100000000;
NEXT;
next1: guard(1); NEXT;
next2: guard(2); NEXT;
next3: guard(3); NEXT;
next4: guard(4); NEXT;
next5: guard(5); NEXT;
loop:
if (count) {
count--;
ip=prog;
NEXT;
}
exit(0);
}
I noticed that each of the next# statements get compiled as TWO instructions.
ldr r2, [r3], #4
mov pc, r2 # indirect register jump
I would have expected this to only need one instruction:
ldr pc, [r3], #4
I found the discussion here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40887
"The problem is that the instruction "ldr pc, [r3, #0]" is not considered a function call by the Cortex-A8's branch predictor, as noted in DDI0344J section 5.2.1, Return stack predictions. Thus, the return from the called function is mispredicted resulting in a penalty of 13 cycles compared to a direct call."
However, the 'goto' is not a function call and there's no expectation for the return stack to be relevant here at all.
I'm wondering if this is some optimization that both GCC and CLANG have missed or is the latter a worse performer for a reason I'm un aware of?
This looks like a missed optimization unless there is some other microarchitectural reason to avoid it on some other CPUs. (That's plausible, but I wouldn't specifically expect it. Loading into a register multiple instructions earlier could give some time to hide load-use latency and reduce possible mispredict penalty, but loading in the previous instruction is unlikely to matter unless there's something special about ldr into PC.)
You're correct, bug #40887 is only about indirect calls with blx vs. manually setting up a return address and jumping. It's not relevant to indirect jumps inside a function, like for a switch or computed goto. (Except perhaps if GCC is avoiding loads into PC in general, so this missed optimization is collateral damage from fixing that bug.)
And you're not using volatile, another thing that often makes GCC do a load with a separate instruction instead of folding it into something else (like avoiding x86 add eax, [rdi]. Or in this case a memory-source jump like ARM load-into-PC, it might consider that special.)
Comparing -mcpu=cortex-a8 -marm vs. -mthumb, we see GCC does need extra instructions in thumb mode to set the low bit of the target address before mov pc,reg. https://godbolt.org/z/87EszvWP1
Or maybe that's a missed optimization, too: just ldr pc, [mem] would stay in the current mode, and we know we're jumping within a single function so there's no possibility of changing mode. And/or the jump table could just have been built with the low bits already set if using bx r2 is actually faster.
https://developer.arm.com/documentation/dui0473/m/arm-and-thumb-instructions/ldr--register-offset- says
For word loads, Rt can be the PC. A load to the PC causes a branch to the address loaded. In ARMv4, bits[1:0] of the address loaded must be 0b00. In ARMv5T and above, bits[1:0] must not be 0b10, and if bit[0] is 1, execution continues in Thumb state, otherwise execution continues in ARM state.
In Thumb mode, ldr into PC is only possible with a 32-bit instruction, but ldr into r0-7 and branching to it can each be 16-bit instructions. But I doubt that would be any better unless you can schedule the load earlier.
I am going through the assembly generated by GCC for an ARM Cortex M4, and noticed that atomic_compare_exchange_weak gets two DMB instructions inserted around the condition (compiled with GCC 4.9 using -std=gnu11 -O2):
// if (atomic_compare_exchange_weak(&address, &x, y))
dmb sy
ldrex r0, [r3]
cmp r0, r2
itt eq
strexeq lr, r1, [r3]
cmpeq.w lr, #0
dmb sy
bne.n ...
Since the programming guide to barrier instructions for ARM Cortex M4 states that:
Omitting the DMB or DSB instruction in the examples in Figure 41 and Figure 42 would not cause any error because the Cortex-M processors:
do not re-order memory transfers
do not permit two write transfers to be overlapped.
Is there any reason why these instructions couldn't be removed when targetting Cortex M?
I'm not aware of whether Cortex M4 can be used in a multi-cpu/multi-core configuration, but in general:
Memory barriers are never necessary (can always be omitted) in single-core systems.
Memory barriers are always necessary (can never be omitted) in multi-core systems where threads/processes operating on the same memory may be running on different cores.
Presence or lack of reordering memory writes at the hardware level is irrelevant.
Of course I would expect the DMB instruction to be essentially free on chips that don't support SMP, so I'm not sure why you'd want to try to hack it out.
Please note that, based on the question's referencing the code the compiler produces for atomic intrinsics, I'm assuming the context is for synchronization of atomics to make them match the high-level specification, not other uses like IO barriers for MMIO, and the above "never" should not be read as applying to this (unrelated) use (though I suspect, for the reasons you already cited, it doesn't apply to Cortex M4).
When I run into a fault handler on my ARM cortex-M4 (Thumb) I get a snapshot of the CPU register just before the fault occured. With this information I can find the stack pointer where it was. Now, what I want is to backtrace through all functions it passed. The only problem I see here is that I don't have a frame pointer, so I cannot really see where a certain subroutine has saved the LR, ad infinitum.
How would one tackle this problem if the frame pointer is not available in r7?
This blog post discusses this issue with reference to the MIPS architecture - the principles can be readily adapted to ARM architectures.
In short, it describes three possibilities for locating the stack frame for a given SP and PC:
Using compiler-generated debug information (not included in the executable image) to calculate it.
Using compiler-generated stack-unwinding (exception handling) information (included in the executable image) to calculate it.
Scanning the call site to locate the prologue or epilogue code that adjusts the stack pointer, and deducing the stack frame address from that.
Obviously it's very compiler- and compiler-option dependent, and not guaranteed to work in all cases.
R7 is not the frame pointer on the M4, it's R11. R7 is the FP for Cortex-M0+/M1 where only the lower registers are generally available. In anycase, when Cortex-M makes a call to a function using BL and variants, it saves the return address into LR (link register). At function entry, the LR is saved onto the stack. So in theory, to get a call trace, you would "chase" the chain of the LRs.
Unfortunately, the saved location of LR on the stack is not defined by the calling convention, and its location must be deduced from the debug info for that function entry in the DWARF records (in the .elf file). I do not know if there is an utility that would extract the LR locations from an ELF file, but it should not be too difficult.
Richard at ImageCraft is right.
More information can be found here
This works fine with C code. I had a harder applying it to C++ but it's not impossible.
I am interested in the ARM and Thumb2 commands: LDR and LDR.W, PC, =ADDR for absolute jumping to a certain address.
For example, when I jump from ARM code to ARM, the command LDR PC, =ADDR is performed.
But what happens in the other scenarios?
from ARM to Thumb2
from Thumb2 to Thumb2
from Thumb2 to ARM
when is +1 needed to be added to the address? and why?
The rule is actually quite simple:
If bit 0 of the address is 0, the CPU will execute the code as ARM code after the next branch
If bit 0 of the address is 1, the CPU will execute the code as Thumb after the next branch
Of course if there is a mismatch, the CPU will certainly get a fault (After executing random code) because it has no way to check if the code is ARM or Thumb.
This is what explains the +1.
Note that depending on the compiler, and depending on the label used, bit 0 of the address may be automatically set by the compiler.
You need to just read the documentation.
The following instructions write a value to the PC, treating that value as an interworking address to branch
to, with low-order bits that determine the new instruction set state:
— BLX (register), BX , and BXJ
— LDR instructions with <Rt> equal to the PC
— POP and all forms of LDM except LDM (exception return), when the register list includes the PC
— in ARM state only, ADC , ADD , ADR , AND , ASR (immediate), BIC , EOR , LSL (immediate), LSR (immediate), MOV ,
MVN , ORR , ROR (immediate), RRX , RSB , RSC , SBC , and SUB instructions with <Rd> equal to the PC and without
flag-setting specified.
Since you mentioned thumb2 that means armv6 or newer. (did you say thumb2 and generically mean thumb?) and I believe the docs are telling us the above applies for armv6 and armv7.
Note that bit is consumed by the instruction, the pc doesnt carry around a set lsbit in thumb mode, it is just used by the instruction to indicate a mode change.
Also note you should think in terms of OR 1 not PLUS 1. If you write your code correctly the toolchain will supply you with the correct address with the correct lsbit, if you add a one to that address you will break the code, if you are paranoid or have not done it right you can OR a one to the address and if it has it there already no harm, if it doesnt then it fixes the problem that prevented it from being there. I would never use a plus one though with respect to switching to thumb mode.
According to the ARM manual, it should be possible to access the banked registers for a specific CPU mode as, for instance, "r13_svc". When I try to do this gcc yells at me with the following error:
immediate expression requires a # prefix -- `mov r2,sp_svc'
What's wrong?
Update. The following text from the ARM Architecture Reference Manual for ARMv5 and ARMv6 led me to believe that it is possible, section A2.4.2:
Registers R13 and R14 have six banked
physical registers each. One is used
in User and System modes, and each of
the remaining five is used in one of
the five exception modes. Where it is
necessary to be specific about which
version is being referred to, you use
names of the form: R13_mode
R14_mode where mode is the
appropriate one of usr, svc (for
Supervisor mode), abt, und, irq and
fiq.
The correct syntax for this is mrs r2,sp_svc or mrs r3, sp_usr. This is a new armv7 extension. The code can be seen in the ARM Linux KVM source file interrupt_head.S. The gas binutils patch for this instruction support by Matthew Gretton-Dann. It requires the virtualization extensions are far as I understand.
According to what I understand, the LPAE (large physical address extension) implies the virtualization extensions. So Cortex-A7, Cortex-A12, Cortex-A15, and Cortex-A17 may be able to use this extension. However, the Cortex-A5, Cortex-A8, and Cortex-A9 can not.
Documentation on the instruction can be found in the ARMv7a TRM revC, under section B9.3.9 MRS (Banked register).
For other Cortex-A (and ARMv6) CPU's you can use the cps instruction to switch modes and transfer the banked register to an un-banked register (R0-R7) and then switch back. The obvious difficulty is with user mode. The correct way to handle this is with ldm rN, {sp,lr}^; user mode has no simple way back to the privileged modes.
For all older CPUs, the information given by old_timer will work. Mainly, use mrs/msr to change modes. mrs/msr works over the full class of ARM cpus but requires multiple instructions and hence may have race issues which require interrupt and exception masking depending on context.
This is an important instruction (sequences) for context switching (which VMs do a lot of).
I don't think that's possible with the mov instruction; at least according to the ARM Architecture Reference Manual I'm reading. What document do you have? There are is a variant of ldm that can load user mode registers from a privileged mode (using ^). Your only other option is to switch to SVC mode, do mov r2, sp, and then switch back to whatever other mode you were using.
The error you're getting is because it doesn't understand sp_svc, so it thinks you're trying to do an immediate mov, which would look like:
mov r2, #0x14
So that's why it says "requires a # prefix".
You use mrs and msr to change modes by changing bits in the cpsr then use r13 normally.
From the arm arm
MRS R0,CPSR
BIC R0,R0,#0x1F
ORR R0,R0,#0x13
MSR CPSR_c,R0
then
mov sp,#0x10000000
or if you need more bits in the immediate
ldr sp,=0x12345600
or if you dont want the assembler placing your data, you can place it yourself.
ldr sp,svc_stack
b 1f
svc_stack: .word 0x12345600
1:
You will see typical arm startup code, where the application is going to support interrupts, aborts and other exceptions, to set all of your stack pointers that you are going to need, change mode, set sp, change mode, set sp, change mode ...