ARM ITCM interface and FLash access - arm

If the access to the Flash memory is done starting from the address 0x0200 0000, it is performed automatically via the ITCM bus. The ART accelerator™ should be enabled to get the equivalent of 0-wait state access to the Flash memory via the ITCM bus. The ART is enabled by setting the bit 9 in the FLASH_ACR register while the ART-Prefetch is enabled by setting the bit 8 in the same register.
If i place my program code starting at 0x0200 0000, What would happen if ART accelerator is not enabled ? will it be beneficial to just use AXIM bus instead for startup code and then enable ART accelerator and point execution to program region which is at 0x0200 0000.
I am just a bit confused.
https://www.st.com/content/ccc/resource/technical/document/application_note/0e/53/06/68/ef/2f/4a/cd/DM00169764.pdf/files/DM00169764.pdf/jcr:content/translations/en.DM00169764.pdf
Page 12

So let's just try it. NUCLEO-F767ZI
Cortex-M7s in general:
Prefetch Unit
The Prefetch Unit (PFU) provides:
1.2.3
• 64-bit instruction fetch bandwidth.
• 4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.
• A Branch Target Address Cache (BTAC) for the single-cycle turn-around of branch predictor state and target address.
• A static branch predictor when no BTAC is specified.
• Forwarding of flags for early resolution of direct branches in the decoder and first execution stages of the processor pipeline.
For this test the branch prediction gets in the way so turn that off:
Set ACTLR to 00003000 (hex, most numbers here are hex)
Don't see how to disable the PFU wouldn't expect to have control like that anyway.
So we expect the prefetch to read 64 bits at a time, 4 instructions on an aligned
boundary.
From ST
The DBANK bit is set indicating a single bank
Instruction prefetch
In case of single bank mode (nDBANK option bit is set) 256 bits representing 8 instructions of 32 bits to 16 instructions of 16 bits according to the program launched. So, in the case of sequential code, at least 8 CPU cycles are needed to execute the previous instruction line read.
So ST is going to turn that into a 256 bit or 16 instructions
Using the systick timer. Am running at 16Mhz so flash is at zero wait states.
08000140 <inner>:
8000140: 46c0 nop ; (mov r8, r8)
8000142: 46c0 nop ; (mov r8, r8)
8000144: 46c0 nop ; (mov r8, r8)
8000146: 46c0 nop ; (mov r8, r8)
8000148: 46c0 nop ; (mov r8, r8)
800014a: 46c0 nop ; (mov r8, r8)
800014c: 3901 subs r1, #1
800014e: d1f7 bne.n 8000140 <inner>
00120002
So 12 clocks per loop. Two prefetches from ARM, the first one becomes a single ST fetch. Should be zero wait state. Note the address this is AXIM
If I reduce the number of nops it stays at 0x1200xx until here:
08000140 <inner>:
8000140: 46c0 nop ; (mov r8, r8)
8000142: 46c0 nop ; (mov r8, r8)
8000144: 3901 subs r1, #1
8000146: d1fb bne.n 8000140 <inner>
00060003
One arm fetch instead of two. Time cut in half, so the prefetch is dominating our
performance.
08000140 <inner>:
8000140: 46c0 nop ; (mov r8, r8)
8000142: 46c0 nop ; (mov r8, r8)
8000144: 46c0 nop ; (mov r8, r8)
8000146: 46c0 nop ; (mov r8, r8)
8000148: 3901 subs r1, #1
800014a: d1f9 bne.n 8000140 <inner>
000 (zero wait states)
00120002
001 (1 wait state)
00140002
002 (2 wait states)
00160002
202 (2 wait states enable ART)
0015FFF3
Why would that affect AXIM?
so each wait state adds 2 clocks per loop, there are two fetches per loop so perhaps each fetch causes st to do one of its 256 bit fetches, that seems broken though.
Switch to ITCM
00200140 <inner>:
200140: 46c0 nop ; (mov r8, r8)
200142: 46c0 nop ; (mov r8, r8)
200144: 46c0 nop ; (mov r8, r8)
200146: 46c0 nop ; (mov r8, r8)
200148: 3901 subs r1, #1
20014a: d1f9 bne.n 200140 <inner>
000
00070004
001
00080003
002
00090003
202
00070004
ram
00070003
So ITCM alone, zero wait state, ART off is 7 clocks per loop for a 6 instruction
loop with a branch. seems reasonable. For this tiny test turning on ART with 2 wait states puts us back at 7 per loop.
Note that from ram this code runs at 7 per loop as well. Let's try another couple
00F
00230007
20F
00070004
I didn't look for other branch predictors other than the BTAC
First thing to note you don't want to ever run an MCU faster than you have to, burns power, many you need to add flash wait states, many the CPU and peripherals have different max clock speeds so there is a boundary where it becomes non-linear (takes X clock cycles at a slow clock rate, peripheral clock = CPU clock, there is a place where N times faster is NX clocks to do something, but one or more boundaries where it takes more than NX to do something when the CPU clock is N times faster). This particular part has this non-linear issue. If you are using libraries from ST to set the clock then you are possibly getting worst case flash wait states, where if you set it up and read the documentation you might be able to shave one or two/few.
The Cortex-M7 has optional L1 caches, didn't mess with it this time around but ST had this ART thing before these came out and I believe they defeat/disable the i cache at least, would it make it better or worse to have both? If it has it then that would make the first past slow then the remaining possibly faster even in AXIM space. You are welcome to try it. Seem to remember they did something tricky with a strap on the processor core, it wasn't easy to see how it was defeated, and that may not be this chip/core but was definitely ST. The M4 doesn't have a cache so it would have to be an M7 that I messed with (this one in particular).
So the short answer is the performance isn't that horrible if you leave off the ART and/or run out of AXIM. ST has implemented the flash such that the ITCM interface is faster than AXIM. We can see the effects of the ARMs fetch itself if you enable branch prediction you can see that as well if you turn it on.
It shouldn't be difficult to create a benchmark that defeats these features, just like you can make one that makes the L1 caches (or any other cache) hurt performance. The ART thing like any other cache makes performance less predictable and as you change your code, add a line, remove a line the performance can jump anywhere from no change to a lot as a result.
Depending on the processor and fetch sizes and alignments your code performance can vary by adding or removing code above the performance-sensitive part(s) of the project, but that depends on some factors that we rarely have visibility into.
Hard to tell looks like they are claiming that ART reduces power. I would expect it to increase power having those srams on/clocked. Don't see an obvious how much you save if you turn off the flash and run from ram. The M7 parts are not really meant to be the low powered parts like some STM32L parts where you can get to ones/tens of microamps (micro not milli, been there done that).
The small number of clocks 0x70004 instead of 0x70000 have to do with some of the fetching overhead be it ARM or ST or a combination of the two. To see memory/flash performance you need to disable as much of the features like branch prediction, caches that you can disable, etc. Otherwise, it's hard to measure performance and then make assumptions about what the flash/memory/bus is doing. I suspect there are still things I didn't turn off to make a clean measurement, and/or can't turn off. And simple nop loops (tried other non-nop instructions, didn't change it) won't tell you everything. Using the docs as a guide you can try to cache-thrash the ART or other and see what kind of hits that take.
For performance-critical code you can run from RAM and avoid all of these issues, I didn't search for it but assume that these parts SRAM can run as fast as the CPU. The answer isn't jumping out at me, you can figure it out.
Note my test actually looks like
ldr r2,[r0]
inner:
nop
nop
nop
nop
sub r1,#1
bne inner
ldr r3,[r0]
sub r0,r2,r3
bx lr
where sampling of systick is just in front of and back of. Before the branch. To measure ART you would want to sample the time before the branch for a memory range that has not been read it is not magically possible to read that faster the first read into the cache should be slower. If I move the time sampling further away I can see it go from 0x7000A to 0x70027 for 0 to 15 wait states with ART on. That is a noticeable performance hit for branches into code that has not been run/cached yet. Knowing the size of the art fetches, should be easy to make a test that hops a lot and the ART feature starts to not matter.
Short answer, the ITCM is a different bus interface on the ARM core, ST has implemented their design such that there is a performance gain. So even without ART enabled using ITCM is faster than AXIM (likely an ARM bus thing not ST flash thing). If you are running fast enough clock rates to have to add wait states to the flash then ART can mostly erase those.

I THINK that the question is much simpler than what the other answers assume.
If you are thinking about doing stuff like put your program elsewhere than simply in flash: Don't. As ST says: with ART the performance will be very close to "zero wait state". So don't worry about it. Anything else you try to do is not going to be faster than that.

Q. If i place my program code starting at 0x0200 0000, What would happen if ART accelerator is not enabled?
A. Program execution (instruction fetch and constant access) will be painfully slow, with a crazy number of wait cycles (15?).
[ UPD. I have to correct that this applies more to configurations with high clock frequency, e.g. 15 wait states are needed for 216 MHz. With lower frequencies, the flash access penalty will be less significant, and minimal at 16 MHz. We do not know what frequency is used by O.P. ]
[ UPD2. At most 9 wait states are needed at 216 MHz, sorry. ]
Q. Which bus is preferable for flash code access, AXI or ITCM?
A. The voluminous document you referred to includes some performance measurements, that also compare various code placement options. The results somewhat differ between processors models because cache sizes and bus width are different. Your code will likely be affected differently. My take away from this paper is, unless your code is performance critical, both options work reasonable well. However, having two parallel busses with caches enables you to do creative things like partitioning your code into pieces and allocating them to separate busses so that the critical but rarely used code is not evicted from the cache. I mean, if you really need that.

Related

How to calculate the LR value when ARMv7-A architecture implementation takes an IRQ exception

I'm researching the the Arm Architecture Reference Manual ARMv7-A ARMv7-R edition document these days. When I read about the exception handling part of this manual, it comes across a confusions to me. The problem is about how to decide the LR value when ARMv7-A architecture implementation takes an IRQ exception.
EXAMPLE: Suppose that the processor is executing an instruction at the address of 0x_0000_1000 and an IRQ is taken.First, we have to calculate some parameters to be used to calculate the LR.
preferred return address,which is the Address of next instruction to execute in this case. So preferred return address = 0x_0000_1002 in thumb instruction set or preferred return address = 0x_0000_1004 for arm instruction set.preferred return address for the exception
PC,which is the program counter and holds the current program address.In this case, PC = 0x_0000_1004 in thumb instruction state or PC = 0x_0000_1008 in arm instruction state.how to calculate PC
Then, here are 2 methods mentioned in the document to decide the LR value when taking this IRQ exception.
by using preferred return address. LR = preferred return address + offset that depends on the instruction set state when the exception was taken.In this case LR = 0x_0000_1002 + 4 in thumb instruction state or LR = 0x_0000_1004 + 4 in arm instruction state.Offsets applied to Link value for exceptions taken to PL1 modes
by using PC. LR = PC-0 if in thumb instruction set or LR = PC-4 when in arm instruction set.In this case LR = 0x_0000_1004 - 0 in thumb instruction set or LR = 0x_0000_1008 - 4 in arm instruction state. Pseudocode description of taking the IRQ exception
Problem:the LR results calculated by the 2 methods are different both in thumb set state and arm set state(with first method we get LR = 0x_0000_1006 or LR = 0x_0000_1008,but second method we get LR = 0x_0000_1004 or LR = 0x_0000_1004). which one is correct or is there any wrong with my understanding?
TL;DR - the IRQ LR will point to the next instruction to complete work as would normally be run without an interrupt. Otherwise, code would not execute the same in the presence of interrupts.
It is confusing as the ARM documents may refer to PC in many different contexts and they are not the same.
EXAMPLE:Suppose that the processor is executing an instruction at the address of 0x_0000_1000 and an IRQ is taken. First, we have to calculate some parameters to be used to calculate the LR.
preferred return address,which is the Address of next instruction to execute in this case. So preferred return address = 0x_0000_1002 in thumb instruction set or preferred return address = 0x_0000_1004 for arm instruction set.preferred return address for the exception
This is not correct. The ARM cpu has a pipeline and the last instruction that it has deemed to have completed is the next instruction. Take for example this sequence,
0: cmp r1, #42
1: bne 4f ; interrupt happens as this completes.
2: add r2, r2, #4
3: b 5f
4: sub r2, r2, #2
5: ; more code.
If the interrupt happens as label '1:' happens, the next instruction will be either '2:' or '4:'. If you followed your rule this would either increase interrupt latency by never allowing an interrupt in such cases, or interrupts would cause incorrect code. Specifically, your link says next instruction to execute.
PC,which is the program counter and holds the current program address.In this case, PC = 0x_0000_1004 in thumb instruction state or PC = 0x_0000_1008 in arm instruction state.how to calculate PC
Here you are mixing concepts. One is when you use a value like ldr r0, [pc, #42]. When you calculate the offset, you must add two to the current ldr instruction. The actual PC is not necessarily this value. At some point (original version), the ARM was a two stage pipeline. In order to keep behaviour the same, subsequent ARM cpus follow the rule of being two ahead when calculating ldr r0, [pc, #42] type addresses. However, the actual PC may be much different inside the CPU. The concept above describes the programmer visible PC for use with addressing.
The CPU will make a decision, sometimes base on configuration, on what work to complete. For instance, ldm sp!, {r0-r12} may take some time to complete. The CPU may decide to abort this instruction to keep interrupt latency low. Alternatively, it may perform 12 memory reads which could have wait states. The LR_irq will be set to the ldm instruction or the next instruction depending whether it is aborted or not.

ARM Cortex-A8 L2 cache miss overhead

I am reading ARM Cortex-A8 data sheet, in data sheet ARM stated that an Load data that missed in L2 take at least 28 core cycle to complete, now i could not imagine that during this 28 cycle CPU will stall and put bubble in pipeline or execute other instruction until this load complete? what if we have an branch based on this load result? what if we have another load just after that instruction that again missed in L2??
Even under a cache miss, the pipeline will go on until the RAW (read after write) dependency bites.
ldr r12, [r0], #4
subs r12, r12, r1
beq end_loop
The subs instruction cannot be executed at the same time as ldr due to the RAW dependency.
The beq instruction cannot be executed at the same time as subs due to the CPSR RAW dependency.
All in all, the sequence above will take 6 cycles in best case: three cycles instruction execution plus 3 cycles L1 hit latency while it will be 3 + 28 = 31 cycles in worst case (total cache miss)

DMB instructions in an interrupt-safe FIFO

Related to this thread, I have a FIFO which should work across different interrupts on a Cortex M4.
The head index must be
atomically written (modified) by multiple interrupts (not threads)
atomically read by a single (lowest level) interrupt
The function for moving the FIFO head looks similar to this (there are also checks to see if the head overflowed in the actual code but this is the main idea):
#include <stdatomic.h>
#include <stdint.h>
#define FIFO_LEN 1024
extern _Atomic int32_t _head;
int32_t acquire_head(void)
{
while (1)
{
int32_t old_h = atomic_load(&_head);
int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
if (atomic_compare_exchange_strong(&_head, &old_h, new_h))
{
return old_h;
}
}
}
GCC will compile this to:
acquire_head:
ldr r2, .L8
.L2:
// int32_t old_h = atomic_load(&_head);
dmb ish
ldr r1, [r2]
dmb ish
// int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
adds r3, r1, #1
ubfx r3, r3, #0, #10
// if (atomic_compare_exchange_strong(&_head, &old_h, new_h))
dmb ish
.L5:
ldrex r0, [r2]
cmp r0, r1
bne .L6
strex ip, r3, [r2]
cmp ip, #0
bne .L5
.L6:
dmb ish
bne .L2
bx lr
.L8:
.word _head
This is a bare metal project without an OS/threads. This code is for a logging FIFO which is not time critical, but I don't want the acquiring of the head to make an impact on the latency of the rest of my program, so my question is:
do I need all these dmbs?
will there be a noticeable performance penalty with these instructions, or can I just ignore this?
if an interrupt happens during a dmb, how many additional cycles of latency does it create?
TL:DR yes, LL/SC (STREX/LDREX) can be good for interrupt latency compared to disabling interrupts, by making an atomic RMW interruptible with a retry.
This may come at the cost of throughput, because apparently disabling / re-enabling interrupts on ARMv7 is very cheap (like maybe 1 or 2 cycles each for cpsid if / cpsie if), especially if you can unconditionally enable interrupts instead of saving the old state. (Temporarily disable interrupts on ARM).
The extra throughput costs are: if LDREX/STREX are any slower than LDR / STR on Cortex-M4, a cmp/bne (not-taken in the successful case), and any time the loop has to retry the whole loop body runs again. (Retry should be very rare; only if an interrupt actually comes in while in the middle of an LL/SC in another interrupt handler.)
C11 compilers like gcc don't have a special-case mode for uniprocessor systems or single-threaded code, unfortunately. So they don't know how to do code-gen that takes advantage of the fact that anything running on the same core will see all our operations in program order up to a certain point, even without any barriers.
(The cardinal rule of out-of-order execution and memory reordering is that it preserves the illusion of a single-thread or single core running instructions in program order.)
The back-to-back dmb instructions separated only by a couple ALU instructions are redundant even on a multi-core system for multi-threaded code. This is a gcc missed-optimization, because current compilers do basically no optimization on atomics. (Better to be safe and slowish than to risk ever being too weak. It's hard enough to reason about, test, and debug lockless code without worrying about possible compiler bugs.)
Atomics on a single-core CPU
You can vastly simplify it in this case by masking after an atomic_fetch_add, instead of simulating an atomic add with earlier rollover using CAS. (Then readers must mask as well, but that's very cheap.)
And you can use memory_order_relaxed. If you want reordering guarantees against an interrupt handler, use atomic_signal_fence to enforce compile-time ordering without asm barriers against runtime reordering. User-space POSIX signals are asynchronous within the same thread in exactly the same way that interrupts are asynchronous within the same core.
// readers must also mask _head & (FIFO_LEN - 1) before use
// Uniprocessor but with an atomic RMW:
int32_t acquire_head_atomicRMW_UP(void)
{
atomic_signal_fence(memory_order_seq_cst); // zero asm instructions, just compile-time
int32_t old_h = atomic_fetch_add_explicit(&_head, 1, memory_order_relaxed);
atomic_signal_fence(memory_order_seq_cst);
int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
return new_h;
}
On the Godbolt compiler explorer
## gcc8.2 -O3 with your same options.
acquire_head_atomicRMW:
ldr r3, .L4 ## load the static address from a nearby literal pool
.L2:
ldrex r0, [r3]
adds r2, r0, #1
strex r1, r2, [r3]
cmp r1, #0
bne .L2 ## LL/SC retry loop, not load + inc + CAS-with-LL/SC
adds r0, r0, #1 ## add again: missed optimization to not reuse r2
ubfx r0, r0, #0, #10
bx lr
.L4:
.word _head
Unfortunately there's no way I know of in C11 or C++11 to express a LL/SC atomic RMW that contains an arbitrary set of operations, like add and mask, so we could get the ubfx inside the loop and part of what gets stored to _head. There are compiler-specific intrinsics for LDREX/STREX, though: Critical sections in ARM.
This is safe because _Atomic integer types are guaranteed to be 2's complement with well-defined overflow = wraparound behaviour. (int32_t is already guaranteed to be 2's complement because it's one of the fixed-width types, but the no-UB-wraparound is only for _Atomic). I'd have used uint32_t, but we get the same asm.
Safely using STREX/LDREX from inside an interrupt handler:
ARM® Synchronization Primitives (from 2009) has some details about the ISA rules that govern LDREX/STREX. Running an LDREX initializes the "exclusive monitor" to detect modification by other cores (or by other non-CPU things in the system? I don't know). Cortex-M4 is a single-core system.
You can have a global monitor for memory shared between multiple CPUs, and local monitors for memory that's marked non-shareable. That documentation says "If a region configured as Shareable is not associated with a global monitor, Store-Exclusive operations to that region always fail, returning 0 in the destination register." So if STREX seems to always fail (so you get stuck in a retry loop) when you test your code, that might be the problem.
An interrupt does not abort a transaction started by an LDREX. If you were context-switching to another context and resuming something that might have stopped right before a STREX, you could have a problem. ARMv6K introduced clrex for this, otherwise older ARM would use a dummy STREX to a dummy location.
See When is CLREX actually needed on ARM Cortex M7?, which makes the same point I'm about to, that CLREX is often not needed in an interrupt situation, when not context-switching between threads.
(Fun fact: a more recent answer on that linked question points out that Cortex M7 (or Cortex M in general?) automatically clears the monitor on interrupt, meaning clrex is never necessary in interrupt handlers. The reasoning below can still apply to older single-core ARM CPUs with a monitor that doesn't track addresses, unlike in multi-core CPUs.)
But for this problem, the thing you're switching to is always the start of an interrupt handler. You're not doing pre-emptive multi-tasking. So you can never switch from the middle of one LL/SC retry loop to the middle of another. As long as STREX fails the first time in the lower-priority interrupt when you return to it, that's fine.
That will be the case here because a higher-priority interrupt will only return after it does a successful STREX (or didn't do any atomic RMWs at all).
So I think you're ok even without using clrex from inline asm, or from an interrupt handler before dispatching to C functions. The manual says a Data Abort exception leaves the monitors architecturally undefined, so make sure you CLREX in that handler at least.
If an interrupt comes in while you're between an LDREX and STREX, the LL has loaded the old data in a register (and maybe computed a new value), but hasn't stored anything back to memory yet because STREX hadn't run.
The higher-priority code will LDREX, getting the same old_h value, then do a successful STREX of old_h + 1. (Unless it is also interrupted, but this reasoning works recursively). This might possibly fail the first time through the loop, but I don't think so. Even if so, I don't think there can be a correctness problem, based on the ARM doc I linked. The doc mentioned that the local monitor can be as simple as a state-machine that just tracks LDREX and STREX instructions, letting STREX succeed even if the previous instruction was an LDREX for a different address. Assuming Cortex-M4's implementation is simplistic, that's perfect for this.
Running another LDREX for the same address while the CPU is already monitoring from a previous LDREX looks like it should have no effect. Performing an exclusive load to a different address would reset the monitor to open state, but for this it's always going to be the same address (unless you have other atomics in other code?)
Then (after doing some other stuff), the interrupt handler will return, restoring registers and jumping back to the middle of the lower-priority interrupt's LL/SC loop.
Back in the lower-priority interrupt, STREX will fail because the STREX in the higher-priority interrupt reset the monitor state. That's good, we need it to fail because it would have stored the same value as the higher-priority interrupt that took its spot in the FIFO. The cmp / bne detects the failure and runs the whole loop again. This time it succeeds (unless interrupted again), reading the value stored by the higher-priority interrupt and storing & returning that + 1.
So I think we can get away without a CLREX anywhere, because interrupt handlers always run to completion before returning to the middle of something they interrupted. And they always begin at the beginning.
Single-writer version
Or, if nothing else can be modifying that variable, you don't need an atomic RMW at all, just a pure atomic load, then a pure atomic store of the new value. (_Atomic for the benefit or any readers).
Or if no other thread or interrupt touches that variable at all, it doesn't need to be _Atomic.
// If we're the only writer, and other threads can only observe:
// again using uniprocessor memory order: relaxed + signal_fence
int32_t acquire_head_separate_RW_UP(void) {
atomic_signal_fence(memory_order_seq_cst);
int32_t old_h = atomic_load_explicit(&_head, memory_order_relaxed);
int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
atomic_store_explicit(&_head, new_h, memory_order_relaxed);
atomic_signal_fence(memory_order_seq_cst);
return new_h;
}
acquire_head_separate_RW_UP:
ldr r3, .L7
ldr r0, [r3] ## Plain atomic load
adds r0, r0, #1
ubfx r0, r0, #0, #10 ## zero-extend low 10 bits
str r0, [r3] ## Plain atomic store
bx lr
This is the same asm we'd get for non-atomic head.
Your code is written in a very not "bare metal" way. Those "general" atomic functions do not know if the value read or stored is located in the internal memory or maybe it is a hardware register located somewhere far from the core and connected via buses and sometimes write/read buffers.
That is the reason why the generic atomic function has to place so many DMB instructions. Because you read or write the internal memory location they are not needed at all (M4 does not have any internal cache so this kind of strong precautions are not needed as well)
IMO it is just enough to disable the interrupts when you want to access the memory location the atomic way.
PS the stdatomic is in a very rare use in the bare metal uC development.
The fastest awy to guarantee the exclusive access on M4 uC is to disable and enable the interrupts.
__disable_irq();
x++;
__enable_irq();
71 __ASM volatile ("cpsid i" : : : "memory");
080053e8: cpsid i
79 x++;
080053ea: ldr r2, [pc, #160] ; (0x800548c <main+168>)
080053ec: ldrb r3, [r2, #0]
080053ee: adds r3, #1
080053f0: strb r3, [r2, #0]
60 __ASM volatile ("cpsie i" : : : "memory");
which will cost only 2 or 4 additional clocks for both instructions.
It guarantees the atomicity and does not provide unnecessary overhead
dmb is required in situations like
p1:
str r5, [r1]
str r0, [r2]
and
p2:
wait([r2] == 0)
ldr r5, [r1]
(from http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf, section 6.2.1 "Weakly-Ordered Message Passing problem").
In-CPUI optimizations can reorder the instructions on p1 so you have to insert an dmb betweeen both stores.
In your example, there are too much dmb which is probably caused by expanding atomic_xxx() which might have dmb both at start and end.
In should be enough to have
acquire_head:
ldr r2, .L8
dmb ish
.L2:
// int32_t old_h = atomic_load(&_head);
ldr r1, [r2]
...
bne .L5
.L6:
bne .L2
dmb ish
bx lr
and no other dmb between.
Performance impact is difficult to estimate (you would have to benchmark code with and without dmb). dmb does not consume cpu cycles; it just stops pipelining within the cpu.

Can x86's MOV really be "free"? Why can't I reproduce this at all?

I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming.
For the life of me, I can't verify this in a single test case. Every test case I try debunks it.
For example, here's the code I'm compiling with Visual C++:
#include <limits.h>
#include <stdio.h>
#include <time.h>
int main(void)
{
unsigned int k, l, j;
clock_t tstart = clock();
for (k = 0, j = 0, l = 0; j < UINT_MAX; ++j)
{
++k;
k = j; // <-- comment out this line to remove the MOV instruction
l += j;
}
fprintf(stderr, "%d ms\n", (int)((clock() - tstart) * 1000 / CLOCKS_PER_SEC));
fflush(stderr);
return (int)(k + j + l);
}
This produces the following assembly code for the loop (feel free to produce this however you want; you obviously don't need Visual C++):
LOOP:
add edi,esi
mov ebx,esi
inc esi
cmp esi,FFFFFFFFh
jc LOOP
Now I run this program several times, and I observe a pretty consistent 2% difference when the MOV instruction is removed:
Without MOV With MOV
1303 ms 1358 ms
1324 ms 1363 ms
1310 ms 1345 ms
1304 ms 1343 ms
1309 ms 1334 ms
1312 ms 1336 ms
1320 ms 1311 ms
1302 ms 1350 ms
1319 ms 1339 ms
1324 ms 1338 ms
So what gives? Why isn't the MOV "free"? Is this loop too complicated for x86?
Is there a single example out there that can demonstrate MOV being free like people claim?
If so, what is it? And if not, why does everyone keep claiming MOV is free?
Register-copy is never free for the front-end, only eliminated from actually executing in the back-end (with zero latency) by the issue/rename stage on the following CPUs:
AMD Bulldozer family for XMM vector registers, not integer.
AMD Zen family for integer and XMM vector registers. (And YMM in Zen2 and later)
(See Agner Fog's microarch guide for details on low/high halves of YMM in BD / Zen 1)
Intel Ivy Bridge and later for integer and vector registers (except MMX)
Intel Goldmont and later low-power CPUs: XMM and integer
(including Alder Lake E-cores integer/XMM but not YMM)
Not Intel Ice Lake: a microcode update disabled register-renaming as part of working around an erratum, for general-purpose integer regs. XMM/YMM/ZMM renaming still works. Tiger Lake is also affected, but not Rocket Lake or Alder Lake P-cores.
uops.info results for mov r32, r32 - note the latency=0 for CPUs where it works. And note they show latency=1 on Ice Lake and Tiger Lake, so they retested after microcode updates.
Your experiment
The throughput of the loop in the question doesn't depend on the latency of MOV, or (on Haswell) the benefit of not using an execution unit.
The loop is still only 4 uops for the front-end to issue into the out-of-order back-end. (mov still has to be tracked by the out-of-order back-end even if it doesn't need an execution unit, but cmp/jc macro-fuses into a single uop).
Intel CPUs since Core 2 have had an issue width of 4 uops per clock, so the mov doesn't stop it from executing at (close to) one iter per clock on Haswell. It would also run at one per clock on Ivybridge (with mov-elimination), but not on Sandybridge (no mov-elimination). On SnB, it would be about one iter per 1.333c cycles, bottlenecked on ALU throughput because the mov would always need one. (SnB/IvB have only three ALU ports, while Haswell has four).
Note that special handling in the rename stage has been a thing for x87 FXCHG (swap st0 with st1) for much longer than MOV. Agner Fog lists FXCHG as 0 latency on PPro/PII/PIII (first-gen P6 core).
The loop in the question has two interlocking dependency chains (the add edi,esi depends on EDI and on the loop counter ESI), which makes it more sensitive to imperfect scheduling. A 2% slowdown vs. theoretical prediction because of seemingly-unrelated instructions isn't unusual, and small variations in order of instructions can make this kind of difference. To run at exactly 1c per iter, every cycle needs to run an INC and an ADD. Since all the INCs and ADDs are dependent on the previous iteration, out-of-order execution can't catch up by running two in a single cycle. Even worse, the ADD depends on the INC in the previous cycle, which is what I meant by "interlocking", so losing a cycle in the INC dep chain also stalls the ADD dep chain.
Also, predicted-taken branches can only run on port6, so any cycle where port6 doesn't executed a cmp/jc is a cycle of lost throughput. This happens every time an INC or ADD steals a cycle on port6 instead of running on ports 0, 1, or 5. IDK if this is the culprit, or if losing cycles in the INC/ADD dep chains themselves is the problem, or maybe some of both.
Adding the extra MOV doesn't add any execution-port pressure, assuming it's eliminated 100%, but it does stop the front-end from running ahead of the back-end execution units. (Only 3 of the 4 uops in the loop need an execution unit, and your Haswell CPU can run INC and ADD on any of its 4 ALU ports: 0, 1, 5, and 6. So the bottlenecks are:
the front-end max throughput of 4 uops per clock. (The loop without MOV is only 3 uops, so the front-end can run ahead).
taken-branch throughput of one per clock.
the dependency chain involving esi (INC latency of 1 per clock)
the dependency chain involving edi (ADD latency of 1 per clock, and also dependent on the INC from the previous iteration)
Without the MOV, the front-end can issue the loop's three uops at 4 per clock until the out-of-order back-end is full. (AFAICT, it "unrolls" tiny loops in the loop-buffer (Loop Stream Detector: LSD), so a loop with ABC uops can issue in an ABCA BCAB CABC ... pattern. The perf counter for lsd.cycles_4_uops confirms that it mostly issues in groups of 4 when it issues any uops.)
Intel CPUs assign uops to ports as they issue into the out-of-order back-end. The decision is based on counters that track how many uops for each port are already in the scheduler (aka Reservation Station, RS). When there are lots of uops in the RS waiting to execute, this works well and should usually avoid scheduling INC or ADD to port6. And I guess also avoids scheduling the INC and ADD such that time is lost from either of those dep chains. But if the RS is empty or nearly-empty, the counters won't stop an ADD or INC from stealing a cycle on port6.
I thought I was onto something here, but any sub-optimal scheduling should let the front-end catch up and keep the back-end full. I don't think we should expect the front-end to cause enough bubbles in the pipeline to explain a 2% drop below max throughput, since the tiny loop should run from the loop buffer at a very consistent 4 per clock throughput. Maybe there's something else going on.
A real example of the benefit of mov elimination.
I used lea to construct a loop that only has one mov per clock, creating a perfect demonstration where MOV-elimination succeeds 100%, or 0% of the time with mov same,same to demonstrate the latency bottleneck that produces.
Since the macro-fused dec/jnz is part of the dependency chain involving the loop counter, imperfect scheduling can't delay it. This is different from the case where cmp/jc "forks off" from the critical-path dependency chain every iteration.
_start:
mov ecx, 2000000000 ; each iteration decrements by 2, so this is 1G iters
align 16 ; really align 32 makes more sense in case the uop-cache comes into play, but alignment is actually irrelevant for loops that fit in the loop buffer.
.loop:
mov eax, ecx
lea ecx, [rax-1] ; we vary these two instructions
dec ecx ; dec/jnz macro-fuses into one uop in the decoders, on Intel
jnz .loop
.end:
xor edi,edi ; edi=0
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
On Intel SnB-family, LEA with one or two components in the addressing mode runs with 1c latency (See http://agner.org/optimize/, and other links in the x86 tag wiki).
I built and ran this as a static binary on Linux, so user-space perf-counters for the whole process are measuring just the loop with negligible startup / shutdown overhead. (perf stat is really easy compared to putting perf-counter queries into the program itself)
$ yasm -felf64 -Worphan-labels -gdwarf2 mov-elimination.asm && ld -o mov-elimination mov-elimination.o &&
objdump -Mintel -drwC mov-elimination &&
taskset -c 1 ocperf.py stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,uops_issued.any,uops_executed.thread -r2 ./mov-elimination
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: b9 00 94 35 77 mov ecx,0x77359400
4000b5: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
00000000004000c0 <_start.loop>:
4000c0: 89 c8 mov eax,ecx
4000c2: 8d 48 ff lea ecx,[rax-0x1]
4000c5: ff c9 dec ecx
4000c7: 75 f7 jne 4000c0 <_start.loop>
00000000004000c9 <_start.end>:
4000c9: 31 ff xor edi,edi
4000cb: b8 e7 00 00 00 mov eax,0xe7
4000d0: 0f 05 syscall
perf stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ -r2 ./mov-elimination
Performance counter stats for './mov-elimination' (2 runs):
513.242841 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.05% )
0 context-switches:u # 0.000 K/sec
1 page-faults:u # 0.002 K/sec
2,000,111,934 cycles:u # 3.897 GHz ( +- 0.00% )
4,000,000,161 instructions:u # 2.00 insn per cycle ( +- 0.00% )
1,000,000,157 branches:u # 1948.396 M/sec ( +- 0.00% )
3,000,058,589 uops_issued_any:u # 5845.300 M/sec ( +- 0.00% )
2,000,037,900 uops_executed_thread:u # 3896.865 M/sec ( +- 0.00% )
0.513402352 seconds time elapsed ( +- 0.05% )
As expected, the loop runs 1G times (branches ~= 1 billion). The "extra" 111k cycles beyond 2G is overhead that's present in the other tests, too, including the one with no mov. It's not from occasional failure of mov-elimination, but it does scale with the iteration count so it's not just startup overhead. It's probably from timer interrupts, since IIRC Linux perf doesn't mess around with perf-counters while handling interrupts, and just lets them keep counting. (perf virtualizes the hardware performance counters so you can get per-process counts even when a thread migrates across CPUs.) Also, timer interrupts on the sibling logical core that shares the same physical core will perturb things a bit.
The bottleneck is the loop-carried dependency chain involving the loop counter. 2G cycles for 1G iters is 2 clocks per iteration, or 1 clock per decrement. This confirms that the length of the dep chain is 2 cycles. This is only possible if mov has zero latency. (I know it doesn't prove that there isn't some other bottleneck. It really only proves that the latency is at most 2 cycles, if you don't believe my assertion that latency is the only bottleneck. There's a resource_stalls.any perf counter, but it doesn't have many options for breaking down which microarchitectural resource was exhausted.)
The loop has 3 fused-domain uops: mov, lea, and macro-fused dec/jnz. The 3G uops_issued.any count confirms that: It counts in the fused domain, which is all of the pipeline from decoders to retirement, except for the scheduler (RS) and execution units. (macro-fused instruction-pairs stay as single uop everywhere. It's only for micro-fusion of stores or ALU+load that 1 fused-domain uop in the ROB tracks the progress of two unfused-domain uops.)
2G uops_executed.thread (unfused-domain) tells us that all the mov uops were eliminated (i.e. handled by the issue/rename stage, and placed in the ROB in an already-executed state). They still take up issue/retire bandwidth, and space in the uop cache, and code-size. They take up space in the ROB, limiting out-of-order window size. A mov instruction is never free. There are many possible microarchitectural bottlenecks besides latency and execution ports, the most important often being the 4-wide issue rate of the front-end.
On Intel CPUs, being zero latency is often a bigger deal than not needing an execution unit, especially in Haswell and later where there are 4 ALU ports. (But only 3 of them can handle vector uops, so non-eliminated vector moves would be a bottleneck more easily, especially in code without many loads or stores taking front-end bandwidth (4 fused-domain uops per clock) away from ALU uops. Also, scheduling uops to execution units isn't perfect (more like oldest-ready first), so uops that aren't on the critical path can steal cycles from the critical path.)
If we put a nop or an xor edx,edx into the loop, those would also issue but not execute on Intel SnB-family CPUs.
Zero-latency mov-elimination can be useful for zero-extending from 32 to 64 bits, and for 8 to 64. (movzx eax, bl is eliminated, movzx eax, bx isn't).
Without mov-elimination
All current CPUs that support mov-elimination don't support it for mov same,same, so pick different registers for zero-extending integers from 32 to 64-bit, or vmovdqa xmm,xmm to zero-extend to YMM in a rare case where that's necessary. (Unless you need the result in the register it's already in. Bouncing to a different reg and back is normally worse.) And on Intel, the same applies for movzx eax,al for example. (AMD Ryzen doesn't mov-eliminate movzx.) Agner Fog's instruction tables show mov as always being eliminated on Ryzen, but I guess he means that it can't fail between two different regs the way it can on Intel.
We can use this limitation to create a micro-benchmark that defeats it on purpose.
mov ecx, ecx # CPUs can't eliminate mov same,same
lea ecx, [rcx-1]
dec ecx
jnz .loop
3,000,320,972 cycles:u # 3.898 GHz ( +- 0.00% )
4,000,000,238 instructions:u # 1.33 insn per cycle ( +- 0.00% )
1,000,000,234 branches:u # 1299.225 M/sec ( +- 0.00% )
3,000,084,446 uops_issued_any:u # 3897.783 M/sec ( +- 0.00% )
3,000,058,661 uops_executed_thread:u # 3897.750 M/sec ( +- 0.00% )
This takes 3G cycles for 1G iterations, because the length of the dependency chain is now 3 cycles.
The fused-domain uop count didn't change, still 3G.
What did change is that now the unfused-domain uop count is the same as fused-domain. All the uops needed an execution unit; none of the mov instructions were eliminated, so they all added 1c latency to the loop-carried dep chain.
(When there are micro-fused uops, like add eax, [rsi], the uops_executed count can be higher than uops_issued. But we don't have that.)
Without the mov at all:
lea ecx, [rcx-1]
dec ecx
jnz .loop
2,000,131,323 cycles:u # 3.896 GHz ( +- 0.00% )
3,000,000,161 instructions:u # 1.50 insn per cycle
1,000,000,157 branches:u # 1947.876 M/sec
2,000,055,428 uops_issued_any:u # 3895.859 M/sec ( +- 0.00% )
2,000,039,061 uops_executed_thread:u # 3895.828 M/sec ( +- 0.00% )
Now we're back down to 2 cycle latency for the loop-carried dep chain.
Nothing is eliminated.
I tested on a 3.9GHz i7-6700k Skylake. I get identical results on a Haswell i5-4210U (to within 40k out of 1G counts) for all the perf events. That's about the same margin of error as re-running on the same system.
Note that if I ran perf as root1, and counted cycles instead of cycles:u (user-space only), it measure the CPU frequency as exactly 3.900 GHz. (IDK why Linux only obeys the bios-settings for max turbo right after reboot, but then drops to 3.9GHz if I leave it idle for a couple minutes. Asus Z170 Pro Gaming mobo, Arch Linux with kernel 4.10.11-1-ARCH. Saw the same thing with Ubuntu. Writing balance_performance to each of /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference from /etc/rc.local fixes it, but writing balance_power makes it drop back to 3.9GHz again later.)
1: update: as a better alternative to running sudo perf, I set sysctl kernel.perf_event_paranoid = 0 in /etc/syctl.d/99-local.conf
You should get the same results on AMD Ryzen, since it can eliminate integer mov. AMD Bulldozer-family can only eliminate xmm register copies. (According to Agner Fog, ymm register copies are an eliminated low-half and an ALU op for the high half.)
For example, AMD Bulldozer and Intel Ivybridge can sustain a throughput of 1 per clock for
movaps xmm0, xmm1
movaps xmm2, xmm3
movaps xmm4, xmm5
dec
jnz .loop
But Intel Sandybridge can't eliminate moves, so it would bottleneck on 4 ALU uops for 3 execution ports. If it was pxor xmm0,xmm0 instead of movaps, SnB could also sustain one iteration per clock. (But Bulldozer-family couldn't, because xor-zeroing still needs an execution unit on AMD, even though is independent of the old value of the register. And Bulldozer-family only has 0.5c throughput for PXOR.)
Limitations of mov-elimination
Two dependent MOV instructions in a row exposes a difference between Haswell and Skylake.
.loop:
mov eax, ecx
mov ecx, eax
sub ecx, 2
jnz .loop
Haswell: minor run-to-run variability (1.746 to 1.749 c / iter), but this is typical:
1,749,102,925 cycles:u # 2.690 GHz
4,000,000,212 instructions:u # 2.29 insn per cycle
1,000,000,208 branches:u # 1538.062 M/sec
3,000,079,561 uops_issued_any:u # 4614.308 M/sec
1,746,698,502 uops_executed_core:u # 2686.531 M/sec
745,676,067 lsd_cycles_4_uops:u # 1146.896 M/sec
Not all the MOV instructions are eliminated: about 0.75 of the 2 per iteration used an execution port. Every MOV that executes instead of being eliminated adds 1c of latency to the loop-carried dep chain, so it's not a coincidence that uops_executed and cycles are very similar. All the uops are part of a single dependency chain, so there's no parallelism possible. cycles is always about 5M higher than uops_executed regardless of run-to-run variation, so I guess there are just 5M cycles being used up somewhere else.
Skylake: more stable than HSW results, and more mov-elimination: only 0.6666 MOVs of every 2 needed an execution unit.
1,666,716,605 cycles:u # 3.897 GHz
4,000,000,136 instructions:u # 2.40 insn per cycle
1,000,000,132 branches:u # 2338.050 M/sec
3,000,059,008 uops_issued_any:u # 7014.288 M/sec
1,666,548,206 uops_executed_thread:u # 3896.473 M/sec
666,683,358 lsd_cycles_4_uops:u # 1558.739 M/sec
On Haswell, lsd.cycles_4_uops accounted for all of the uops. (0.745 * 4 ~= 3). So in almost every cycle where any uops are issued, a full group of 4 is issued (from the loop-buffer. I probably should have looked at a different counter that doesn't care where they came from, like uops_issued.stall_cycles to count cycles where no uops issued).
But on SKL, 0.66666 * 4 = 2.66664 is less than 3, so in some cycles the front-end issued fewer than 4 uops. (Usually it stalls until there's room in the out-of-order back-end to issue a full group of 4, instead of issuing non-full groups).
It's weird, IDK what the exact microarchitectural limitation is. Since the loop is only 3 uops, each issue-group of 4 uops is more than a full iteration. So an issue group can contain up to 3 dependent MOVs. Perhaps Skylake is designed to break that up sometimes, to allow more mov-elimination?
update: actually this is normal for 3-uop loops on Skylake. uops_issued.stall_cycles shows that HSW and SKL issue a simple 3 uop loop with no mov-elimination the same way they issue this one. So better mov-elimination is a side-effect of splitting up issue groups for some other reason. (It's not a bottleneck because taken branches can't execute faster than 1 per clock regardless of how fast they issue). I still don't know why SKL is different, but I don't think it's anything to worry about.
In a less extreme case, SKL and HSW are the same, with both failing to eliminate 0.3333 of every 2 MOV instructions:
.loop:
mov eax, ecx
dec eax
mov ecx, eax
sub ecx, 1
jnz .loop
2,333,434,710 cycles:u # 3.897 GHz
5,000,000,185 instructions:u # 2.14 insn per cycle
1,000,000,181 branches:u # 1669.905 M/sec
4,000,061,152 uops_issued_any:u # 6679.720 M/sec
2,333,374,781 uops_executed_thread:u # 3896.513 M/sec
1,000,000,942 lsd_cycles_4_uops:u # 1669.906 M/sec
All of the uops issue in groups of 4. Any contiguous group of 4 uops will contain exactly two MOV uops that are candidates for elimination. Since it clearly succeeds at eliminating both in some cycles, IDK why it can't always do that.
Intel's optimization manual says that overwriting the result of mov-elimination as early as possible frees up the microarchitectural resources so it can succeed more often, at least for movzx. See Example 3-23. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions.
So maybe it's tracked internally with a limited-size table of ref-counts? Something has to stop the physical register file entry from being freed when it's no longer needed as the value of the original architectural register, if it's still needed as the value of the mov destination. Freeing PRF entries as soon as possible is key, because PRF size can limit the out-of-order window to smaller than ROB size.
I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.
See also Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for more research + guesswork about how mov-elimination works, and whether it could work for xchg eax, ecx. (In practice xchg reg,reg is 3 ALU uops on Intel, but 2 eliminated uops on Ryzen. It's interesting to guess whether Intel could have implemented it more efficiently.)
BTW, as a workaround for an erratum on Haswell, Linux doesn't provide uops_executed.thread when hyperthreading is enabled, only uops_executed.core. The other core was definitely idle the whole time, not even timer interrupts, because I took it offline with echo 0 > /sys/devices/system/cpu/cpu3/online. Unfortunately this can't be done before the kernel's perf drivers (PAPI) decides that HT is enabled on boot, and my Dell laptop doesn't have a BIOS option to disable HT. So I can't get perf to use all 8 hardware PMU counters at once on that system, only 4. :/
Here are two small tests that I believe conclusively show evidence for mov-elimination:
__loop1:
add edx, 1
add edx, 1
add ecx, 1
jnc __loop1
versus
__loop2:
mov eax, edx
add eax, 1
mov edx, eax
add edx, 1
add ecx, 1
jnc __loop2
If mov added a cycle to a dependency chain, it would be expected that the second version takes about 4 cycles per iteration. On my Haswell, both take about 2 cycles per iteration, which cannot happen without mov-elimination.

why does uboot invalidate TLB s, icache, BP array at beginning

On arm platform, the u-boot will invalidate TLBs, icache and BP array at beginning, but what's the reason? Is it necessary?
cpu_init_crit:
/*
* Invalidate L1 I/D
*/
mov r0, #0 # set up for MCR
mcr p15, 0, r0, c8, c7, 0 # invalidate TLBs
mcr p15, 0, r0, c7, c5, 0 # invalidate icache
mcr p15, 0, r0, c7, c5, 6 # invalidate BP array
mcr p15, 0, r0, c7, c10, 4 # DSB
mcr p15, 0, r0, c7, c5, 4 # ISB
A reset may be hard or soft. It is possible for something to jump to a reset vector. If cache is enabled with stale entries, the code may crash resulting in a system not booting. This can be more common than you think as SDRAM may miss behave after a cold power on and mis-read causing a crash. Often a watchdog or unrecoverable faults will jump to the reset vector. Finally, there maybe u-boot systems in XIP NOR flash. Some systems/code may just jump to this code to perform a soft reset.
In all of these cases, the debug available is NIL. You may have a working system in 99.9999% of the cases. Having to trouble shoot a boot failure on particular hardware which only occurs in a hot bar or an ice cream freezer might make you enjoy this extra code. Ie, there are rare circumstances where it is needed.
Hardware failures are much more common as supply rails come on line. Datasheets describe properly behaving systems with power/temperature, etc in range. u-boot may not operate in this ideal world. If your concern is boot time, you are much better to write your own loader (keep this code though) or optimize things like BSS clearing, etc.
On A15/A7/A12 you do not need to do cache/mmu/btb invalidation on post-reset, so this is done just for "paranoia" reasons. However, there might be cores which do not perform auto-invalidate on reset so this code just makes sure that the same behavior is preserved across different core types.
This is necessary in the cases where the invalidation is not done in hardware on a reset (cold or warm). Another more practical reason is that U-Boot is typically not the first piece of code that runs on the hardware. In most cases there would be the ROM code that runs before U-Boot and to avoid making any assumptions regarding the state of the hardware when U-Boot got control of the system it's safer to always try and get the hardware to a known state and proceed from there.

Resources