Why ARMv7-A crashes when flushing the stack pointer from the cache

Why ARMv7-A crashes when flushing the stack pointer from the cache - c

I am trying to evict the memory address in which the stack pointer is pointing to it in an ARM Cortex-A8 processor. I am trying to do that with the below code:
cpy r3, sp
mcr p15, 0x0, r3, cr7, cr6, 0x1
I have run the above code in a loadable kernel module. after running the above code in the kernel, OS crashes and needs a restart. but the above instructions work fine for flushing a variable from the cache.
Can anyone give me any advice to solve the problem?

Thanks to artless noise, actually ARM cortex-a8 has 3 types of command for cache manipulation based on Modified Virtual address:
Invalidate (C6, 1) (just invalidate the cache line)
Clean (C10, 1) (Update memory if the cache line is dirty)
Clean & Invalidate (C14, 1) (Update memory then invalidate cache line)
and as you can see in the question I used Invalidate instruction and it caused that memory to have invalid data for the stack. but after using Clean&Invalidate instruction the problem was solved. so the final code is as below:
cpy r3, sp
mcr p15, 0x0, r3, cr7, cr14, 0x1
DSB SY

Related

DMB instructions in an interrupt-safe FIFO

Related to this thread, I have a FIFO which should work across different interrupts on a Cortex M4.
The head index must be
atomically written (modified) by multiple interrupts (not threads)
atomically read by a single (lowest level) interrupt
The function for moving the FIFO head looks similar to this (there are also checks to see if the head overflowed in the actual code but this is the main idea):
#include <stdatomic.h>
#include <stdint.h>
#define FIFO_LEN 1024
extern _Atomic int32_t _head;
int32_t acquire_head(void)
{
while (1)
{
int32_t old_h = atomic_load(&_head);
int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
if (atomic_compare_exchange_strong(&_head, &old_h, new_h))
{
return old_h;
}
}
}
GCC will compile this to:
acquire_head:
ldr r2, .L8
.L2:
// int32_t old_h = atomic_load(&_head);
dmb ish
ldr r1, [r2]
dmb ish
// int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
adds r3, r1, #1
ubfx r3, r3, #0, #10
// if (atomic_compare_exchange_strong(&_head, &old_h, new_h))
dmb ish
.L5:
ldrex r0, [r2]
cmp r0, r1
bne .L6
strex ip, r3, [r2]
cmp ip, #0
bne .L5
.L6:
dmb ish
bne .L2
bx lr
.L8:
.word _head
This is a bare metal project without an OS/threads. This code is for a logging FIFO which is not time critical, but I don't want the acquiring of the head to make an impact on the latency of the rest of my program, so my question is:
do I need all these dmbs?
will there be a noticeable performance penalty with these instructions, or can I just ignore this?
if an interrupt happens during a dmb, how many additional cycles of latency does it create?

TL:DR yes, LL/SC (STREX/LDREX) can be good for interrupt latency compared to disabling interrupts, by making an atomic RMW interruptible with a retry.
This may come at the cost of throughput, because apparently disabling / re-enabling interrupts on ARMv7 is very cheap (like maybe 1 or 2 cycles each for cpsid if / cpsie if), especially if you can unconditionally enable interrupts instead of saving the old state. (Temporarily disable interrupts on ARM).
The extra throughput costs are: if LDREX/STREX are any slower than LDR / STR on Cortex-M4, a cmp/bne (not-taken in the successful case), and any time the loop has to retry the whole loop body runs again. (Retry should be very rare; only if an interrupt actually comes in while in the middle of an LL/SC in another interrupt handler.)
C11 compilers like gcc don't have a special-case mode for uniprocessor systems or single-threaded code, unfortunately. So they don't know how to do code-gen that takes advantage of the fact that anything running on the same core will see all our operations in program order up to a certain point, even without any barriers.
(The cardinal rule of out-of-order execution and memory reordering is that it preserves the illusion of a single-thread or single core running instructions in program order.)
The back-to-back dmb instructions separated only by a couple ALU instructions are redundant even on a multi-core system for multi-threaded code. This is a gcc missed-optimization, because current compilers do basically no optimization on atomics. (Better to be safe and slowish than to risk ever being too weak. It's hard enough to reason about, test, and debug lockless code without worrying about possible compiler bugs.)
Atomics on a single-core CPU
You can vastly simplify it in this case by masking after an atomic_fetch_add, instead of simulating an atomic add with earlier rollover using CAS. (Then readers must mask as well, but that's very cheap.)
And you can use memory_order_relaxed. If you want reordering guarantees against an interrupt handler, use atomic_signal_fence to enforce compile-time ordering without asm barriers against runtime reordering. User-space POSIX signals are asynchronous within the same thread in exactly the same way that interrupts are asynchronous within the same core.
// readers must also mask _head & (FIFO_LEN - 1) before use
// Uniprocessor but with an atomic RMW:
int32_t acquire_head_atomicRMW_UP(void)
{
atomic_signal_fence(memory_order_seq_cst); // zero asm instructions, just compile-time
int32_t old_h = atomic_fetch_add_explicit(&_head, 1, memory_order_relaxed);
atomic_signal_fence(memory_order_seq_cst);
int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
return new_h;
}
On the Godbolt compiler explorer
## gcc8.2 -O3 with your same options.
acquire_head_atomicRMW:
ldr r3, .L4 ## load the static address from a nearby literal pool
.L2:
ldrex r0, [r3]
adds r2, r0, #1
strex r1, r2, [r3]
cmp r1, #0
bne .L2 ## LL/SC retry loop, not load + inc + CAS-with-LL/SC
adds r0, r0, #1 ## add again: missed optimization to not reuse r2
ubfx r0, r0, #0, #10
bx lr
.L4:
.word _head
Unfortunately there's no way I know of in C11 or C++11 to express a LL/SC atomic RMW that contains an arbitrary set of operations, like add and mask, so we could get the ubfx inside the loop and part of what gets stored to _head. There are compiler-specific intrinsics for LDREX/STREX, though: Critical sections in ARM.
This is safe because _Atomic integer types are guaranteed to be 2's complement with well-defined overflow = wraparound behaviour. (int32_t is already guaranteed to be 2's complement because it's one of the fixed-width types, but the no-UB-wraparound is only for _Atomic). I'd have used uint32_t, but we get the same asm.
Safely using STREX/LDREX from inside an interrupt handler:
ARM® Synchronization Primitives (from 2009) has some details about the ISA rules that govern LDREX/STREX. Running an LDREX initializes the "exclusive monitor" to detect modification by other cores (or by other non-CPU things in the system? I don't know). Cortex-M4 is a single-core system.
You can have a global monitor for memory shared between multiple CPUs, and local monitors for memory that's marked non-shareable. That documentation says "If a region configured as Shareable is not associated with a global monitor, Store-Exclusive operations to that region always fail, returning 0 in the destination register." So if STREX seems to always fail (so you get stuck in a retry loop) when you test your code, that might be the problem.
An interrupt does not abort a transaction started by an LDREX. If you were context-switching to another context and resuming something that might have stopped right before a STREX, you could have a problem. ARMv6K introduced clrex for this, otherwise older ARM would use a dummy STREX to a dummy location.
See When is CLREX actually needed on ARM Cortex M7?, which makes the same point I'm about to, that CLREX is often not needed in an interrupt situation, when not context-switching between threads.
(Fun fact: a more recent answer on that linked question points out that Cortex M7 (or Cortex M in general?) automatically clears the monitor on interrupt, meaning clrex is never necessary in interrupt handlers. The reasoning below can still apply to older single-core ARM CPUs with a monitor that doesn't track addresses, unlike in multi-core CPUs.)
But for this problem, the thing you're switching to is always the start of an interrupt handler. You're not doing pre-emptive multi-tasking. So you can never switch from the middle of one LL/SC retry loop to the middle of another. As long as STREX fails the first time in the lower-priority interrupt when you return to it, that's fine.
That will be the case here because a higher-priority interrupt will only return after it does a successful STREX (or didn't do any atomic RMWs at all).
So I think you're ok even without using clrex from inline asm, or from an interrupt handler before dispatching to C functions. The manual says a Data Abort exception leaves the monitors architecturally undefined, so make sure you CLREX in that handler at least.
If an interrupt comes in while you're between an LDREX and STREX, the LL has loaded the old data in a register (and maybe computed a new value), but hasn't stored anything back to memory yet because STREX hadn't run.
The higher-priority code will LDREX, getting the same old_h value, then do a successful STREX of old_h + 1. (Unless it is also interrupted, but this reasoning works recursively). This might possibly fail the first time through the loop, but I don't think so. Even if so, I don't think there can be a correctness problem, based on the ARM doc I linked. The doc mentioned that the local monitor can be as simple as a state-machine that just tracks LDREX and STREX instructions, letting STREX succeed even if the previous instruction was an LDREX for a different address. Assuming Cortex-M4's implementation is simplistic, that's perfect for this.
Running another LDREX for the same address while the CPU is already monitoring from a previous LDREX looks like it should have no effect. Performing an exclusive load to a different address would reset the monitor to open state, but for this it's always going to be the same address (unless you have other atomics in other code?)
Then (after doing some other stuff), the interrupt handler will return, restoring registers and jumping back to the middle of the lower-priority interrupt's LL/SC loop.
Back in the lower-priority interrupt, STREX will fail because the STREX in the higher-priority interrupt reset the monitor state. That's good, we need it to fail because it would have stored the same value as the higher-priority interrupt that took its spot in the FIFO. The cmp / bne detects the failure and runs the whole loop again. This time it succeeds (unless interrupted again), reading the value stored by the higher-priority interrupt and storing & returning that + 1.
So I think we can get away without a CLREX anywhere, because interrupt handlers always run to completion before returning to the middle of something they interrupted. And they always begin at the beginning.
Single-writer version
Or, if nothing else can be modifying that variable, you don't need an atomic RMW at all, just a pure atomic load, then a pure atomic store of the new value. (_Atomic for the benefit or any readers).
Or if no other thread or interrupt touches that variable at all, it doesn't need to be _Atomic.
// If we're the only writer, and other threads can only observe:
// again using uniprocessor memory order: relaxed + signal_fence
int32_t acquire_head_separate_RW_UP(void) {
atomic_signal_fence(memory_order_seq_cst);
int32_t old_h = atomic_load_explicit(&_head, memory_order_relaxed);
int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
atomic_store_explicit(&_head, new_h, memory_order_relaxed);
atomic_signal_fence(memory_order_seq_cst);
return new_h;
}
acquire_head_separate_RW_UP:
ldr r3, .L7
ldr r0, [r3] ## Plain atomic load
adds r0, r0, #1
ubfx r0, r0, #0, #10 ## zero-extend low 10 bits
str r0, [r3] ## Plain atomic store
bx lr
This is the same asm we'd get for non-atomic head.

Your code is written in a very not "bare metal" way. Those "general" atomic functions do not know if the value read or stored is located in the internal memory or maybe it is a hardware register located somewhere far from the core and connected via buses and sometimes write/read buffers.
That is the reason why the generic atomic function has to place so many DMB instructions. Because you read or write the internal memory location they are not needed at all (M4 does not have any internal cache so this kind of strong precautions are not needed as well)
IMO it is just enough to disable the interrupts when you want to access the memory location the atomic way.
PS the stdatomic is in a very rare use in the bare metal uC development.
The fastest awy to guarantee the exclusive access on M4 uC is to disable and enable the interrupts.
__disable_irq();
x++;
__enable_irq();
71 __ASM volatile ("cpsid i" : : : "memory");
080053e8: cpsid i
79 x++;
080053ea: ldr r2, [pc, #160] ; (0x800548c <main+168>)
080053ec: ldrb r3, [r2, #0]
080053ee: adds r3, #1
080053f0: strb r3, [r2, #0]
60 __ASM volatile ("cpsie i" : : : "memory");
which will cost only 2 or 4 additional clocks for both instructions.
It guarantees the atomicity and does not provide unnecessary overhead

dmb is required in situations like
p1:
str r5, [r1]
str r0, [r2]
and
p2:
wait([r2] == 0)
ldr r5, [r1]
(from http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf, section 6.2.1 "Weakly-Ordered Message Passing problem").
In-CPUI optimizations can reorder the instructions on p1 so you have to insert an dmb betweeen both stores.
In your example, there are too much dmb which is probably caused by expanding atomic_xxx() which might have dmb both at start and end.
In should be enough to have
acquire_head:
ldr r2, .L8
dmb ish
.L2:
// int32_t old_h = atomic_load(&_head);
ldr r1, [r2]
...
bne .L5
.L6:
bne .L2
dmb ish
bx lr
and no other dmb between.
Performance impact is difficult to estimate (you would have to benchmark code with and without dmb). dmb does not consume cpu cycles; it just stops pipelining within the cpu.

How to cause data abort and prefectch abort in arm v7

I would like to create a data abort and pre-fetch abort to test whether the exception handlers for the same are getting called properly or not. As per my understanding that dereferencing a NULL pointer can cause data abort. But I am not getting how to create a pre-fetch abort for testing. I am working on armv7a. I am not using any OS, working on the boot code.

both of these are bus faults, jumping into unknown code is going to result in an undefined instruction not prefetch abort. I would focus on unaigned accesses do an ldr with bit 1 set and do a bx with bit 1 set (but bit 0 not set) would be where I start. If those dont work it might not be possible with a test fixture inside the chip.
There may be portions of the address space that dont respond, those should simply hang the processor, but you might get lucky and the memory controller returns a fault.
if you have parity or ecc in your system those would be the best way, assuming you have a way to inject an error into those memories to force a parity or ecc fault (also assuming that the memory controller, etc for that design (little to no logic that is relevant to your question is part of the ARM processor) returns a fault on a parity or ecc error).
cortex-m might fault on some address spaces as they to some extent dictate where you are supposed to go.
If one of the newer cores you can use the mmus protection and I dont know if that returns a data/prefetch fault or not, setup the mmu such that some space is at a different access level than the code you are going to hit it with and see what fault you get.
EDIT
Have to look for an armv7 but on an armv6 (pi 1 for example) if I enable alignment checking in the control register, and do an ldr of say address 0x1001 which is an alignment fault then it gives me a data abort.
save a line of code and use address 0x01
mrc p15, 0, r0, c1, c0, 0
orr r0,#2
mcr p15, 0, r0, c1, c0, 0
mov r0,#0x1
ldr r0,[r0]
jumping into an invalid instruction causes an undefined not a prefetch abort, the memory system has to assert the abort, so you can use the mmu for this most likely,
undefined exception
.globl TEST
TEST:
.word 0xFFFFFFFF
bx lr
easiest way to make a prefetch abort (as stated in the ARM ARM).
.globl TEST
TEST:
bkpt

For a data abort you can just attempt to read an unmapped or non-readable memory region. For example try reading/writing a NULL pointer.
For prefetch abort just try jumping to a random address in some unmapped, non-readable or non-executable region (do you have MMU enabled at this stage?):
mov r0, #0xFFFFFFFF ; Some address that is satisfying the above
push r0 ; push it to the stack
pop pc ; jump to that address
Note, that just jumping to a random address might result in an Undefined abort instead, as it might be executable but contain an unknown instruction.

Why do we require two memory barriers in a postbox data communication between two cores?

Here we have a code of postbox code for data communication between two ARM cores (directly referred from the ARM Cortex A Series Programming Guide).
Core A:
STR R0, [Msg] # write some new data into postbox
STR R1, [Flag] # new data is ready to read
Core B:
Poll_loop:
LDR R1, [Flag]
CMP R1,#0 # is the flag set yet?
BEQ Poll_loop
LDR R0, [Msg] # read new data.
In order to enforce dependency, the document says that we need to insert not one, but two memory barriers, DMB, into the code.
Core A:
STR R0, [Msg] # write some new data into postbox
DMB
STR R1, [Flag] # new data is ready to read
Core B:
Poll_loop:
LDR R1, [Flag]
CMP R1,#0 # is the flag set yet?
BEQ Poll_loop
DMB
LDR R0, [Msg] # read new data.
I understand the first DMB in the Core A: it prevents compile reordering and also the memory access to [Msg] variable be observed by the system. Below is the definition of the DMB from the same document.
Data Memory Barrier (DMB)
This instruction ensures that all memory
accesses in program order before the barrier are observed in the
system before any explicit memory accesses that appear in program
order after the barrier. It does not affect the ordering of any other
instructions executing on the core, or of instruction fetches.
However, I am not sure why the DMB in the Core B is used. In the document it says:
Core B requires a DMB before the LDR R0, [Msg] to be sure that the
message is not read until the flag is set.
If the DMB in the Core A makes the store to the [Msg] be observed to the system, then we should not need the DMB in the second core. My guess is, the compiler might do a reordering of reading [Flag] and [Msg] in the Core B (though I do not understand why it should do this since the read on [Msg] is dependent on [Flag]).
If this is the case, a compile barrier (asm volatile("" ::: "memory) instead of DMB should be enough. Do I miss something here?

Both barriers are necessary, and do need to be dmbs - this is still about the hardware memory model, and nothing to do with compiler reordering.
Let's look at the writer on core A first:
STR R0, [Msg] # write some new data into postbox
STR R1, [Flag] # new data is ready to read
Since these are two independent stores to different addresses with no dependency between them, there is nothing to force core A to actually issue the stores in program order. The store to Msg could, say, linger in a part-filled write buffer whilst the store to Flag overtakes it and goes straight out to the memory system. Thus any observer other than core A could see the new value of Flag, without yet seeing the new value of Msg.
STR R0, [Msg] # write some new data into postbox
DMB
STR R1, [Flag] # new data is ready to read
Now, with the barrier, the store to Flag is not permitted to be visible before the store to Msg, because that would necessitate one or other store appearing to cross the barrier. Thus any external observer may either see both old values, the new Msg but the old Flag, or both new values. The previous case of seeing the new Flag but the old Msg can no longer occur.
OK, so the first barrier handles things getting written in the correct order, but there's also the matter of how they are read. Over on core B...
Poll_loop:
LDR R1, [Flag]
CMP R1,#0 # is the flag set yet?
BEQ Poll_loop
LDR R0, [Msg] # read new data.
Note that the branch to Poll_loop does not form a control dependency between the two loads; if you consider program order, the load of Msg is unconditional, and the value of Flag does not affect whether it is executed or not, only whether execution ever progresses to that part of the program at all. Therefore the code could equivalently be written thus:
Poll_loop:
LDR R1, [Flag]
LDR R0, [Msg] # read data, just in case.
CMP R1,#0 # is the flag set yet?
BEQ Poll_loop # no? OK, throw away that data and read everything again.
... # do stuff with R0, because Flag was set so it must be good data, right?
Start to see the problem? Even with the original code, core B is free to speculatively load Msg as soon as it reaches Poll_loop, so even if the stores from core A become visible in program order, things could still play out like this:
core A | core B
-----------+-----------
| load Msg
store Msg |
store Flag |
| load Flag
| conclude that old Msg is valid
Thus you either need a barrier:
...
BEQ Poll_loop
DMB
LDR R0, [Msg] # read new data.
or perhaps a fake address dependency:
...
BEQ Poll_loop
EOR R1, R1, R1
LDR R0, [Msg, R1] # read new data.
To order the two loads against each other.

First, you're kind of mixing up compiler barriers and memory barriers. Compiler barriers prevent the compiler from moving instructions across that barrier in the final assembly. OTOH, memory barriers instruct the hardware to obey a certain ordering. Since you're already presenting assembly code, your question is really about hardware memory barriers and there is no compiler involved here.
The reason why you need a (read) memory barrier in Core B is that the core may reorder the message reading instruction wherever it wants, since
no, there is no data dependency between the reading of the flag and the reading of the message, at least not in the code above: the only information needed for reading Msg is its address and this is known at each point in time. You might want to argue that there is a control dependency. However, control dependencies do not impose any ordering constraints on memory reads.

why does uboot invalidate TLB s, icache, BP array at beginning

On arm platform, the u-boot will invalidate TLBs, icache and BP array at beginning, but what's the reason? Is it necessary?
cpu_init_crit:
/*
* Invalidate L1 I/D
*/
mov r0, #0 # set up for MCR
mcr p15, 0, r0, c8, c7, 0 # invalidate TLBs
mcr p15, 0, r0, c7, c5, 0 # invalidate icache
mcr p15, 0, r0, c7, c5, 6 # invalidate BP array
mcr p15, 0, r0, c7, c10, 4 # DSB
mcr p15, 0, r0, c7, c5, 4 # ISB

A reset may be hard or soft. It is possible for something to jump to a reset vector. If cache is enabled with stale entries, the code may crash resulting in a system not booting. This can be more common than you think as SDRAM may miss behave after a cold power on and mis-read causing a crash. Often a watchdog or unrecoverable faults will jump to the reset vector. Finally, there maybe u-boot systems in XIP NOR flash. Some systems/code may just jump to this code to perform a soft reset.
In all of these cases, the debug available is NIL. You may have a working system in 99.9999% of the cases. Having to trouble shoot a boot failure on particular hardware which only occurs in a hot bar or an ice cream freezer might make you enjoy this extra code. Ie, there are rare circumstances where it is needed.
Hardware failures are much more common as supply rails come on line. Datasheets describe properly behaving systems with power/temperature, etc in range. u-boot may not operate in this ideal world. If your concern is boot time, you are much better to write your own loader (keep this code though) or optimize things like BSS clearing, etc.

On A15/A7/A12 you do not need to do cache/mmu/btb invalidation on post-reset, so this is done just for "paranoia" reasons. However, there might be cores which do not perform auto-invalidate on reset so this code just makes sure that the same behavior is preserved across different core types.

This is necessary in the cases where the invalidation is not done in hardware on a reset (cold or warm). Another more practical reason is that U-Boot is typically not the first piece of code that runs on the hardware. In most cases there would be the ROM code that runs before U-Boot and to avoid making any assumptions regarding the state of the hardware when U-Boot got control of the system it's safer to always try and get the hardware to a known state and proceed from there.

ARM Bootloader: Disable MMU and Caches

According to some tutorials, we will disable MMU and I/D-Caches at the beginning of bootlaoder. If I understand correctly, it aims to use the physical address directly in the program, so please correct me if I'm wrong. Thank you!
Secondly, we do this to disable MMU and Caches:
mrc P15, 0, R0, C1, C0, 0
bic R0, R0, #0x00002300 # clear bits 13, 9:8
bic R0, R0, #0x00000087 # clear bits 7, 2:0
orr R0, R0, #0x00000002 # set bit 2 (A) Align
orr R0, R0, #0x00001000 # set bit 12 (I) I-Cache
mcr P15, 0, R0, C1, C0, 0
D-Cache, MMU and Data Address Alignment Fault Checking have been disabled by clear bits 2:0, but why we enable bit 2 immediately in the following instrument? To make sure this manipulation is valid?
Last question is why D-cache is disabled but I-caches is able? To speed up instrument process?

Last question is why D-cache is disabled but I-caches is able? To speed up instrument process?
The MMU has settings to determine which memory regions are cacheable or not. If you do not have the mmu on but you have the data cache on (if possible) then you cannot safely talk to peripherals. if you read the uart status register for example that goes through the cache just like any other data operation, whatever that status is stays in the cache for subsequent reads until such time as that cache line is evicted and you get one more shot at the actual register. Lets say for example you have some code that polls the uart status register waiting for a character in the rx buffer. If that first read shows there is no character, that status goes in the cache, you will remain in the loop forever since you will never get to talk to the status register again you will simply get the cached copy of the register. if there was a character in there then that status also gets cached, you read the rx register, and perhaps do something, if when you come back again if the status has not been evicted from the data cache then you get the stale status which shows there is a character, you rx buffer read may or may not also be cached so you may get the stale value in the cache, you may get a stale value or whatever the peripheral does when you read and there is no new value or you might get a new value, but what you dont get in these situations is proper access to the peripheral. When the mmu is on, you use the mmu to mark the address space used by that peripheral as non-(data)-cacheable, and you dont have this problem. With the mmu off you need the data cache off for arm systems.
Leaving the I-cache on is okay because instruction fetches only read instructions...Well for a bare metal application that is okay, it helps for example if you are using a flash that has a potential for read disturb (spi or i2c flashes). The problem is this application is a bootloader, so you must take some extra care. For example your bootloader has some code at address 0x8000 that it runs through at least once, then you choose to use it as a bootloader, the bootloader might be at say address 0x10000000 allowing you to load a new program at 0x8000, this load uses data accesses so it does not go through the instruction cache. So there is a potential that the instruction cache has some or all of the code from the last time you were in the 0x8000 area, and when you branch to the bootloaded code at 0x8000 you will get either the old program from cache or a nasty mixture of old program and new program for the parts that are cached and not cached. So if your bootloader allows for the i-cache to be on, you need to invalidate the cache before branching to bootloaded code.
Lastly, if you or anyone using this bootloader wants to use jtag, then you have that same problem but worse, data cycles that do not go through the i-cache are used to write the new program to ram, when you tell the jtag debugger to then run the new program you will get 1) only the new program, 2) a mixture of the new program and old program fragments from cache 3) the old program from cache.
So d-cache is bad without an mmu because of things that are not in ram, peripherals, etc. The i-cache is a use at your own risk kind of thing which you can mitigate except for the times that jtag is used for debugging.
If you have concerns or have confirmed read-disturb in your (external) flash, then I recommend turn on the i-cache, use a tight loop to copy your application to ram, branch to the ram copy and run there, turn off the i-cache (or use at your own risk) and dont touch the flash again, certainly not heavy read accesses to small areas. A tight uart polling loop like you might have for a command line parser, is a really good place to get hit with read-disturb.

You did not specified on which ARM you are working. Capabilities may vary from one ARM to an other (there is a huge gap between an ARM9 and an ARM Cortex A15).
In the given code, bit 2 is cleared and then set, but it does not matter, as those changes are done in R0. There is no change in the ARM behavior until the write in CP15 register (done by the instruction mcr P15, 0, R0, C1, C0, 0).
Concerning d-cache/i-cache enabling, it is only a matter of choice, there is no requirement. On the products I work on, the bootloader enables L1 I-cache, D-cache, L2 cache, and MMU (and it disables all that stuff before jumping on Linux). Be sure to follow ARM documentations about cache invalidation and memory barriers (according to your actual ARM Core) if you use cache and MMU in your bootloader.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight