TLB flushing within a given virtual address range - arm

I know that I can flush a TLB entry for a given virtual address as follows in ARMv7, VMSA
mcr p15, 4, c8, c7, 1, $VA ; TLBIMVAH
I've failed to find a single instruction that can flush a TLB entry for a range of virtual address (e.g., from A to B). All I can do is loop over the virtual address ranges and issue the above instruction over and over again.
My question is here: Is there any efficient method or golden instruction that flushes a given range of virtual address?
And, just out of curiosity, if there is no such kind of instruction, could you tell me which constraints make this instruction impossible?

Related

Are memory mapped registers separate registers on the bus?

I will use the TM4C123 Arm Microcontroller/Board as an example.
Most of it's I/O registers are memory mapped so you can get/set their values using
regular memory load/store instructions.
My questions is, is there some type of register outside of cpu somewhere on the bus which is mapped to memory and you read/write to it using the memory region essentially having duplicate values one on the register and on memory, or the memory IS the register itself?
There are many buses even in an MCU. Bus after bus after bus splitting off like branches in a tree. (sometimes even merging unlike a tree).
It may predate the intel/motorola battle but certainly in that time frame you had segmented vs flat addressing and you had I/O mapped I/O vs memory mapped I/O, since motorola (and others) did not have a separate I/O bus (well one extra...address...signal).
Look at the arm architecture documents and the chip documentation (arm makes IP not chips). You have load and store instructions that operate on addresses. The documentation for the chip (and to some extent ARM provides rules for the cortex-m address space) provides a long list of addresses for things. As a programmer you simply line up the address you do loads and stores with and the right instructions.
Someones marketing may still carry about terms like memory mapped I/O, because intel x86 still exists (how????), some folks will continue to carry those terms. As a programmer, they are number one just bits that go into registers, and for single instructions here and there those bits are addresses. If you want to add adjectives to that, go for it.
If the address you are using based on the chip and core documentation is pointing at an sram, then that is a read or write of memory. If it is a flash address, then that is the flash. The uart, the uart. timer 5, then timer 5 control and status registers. Etc...
There are addresses in these mcus that point at two or three things, but not at the same time. (address 0x00000000 and some number of kbytes after that). But, again, not at the same time. And this overlap at least in many of these cortex-m mcus, these special address spaces are not going to overlap "memory" and peripherals (I/O). But instead places where you can boot the chip and run some code. With these cortex-ms I do not think you can even use the sort of mmu to mix these spaces. Yes definitely in full sized arms and other architectures you can use a fully blow mcu to mix up the address spaces and have a virtual address space that lands on a physical address space of a different type.

aarch64; Load-Acquire Exclusive vs Load Exclusive

What is the difference between LDAXR & LDXR instructions out of AArch64 instruction set?
From reference manual they looks totally the same (with exception of 'acquire' word):
LDAXR - Load-Acquire Exclusive Register: loads word from memory addressed by base to Wt. Records the physical address as an exclusive access.
LDXR - Load Exclusive Register: loads a word from memory addressed by base to Wt. Records the physical address as an exclusive access.
Thanks
In the simplest form, LDAEX == LDXR +DMB_SY.
This is the description which I find for LDAXR:
C6.2.104 LDAXR
Load-Acquire Exclusive Register derives an address from a base
register value, loads a 32-bit word or 64-bit doubleword from memory,
and writes it to a register. The memory access is atomic. The PE marks
the physical address being accessed as an exclusive access. This
exclusive access mark is checked by Store Exclusive instructions. See
Synchronization and semaphores on page B2-135. The instruction also
has memory ordering semantics as described in Load-Acquire,
Load-AcquirePC, and Store-Release on page B2-108. For information
about memory accesses see Load/Store addressing modes on page C1-157.
From section K11.3 of DDI0487 Da
The ARMv8 architecture adds the acquire and release semantics to
Load-Exclusive and Store-Exclusive instructions, which allows them to
gain ordering acquire and/or release semantics. The Load-Exclusive
instruction can be specified to have acquire semantics, and the
Store-Exclusive instruction can be specified to have release
semantics. These can be arbitrarily combined to allow the atomic
update created by a successful Load-Exclusive and Store-Exclusive pair
to have any of:
No Ordering semantics (using LDREX and STREX).
Acquire only semantics (using LDAEX and STREX).
Release only semantics (using LDREX and STLEX).
Sequentially consistent semantics (using LDAEX and STLEX).
Also (B2.3.5),
The basic principle of a Load-Acquire instruction is to introduce
order between the memory access generated by the Load-Acquire
instruction and the memory accesses appearing in program order after
the Load-Acquire instruction, such that the memory access generated by
the Load-Acquire instruction is Observed-by each PE, to the extent
that that PE is required to observe the access coherently, before any
of the memory accesses appearing in program order after the
Load-Acquire instruction are Observed-by that PE, to the extent that
the PE is required to observe the accesses coherently.

LDREX/STREX with Cortex M3 and M4

I was reading up on the LDREX and STREX to implement mutexes. From looking at the ARM reference manual:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100166_0001_00_en/ric1417175928887.html
It appears that LDREX/STREX only store address granularity is the whole memory space, hence you are only allowed to use LDREX/STREX on at maximum one 32bit register.
Is this correct or am I missing something? If so it kind of makes the LDREX/STREX very limited. I mean you could do a bit mapped mutex and maybe get 32 mutexes.
Does anyone use the LDREX/STREX on a M3 or M4 and if so how do they use it?
So I contacted ARM and got some more information. For example if you did this it LDREX/STREX would fail:
LDREX address1
LDREX address2
STREX address1
The STREX to address1 would pass even though the last LDREX was not for address1. This is correct as that the LDREX/STREX address resolution is the entire memory space.
So I was worried that if you have a two tasks: and the first one got interrupted after the first LDREX, and then the second task got interrupted after the second LDREX to address2 and then the first task got processor back and tried the STREX it would cause a problem. However it appears that ARM issues CLREX on every exception/interrupt entry and exit. Therefore the STREX would fail as that the tasks had to be preemptive by an interrupt. That is if any interrupt occurs between LDREX and STREX the STREX will fail. So you want to keep the code as small as possible between LDREX and STREX to reduce the chances of interrupt. Additionally if the STREX fails you most likely want to try the LDREX/STREX process once or twice more before giving up.
Again this is for a single core M3/M4/M7.
Note the only place I found the reference to the CLREX being cleared with exception was in the ArmV7-M Architecture Reference Manual in section A3.4.4 Context switch support. This document is much better than anything I found online describing how the LDREX/STREX actually works.

Which data bus is used after physical remap to RAM in STM32F4?

STM32F4 controllers (with ARM Cortex M4 CPU) allow a so called physical remap of the lowest addresses in the memory space (0x00000000 to 0x03FFFFFF) using the SYSCFG_MEMRMP register. What I do understand is that the register selects which memory (FLASH/RAM/etc.) is aliased to the lowest addresses and therefore from which memory the reset vector and stack pointer is fetched after reset.
The documentation [1] also mentions that
In remap mode, the CPU can access the external memory via ICode bus
instead of System bus which boosts up the performance.
This means that after a remap e.g. to RAM an instruction fetched from within the alias address space (0x00000000 to 0x03FFFFFF) the ICode bus will be used.
Now my question: After such a remap operation e.g. to RAM, will a fetch to the non-aliased location of the RAM use the system bus or the ICode bus?
The background of the question is that I want to write a linker script for an image executing from RAM only (under control of a debugger). To which memory area should the .text section go? The alias space or the physical space?
[1] ST DocID018909 Rev 7
Thanks to Sean I could find the answer in the ARM® Cortex®‑M4 Processor Technical Reference Manual section 2.3.1 Bus interfaces:
ICode memory interface: Instruction fetches from Code memory space,
0x00000000 to 0x1FFFFFFC, are performed over the [sic!: this] 32-bit AHB-Lite bus.
DCode memory interface: Data and debug accesses to Code memory space,
0x00000000 to 0x1FFFFFFF, are performed over the [sic!: this] 32-bit AHB-Lite bus.
System interface: Instruction fetches and data and debug accesses to
address ranges 0x20000000 to 0xDFFFFFFF and 0xE0100000 to 0xFFFFFFFF
are performed over the [sic!: this] 32-bit AHB-Lite bus.
This also makes clear, that the flash memory of STM32F4 MCUs located at 0x08000000 is always accessed (by the CPU core) using the ICode/DCode busses, regardless if it is remapped. This is because both, the original location and the remapped location are within the code memory space (0x00000000 to 0x1FFFFFFF).
However, if the code is located in SRAM at 0x20000000 then access to the remapped location at 0x00000000 uses the ICode/DCode busses while access to the original location (outside the code memory space) uses the system bus.
The choice of bus interface on the core depends on the addresses accessed. If you access an instruction at 0x00000004, this is done on the ICode bus. An access to 0x20000004 is done using the System bus.
What the REMAP function does is change the physical memory system so that an access to 0x00000004 (ICode bus) will use the same RAM as you can also access on the system bus. Any access to 0x20000004 will be unaffected, and still be generated on the System bus by the core.

How to debug an aarch64 translation fault?

I am writing a simple kernel in armv8 (aarch64).
MMU config:
48 VA bits (T1SZ=64-48=16)
4K page size
All physical RAM flat mapped into kernel virtual memory (on TTBR1_EL1)
(MMU is active with TTBR0_EL1=0, so I'm only using addresses in 0xffff< addr >, all flat-mapped into physical memory)
I'm mapping a new address space (starting at 1<<40) to some free physical region. When I try to access address 1<<40, I get an exception (of type "EL1 using SP1, synchronous"):
ESR_EL1=0x96000044
FAR_EL1=0xffff010000000000
Inspecting other registers, I have:
TTBR1_EL1=0x82000000
TTBR1_EL1[2]=0x0000000082003003
So, based on ARM Architecture Reference Manual for ARMv8 (ARMv8-A profile):
ESR (exception syndrome register) translates into: Exception Class=100101 (Data abort without a change in exception level) on pages D7-1933 sq. ; WnR=1 (faulting instruction is a write) ; DFSC=0b000100 (translation fault at level 0) on page D7-1958 ;
FAR_EL1 is the faulting address ; it indicates TTBR1_EL1 is used (since high bits are all 1). The VA top 9 bits are 0b000000010, which indicate that entry 2 is used in the table ;
Entry 2 in the table indicates a next-level table (low bits 0b11) at physical address 0x82003000.
So, translation fails at level 0, where it should not.
My question is: am I doing something wrong? Am I missing some info that could lead to the translation fault? And, more generally, how to debug a translation fault ?
Update:
Everthing works when I write to tables before enabling the MMU.
Whenever I write to tables AFTER enabling the MMU (via flat-mapped table region), mapping never works. I wonder why this happens.
I also tried manually writing to the selected tables (to remove any side effect from my mmapping function): same result (when writes are done before MMU is on, it works; after, it fails).
I tried doing tlbi and dsb sy instructions, followed by isb, without effect. Only one CPU is running at this time so caching should not be a problem - write instructions and MMU talk to the same caches (but I will test it next).
I overlooked caching issues within a single core. The problem was that, after turning the MMU on, the CPU and table walk unit didn't have the same view of memory. ARMv8 Cortex-A Programming Guide states that cache has to be cleaned/invalidated to point of unification (same view for a single core) after modifying tables.
Two possibilities can explain this behavior (I don't fully understand how caches work yet):
First possibility: the MMU does not have the required address in its internal walk cache.
In this case, when updating regular data and making it available to other core's L1, the dsb instruction simply waits for all cores to have a synchronized state (thanks to coherency network): other cores will know that the line has to be updated, and when they try to access it, it gets updated to L2 or migrated from the previous core's L1 to their L1.
This does not happen with the MMU (no coherency participation), so it still sees the old value in L2.
However, if this were the case, the same thing should happen before the MMU is turned on (because caching is activated way before), except if all memory is considered L1-non-cacheable before MMU is activated (which is possible, I'll have to double check that).
A minimal way of fixing the problem may be to change caching policies for table pages, but the cache maintenance is still necessary to clear possible old values from the MMU.
Second possibility: in all cases tested, the MMU already has the faulting address in its internal walk cache, which is not coherent with data L1 or L2.
In that case, only an explicit invalidate can eject the old line from the MMU cache. Before the MMU is turned on, the cache contains nothing and never gets the old value (0), only the new one.
I still think that case is unlikely because I tested many cases, and sometimes the offset between previsouly mapped memory (for example, entry 0 in the level 1 table) and newly mapped memory (for example, entry 128 in the same level 1 table) was greater than the cache line size (in this case, 1024 bytes, which is more than any cache line size).
So, I'm still not sure what exactly causes the problem, but cleaning/invalidating all the updated addresses works.

Resources