ARM11 Translation Lookaside Buffer (TLB) usage? - arm

Is there a decent guide explaining how to use the TLB (Translation Lookaside Buffers) tables on an ARM1176JZF-S core?
Having looked over the technical documentation for the that ARM platform I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
Apart from that, I have absolutely no clue on how to use them.
What structure does a TLB entry have? How do I create new entries?
How do I handle VM in context switches for user-space threads? How do I ensure that those threads can only access specific pages assigned to their parent processes (enforce memory protection)? Do I save the TLB state for each context?
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Thank you in advance. I'll be really glad if someone provides an explanation of what TLBs are. I'm currently working on a memory mapper for my kernel, and I've pretty much hit a dead end.

The technical reference manual for ARM1176JZF-S appears to be DDI 0301. That document contains all the specific details for that specific ARM core.
I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
A TLB is a cache of the page table. Some processors allow direct access to the TLB, while knowing nothing about page tables (e.g: MIPS), while others know about page tables, and internally use TLBs that the programmer mostly doesn't see (e.g: x86). In this case, the TLB is managed by hardware, and the system programmer only has to care to make the TTB (Translation Table Base) registers point to the page table, and invalidate the TLB in apropriate places.
What structure does a TLB entry have? How do I create new entries?
Done by hardware. On a TLB miss, the MMU walks the page table and fills the TLB from there.
How do I handle VM in context switches for user-space threads?
Some platforms have TLBs that simply map virtual addresses to physical addresses (e.g: x86). On these platforms, you have to do a full TLB flush on each context switch. Other platforms (MIPS, this specific ARM core) map (ASID, virtual address) pairs to physical addresses. An ASID is an Application-Specific Identifier, i.e: an identifier for a process. The MMU uses a register to know which ASID to use (I think it's the Context ID register in this case). Since there may be more processes than ASIDs, occasionally you may need to recycle an ASID (assigning it to a different process) and do a TLB flush (that's what the Invalidate TLB by ASID operation is for).
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
This is exactly for the same reason you have small separate level-1 caches for instructions and data. Since they are caches, you don't need more than 10 (though having more could improve performance).
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Some memory pages (e.g: some portions of the kernel) are accessed very often. It makes sense to lock them, so they don't get thrown off of the TLB. Also, on realtime systems, a TLB miss or a cache miss may introduce some unwanted unpredictability. So, there is an option to lock a number of TLB entries. The main TLB has more entries, but only those 8 are lockable.

Related

How to debug an aarch64 translation fault?

I am writing a simple kernel in armv8 (aarch64).
MMU config:
48 VA bits (T1SZ=64-48=16)
4K page size
All physical RAM flat mapped into kernel virtual memory (on TTBR1_EL1)
(MMU is active with TTBR0_EL1=0, so I'm only using addresses in 0xffff< addr >, all flat-mapped into physical memory)
I'm mapping a new address space (starting at 1<<40) to some free physical region. When I try to access address 1<<40, I get an exception (of type "EL1 using SP1, synchronous"):
ESR_EL1=0x96000044
FAR_EL1=0xffff010000000000
Inspecting other registers, I have:
TTBR1_EL1=0x82000000
TTBR1_EL1[2]=0x0000000082003003
So, based on ARM Architecture Reference Manual for ARMv8 (ARMv8-A profile):
ESR (exception syndrome register) translates into: Exception Class=100101 (Data abort without a change in exception level) on pages D7-1933 sq. ; WnR=1 (faulting instruction is a write) ; DFSC=0b000100 (translation fault at level 0) on page D7-1958 ;
FAR_EL1 is the faulting address ; it indicates TTBR1_EL1 is used (since high bits are all 1). The VA top 9 bits are 0b000000010, which indicate that entry 2 is used in the table ;
Entry 2 in the table indicates a next-level table (low bits 0b11) at physical address 0x82003000.
So, translation fails at level 0, where it should not.
My question is: am I doing something wrong? Am I missing some info that could lead to the translation fault? And, more generally, how to debug a translation fault ?
Update:
Everthing works when I write to tables before enabling the MMU.
Whenever I write to tables AFTER enabling the MMU (via flat-mapped table region), mapping never works. I wonder why this happens.
I also tried manually writing to the selected tables (to remove any side effect from my mmapping function): same result (when writes are done before MMU is on, it works; after, it fails).
I tried doing tlbi and dsb sy instructions, followed by isb, without effect. Only one CPU is running at this time so caching should not be a problem - write instructions and MMU talk to the same caches (but I will test it next).
I overlooked caching issues within a single core. The problem was that, after turning the MMU on, the CPU and table walk unit didn't have the same view of memory. ARMv8 Cortex-A Programming Guide states that cache has to be cleaned/invalidated to point of unification (same view for a single core) after modifying tables.
Two possibilities can explain this behavior (I don't fully understand how caches work yet):
First possibility: the MMU does not have the required address in its internal walk cache.
In this case, when updating regular data and making it available to other core's L1, the dsb instruction simply waits for all cores to have a synchronized state (thanks to coherency network): other cores will know that the line has to be updated, and when they try to access it, it gets updated to L2 or migrated from the previous core's L1 to their L1.
This does not happen with the MMU (no coherency participation), so it still sees the old value in L2.
However, if this were the case, the same thing should happen before the MMU is turned on (because caching is activated way before), except if all memory is considered L1-non-cacheable before MMU is activated (which is possible, I'll have to double check that).
A minimal way of fixing the problem may be to change caching policies for table pages, but the cache maintenance is still necessary to clear possible old values from the MMU.
Second possibility: in all cases tested, the MMU already has the faulting address in its internal walk cache, which is not coherent with data L1 or L2.
In that case, only an explicit invalidate can eject the old line from the MMU cache. Before the MMU is turned on, the cache contains nothing and never gets the old value (0), only the new one.
I still think that case is unlikely because I tested many cases, and sometimes the offset between previsouly mapped memory (for example, entry 0 in the level 1 table) and newly mapped memory (for example, entry 128 in the same level 1 table) was greater than the cache line size (in this case, 1024 bytes, which is more than any cache line size).
So, I'm still not sure what exactly causes the problem, but cleaning/invalidating all the updated addresses works.

Arm cortex a9 memory access

I want to know the sequence an ARM core (Cortex-A series processor) accesses memory? Right from Virtual Address generated by core to memory and Instruction/Data transferred from the memory to the core. Consider core has generated a virtual address for some data/instruction and there is a miss from TLBs, then how does address reach to main memory(DRAM if I am not wrong) and how does data comes to core through L2 and L1 caches.
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
I am confused regarding cache and MMU communications.
tl;dr - Whatever you want. The ARM is highly flexible and the SOC vendor and/or the system programmer may make the memory sub-systems do a great many different things depending on the end device features and needs.
First, the MMU has fields that explicitly dictate how the cache is to be used. I recommend reading Chapter 9 Caches and Chapter 10 Memory Management Unit of the Cortex-A Series Programmers Guide.
Some terms are,
PoC - point of coherency.
PoU - point of unification.
Strongly ordered.
Device
Normal
Many MMU properties and caching can be affected by different CP15 and configuration registers. For instance, an 'exclusive configuration' for data in the L1 cache is never in the L2 can make it particularly difficult to cleanly write self modifying code and other dynamic updates. So, even for a particular Cortex-A model, the system configuration may change things (write-back/write-through, write-allocate/no write-allocate, bufferable, non-cacheable, etc).
A typical sequence for general DDR core memory is,
Resolve virt -> phys
Micro TLB present? Yes, have `phys`
TLB present? Yes, have `phys`
Table walk. Have `phys` or fault.
Access marked cacheable? Yes do 2.1. No step 4.
In L1 cache? Yes 2b.
If read return data. If write fill data and mark drity (write back).
In L2 cache? Yes 3.1
If read return data. If write fill data and mark drity (write back).
Run physical cycle on AXI bus (may route to sub-bus).
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
For normal cases these are just cache hits. If it is a 'write-through' and 'write' then the value is updated in cache and written to memory. It it is 'write-back' the value is updated in cache and marked dirty.Note1 If it is a read, then the cache memory is used (in both case).
The system maybe set up completely differently for device memory (Ie, memory mapped USB registers, world shareable memory, multi-core/cpu buffers, etc). Often the setup will depend on system cost, performance and power consumption. Ie, a write-through cache is easier to implement (lower power and less cost) but often lower performance.
I am confused regarding cache and MMU communications.
Mainly the MMU will provide information for the caches to resolve an address. The MMU may say to use/not use the cache. It may tell the cache it can 'gang' writes together (write-bufferable), but should not store them indefinitely, etc. So many of the MMU specifiers can selectively alter the behavior of the cache. As the Cortex-A cache parameters are not defined (it is up to each SOC manufacturer), it is often the case that particular MMU bits may have alternate behavior on different systems.
Note1: The 'dirty cache' may have additional 'broadcasts' of exclusion monitor information for strex and ldrex type accesses.

PoU and PoC in cache maintenance operations in arm

When reading ARM arch. ref. manual v7, I've found two concepts; point of coherency (PoC) and point of unification (PoU).
For PoC, it looks like the point that all agents (i.e., CPU cores) can see the same copy of memory.
For PoU, it looks like the point that all agents (in this case, CPU cores and MMU) can see the same copy of memory.
I have several follow up questions:
Is my understanding correct?
If so, If I issue DCCMVAC (Data cache clean MVA to PoC) with giving MVA to 0x40000000, (and let say PoC happen to be 0x70000000),
all cache entries between VA of 0x40000000 and 0x70000000 are cleaned?
Then, if I issue DCCMVAC with MVA 0x0, all data cache entries are cleaned?
PoU sounds like that MMU itself has its own data caches (not TLB) for page table walk inside main memory. Is this correct?
According to ARM training materials:
The PoU (Point of Unification) for a processor is the point (physical location within the hardware) where the instruction and data caches and the translation table walks of the processor are guaranteed to see the same copy of a memory location. For example, a unified level 2 cache would be the point of unification in a system with Harvard level 1 caches and a TLB (to cache page table entries). If no external cache is present, main memory would be the Point of unification.
The PoC (Point of [system] Coherency) is the point at which all blocks (for example, CPUs, DSPs, or DMA engines) which can access memory, are guaranteed for a particular address to see the same copy of a memory location. Typically, this will be the main external system memory.
it's one old case, however, adding some comments in case of someone's search.
in my opinion, PoU and PoC are coined by ARM to define one level for cache maintenance. the definition of PoC and PoU is in ARM ARM specification, while its ARMv8 programming guide (not ARM spec) gives some diagram for better understanding: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch11s04.html
one point is ,under ARM V8 processor's implementation, Iside can snoop Dside, for example, if there is one Icache miss, it will check Dcache, so you could treat PoU as the level of L1 cache. while other ARMv8 processor may not have this behaviour.
back to the original questions:
2) DCCMVAC 0x40000000, it will do cache clean to PoC about this address, mostly one cache line
PoC is defined by SoC implementation, not by address.
3) considering Q2, DCCMVAC 0x0, only applies to one cache line.
if you want to clean and invalidate the whole cache, you need use by set/way to walkthrough the whole cache.
4) PoU has nothing to do with MMU.
MMU hardware block owns some buffers to save TLB entries, it's one common practice, as for pagetable, which is built by software as in the memory, normally it's defined as normal memory type, so it could be in the cache during setup by CPU instruction, or walk by MMU hardware walk unit.

Does Virtual Memory area struct only comes into picture when there is a page fault?

Virtual Memory is a quite complex topic for me. I am trying to understand it. Here is my understanding for a 32-bit system. Example RAM is just 2GB. I have tried reading many links, and I am not confident at the moment. I would like you people to help me in clearing up my concepts. Please acknowledge my points, and also please answer for what you feel is wrong. I have also a confused section in my points. So, here starts the summary.
Every process thinks it is only running. It can access the 4GB of memory - virtual address space.
When a process access a virtual address it is translated to physical address via MMU.
This MMU is a part of a CPU - a hardware.
When the MMU cannot translate the address to a physical one, it raises a page fault.
On page fault, the kernel is notified. The kernel check the VM area struct. If it can find it - may be on disk. It will do some page-in /page-out. And get this memory on the RAM.
Now MMU will again try and will succeed this time.
In case the kernel cannot find the address, it will raise a signal. For example, invalid access will raise a SIGSEGV.
Confused points.
Does Page table is maintained in Kernel? This VM area struct has a page table ?
How MMU cannot find the address in physical RAM. Let's say it translates to some wrong address in RAM. Still the code will execute, but it will be a bad address. How MMU ensures that it is reading a right data? Does it consult Kernel VM area everytime?
Is the Mapping table - virtual to physical is inside a MMU. I have read it that is maintained by an individual process. If it is inside a process, why I can't see it.
Or if it is MMU, how MMU generates the address - is it that Segment + 12-bit shift -> Page frame number, and then the addition of offset (bits -1 to 10) -> gives a physical address.
Does it mean that for a 32-bit architecture, with this calculation in my mind. I can determine the physical address from a virtual address.
cat /proc/pid_value/maps. This shows me the current mapping of the vmarea. Basically, it reads the Vmarea struct and prints it. That means that this is important. I am not able to fit this piece in the complete picture. When the program is executed does the vmarea struct is generated. Is VMAREA comes only into the picture when the MMU cannnot translate the address i.e. Page fault? When I print the vmarea it displays the address range , permission and mapped to file descriptor, and offset. I am sure this file descriptor is the one in the hard-disk and the offset is for that file.
The high-mem concept is that kernel cannot directly access the Memory region greater than 1 GB(approx). Thus, it needs a page table to indirectly map it. Thus, it will temporarily load some page table to map the address. Does HIGH MEM will come into the picture everytime. Because Userspace can directly translate the address via MMU. On what scenario, does kernel really want to access the High MEM. I believe the kernel drivers will mostly be using kmalloc. This is a direct memory + offset address. In this case no mapping is really required. So, the question is on what scenario a kernel needs to access the High Mem.
Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run LInux?
Does Page table is maintained in Kernel? This VM area struct has a page table ?
Yes. Not exactly: each process has a mm_struct, which contains a list of vm_area_struct's (which represent abstract, processor-independent memory regions, aka mappings), and a field called pgd, which is a pointer to the processor-specific page table (which contains the current state of each page: valid, readable, writable, dirty, ...).
The page table doesn't need to be complete, the OS can generate each part of it from the VMAs.
How MMU cannot find the address in physical RAM. Let's say it translates to some wrong address in RAM. Still the code will execute, but it will be a bad address. How MMU ensures that it is reading a right data? Does it consult Kernel VM area everytime?
The translation fails, e.g. because the page was marked as invalid, or a write access was attempted against a readonly page.
Is the Mapping table - virtual to physical is inside a MMU. I have read it that is maintained by an individual process. If it is inside a process, why I can't see it.
Or if it is MMU, how MMU generates the address - is it that Segment + 12-bit shift -> Page frame number, and then the addition of offset (bits -1 to 10) -> gives a physical address.
Does it mean that for a 32-bit architecture, with this calculation in my mind. I can determine the physical address from a virtual address.
There are two kinds of MMUs in common use. One of them only has a TLB (Translation Lookaside Buffer), which is a cache of the page table. When the TLB doesn't have a translation for an attempted access, a TLB miss is generated, the OS does a page table walk, and puts the translation in the TLB.
The other kind of MMU does the page table walk in hardware.
In any case, the OS maintains a page table per process, this maps Virtual Page Numbers to Physical Frame Numbers. This mapping can change at any moment, when a page is paged-in, the physical frame it is mapped to depends on the availability of free memory.
cat /proc/pid_value/maps. This shows me the current mapping of the vmarea. Basically, it reads the Vmarea struct and prints it. That means that this is important. I am not able to fit this piece in the complete picture. When the program is executed does the vmarea struct is generated. Is VMAREA comes only into the picture when the MMU cannnot translate the address i.e. Page fault? When I print the vmarea it displays the address range , permission and mapped to file descriptor, and offset. I am sure this file descriptor is the one in the hard-disk and the offset is for that file.
To a first approximation, yes. Beyond that, there are many reasons why the kernel may decide to fiddle with a process' memory, e.g: if there is memory pressure it may decide to page out some rarely used pages from some random process. User space can also manipulate the mappings via mmap(), execve() and other system calls.
The high-mem concept is that kernel cannot directly access the Memory region greater than 1 GB(approx). Thus, it needs a page table to indirectly map it. Thus, it will temporarily load some page table to map the address. Does HIGH MEM will come into the picture everytime. Because Userspace can directly translate the address via MMU. On what scenario, does kernel really want to access the High MEM. I believe the kernel drivers will mostly be using kmalloc. This is a direct memory + offset address. In this case no mapping is really required. So, the question is on what scenario a kernel needs to access the High Mem.
Totally unrelated to the other questions. In summary, high memory is a hack to be able to access lots of memory in a limited address space computer.
Basically, the kernel has a limited address space reserved to it (on x86, a typical user/kernel split is 3Gb/1Gb [processes can run in user space or kernel space. A process runs in kernel space when a syscall is invoked. To avoid having to switch the page table on every context-switch, on x86 typically the address space is split between user-space and kernel-space]). So the kernel can directly access up to ~1Gb of memory. To access more physical memory, there is some indirection involved, which is what high memory is all about.
Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run Linux?
Laptop/desktop processors come with an MMU. x86 supports paging since the 386.
Linux, specially the variant called µCLinux, supports processors without MMUs (!MMU). Many embedded systems (ADSL routers, ...) use processors without an MMU. There are some important restrictions, among them:
Some syscalls don't work at all: e.g fork().
Some syscalls work with restrictions and non-POSIX conforming behavior: e.g mmap()
The executable file format is different: e.g bFLT or ELF-FDPIC instead of ELF.
The stack cannot grow, and its size has to be set at link-time.
When a program is loaded first the kernel will setup a kernel VM-Area for that process is it? This Kernel VM Area actually holds where the program sections are there in the memory/HDD. Then the entire story of updating CR3 register, and page walkthrough or TLB comes into the picture right? So, whenever there is a pagefault - Kernel will update the page table by looking at Kernel virtual memory area is it? But they say Kernel VM area keeps updating. How this is possible, since cat /proc/pid_value/map will keep updating.The map won't be constant from start to end. SO, the real information is available in the Kernel VM area struct is it? This is the acutal information where the section of program lies, it could be HDD or physical memory -- RAM? So, this is filled during process loading is it, the first job? Kernel does the page in page out on page fault, and will update the Kernel VM area is it? So, it should also know the entire program location on the HDD for page-in / page out right? Please correct me here. This is in continuation to my first question of the previous comment.
When the kernel loads a program, it will setup several VMAs (mappings), according to the segments in the executable file (which on ELF files you can see with readelf --segments), which will be text/code segment, data segment, etc... During the lifetime of the program, additional mappings may be created by the dynamic/runtime linkers, by the memory allocator (malloc(), which may also extend the data segment via brk()), or directly by the program via mmap(),shm_open(), etc..
The VMAs contain the necessary information to generate the page table, e.g. they tell whether that memory is backed by a file or by swap (anonymous memory). So, yes, the kernel will update the page table by looking at the VMAs. The kernel will page in memory in response to page faults, and will page out memory in response to memory pressure.
Using x86 no PAE as an example:
On x86 with no PAE, a linear address can be split into 3 parts: the top 10 bits point to an entry in the page directory, the middle 10 bits point to an entry in the page table pointed to by the aforementioned page directory entry. The page table entry may contain a valid physical frame number: the top 22 bits of a physical address. The bottom 12 bits of the virtual address is an offset into the page that goes untranslated into the physical address.
Each time the kernel schedules a different process, the CR3 register is written to with a pointer to the page directory for the current process. Then, each time a memory access is made, the MMU tries to look for a translation cached in the TLB, if it doesn't find one, it looks for one doing a page table walk starting from CR3. If it still doesn't find one, a GPF fault is raised, the CPU switches to Ring 0 (kernel mode), and the kernel tries to find one in the VMAs.
Also, I believe this reading from CR, page directory->page-table->Page frame number-memory address this all done by MMU. Am I correct?
On x86, yes, the MMU does the page table walk. On other systems (e.g: MIPS), the MMU is little more than the TLB, and on TLB miss exceptions the kernel does the page table walk by software.
Though this is not going to be the best answer, iw ould like to share my thoughts on confused points.
1. Does Page table is maintained...
Yes. kernel maintains the page tables. In fact it maintains nested page tables. And top of the page tables is stored in top_pmd. pmd i suppose it is page mapping directory. You can traverse through all the page tables using this structure.
2. How MMU cannot find the address in physical RAM.....
I am not sure i understood the question. But in case because of some problem, the instruction is faulted or out of its instruction area is being accessed, you generally get undefined instruction exception resulting in undefined exception abort. If you look at the crash dumps, you can see it in the kernel log.
3. Is the Mapping table - virtual to physical is inside a MMU...
Yes. MMU is SW+HW. HW is like TLB and all. The mapping tables are stored here. For instructions, that is for code section i always converted the physical-virtual address and always they matched. And almost all the times it matches for Data sections as well.
4. cat /proc/pid_value/maps. This shows me the current mapping of the vmarea....
This is more used for analyzing the virtual addresses of user space stacks. As you know virtually all the user space programs can have 4 GB of virtual address. So unlike kernel if i say 0xc0100234. You cannot directly go and point to the istruction. So you need this mapping and the virtual address to point the instruction based on the data you have.
5. The high-mem concept is that kernel cannot directly access the Memory...
High-mem corresponds to user space memory(some one correct me if i am wrong). When kernel wants to read some data from a address at user space you will be accessing the HIGHMEM.
6. Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run LInux?
MMU as i mentioned is HW + SW. So mostly it would be coming with the chipset. and the SW would be generally architecture dependent. You can disable MMU from kernel config and build. I have never tried it though. Mostly these days allthe chipsets have it. But small boards i think they disable MMU. I am not entirely sure though.
As all these are conceptual questions, i may be lacking some knowledge and be wrong at places. If so others please correct me.

Is it possible to assign parts of the shared L2 caches to different cores

Lets say, 4 threads are running on 4 separate cores of a Multicore x86 processor, and they do not share any data, is it possible to progammatically make the 4 cores use separate and predefined portions of the shared L2 cache.
Let's use two terms, exclusive and shared caches instead of L1, L2, L3, L4 caches. Different CPU families start to share cache on different levels. In the presented terms the original question is - is it possible split shared cache into the parts, each of which will be used exclusively by one of the CPU/cores? There is no clear answer. Furthermore there are two answers opposite to each other.
1) First and general answer: NO.
Cache is by design managed in hardware. There are only few control levers of cache accessible in software such as enable/disable cache for whole memory or defined memory region, apply specified policy for cache flushing (write through/ write back). NO basically due to the fact, that it was designed to be managed in hardware. So there are no useful interface that will allow manage it gracefully in software.
2) Second answer: Yes.
In fact, cache designed in such a way, that each line of the cache can save data from specified set of memory lines. Due to this if memory manager provides guaranty, that the same CPU one CPU/core own and use all memory lines assigned to the same cache line exclusively, then memory manager provides guaranty that that cache line will be used by that CPU exclusively. It is a very tricky workaround. And it have very limited benefits, and have serious drawbacks: memory layout is very fragmented, cache usage is unbalanced, complicated memory management, very hadrware-dependent (Details can be found in the paper provided by "MetallicPriest").
Resume: it is possible in theory and almost impossible on practice.

Resources