Arm cortex a9 memory access - arm

I want to know the sequence an ARM core (Cortex-A series processor) accesses memory? Right from Virtual Address generated by core to memory and Instruction/Data transferred from the memory to the core. Consider core has generated a virtual address for some data/instruction and there is a miss from TLBs, then how does address reach to main memory(DRAM if I am not wrong) and how does data comes to core through L2 and L1 caches.
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
I am confused regarding cache and MMU communications.

tl;dr - Whatever you want. The ARM is highly flexible and the SOC vendor and/or the system programmer may make the memory sub-systems do a great many different things depending on the end device features and needs.
First, the MMU has fields that explicitly dictate how the cache is to be used. I recommend reading Chapter 9 Caches and Chapter 10 Memory Management Unit of the Cortex-A Series Programmers Guide.
Some terms are,
PoC - point of coherency.
PoU - point of unification.
Strongly ordered.
Device
Normal
Many MMU properties and caching can be affected by different CP15 and configuration registers. For instance, an 'exclusive configuration' for data in the L1 cache is never in the L2 can make it particularly difficult to cleanly write self modifying code and other dynamic updates. So, even for a particular Cortex-A model, the system configuration may change things (write-back/write-through, write-allocate/no write-allocate, bufferable, non-cacheable, etc).
A typical sequence for general DDR core memory is,
Resolve virt -> phys
Micro TLB present? Yes, have `phys`
TLB present? Yes, have `phys`
Table walk. Have `phys` or fault.
Access marked cacheable? Yes do 2.1. No step 4.
In L1 cache? Yes 2b.
If read return data. If write fill data and mark drity (write back).
In L2 cache? Yes 3.1
If read return data. If write fill data and mark drity (write back).
Run physical cycle on AXI bus (may route to sub-bus).
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
For normal cases these are just cache hits. If it is a 'write-through' and 'write' then the value is updated in cache and written to memory. It it is 'write-back' the value is updated in cache and marked dirty.Note1 If it is a read, then the cache memory is used (in both case).
The system maybe set up completely differently for device memory (Ie, memory mapped USB registers, world shareable memory, multi-core/cpu buffers, etc). Often the setup will depend on system cost, performance and power consumption. Ie, a write-through cache is easier to implement (lower power and less cost) but often lower performance.
I am confused regarding cache and MMU communications.
Mainly the MMU will provide information for the caches to resolve an address. The MMU may say to use/not use the cache. It may tell the cache it can 'gang' writes together (write-bufferable), but should not store them indefinitely, etc. So many of the MMU specifiers can selectively alter the behavior of the cache. As the Cortex-A cache parameters are not defined (it is up to each SOC manufacturer), it is often the case that particular MMU bits may have alternate behavior on different systems.
Note1: The 'dirty cache' may have additional 'broadcasts' of exclusion monitor information for strex and ldrex type accesses.

Related

Is accessing mapped device memory slow (in terms of latency)?

I know the question is vague.. but here is what I hope to learn: the MCU directs some part of memory address to devices on the PCI bus, hence in theory user/kernel code can directly read/write device memory as if it were main memory. But data in and out of PCI Express devices are packaged/serialized/transmitted in lanes, which means each read/write incurs significant overhead, such as packaging (add headers) and un-packaging. So that means it is not ideal for user/kernel to read device memory a byte at a time, instead it should do some sort of bulk transfer. If so, what is the preferred mechanism and API?
BTW, I know there is DMA, but it seems to me that DMA does not require device memory to be directly mapped into main memory address space - DMA is about letting device access main memory, and my question is the other way, letting user/kernel access device memory. So I am guessing it is not related to the question above, is that correct?
Yes, accessing memory-mapped I/O (MMIO) is slow.
The primary reason that it is slow is that it is typically uncacheable,
so every access has to go all the way to the device.
In x86 systems, which I am most familiar with, cacheable memory is accessed in 64-byte chunks,
even though processor instructions typically access memory in 1, 2, 4, or 8 byte chunks.
If multiple processor instructions access adjacent cacheable memory locations, all but the first access are satisfied from the cache. For similar accesses to device memory, every access incurs the full latency to the device and back.
The second reason is that the path from the processors to memory are critical to performance and are highly optimized.
The path to devices has always been slow, so software is designed to compensate for that, and optimizing the performance of MMIO isn't a priority.
Another related reason is that PCI has ordering rules that require accesses to be buffered and processed in a strict order.
The memory system can handle ordering in a much more flexible way. For example, a dirty cache line may be written to memory at any convenient time. MMIO accesses must be performed precisely in the order that they are executed by the CPU.
The best way to do bulk transfer of data to a device is to have the device itself perform DMA, "pulling" the data from memory into the device, rather than "pushing" it from the CPU to the device. (This also reduces the load on the CPU, freeing it to do other useful work.)

What is the difference in cache memory and tightly coupled memory

Due to being embedded inside the CPU The TCM has a
Harvard-architecture, so there is an ITCM (instruction TCM)
and a DTCM (data TCM). The DTCM can not contain any
instructions, but the ITCM can actually contain data.
The size of DTCM or ITCM is minimum 4KiB so the typical
minimum configuration is 4KiB ITCM and 4KiB DTCM.
It looks like tcm have same purpose as cache memory.
No. They didn't used the word cache in explanation
A cache uses access patterns to populate data within the cache. It has extra hardware to track the backing address and may have communication with other system entities (SMP) to track when a cache line is dirty (someone else has written something to primary memory).
The 'TCM' (tightly coupled memory) is fast, probably SRAM multi-transistor memory, like the cache. Both have a fast dedicated connection to the CPU. However, the overhead to implement the TCM is far less than a cache. Typically TCM is found on lower-end (deeply embedded probably Cortex-M) ARM devices.
Most CPU caches have a lock down feature which enables them to behave like the TCM. However, the TCM does not have on the fly capabilities to buffer high use code and data. Because of this, the TCM (and locked cache) is probably more deterministic which may help hard real time applications.
This is what I found that I feel is more concise and to the point.
Cache memory is implemented with on-chip memory and control logic. Tightly coupled memory is implemented with on-chip memory and a dedicated connection.
Tightly coupled memory has a fixed span in the address map. Cache does not live in the address map (.... well it kinda does.... just don't think of it as a physical memory) but instead serves as an intermediate between the processor and the memory to (hopefully) provide more efficient memory accesses.
Tightly coupled memory has deterministic access time. Accesses through the cache are not deterministic since the data will either live in the cache (hit) or the data must be fetched from main memory (miss).
Another
While both are very fast accessed memories, cache stores dynamically data/code which has been lately used in order to improve access speed, compared to standard memory connected to the global Avalon matrix. Every time a memory access is required, the processor checks if the required data is already present in the cache or must be newly fetched from memory; in the meantime, old unused cache data is being continously replaced with new data.
Tightly coupled memory is also a fast access memory, since it exploits a dedicated port, but it has static content: you decide what you need there and you specify it in the linker script.
TCM has allocated address space so you can find it in the memory map. You can control the data that will be stored there already at link time. Just think of it as a normal system memory that has access times similar with cache. Usually the data from TCM is uncacheable.
If we ignore the dual-configurable cache-cum-TCM, cache memory should be connected to Bus Interface (BIU) to connect external memory, whereas TCM aren't. Reason is TCM has original data by itself. Whereas cache is temporary storage (for speed) of external memory content.

PoU and PoC in cache maintenance operations in arm

When reading ARM arch. ref. manual v7, I've found two concepts; point of coherency (PoC) and point of unification (PoU).
For PoC, it looks like the point that all agents (i.e., CPU cores) can see the same copy of memory.
For PoU, it looks like the point that all agents (in this case, CPU cores and MMU) can see the same copy of memory.
I have several follow up questions:
Is my understanding correct?
If so, If I issue DCCMVAC (Data cache clean MVA to PoC) with giving MVA to 0x40000000, (and let say PoC happen to be 0x70000000),
all cache entries between VA of 0x40000000 and 0x70000000 are cleaned?
Then, if I issue DCCMVAC with MVA 0x0, all data cache entries are cleaned?
PoU sounds like that MMU itself has its own data caches (not TLB) for page table walk inside main memory. Is this correct?
According to ARM training materials:
The PoU (Point of Unification) for a processor is the point (physical location within the hardware) where the instruction and data caches and the translation table walks of the processor are guaranteed to see the same copy of a memory location. For example, a unified level 2 cache would be the point of unification in a system with Harvard level 1 caches and a TLB (to cache page table entries). If no external cache is present, main memory would be the Point of unification.
The PoC (Point of [system] Coherency) is the point at which all blocks (for example, CPUs, DSPs, or DMA engines) which can access memory, are guaranteed for a particular address to see the same copy of a memory location. Typically, this will be the main external system memory.
it's one old case, however, adding some comments in case of someone's search.
in my opinion, PoU and PoC are coined by ARM to define one level for cache maintenance. the definition of PoC and PoU is in ARM ARM specification, while its ARMv8 programming guide (not ARM spec) gives some diagram for better understanding: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch11s04.html
one point is ,under ARM V8 processor's implementation, Iside can snoop Dside, for example, if there is one Icache miss, it will check Dcache, so you could treat PoU as the level of L1 cache. while other ARMv8 processor may not have this behaviour.
back to the original questions:
2) DCCMVAC 0x40000000, it will do cache clean to PoC about this address, mostly one cache line
PoC is defined by SoC implementation, not by address.
3) considering Q2, DCCMVAC 0x0, only applies to one cache line.
if you want to clean and invalidate the whole cache, you need use by set/way to walkthrough the whole cache.
4) PoU has nothing to do with MMU.
MMU hardware block owns some buffers to save TLB entries, it's one common practice, as for pagetable, which is built by software as in the memory, normally it's defined as normal memory type, so it could be in the cache during setup by CPU instruction, or walk by MMU hardware walk unit.

Is it possible to assign parts of the shared L2 caches to different cores

Lets say, 4 threads are running on 4 separate cores of a Multicore x86 processor, and they do not share any data, is it possible to progammatically make the 4 cores use separate and predefined portions of the shared L2 cache.
Let's use two terms, exclusive and shared caches instead of L1, L2, L3, L4 caches. Different CPU families start to share cache on different levels. In the presented terms the original question is - is it possible split shared cache into the parts, each of which will be used exclusively by one of the CPU/cores? There is no clear answer. Furthermore there are two answers opposite to each other.
1) First and general answer: NO.
Cache is by design managed in hardware. There are only few control levers of cache accessible in software such as enable/disable cache for whole memory or defined memory region, apply specified policy for cache flushing (write through/ write back). NO basically due to the fact, that it was designed to be managed in hardware. So there are no useful interface that will allow manage it gracefully in software.
2) Second answer: Yes.
In fact, cache designed in such a way, that each line of the cache can save data from specified set of memory lines. Due to this if memory manager provides guaranty, that the same CPU one CPU/core own and use all memory lines assigned to the same cache line exclusively, then memory manager provides guaranty that that cache line will be used by that CPU exclusively. It is a very tricky workaround. And it have very limited benefits, and have serious drawbacks: memory layout is very fragmented, cache usage is unbalanced, complicated memory management, very hadrware-dependent (Details can be found in the paper provided by "MetallicPriest").
Resume: it is possible in theory and almost impossible on practice.

ARM11 Translation Lookaside Buffer (TLB) usage?

Is there a decent guide explaining how to use the TLB (Translation Lookaside Buffers) tables on an ARM1176JZF-S core?
Having looked over the technical documentation for the that ARM platform I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
Apart from that, I have absolutely no clue on how to use them.
What structure does a TLB entry have? How do I create new entries?
How do I handle VM in context switches for user-space threads? How do I ensure that those threads can only access specific pages assigned to their parent processes (enforce memory protection)? Do I save the TLB state for each context?
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Thank you in advance. I'll be really glad if someone provides an explanation of what TLBs are. I'm currently working on a memory mapper for my kernel, and I've pretty much hit a dead end.
The technical reference manual for ARM1176JZF-S appears to be DDI 0301. That document contains all the specific details for that specific ARM core.
I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
A TLB is a cache of the page table. Some processors allow direct access to the TLB, while knowing nothing about page tables (e.g: MIPS), while others know about page tables, and internally use TLBs that the programmer mostly doesn't see (e.g: x86). In this case, the TLB is managed by hardware, and the system programmer only has to care to make the TTB (Translation Table Base) registers point to the page table, and invalidate the TLB in apropriate places.
What structure does a TLB entry have? How do I create new entries?
Done by hardware. On a TLB miss, the MMU walks the page table and fills the TLB from there.
How do I handle VM in context switches for user-space threads?
Some platforms have TLBs that simply map virtual addresses to physical addresses (e.g: x86). On these platforms, you have to do a full TLB flush on each context switch. Other platforms (MIPS, this specific ARM core) map (ASID, virtual address) pairs to physical addresses. An ASID is an Application-Specific Identifier, i.e: an identifier for a process. The MMU uses a register to know which ASID to use (I think it's the Context ID register in this case). Since there may be more processes than ASIDs, occasionally you may need to recycle an ASID (assigning it to a different process) and do a TLB flush (that's what the Invalidate TLB by ASID operation is for).
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
This is exactly for the same reason you have small separate level-1 caches for instructions and data. Since they are caches, you don't need more than 10 (though having more could improve performance).
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Some memory pages (e.g: some portions of the kernel) are accessed very often. It makes sense to lock them, so they don't get thrown off of the TLB. Also, on realtime systems, a TLB miss or a cache miss may introduce some unwanted unpredictability. So, there is an option to lock a number of TLB entries. The main TLB has more entries, but only those 8 are lockable.

Resources