PoU and PoC in cache maintenance operations in arm - arm

When reading ARM arch. ref. manual v7, I've found two concepts; point of coherency (PoC) and point of unification (PoU).
For PoC, it looks like the point that all agents (i.e., CPU cores) can see the same copy of memory.
For PoU, it looks like the point that all agents (in this case, CPU cores and MMU) can see the same copy of memory.
I have several follow up questions:
Is my understanding correct?
If so, If I issue DCCMVAC (Data cache clean MVA to PoC) with giving MVA to 0x40000000, (and let say PoC happen to be 0x70000000),
all cache entries between VA of 0x40000000 and 0x70000000 are cleaned?
Then, if I issue DCCMVAC with MVA 0x0, all data cache entries are cleaned?
PoU sounds like that MMU itself has its own data caches (not TLB) for page table walk inside main memory. Is this correct?

According to ARM training materials:
The PoU (Point of Unification) for a processor is the point (physical location within the hardware) where the instruction and data caches and the translation table walks of the processor are guaranteed to see the same copy of a memory location. For example, a unified level 2 cache would be the point of unification in a system with Harvard level 1 caches and a TLB (to cache page table entries). If no external cache is present, main memory would be the Point of unification.
The PoC (Point of [system] Coherency) is the point at which all blocks (for example, CPUs, DSPs, or DMA engines) which can access memory, are guaranteed for a particular address to see the same copy of a memory location. Typically, this will be the main external system memory.

it's one old case, however, adding some comments in case of someone's search.
in my opinion, PoU and PoC are coined by ARM to define one level for cache maintenance. the definition of PoC and PoU is in ARM ARM specification, while its ARMv8 programming guide (not ARM spec) gives some diagram for better understanding: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch11s04.html
one point is ,under ARM V8 processor's implementation, Iside can snoop Dside, for example, if there is one Icache miss, it will check Dcache, so you could treat PoU as the level of L1 cache. while other ARMv8 processor may not have this behaviour.
back to the original questions:
2) DCCMVAC 0x40000000, it will do cache clean to PoC about this address, mostly one cache line
PoC is defined by SoC implementation, not by address.
3) considering Q2, DCCMVAC 0x0, only applies to one cache line.
if you want to clean and invalidate the whole cache, you need use by set/way to walkthrough the whole cache.
4) PoU has nothing to do with MMU.
MMU hardware block owns some buffers to save TLB entries, it's one common practice, as for pagetable, which is built by software as in the memory, normally it's defined as normal memory type, so it could be in the cache during setup by CPU instruction, or walk by MMU hardware walk unit.

Related

How to manage devices that cannot access d-cache in ARM

I'm using an SPI device with DMA enabled in an STM32H7 SoC. The DMA periph. cannot access d-cache, so in order to make it work I have disabled d-cache entirely (for more info. about this, see this explanation). However, I would like to avoid disabling d-cache globally for a problem that only affects to a small region of memory.
I have read this post about the meaning of clean and invalidate cache operations, in the ARM domain. My understanding is that, by cleaning a cache area, you force it to be written in the actual memory. On the other hand, by invalidating a cache area, you force the actual memory to be cached. Is this correct?
My intention with this is to follow these steps to transmit something over SPI (with DMA):
Write the value you want on the buffer that DMA will read from.
Clean d-cache for that area to force it to go to actual memory, so DMA can see it.
Launch the operation: DMA will read the value from the area above and write it to the SPI's Tx buffer.
SPI reads data at the same time it writes, so there will be data in the SPI's Rx buffer, which will be read by DMA and then it will write it to the recv. buffer provided by the user. It could happen that an observer of such buffer can indeed access d-cache. The latter could not be updated with the new value received by SPI yet, so invalidate the recv. buffer area to force d-cache to get updated.
Does the above make sense?
EDIT
Adding some more sources/examples of the problem I'm facing:
Example from the ST github: https://github.com/STMicroelectronics/STM32CubeH7/issues/153
Post in ST forums answring and explaining the d-cache problem: https://community.st.com/s/question/0D53W00000m2fjHSAQ/confused-about-dma-and-cache-on-stm32-h7-devices
Here the interconnection between memory and DMA:
As you can see, DMA1 can access sram1, 2 and 3. I'm using sram2.
Here the cache attributes of sram2:
As you can see, it is write back,write allocate, but not write through. I'm not familiar with these attributes, so I read the definition from here. However, that article seems to talk about the CPU physical cache (L1, L2 etc.) I'm not sure if ARM i-cache and d-cache refer to this physical cache. In any case, I'm assuming the definition for write through and the other terms are valid for d-cache as well.
I forget off hand how the data cache works on the cortex-m7/armv7-m. I want to remember it does not have an MMU and caching is based on address. ARM and ST would be smart enough to know to put cached and non-cached access to sram from the processor core.
If you are wanting to send or receive data using DMA you do not go through the cache.
You linked a question from before which I had provided an answer.
Caches contain some amount of sram as we tend to see a spec for this many KBytes or this many MBytes, whatever. But there are also tag rams and other infrastructure. How does the cache know if there is a hit or a miss. Not from the data, but from other bits of information. Taken from the address of the transaction. Some number of bits of that address are taken and compared to however many "ways" you have so there may be 8 ways for example so there are 8 small memories think of them as arrays of structures in C. In that structure is some information is this cache line valid? If valid what is the tag or bit of address that it is tied to, is it clean/dirty...
Clean or dirty meaning the overall caching infrastructure will be designed (kinda the whole point) to hold information in a faster sram (sram in mcus is very fast already so why a cache in the first place???), which means that write transactions, if they go through the cache (they should in some form) will get written to the cache, and then based on design/policy will get written out into system memory or at least get written on the memory side of the cache. While the cache contains information that has been written that is not also in system memory (due to a write) that is dirty. And when you clean the cache using ARM's term clean, or flush is another term, etc. You go through all of the cache and look for items that are valid and dirty and you initiate writes to system memory to clean them. This is how you force things out the cache into system memory for coherency reasons, if you have a need to do that.
Invalidate a cache simply means you go through the tag rams and you change the valid bit to indicate invalid for that cache line. Basically that "loses" all information about that cache line it is now available to use. It will not result in any hits and it will not do a write to the system for a clean/flush. The actual cache line in the cache memory does not have to be zeroed or put in any other state. Technically just the valid/invalid bit or bits.
How things generally get into a cache are certainly from reads. Depending on the design and settings if a read is cacheable then the cache will first look to see if it has a tag for that item and if it is valid, if so then it simply takes the information in the cache and returns it. If there is a miss, that data does not have a copy in the cache, then it initiates one or more cache line reads from the system side. So a single byte read can/will cause a larger, sometimes much larger, read to happen on the system side, the transaction is held until that (much larger) data (read) returns and then it is put in the cache and the item requested is returned to the processor.
Depending on the architecture and settings, writes may or may not create an entry in the cache, if a (cacheable) write happens and there are no hits in the cache then it may just go straight to the system side as a write of that size and shape. As if the cache was not there. If there is a cache hit then it will go into the cache, and the that/those cache lines are marked as dirty and then depending on the design, etc it may be written to system memory as a side effect of the write from the processor side, the processor will be freed to continue execution but the cache and other logic (write buffer) may continue to process this transaction moving this new data to the system side essentially cleaning/flushing automatically. One normally does not expect this as it takes away performance that the cache was there to provide in the first place.
In any case if it is determined that a transaction has a miss and it is to be cached, then based on that tag, the ways have already been examined to determine if there was a hit. One of the ways will be chosen to hold this new cache line. How that is determined is based on design and in some cases programmable settings. Hopefully if there are any that are invalid then it would go to one of those. But round robin, randomizer, oldest first, etc are solutions you may see. And if there is dirty data in that space then it has to get written out first, making room for the new information. So, absolutely a single byte or single word read (since they have the same performance in a system like this) can require a cache flush of a cache line, then a read from the system and then the result is returned, more clock cycles than if the cache was not there. Nature of the beast. Caches are not perfect, with the right information and experience you can easily write code that makes the cache degrade the performance of the application.
Clean means if a cache line is valid and dirty then write it out to system memory and mark it as clean.
Invalidate means if the cache line is valid then mark it as valid. If it was valid and dirty that information is lost.
In your case you do not want to deal with cache at all for these transactions, the cache in question is in the arm core so nobody but the arm core has access to that cache, nobody else is behind the cache, they are all on the system end.
Taking a quick look at the ARM ARM for armv7-m they do use address space to determine write through and cached or not. One then needs to look at the cortex-m7 TRM for further information and then, particularly in this case, since it is a chip thing not an arm thing anyway, the whole system. The arm processor is just some bit of ip that st bought to glue into a chip with a bunch of other ip and ip of their own that is glued together. Like the engine in the car, the engine manufacturer cant answer questions about the rear differential nor the transmission, that is the car company not the engine company.
arm knows what they are doing
st knows what they are doing
if a chip company makes a chip with dma but the only path between the processor and the memory shared with the dma engine is through the processors cache when the cache is enabled, and clean/flush and invalidate of address ranges are constantly required to use that dma engine...Then you need to immediately discard that chip, blacklist that company's products (if this product is that poorly designed then assume all of their products are), and find a better company to buy products from.
I cant imagine that is the case here, so
Initialize the peripheral, choosing to use DMA and configure the peripheral or dma engine or both (for each direction).
Start the peripheral (this might be part of 4)
write the tx data to the configured address space for dma
tell the peripheral to start the transfer
monitor for completion of transfer
read the received data from the configured address space for dma
That is generic but that is what you are looking for, caches are not involved. For a part/family like this there should be countless examples including the (choose your name for the quality) one or more library solutions and examples that come from the chip vendor. Look at how they others are using the part, compare that to the documentation, determine your risk level for their solution and use it or modify it or learn from it if nothing else.
I know that st products do not have an instruction cache they do their own thing, or at least that is what I remember (some trademarked name for a flash cache, on most of them you cannot turn it off). Does that mean they have not implemented a data cache on the products either? Possible. Just because the architecture for an ip product has a feature (fpu, caches, ...) does not automatically mean that the chip vendor has enabled/implemented those. Depending on the ip there are various ways to do that as some ip does not have a compile time option for the chip vendor to not compile in a feature. if nothing else the chip vendor could simply stub out the cache memory interfaces and write a few lines of text in the docs that there is no cache, and you can write control registers and see things appear to enable that feature but it simply does not work. One expects that arm provides compile time features, that are not in the public documentation we can see, but are available to the chip vendor in some form. Sometimes when you buy the ip you are given a menu if you will like ordering a custom burger at a fancy burger shop, a list of checkboxes, mayo, mustard, pickle. ... fpu, cache, 16 bit fetch, 32 bit fetch, one cycle multiply, x cycle multiply, divide, etc. And the chip vendor then produces your custom burger. Or some vendors you get the whole burger then you have to pick off the pickles and onions yourself.
So again, not our job to read the docs for you, so first off does this part even have a dcache? Look between the arm arm, the arm trm and the documentation for the chip address spaces (as well as the countless examples) and determine what address space or whet settings, etc are needed to access portions of sram in a non-cached way. If it has a data cache feature at all.
I have investigated a bit more:
With regards to clean and invalidate memory question, the answer is yes: clean will force cache to be written in memory and invalidate will force memory to be cached.
With regards to the steps I proposed, again yes, it makes sense.
Here is a sequence of 4 videos that explain this exact situation (DMA and memory coherency). As can be seen, the 'software' solution (doesn't involve MPU) proposed by the videos (and other resources provided above) is exactly the sequence of steps I posted.
https://youtu.be/5xVKIGCPy2s
https://youtu.be/2q8IvCxSjaY
https://youtu.be/6IEtoG7m0jI
https://youtu.be/0DhYTqPCRiA
The other proposed solution is to configure the cortex-m7 MPU to change the attributes of a particular memory region to keep memory coherency.
This all apart from the easiest solution which is to globally disable d-cache, although, naturally, this is not desirable.

How to debug an aarch64 translation fault?

I am writing a simple kernel in armv8 (aarch64).
MMU config:
48 VA bits (T1SZ=64-48=16)
4K page size
All physical RAM flat mapped into kernel virtual memory (on TTBR1_EL1)
(MMU is active with TTBR0_EL1=0, so I'm only using addresses in 0xffff< addr >, all flat-mapped into physical memory)
I'm mapping a new address space (starting at 1<<40) to some free physical region. When I try to access address 1<<40, I get an exception (of type "EL1 using SP1, synchronous"):
ESR_EL1=0x96000044
FAR_EL1=0xffff010000000000
Inspecting other registers, I have:
TTBR1_EL1=0x82000000
TTBR1_EL1[2]=0x0000000082003003
So, based on ARM Architecture Reference Manual for ARMv8 (ARMv8-A profile):
ESR (exception syndrome register) translates into: Exception Class=100101 (Data abort without a change in exception level) on pages D7-1933 sq. ; WnR=1 (faulting instruction is a write) ; DFSC=0b000100 (translation fault at level 0) on page D7-1958 ;
FAR_EL1 is the faulting address ; it indicates TTBR1_EL1 is used (since high bits are all 1). The VA top 9 bits are 0b000000010, which indicate that entry 2 is used in the table ;
Entry 2 in the table indicates a next-level table (low bits 0b11) at physical address 0x82003000.
So, translation fails at level 0, where it should not.
My question is: am I doing something wrong? Am I missing some info that could lead to the translation fault? And, more generally, how to debug a translation fault ?
Update:
Everthing works when I write to tables before enabling the MMU.
Whenever I write to tables AFTER enabling the MMU (via flat-mapped table region), mapping never works. I wonder why this happens.
I also tried manually writing to the selected tables (to remove any side effect from my mmapping function): same result (when writes are done before MMU is on, it works; after, it fails).
I tried doing tlbi and dsb sy instructions, followed by isb, without effect. Only one CPU is running at this time so caching should not be a problem - write instructions and MMU talk to the same caches (but I will test it next).
I overlooked caching issues within a single core. The problem was that, after turning the MMU on, the CPU and table walk unit didn't have the same view of memory. ARMv8 Cortex-A Programming Guide states that cache has to be cleaned/invalidated to point of unification (same view for a single core) after modifying tables.
Two possibilities can explain this behavior (I don't fully understand how caches work yet):
First possibility: the MMU does not have the required address in its internal walk cache.
In this case, when updating regular data and making it available to other core's L1, the dsb instruction simply waits for all cores to have a synchronized state (thanks to coherency network): other cores will know that the line has to be updated, and when they try to access it, it gets updated to L2 or migrated from the previous core's L1 to their L1.
This does not happen with the MMU (no coherency participation), so it still sees the old value in L2.
However, if this were the case, the same thing should happen before the MMU is turned on (because caching is activated way before), except if all memory is considered L1-non-cacheable before MMU is activated (which is possible, I'll have to double check that).
A minimal way of fixing the problem may be to change caching policies for table pages, but the cache maintenance is still necessary to clear possible old values from the MMU.
Second possibility: in all cases tested, the MMU already has the faulting address in its internal walk cache, which is not coherent with data L1 or L2.
In that case, only an explicit invalidate can eject the old line from the MMU cache. Before the MMU is turned on, the cache contains nothing and never gets the old value (0), only the new one.
I still think that case is unlikely because I tested many cases, and sometimes the offset between previsouly mapped memory (for example, entry 0 in the level 1 table) and newly mapped memory (for example, entry 128 in the same level 1 table) was greater than the cache line size (in this case, 1024 bytes, which is more than any cache line size).
So, I'm still not sure what exactly causes the problem, but cleaning/invalidating all the updated addresses works.

Arm cortex a9 memory access

I want to know the sequence an ARM core (Cortex-A series processor) accesses memory? Right from Virtual Address generated by core to memory and Instruction/Data transferred from the memory to the core. Consider core has generated a virtual address for some data/instruction and there is a miss from TLBs, then how does address reach to main memory(DRAM if I am not wrong) and how does data comes to core through L2 and L1 caches.
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
I am confused regarding cache and MMU communications.
tl;dr - Whatever you want. The ARM is highly flexible and the SOC vendor and/or the system programmer may make the memory sub-systems do a great many different things depending on the end device features and needs.
First, the MMU has fields that explicitly dictate how the cache is to be used. I recommend reading Chapter 9 Caches and Chapter 10 Memory Management Unit of the Cortex-A Series Programmers Guide.
Some terms are,
PoC - point of coherency.
PoU - point of unification.
Strongly ordered.
Device
Normal
Many MMU properties and caching can be affected by different CP15 and configuration registers. For instance, an 'exclusive configuration' for data in the L1 cache is never in the L2 can make it particularly difficult to cleanly write self modifying code and other dynamic updates. So, even for a particular Cortex-A model, the system configuration may change things (write-back/write-through, write-allocate/no write-allocate, bufferable, non-cacheable, etc).
A typical sequence for general DDR core memory is,
Resolve virt -> phys
Micro TLB present? Yes, have `phys`
TLB present? Yes, have `phys`
Table walk. Have `phys` or fault.
Access marked cacheable? Yes do 2.1. No step 4.
In L1 cache? Yes 2b.
If read return data. If write fill data and mark drity (write back).
In L2 cache? Yes 3.1
If read return data. If write fill data and mark drity (write back).
Run physical cycle on AXI bus (may route to sub-bus).
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
For normal cases these are just cache hits. If it is a 'write-through' and 'write' then the value is updated in cache and written to memory. It it is 'write-back' the value is updated in cache and marked dirty.Note1 If it is a read, then the cache memory is used (in both case).
The system maybe set up completely differently for device memory (Ie, memory mapped USB registers, world shareable memory, multi-core/cpu buffers, etc). Often the setup will depend on system cost, performance and power consumption. Ie, a write-through cache is easier to implement (lower power and less cost) but often lower performance.
I am confused regarding cache and MMU communications.
Mainly the MMU will provide information for the caches to resolve an address. The MMU may say to use/not use the cache. It may tell the cache it can 'gang' writes together (write-bufferable), but should not store them indefinitely, etc. So many of the MMU specifiers can selectively alter the behavior of the cache. As the Cortex-A cache parameters are not defined (it is up to each SOC manufacturer), it is often the case that particular MMU bits may have alternate behavior on different systems.
Note1: The 'dirty cache' may have additional 'broadcasts' of exclusion monitor information for strex and ldrex type accesses.

Significance of Reset Vector in Modern Processors

I am trying to understand how computer boots up in very detail.
I came across two things which made me more curious,
1. RAM is placed at the bottom of ROM, to avoid Memory Holes as in Z80 processor.
2. Reset Vector is used, which takes the processor to a memory location in ROM, whose contents point to the actual location (again ROM) from where processor would actually start executing instructions (POST instruction). Why so?
If you still can't understand me, this link will explain you briefly,
http://lateblt.tripod.com/bit68.txt
The processor logic is generally rigid and fixed, thus the term hardware. Software is something that can be changed, molded, etc. thus the term software.
The hardware needs to start some how, two basic methods,
1) an address, hardcoded in the logic, in the processors memory space is read and that value is an address to start executing code
2) an address, hardcoded in the logic, is where the processor starts executing code
When the processor itself is integrated with other hardware, anything can be mapped into any address space. You can put ram at address 0x1000 or 0x40000000 or both. You can map a peripheral to 0x1000 or 0x4000 or 0xF0000000 or all of the above. It is the choice of the system designers or a combination of the teams of engineers where things will go. One important factor is how the system will boot once reset is relesed. The booting of the processor is well known due to its architecture. The designers often choose two paths:
1) put a rom in the memory space that contains the reset vector or the entry point depending on the boot method of the processor (no matter what architecture there is a first address or first block of addresses that are read and their contents drive the booting of the processor). The software places code or a vector table or both in this rom so that the processor will boot and run.
2) put ram in the memory space, in such a way that some host can download a program into that ram, then release reset on the processor. The processor then follows its hardcoded boot procedure and the software is executed.
The first one is most common, the second is found in some peripherals, mice and network cards and things like that (Some of the firmware in /usr/lib/firmware/ is used for this for example).
The bottom line though is that the processor is usually designed with one boot method, a fixed method, so that all software written for that processor can conform to that one method and not have to keep changing. Also, the processor when designed doesnt know its target application so it needs a generic solution. The target application often defines the memory map, what is where in the processors memory space, and one of the tasks in that assignment is how that product will boot. From there the software is compiled and placed such that it conforms to the processors rules and the products hardware rules.
It completely varies by architecture. There are a few reasons why cores might want to do this though. Embedded cores (think along the lines of ARM and Microblaze) tend to be used within system-on-chip machines with a single address space. Such architectures can have multiple memories all over the place and tend to only dictate that the bottom area of memory (i.e. 0x00) contains the interrupt vectors. Then then allows the programmer to easily specify where to boot from. On Microblaze, you can attach memory wherever the hell you like in XPS.
In addition, it can be used to easily support bootloaders. These are typically used as a small program to do a bit of initialization, then fetch a larger program from a medium that can't be accessed simply (e.g. USB or Ethernet). In these cases, the bootloader typically copies itself to high memory, fetches below it and then jumps there. The reset vector simply allows the programmer to bypass the first step.

ARM11 Translation Lookaside Buffer (TLB) usage?

Is there a decent guide explaining how to use the TLB (Translation Lookaside Buffers) tables on an ARM1176JZF-S core?
Having looked over the technical documentation for the that ARM platform I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
Apart from that, I have absolutely no clue on how to use them.
What structure does a TLB entry have? How do I create new entries?
How do I handle VM in context switches for user-space threads? How do I ensure that those threads can only access specific pages assigned to their parent processes (enforce memory protection)? Do I save the TLB state for each context?
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Thank you in advance. I'll be really glad if someone provides an explanation of what TLBs are. I'm currently working on a memory mapper for my kernel, and I've pretty much hit a dead end.
The technical reference manual for ARM1176JZF-S appears to be DDI 0301. That document contains all the specific details for that specific ARM core.
I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
A TLB is a cache of the page table. Some processors allow direct access to the TLB, while knowing nothing about page tables (e.g: MIPS), while others know about page tables, and internally use TLBs that the programmer mostly doesn't see (e.g: x86). In this case, the TLB is managed by hardware, and the system programmer only has to care to make the TTB (Translation Table Base) registers point to the page table, and invalidate the TLB in apropriate places.
What structure does a TLB entry have? How do I create new entries?
Done by hardware. On a TLB miss, the MMU walks the page table and fills the TLB from there.
How do I handle VM in context switches for user-space threads?
Some platforms have TLBs that simply map virtual addresses to physical addresses (e.g: x86). On these platforms, you have to do a full TLB flush on each context switch. Other platforms (MIPS, this specific ARM core) map (ASID, virtual address) pairs to physical addresses. An ASID is an Application-Specific Identifier, i.e: an identifier for a process. The MMU uses a register to know which ASID to use (I think it's the Context ID register in this case). Since there may be more processes than ASIDs, occasionally you may need to recycle an ASID (assigning it to a different process) and do a TLB flush (that's what the Invalidate TLB by ASID operation is for).
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
This is exactly for the same reason you have small separate level-1 caches for instructions and data. Since they are caches, you don't need more than 10 (though having more could improve performance).
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Some memory pages (e.g: some portions of the kernel) are accessed very often. It makes sense to lock them, so they don't get thrown off of the TLB. Also, on realtime systems, a TLB miss or a cache miss may introduce some unwanted unpredictability. So, there is an option to lock a number of TLB entries. The main TLB has more entries, but only those 8 are lockable.

Resources