What is PDE cache? - arm

I have the following specifications of an ARM based SoC:
L1 Data cache = 32 KB, 64 B/line, 2-WAY, LRU
L2 Cache = 1 MB, 64 B/line, 16-WAY
L1 Data TLB (for loads): 32 entries, fully associative
L2 Data TLB: 512 entries, 4-WAY
PDE Cache: 16 entries (one entry per 1 MB of virtual space)
And I wonder what is the PDE cache? I guess it's something similar to TLB, but I'm not sure.
Answer
It seems that PDE (Page Directory Entry) is Intermediate table walk cache which indeed can be implemented separately from TLB.
The Cortex-A15 MPCore processor implements dedicated caches that store intermediate levels of translation table entries as part of a table walk.

A PDE ("Page Directory Entry") is x86 architecture terminology for a top level* page table entry - the equivalent of a "first-level descriptor" in ARM VMSA terms.
Assuming this is the source of the data in the question, it's presumably referring to Cortex-A15's "intermediate table walk cache", which is not entirely appropriate since that can actually cache any level of translation.
* in IA-32 at least - 64-bit mode has levels above this

A TLB caches full translations, it doesn't reflect a coherent part of the memory per-se (although not being coherent, it may cause loss of coherency in case the page map changes, so SW must enforce coherency explicitly through flushing).
However, the pagemap itself does reside in memory, and as such - every part of it may also be cached, whether in the general purpose cache hierarchy, or within special dedicated caches such as a PDE cache. This is implementation specific, different CPUs may decide differently how to do that.
An access hitting the TLB (in any of its levels) won't need that data, but a TLB miss will trigger a pagewalk which will issue memory reads from the pagemap - these reads may hit in the caches if they include the pagemap data, instead of having to go all the way to memory.
Since a pagewalk is a long, serialized, critical chain of accesses (even more so if you have virtualization), you can imagine how important it is to optimize the latency of these accesses by caching them. Therefore, a dedicated cache to any of the pagemap levels, which would help them compete with the normal data lines (that are much more frequently thrashing the cache), is often very useful for performance

That's interesting. ARM does not name the existence of this PDE cache in Cortex-A15, Cortex-A57 documentations and ARMv7 and ARMv8 programming guides.
PDE generally stands for Page Directory Entry so this may be a dedicated cache to store these entries and write the TTBR register when doing an address translation.
ARM has some "intermediate table walk caches" that are associated with an ASID field (address space identifier) and VMID field (virtual machine identifier) so it seems like PDE cache and intermediate table walk cache are related. In the documentation, "intermediate table walk caches" store intermediate levels of translation table entries ... so this may well be the page directory entries.

Related

Cache locality of an array and amortized complexity

An array is usually faster in access than a linked list.
This is primarily due to cache locality of an array.
I had two doubts :
On what factor does the amount of data that is brought into the cache memory depend ? Is it completely equal to the cache memory of the system ? How can we know what amount of memory is this ?
The first access to an array is usually costlier as the array has to be searched for in the memory and brought in the memory. The subsequent operations are comparitively faster. How do we calculate the amortized complexity of the access operation ?
What are cache misses ? Does it mean (in reference to linked list) that the required item which is needed (current pointer-> next) has not been loaded into the cache memory and hence memory has to be searched again for its address ?
In reality, it is a bit more complex than the simple model you present in the question.
First, you may have multiple caching layers (L1,L2,L3), each of them with different characteristics. Especially, the replacement policy for each cache may use different algorithms as a tradeoff between efficiency and complexity (i.e. cost).
Then, all modern operating-systems implement virtual memory mechanisms. It is not enough to cache the data and the instructions (which is what L1..L3 are used for), it is also required to cache the association between virtual and physical addresses (in the TLB, transaction lookaside buffer).
To understand the impact of locality, you need to consider all these mechanisms.
Question 1
The minimum unit exchanged between the memory and a cache is the cache line. Typically, it is 64 bytes (but it depends on the CPU model). Let's imagine the caches are empty.
If you iterate on an array, you will pay for a cache miss every 64 bytes. A smart CPU (and a smart program) could analyze the memory access pattern and decide to prefetch contiguous blocks of memory in the caches to increase the throughput.
If you iterate on a list, the access pattern will be random, and you will likely pay a cache miss for each item.
Question 2
The whole array is not searched and brought in the cache on first access. Only the first cache line is.
However, there is also another factor to consider: the TLB. The page size is system dependent. Typical value is 4 KB. The first time the array will be accessed, an address translation will occur (and its result will be stored in the TLB). If the array is smaller than 4 KB (page size), no other address translation will have to be done. If it is bigger, than one translation per page will be done.
Compare this to a list. The probability that multiple items fits in the same page (4 KB) is much lower than for the array. The probability that they can fit in the same cache line (64 bytes) is extremely low.
I think it is difficult to calculate a complexity because there are probably other factors to consider. But in this complexity, you have to take in account the cache line size (for cache misses), and the page size (for TLB misses).
Question 3
A cache miss is when a given cache line is not in the cache. It can happen at L1, L2, or L3 levels. The higher level, the more expensive.
A TLB miss occurs when the virtual address is not in the TLB. In that case, a conversion to a physical address is done using the page tables (costly) and the result is stored in the TLB.
So yes, with a linked list, you will likely pay for a higher number of cache and TLB misses than for an array.
Useful links:
Wikipedia article on caches in CPUs: https://en.wikipedia.org/wiki/CPU_cache
Another good article on this topic: http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
An oldish, but excellent article on the same topic: http://arstechnica.com/gadgets/2002/07/caching/
A gallery of various caching effects: http://igoro.com/archive/gallery-of-processor-cache-effects/

Programmatically find the number of cache levels

i am a newbie in c programming . I have an assignment to find the number of data cache levels in the cpu and also the hit time of each levels.I am looking at C Program to determine Levels & Size of Cache but finding it difficult to interpret the results. How is the number of cache levels revealed?
any pointers will be helpful
Assuming you don't have a way to cheat (like some way of getting that information from the operating system or some CPU identification register):
The basic idea is that (by design), your L1 cache is faster than your L2 cache which is faster than your L3 cache... In any normal design, your L1 cache is also smaller than your L2 cache which is smaller than your L3 cache...
So you want to allocate a large-ish block of memory and then access (read and write) it sequentially[1] until you notice that the time taken to perform X accesses has risen sharply. Then keep going until you see the same thing again. You would need to allocate a memory block larger than the largest cache you are hoping to discover.
This requires access to some low-overhead access timestamp counter for the actual measurement (as pointed out in the referrered-to answer).
[1] or depending on whether you want to try to fool any clever prefetching that may skew the results, randomly within a sequentially progressing N-byte block.

Storing C/C++ variables in processor cache instead of system memory

On the Intel x86 platform running Linux, in C/C++, how can I tell the OS and the hardware to store a value (such as a uint32) in L1/L2 cache, and not in system memory? For example, let's say either for security or performance reasons, I don't want to store a 32-bit key (a 32-bit unsigned int) in DRAM, and instead I would like to store it only in the processor's cache. How can I do this? I'm using Fedora 16 (Linux 3.1 and gcc 4.6.2) on an Intel Xeon processor.
Many thanks in advance for your help!
I don't think you are able to force a variable to be stored in the processor's cache, but you can use the register keyword to suggest to the compiler that a given variable should be allocated into a CPU register, declaring it like:
register int i;
There are no cpu instructions on x86 (or indeed any platform that I'm aware of) that will allow you to force the CPU to keep something in the L1/L2 cache. Let alone exposing such an extremely low level detail to the higher level languages like C/C++.
Saying you need to do this for "performance" is meaningless without more context of what sort of performance you're looking at. Why is your program so tightly dependent on having access to data in cache alone. Saying you need this for security just seems like bad security design. In either case, you have to provide a lot more detail of what exactly you're trying to do here.
Short answer, you can't - that is not what those caches are for - they are fed from main memory, to speed up access, or allow for advanced techniques like branch prediction and pipelining.
There are ways to ensure the caches are used for certain data, but it will still reside in ram, and in a pre-emptive multitasking operating system, you cannot guarantee that your cache contents will not be blown away through a context switch between any two instructions, except by 'stopping the world', or low level atomic operations, but they are generally for very, very, very short sequences of instructions that simply cannot not be interrupted, like increment and fetch for spinlocks, not processing cryptographic algorithms in one go.
You can't use cache directly but you can use hardware registers for integers, and they are faster.
If you really want performance, a variable is better off in a CPU register.
If you cannot use a register, for example because you need to share the same value across different threads or cores (multicore is getting common now!), you need to store the variable into memory.
As already mentioned, you cannot force some memory into the cache using a call or keyword.
However, caches aren't entirely stupid: if you memory block is used often enough you shouldn't have a problem to keep it in the cache.
Keep in mind that if you happen to write to this memory place a lot from different cores, you're going to strain the cache coherency blocks in the processor because they need to make sure that all the caches and the actual memory below are kept in sync.
Put simply, this would reduce overall performance of the CPU.
Note that the opposite (do not cache) does exist as a property you can assign to parts of your heap memory.

Process Page Tables

I'm interested in gaining a greater understanding of the virtual memory and page mechanism, specifically for Windows x86 systems. From what I have gathered from various online resources (including other questions posted on SO),
1) The individual page tables for each process are located within the kernel address space of that same process.
2) There is only a single page table per process, containing the mapping of virtual pages onto physical pages (or frames).
3) The physical address corresponding to a given virtual address is calculated by the memory management unit (MMU) essentially by using the first 20 bits of the provided virtual address as the index of the page table, using that index to retrieve the beginning address of the physical frame and then applying some offset to that address according to the remaining 12 bits of the virtual address.
Are these three statements correct? Or am I misinterpreting the information?
So, first lets clarify some things:
In the case of the x86 architecture, it is not the operating system that determines the paging policy, it is the CPU (more specifically it's MMU). How the operating system views the paging system is independent of the the way it is implemented. As a commenter rightly pointed out, there is an OS specific component to paging models. This is subordinate to the hardware's way of doing things.
32 bit and 64 bit x86 processors have different paging schemes so you can't really talk about the x86 paging model without also specifying the word size of the processor.
What follows is a massively condensed version of the 32 bit x86 paging model, using the simplest version of it. There are many additional tweaks that are possible and I know that various OS's make use of them. I'm not going into those because I'm not really familiar with the internals of most OS's and because you really shouldn't go into that until you have a grasp on the simpler stuff. If you want the to know all of the wonderful quirks of the x86 paging model, you can go to the Intel docs: Intel System Programming Guide
In the simplest paging model, the memory space is divided into 4KB blocks called pages. A contiguous chunk of 1024 of these is mapped to a page table (which is also 4KB in size). For a further level of indirection, All 1024 page tables are mapped to a 4KB page directory and the base of this directory sits in a special register %cr3 in the processor. This two level structure is in place because most memory spaces in the OS are sparse which means that most of it is unused. You don't want to keep a bunch of page tables around for memory that isn't touched.
When you get a memory address, the most significant 10 bits index into the page directory, which gives you the base of the page table. The next 10 bits index into that page table to give you the base of the physical page (also called the physical frame). Finally, the last 12 bits index into the frame. The MMU does all of this for you, assuming you've set %cr3 to the correct value.
64 bit systems have a 4 level paging system because their memory spaces are much more sparse. Also, it is possible to page sizes that are not 4KB.
To actually get to your questions:
All of this paging information (tables, directories etc) sits in kernel memory. Note that kernel memory is one big chuck and there is no concept of having kernel memory for a single process.
There is only one page directory per process. This is because the page directory defines a memory space and each process has exactly one memory space.
The last paragraph above gives you the way an address is chopped up.
Edit: Clean up and minor modifications.
Overall that's pretty much correct.
If memory serves, a few details are a bit off though:
The paging for the kernel memory doesn't change per-process, so all the page tables are always visible to the kernel.
In theory, there's also a segment-based translation step. Most practical systems (e.g., *BSD, Linux, Windows, OS/X), however, use segments with their base set to 0 and limit set to the address space limit, so this step ends up as essentially a NOP.

Why does CPU access memory on a word boundary?

I heard a lot that data should be properly aligned in memory for better access efficiency. CPU access memory on a word boundary.
So in the following scenario, the CPU has to make 2 memory accesses to get a single word.
Supposing: 1 word = 4 bytes
("|" stands for word boundary. "o" stands for byte boundary)
|----o----o----o----|----o----o----o----| (The word boundary in CPU's eye)
----o----o----o---- (What I want to read from memory)
Why should this happen? What's the root cause of the CPU can only read at the word boundary?
If the CPU can only access at the 4-byte word boundary, the address line should only need 30bit, not 32bit width. Cause the last 2bit are always 0 in CPU's eye.
ADD 1
And even more, if we admit that CPU must read at the word boundary, why can't the boundary start at where I want to read? It seems that the boundary is fixed in CPU's eye.
ADD 2
According to AnT, it seems that the boundary setting is hardwired and it is hardwired by the memory access hardware. CPU is just innocent as far as this is concerned.
The meaning of "can" (in "...CPU can access...") in this case depends on the hardware platform.
On x86 platform CPU instructions can access data aligned on absolutely any boundary, not only on "word boundary". The misaligned access might be less efficient than aligned access, but the reasons for that have absolutely nothing to do with CPU. It has everything to do with how the underlying low-level memory access hardware works. It is quite possible that in this case the memory-related hardware will have to make two accesses to the actual memory, but that's something CPU instructions don't know about and don't need to know about. As far as CPU is concerned, it can access any data on any boundary. The rest is implemented transparently to CPU instructions.
On hardware platforms like Sun SPARC, CPU cannot access misaligned data (in simple words, your program will crash if you attempt to), which means that if for some reason you need to perform this kind of misaligned access, you'll have to implement it manually and explicitly: split it into two (or more) CPU instructions and thus explicitly perform two (or more) memory accesses.
As for why it is so... well, that's just how modern computer memory hardware works. The data has to be aligned. If it is not aligned, the access either is less efficient or does not work at all.
A very simplified model of modern memory would be a grid of cells (rows and columns), each cell storing a word of data. A programmable robotic arm can put a word into a specific cell and retrieve a word from a specific cell. One at a time. If your data is spread across several cells, you have no other choice but to make several consecutive trips with that robotic arm. On some hardware platforms the task of organizing these consecutive trips is hidden from CPU (meaning that the arm itself knows what to do to assemble the necessary data from several pieces), on other platforms it is visible to the CPU (meaning that it is the CPU who's responsible for organizing these consecutive trips of the arm).
It saves silicon in the addressing logic if you can make certain assumptions about the address (like "bottom n bits are zero). Some CPUs (x86 and their work-alikes) will put logic in place to turn misaligned data into multiple fetches, concealing some nasty performance hits from the programmer. Most CPUs outside of that world will instead raise a hardware error explaining in no uncertain terms that they don't like this.
All the arguments you're going to hear about "efficiency" are bollocks or, more precisely are begging the question. The real reason is simply that it saves silicon in the processor core if the number of address bits can be reduced for operations. Any inefficiency that arises from misaligned access (like in the x86 world) are a result of the hardware design decisions, not intrinsic to addressing in general.
Now that being said, for most use cases the hardware design decision makes sense. If you're accessing data in two-byte words, most common use cases have you access offset, then offset+2, then offset+4 and so on. Being able to increment the address byte-wise while accessing two-byte words is typically (as in 99.44% certainly) not what you want to be doing. As such it doesn't hurt to require address offsets to align on word boundaries (it's a mild, one-time inconvenience when you design your data structures) but it sure does save on your silicon.
As a historical aside, I worked once on an Interdata Model 70 -- a 16-bit minicomputer. It required all memory access to be 16-bit aligned. It also had a very small amount of memory by the time I was working on it by the standards of the time. (It was a relic even back then.) The word-alignment was used to double the memory capacity since the wire-wrapped CPU could be easily hacked. New address decode logic was added that took a 1 in the low bit of the address (previously an alignment error in the making) and used it to switch to a second bank of memory. Try that without alignment logic! :)
Because it is more efficient.
In your example, the CPU would have to do two reads: it has to read in the first half, then read in the second half separately, then reassemble them together to do the computation. This is much more complicated and slower than doing the read in one go if the data was properly aligned.
Some processors, like x86, can tolerate misaligned data access (so you would still need all 32 bits) - others like Itanium absolutely cannot handle misaligned data accesses and will complain quite spectacularly.
Word alignment is not only featured by CPUs
On the hardware level, most RAM-Modules have a given Word size in respect to the amount of bits that can be accessed per read/write cycle.
On a module I had to interface on an embedded device, addressing was implemented through three parameters: The module was organized in four banks which could be selected prior to the RW operation. each of this banks was essentially a large table 32-bit words, wich could be adressed through a row and column index.
In this design, access was only possible per cell, so every read operation returned 4 bytes, and every write operation expected 4 bytes.
A memory controller hooked up to this RAM chip could be desigend in two ways: either allowing unrestricted access to the memory chip using several cycles to split/merge unaligned data to/from several cells (with additional logic), or imposing some restrictions on how memory can be accessed with the gain of reduced complexity.
As complexity can impede maintainability and performance, most designers chose the latter [citation needed]

Resources