Why does ARM have 64KB Large Pages? [closed] - arm

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
The ARM720T user manual mentions small and large pages. Since the ARM 720T requires a 64KB page table entry to be duplicated 16 times in the page table, why not place 16 small page (4KB) entries to mimic a 64KB page entry instead of using a large page in the first place?

From the ARM720 TRM,
Large Pages
consist of 64KB blocks of memory. Large Pages are
supported to allow mapping of a large region of
memory while using only a single entry in the
TLB).
Additional access control mechanisms are extended
to 16KB Sub-Pages.
The main benefit is a 64k entry will only consume one TLB (MMU page entry cache). The TLB is 64 entries so 64*4k = 256kB versus 64*64k = 4MB; a significant increase in the amount of memory that doesn't require a page table lookup to address.
There are many down sides. For instance, a portable OS (and it's API) might require the smaller pages. If all entries are 64k fragmentation can result. The section entries are even better with each representing a 1MB chunk with 64MB fitting in the TLB. Generally the section will work better for a virtual==physical mapping.
If you know your system only has 4MB of usable memory then the 64k page entries can result in more reliable performance. Even with larger memory sizes the interrupt code and data can use 64k entries with TLB lock down note to avoid page table walks. This can result in better IRQ latency. The TLB is a limited resource so using 4k entries for the interrupt handler may result in wasting the TLB. Using section entries may waste memory as most interrupt code is <1MB.
Even without lock down, it is more likely that a 64k entry that is frequently used will remain in the TLB. An OS with per task/process memory may need to change the MMU tables which can result in TLB and cache flushing and invalidate. In order to simplify the context switch, everything maybe invalidated and flushed. So a table walk on an interrupt may be more common than you would suspect. This is a motivation to use the MMU 'PID' functionality and to only flush/invalidate smaller regions of memory and allow kernel code/data to remain in system caches. Additional code like the scheduler will also benefit from being mapped by a 64k entry.
Note: The ARM720T may/may not have lock down, but some ARM CPUs do and the MMU entry are fairly similar between CPU families. This answer applies to many different families of ARM CPUs.

Related

why mmap is faster than traditional file io [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.
mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.
"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.

Arm cortex a9 memory access

I want to know the sequence an ARM core (Cortex-A series processor) accesses memory? Right from Virtual Address generated by core to memory and Instruction/Data transferred from the memory to the core. Consider core has generated a virtual address for some data/instruction and there is a miss from TLBs, then how does address reach to main memory(DRAM if I am not wrong) and how does data comes to core through L2 and L1 caches.
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
I am confused regarding cache and MMU communications.
tl;dr - Whatever you want. The ARM is highly flexible and the SOC vendor and/or the system programmer may make the memory sub-systems do a great many different things depending on the end device features and needs.
First, the MMU has fields that explicitly dictate how the cache is to be used. I recommend reading Chapter 9 Caches and Chapter 10 Memory Management Unit of the Cortex-A Series Programmers Guide.
Some terms are,
PoC - point of coherency.
PoU - point of unification.
Strongly ordered.
Device
Normal
Many MMU properties and caching can be affected by different CP15 and configuration registers. For instance, an 'exclusive configuration' for data in the L1 cache is never in the L2 can make it particularly difficult to cleanly write self modifying code and other dynamic updates. So, even for a particular Cortex-A model, the system configuration may change things (write-back/write-through, write-allocate/no write-allocate, bufferable, non-cacheable, etc).
A typical sequence for general DDR core memory is,
Resolve virt -> phys
Micro TLB present? Yes, have `phys`
TLB present? Yes, have `phys`
Table walk. Have `phys` or fault.
Access marked cacheable? Yes do 2.1. No step 4.
In L1 cache? Yes 2b.
If read return data. If write fill data and mark drity (write back).
In L2 cache? Yes 3.1
If read return data. If write fill data and mark drity (write back).
Run physical cycle on AXI bus (may route to sub-bus).
What if required data/instruction is already in L1 cache?
What if required data/instruction is already in L2 cache?
For normal cases these are just cache hits. If it is a 'write-through' and 'write' then the value is updated in cache and written to memory. It it is 'write-back' the value is updated in cache and marked dirty.Note1 If it is a read, then the cache memory is used (in both case).
The system maybe set up completely differently for device memory (Ie, memory mapped USB registers, world shareable memory, multi-core/cpu buffers, etc). Often the setup will depend on system cost, performance and power consumption. Ie, a write-through cache is easier to implement (lower power and less cost) but often lower performance.
I am confused regarding cache and MMU communications.
Mainly the MMU will provide information for the caches to resolve an address. The MMU may say to use/not use the cache. It may tell the cache it can 'gang' writes together (write-bufferable), but should not store them indefinitely, etc. So many of the MMU specifiers can selectively alter the behavior of the cache. As the Cortex-A cache parameters are not defined (it is up to each SOC manufacturer), it is often the case that particular MMU bits may have alternate behavior on different systems.
Note1: The 'dirty cache' may have additional 'broadcasts' of exclusion monitor information for strex and ldrex type accesses.

Segments in RAM memory [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am confused with the segments in RAM memory,please clarify following doubts
RAM has been been dived as User space and Kernel space;is this memory division is done by O/S or it is done by H/W(CPU).
What are the contents of kernel space;as far as i have understood there will be kernel image only,please correct me if i am wrong.
Where does this code,data,stack and heap segments exist?
a) Does User and Kernel space has separate code,data,stack and heap segments?
b) Is this segments are created by H/W or (O/S).
Can i find the amount of memory occupied by Kernel space and User Space?
a) Is there any Linux command (or) system calls to find this?
Why the RAM has been divided into user space and kernel space?
a) I fell it is done to keep the kernel safe from application program is it so?is this is the only reason.
I am a beginner so please suggest me some good books,links and the way to approach these concepts.
I took up the challenge and tried with rather short answers:
Execution happens in user and kernel space. BIOS & CPU support the OS at detecting and separating resources/address ranges such as main memory and devices (-> related question) to establish the protected mode. In protected mode, memory is separated via virtual address spaces, which are mapped page wise (usually blocks of 4096 byte) to real addresses of physical memory via the MMU (Memory Management Unit).
From user space, one cannot accesses memory directly (in real mode), one has to access it via the MMU, which acts like a transparent proxy with access protection. Access errors are known as segmentation fault, access violation, segmentation violation (SIGSEGV), which are abstracted with NullPointerException (NPE) in high level programming languages like Java.
Read about protected mode, real mode and 'rings'.
Note: Special CPUs, such as in embedded systems, don't necessarily have an MMU and could therefore be limited to special OSes like µClinux or FreeRTOS.
A kernel does also allocate buffers, the same goes for drivers (e.g. IO buffers for disks, network interfaces and GPUs).
Generally, resources exist per space and process/thread
a) The kernel puts its own, protected stack on top of the user space stack (per thread) and has also separate code (also 'text'), data and heap segments. Also, each process has its own resources.
b) CPU architectures have certain requirements (depends upon the degree of support they offer), but in the end, it is the software (kernel & user space libraries used for interfacing), which create these structures.
Every reasonable OS provides at least one way to do that.
a) Try sudo cat /proc/slabinfo or simply sudo slabtop
Read 1.
a) Primarily, yes, just like user space processes are isolated from each other, except for special techniques such as CMA (Cross Memory Attach) for fast direct access in newer kernels.
Search the stack sites for recommended books
What can cause segmentation faults in C++?

Is it possible to assign parts of the shared L2 caches to different cores

Lets say, 4 threads are running on 4 separate cores of a Multicore x86 processor, and they do not share any data, is it possible to progammatically make the 4 cores use separate and predefined portions of the shared L2 cache.
Let's use two terms, exclusive and shared caches instead of L1, L2, L3, L4 caches. Different CPU families start to share cache on different levels. In the presented terms the original question is - is it possible split shared cache into the parts, each of which will be used exclusively by one of the CPU/cores? There is no clear answer. Furthermore there are two answers opposite to each other.
1) First and general answer: NO.
Cache is by design managed in hardware. There are only few control levers of cache accessible in software such as enable/disable cache for whole memory or defined memory region, apply specified policy for cache flushing (write through/ write back). NO basically due to the fact, that it was designed to be managed in hardware. So there are no useful interface that will allow manage it gracefully in software.
2) Second answer: Yes.
In fact, cache designed in such a way, that each line of the cache can save data from specified set of memory lines. Due to this if memory manager provides guaranty, that the same CPU one CPU/core own and use all memory lines assigned to the same cache line exclusively, then memory manager provides guaranty that that cache line will be used by that CPU exclusively. It is a very tricky workaround. And it have very limited benefits, and have serious drawbacks: memory layout is very fragmented, cache usage is unbalanced, complicated memory management, very hadrware-dependent (Details can be found in the paper provided by "MetallicPriest").
Resume: it is possible in theory and almost impossible on practice.

ARM11 Translation Lookaside Buffer (TLB) usage?

Is there a decent guide explaining how to use the TLB (Translation Lookaside Buffers) tables on an ARM1176JZF-S core?
Having looked over the technical documentation for the that ARM platform I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
Apart from that, I have absolutely no clue on how to use them.
What structure does a TLB entry have? How do I create new entries?
How do I handle VM in context switches for user-space threads? How do I ensure that those threads can only access specific pages assigned to their parent processes (enforce memory protection)? Do I save the TLB state for each context?
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Thank you in advance. I'll be really glad if someone provides an explanation of what TLBs are. I'm currently working on a memory mapper for my kernel, and I've pretty much hit a dead end.
The technical reference manual for ARM1176JZF-S appears to be DDI 0301. That document contains all the specific details for that specific ARM core.
I still have no clue what a TLB is or what it looks like. As far as I understand, each TLB entry maps a virtual page to a physical page, allowing remapping and controlling memory permissions.
A TLB is a cache of the page table. Some processors allow direct access to the TLB, while knowing nothing about page tables (e.g: MIPS), while others know about page tables, and internally use TLBs that the programmer mostly doesn't see (e.g: x86). In this case, the TLB is managed by hardware, and the system programmer only has to care to make the TTB (Translation Table Base) registers point to the page table, and invalidate the TLB in apropriate places.
What structure does a TLB entry have? How do I create new entries?
Done by hardware. On a TLB miss, the MMU walks the page table and fills the TLB from there.
How do I handle VM in context switches for user-space threads?
Some platforms have TLBs that simply map virtual addresses to physical addresses (e.g: x86). On these platforms, you have to do a full TLB flush on each context switch. Other platforms (MIPS, this specific ARM core) map (ASID, virtual address) pairs to physical addresses. An ASID is an Application-Specific Identifier, i.e: an identifier for a process. The MMU uses a register to know which ASID to use (I think it's the Context ID register in this case). Since there may be more processes than ASIDs, occasionally you may need to recycle an ASID (assigning it to a different process) and do a TLB flush (that's what the Invalidate TLB by ASID operation is for).
Why are there two TLBs? What can I use the MicroTLB for if it can only have 10 entries? Surely, I need more than 10.
This is exactly for the same reason you have small separate level-1 caches for instructions and data. Since they are caches, you don't need more than 10 (though having more could improve performance).
It says that one of the parts of the main TLB is "a fully-associative array of eight elements, that is lockable". What is that? Do I only get to have 8 entries for the Main TLB?
Some memory pages (e.g: some portions of the kernel) are accessed very often. It makes sense to lock them, so they don't get thrown off of the TLB. Also, on realtime systems, a TLB miss or a cache miss may introduce some unwanted unpredictability. So, there is an option to lock a number of TLB entries. The main TLB has more entries, but only those 8 are lockable.

Resources