Simultaneous existence of different-sized pages on Aarch64 - arm

According to the architecture overview document Aarch64 supports 4k and 64k pages. Some CPUs also support 16k pages. Looking into address translation scheme details I come to the conclusion that such CPUs don't support the simultaneous existence of different-sized pages (unlike x86_64 which allows that). Am I right?

You're conflating two different, albeit related, things here - page size vs. granularity.
In AArch64, you have 3 possible translation granules to choose from, each of which results in a different set of page sizes:
4KB granule: 4KB, 2MB, and 1GB pages.
16KB granule: 16KB and 32MB pages.
64KB granule: 64KB and 512MB pages.
The translation granule defines various properties of the translation regime in general, so it applies to a whole set of tables and you are correct in the sense that you can't mix and match granules within a table, although it's perfectly fine to use different granules for different tables at the same time (e.g. at different exception levels).
Comparatively, x86 always has 4KB granularity, but the range of page sizes on offer varies depending on the mode:
32-bit: 4KB and 4MB pages.
PAE: 4KB and 2MB pages.
64-bit: 4KB, 2MB, and (if supported) 1GB pages.
In both cases, the page sizes larger than the basic granule represent block entries at intermediate table levels. In other words, using the common 4KB granule, 3-level*, example:
Each valid entry in the first-level table points to either a naturally-aligned 1GB region of memory, or a second-level table describing that 1GB of address space.
Each valid entry in a second-level table points to either a naturally-aligned 2MB region of memory, or a third-level table describing that 2MB of address space.
Each valid entry in a third-level table points to a naturally-aligned 4KB region of memory.
* Depending on the actual address space size, there may be a zero'th-level table above this, but neither architecture allows block entries at that level (they would be impractically huge anyway). For AArch64 the larger granules only support block/page entries at level 2 and 3, and the 64KB granule never has a level 0 at all.

looks, it is possible on Raspberry Pi 4B (aarch64) with 64-Bit OS (Ubuntu 22.04) ..
but the number of huge pages has to be reserved or allocated as root beforehand:
echo "256" |sudo tee /proc/sys/vm/nr_hugepages
or permanently
sudo sysctl -w vm.nr_hugepages=256
replace the 256 with the required maximum amount of huge pages, each with size of
grep Hugepagesize: /proc/meminfo
i get 2048 kB or 2 MBytes for huge pages and 4 kB for normal pages with getconf PAGE_SIZE.
now, an mmap() with the MAP_HUGETLB flag is accepted from a C program.
found this at Why mmap cannot allocate memory? when looking for a solution myself.

Related

Are Arrays Contiguous? (Virtual vs Physical)

I read that arrays are contiguous in Virtual Memory but probably not in Physical memory, and I don't get that.
Let's suppose I have an array of size 4KB (one page = one frame size), In virtual memory that array is one page.
In virtual memory every page in translated into one frame so our array is still contiguous...
(In Page Table we translate pages into frames not every byte into its own frame...)
Side Question: (When Answering this please mention clearly it's for the side note):
When allocating array in virtual memory of size one page does it have to be one page or could be split into two contiguous pages in virtual memory (for example bottom half of first one and top half of the second)? In this case at worst the answer above is 2, am I wrong?
Unless the start of the array happens to be aligned to the beginning of a memory page, it can still occupy two pages; it can start near the end of one page and end on the next page. Arrays allocated on the stack will probably not be forced to occupy a single page, because stack frames are simply allocated sequentially in the stack memory, and the array will usually be at the same offset within each stack frame.
The heap memory allocator (malloc()) could try to ensure that arrays that are smaller than a page will be allocated entirely on the same page, but I'm not sure if this is actually how most allocators are implemented. Doing this might increase memory fragmentation.
I read that arrays are contiguous in Virtual Memory but probably not in Physical memory, and I don't get that.
This statement is missing something very important. The array size
For small arrays the statement is wrong. For "large/huge" arrays the statement is correct.
In other words: The probability of an array being split over multiple non-contiguous physical pages is a function of the array size.
For small arrays the probability is close to zero but the probability increases as the array size increase. When the array size increases above the systems page size, the probability gets closer and closer to 1. But an array requiring multiple page may still be contiguous in physical memory.
For you side question:
With an array size equal to your systems page size, the array can at maximum span two physical pages.
Anything (array, structure, ...) that is larger than the page size must be split across multiple pages; and therefore may be "virtually contiguous, physical non-contiguous".
Without further knowledge or restriction; anything (array, structure, ...) that is between its minimum alignment (e.g. 4 bytes for an array of uint32_t) and the page size has a probability of being split across multiple pages; where the probability depends on its size and alignment. For example, if page size is 4096 bytes and an array has a minimum alignment of 4 bytes and a size of 4092 bytes, then there's 2 chances in 1024 that it will end up on a single page (and a 99.8% chance that it will be split across multiple pages).
Anything (variable, tiny array, tiny structure, ...) that has a size equal to its minimum alignment won't (shouldn't - see note 3) be split across multiple pages.
Note 1: For anything using memory allocated from the heap, the minimum alignment can be assumed to be the (implementation defined) minimum alignment provided by the heap and not the minimum alignment of the object itself. E.g. for an array of uint16_t the minimum alignment would be 2 bytes; but malloc() will return memory with much larger alignment (maybe 16 bytes)
Note 2: When things are nested (e.g. array inside a structure inside another structure) all of the above applies to the outer structure only. E.g. if you have an array of uint16_t inside a structure where the array happens to begin at offset 4094 within the structure; then it will be significantly more likely that the array will be split across pages.
Note 3: It's possible to explicitly break minimum alignment using pointers (e.g. use malloc() to allocate 1024 bytes, then create a pointer to an array that begins at any offset you want within the allocated area).
Note 4: If something (array, structure, ...) is split across multiple pages; then there's a chance that it will still be physically contiguous. For worst case this depends on the amount of physical memory (e.g. if the computer has 1 GiB of usable physical memory and 4096 byte pages, then there's approximately 1 chance in 262000 that 2 virtually contiguous pages will be "physically contiguous by accident"). If the OS implements page/cache coloring (see https://en.wikipedia.org/wiki/Cache_coloring ) it improves the probability of "physically contiguous by accident" by the number of page/cache "colors" (e.g. if the computer has 1 GiB of usable physical memory and 4096 byte pages, and the OS uses 256 page/cache colors, then there's approximately 1 chance in 1024 that 2 virtually contiguous pages will be "physically contiguous by accident").
Note 5: Most modern operating systems using multiple page sizes (e.g. 4 KiB pages and 2 MiB pages, and maybe also 1 GiB pages). This can either make it hard to guess what the page size actually is, or improve the probability of "physically contiguous by accident" if you assume the smallest page size is used.
Note 6: For some CPUs (e.g. recent AMD/Zen) the TLBs behave as if pages are larger (e.g. as if you're using 16 KiB pages and not 4 KiB pages) if and only if page table entries are compatible (e.g. if 4 page table entries describe four physically contiguous 4 KiB pages with the same permissions/attributes). If an OS is optimized for these CPUs the result is similar to having an extra page size (4 KiB, "16 KiB", 2 MiB and maybe 1 GiB).
When allocating array in virtual memory of size one page does it have to be one page or could be split into two contiguous pages in virtual memory (for example bottom half of first one and top half of the second)?
When allocating an array in heap memory of size one page; the minimum alignment would be the implementation defined minimum alignment provided by the heap manager/malloc() (e.g. maybe 16 bytes). However; most modern heap managers switch to using an alternative (e.g. mmap() or VirtualAlloc() or similar) when the amount of memory being allocated is "large enough"; so (depending on the implementation and their definition of "large enough") it might be page aligned.
When allocating an array in raw virtual memory (e.g. using mmap() or VirtualAlloc() or similar yourself, and NOT using the heap and not using something like malloc()); page alignment is guaranteed (mostly because the virtual memory manager doesn't deal with anything smaller).

Does using FreeDOS allow my program to access more than 64 K of memory?

I am interested in programming in C on FreeDOS while learning some basic ASM in the process, will using FreeDOS allow my program to access more than the standard 640K of memory?
And secondly, about the ASM, I know on modern processors it is hard to program on assembly due to the complexity of the CPU architecture, but does using FreeDOS limit me to the presumably simpler 16-bit instruction set?
MS-DOS and FreeDOS use the "HIMEM" areas: These are:
Some memory areas above 0xA000:0x0000 reserved for extension cards that contain RAM instead of extension cards
The memory starting from 0xFFFF:0x0010 to 0xFFFF:0xFFFF which is located above 1MB but can be accessed using 16-bit real mode code (if the so-called A20-line is active).
The maximum memory size that can be archieved this way is about 800K.
Using XMS and EMS you can use up to 64M:
XMS will allocate memory blocks above the area that can be accessed via 16-bit real mode code. There are special functions that can copy data from that memory to the low 640K of memory and vice versa
EMS is similar; however using EMS it is possible to "map" the high memory to a low address (a feature of 32-bit CPUs) which means that you can access some memory above the 1MB area as if it was located at an address below 1MB.
Without any extender a program can use maximum 640KB of low memory in DOS. But each structure will be limited to the size of a segment, or 64KB. That means you can have 10 large arrays of size 64KB. Of course you can have multiple arrays in a segment but their total size must not exceed the segment size. Some compilers also handle addresses spanning across multiple segments automatically so you can use objects larger than 64KB seamlessly, or you can also do the same if you're writing in assembly
To access more memory you need an extender like EMS or XMS. But note that the address space is still 20-bit wide. The extenders just map the high memory regions into some segments in the addressable space so you can only see a small window of your data at a time
Regarding assembly, you can use 32-bit registers in 16-bit mode. There are 66h and 67h prefixes to change the operand size. However that doesn't mean that writing 16-bit code is easier. In fact it has lots of idiosyncrasies to remember like the limited register usage in memory addressing. The 32-bit x86 instruction set is a lot cleaner with saner addressing modes as well as a flat address space which is a lot easier to use.

Virtual memory management in Fortran under Mac OS X

I'm writing a Fortran 90 program (compiled using gfortran) to run under Mac OS X. I have 13 data arrays, each comprising about 0.6 GB of data My machine is maxed out at 8 GB real memory, and if I try to hold all 13 arrays in memory at once, I'm basically trying to use all 8 GB, which I know isn't possible in view of other system demands. So I know that the arrays would be subject to swapping. What I DON'T know is how this managed by the operating system. In particular,
Does the OS swap out entire data structures (e.g., arrays) when it needs to make room for other data structures, or does it rather do it on a page-by-page basis? That is, does it swap out partial arrays, based on which portions of the array have been least-recently accessed?
The answer may determine how I organize the arrays. If partial arrays can get swapped out, then I could store everything in one giant array (with indexing to select which of the 13 subarrays I need) and trust the OS to manage everything efficiently. Otherwise, I might preserve separate and distinct arrays, each one individually fitting comfortably within the available physical memory.
Operating systems are not typically made aware of structures (like arrays) in user memory. Most operating systems I'm aware of, including Mac OS X, swap out memory on a page-by-page basis.
Although the process is often wrongly called swapping, on x86 as well as on many modern architectures, the OS performs paging to what is still called the swap device (mostly because of historical reasons). The virtual memory space of each process is divided into pages and a special table, called process page table, holds the mapping between pages in virtual memory and frames in physical memory. Each page can be mapped or not mapped. Further mapped pages can be present or not present. Access to an unmapped page results in segmentation fault. Access to a non-present page results in page fault which is further handled by the OS - it takes the page from the swap device and installs it into a frame in the physical memory (if any is available). The standard page size is 4 KiB on x86 and almost any other widespread architecture nowadays. Also, modern MMUs (Memory Management Units, often an integral part of the CPU) support huge pages (e.g. 2 MiB) that can be used to reduce the amount of entries in the page tables and thus leave more memory for user processes.
So paging is really fine grained in comparison with your data structures and one often has loose or no control whatsoever over how the OS does it. Still, most Unices allow you to give instructions and hints to the memory manager using the C API, available in the <sys/mman.h> header file. There are functions that allows you to lock a certain portion of memory and prevent the OS from paging it out to the disk. There are functions that allows you to hint the OS that a certain memory access pattern is to be expected so that it can optimise the way it moves pages in and out. You may combine these with clearly developed data structures in order to achieve some control over paging and to get the best performance of a given OS.

How to define a page in terms of a block?

I learnt that when we manage a data structure such as a tree or other graph its nodes are stored in the computer in something called a block and nodes of the graph can make up the block and it is the block that is transferred between secondary and primary memory when a data structure gets moved between primary and secondary memory. So I think it's pretty clear what a block is, it can consist of different sizes depending on architecture but is often 4K. Now I want to know how a block relates to memory pages. Do pages consist of blocks or what is the relation of blocks to pages? Can we define what a page is in memory in terms of a block?
You typically try to define a block so it's either the same size as a memory page, or its size is evenly divisible by the size of a memory page, so an integral number of blocks will fit in a page.
As you mentioned, 4K tends to work well -- typical memory page sizes are 4K and 8K. Most also support at least one larger page size (e.g., 1 megabyte) but you can typically more or less ignore them; they're used primarily for mapping single, large chunks of contiguous memory (e.g., the part of graphics memory that's directly visible to the CPU).

Process Page Tables

I'm interested in gaining a greater understanding of the virtual memory and page mechanism, specifically for Windows x86 systems. From what I have gathered from various online resources (including other questions posted on SO),
1) The individual page tables for each process are located within the kernel address space of that same process.
2) There is only a single page table per process, containing the mapping of virtual pages onto physical pages (or frames).
3) The physical address corresponding to a given virtual address is calculated by the memory management unit (MMU) essentially by using the first 20 bits of the provided virtual address as the index of the page table, using that index to retrieve the beginning address of the physical frame and then applying some offset to that address according to the remaining 12 bits of the virtual address.
Are these three statements correct? Or am I misinterpreting the information?
So, first lets clarify some things:
In the case of the x86 architecture, it is not the operating system that determines the paging policy, it is the CPU (more specifically it's MMU). How the operating system views the paging system is independent of the the way it is implemented. As a commenter rightly pointed out, there is an OS specific component to paging models. This is subordinate to the hardware's way of doing things.
32 bit and 64 bit x86 processors have different paging schemes so you can't really talk about the x86 paging model without also specifying the word size of the processor.
What follows is a massively condensed version of the 32 bit x86 paging model, using the simplest version of it. There are many additional tweaks that are possible and I know that various OS's make use of them. I'm not going into those because I'm not really familiar with the internals of most OS's and because you really shouldn't go into that until you have a grasp on the simpler stuff. If you want the to know all of the wonderful quirks of the x86 paging model, you can go to the Intel docs: Intel System Programming Guide
In the simplest paging model, the memory space is divided into 4KB blocks called pages. A contiguous chunk of 1024 of these is mapped to a page table (which is also 4KB in size). For a further level of indirection, All 1024 page tables are mapped to a 4KB page directory and the base of this directory sits in a special register %cr3 in the processor. This two level structure is in place because most memory spaces in the OS are sparse which means that most of it is unused. You don't want to keep a bunch of page tables around for memory that isn't touched.
When you get a memory address, the most significant 10 bits index into the page directory, which gives you the base of the page table. The next 10 bits index into that page table to give you the base of the physical page (also called the physical frame). Finally, the last 12 bits index into the frame. The MMU does all of this for you, assuming you've set %cr3 to the correct value.
64 bit systems have a 4 level paging system because their memory spaces are much more sparse. Also, it is possible to page sizes that are not 4KB.
To actually get to your questions:
All of this paging information (tables, directories etc) sits in kernel memory. Note that kernel memory is one big chuck and there is no concept of having kernel memory for a single process.
There is only one page directory per process. This is because the page directory defines a memory space and each process has exactly one memory space.
The last paragraph above gives you the way an address is chopped up.
Edit: Clean up and minor modifications.
Overall that's pretty much correct.
If memory serves, a few details are a bit off though:
The paging for the kernel memory doesn't change per-process, so all the page tables are always visible to the kernel.
In theory, there's also a segment-based translation step. Most practical systems (e.g., *BSD, Linux, Windows, OS/X), however, use segments with their base set to 0 and limit set to the address space limit, so this step ends up as essentially a NOP.

Resources