Process Page Tables

Process Page Tables - c

I'm interested in gaining a greater understanding of the virtual memory and page mechanism, specifically for Windows x86 systems. From what I have gathered from various online resources (including other questions posted on SO),
1) The individual page tables for each process are located within the kernel address space of that same process.
2) There is only a single page table per process, containing the mapping of virtual pages onto physical pages (or frames).
3) The physical address corresponding to a given virtual address is calculated by the memory management unit (MMU) essentially by using the first 20 bits of the provided virtual address as the index of the page table, using that index to retrieve the beginning address of the physical frame and then applying some offset to that address according to the remaining 12 bits of the virtual address.
Are these three statements correct? Or am I misinterpreting the information?

So, first lets clarify some things:
In the case of the x86 architecture, it is not the operating system that determines the paging policy, it is the CPU (more specifically it's MMU). How the operating system views the paging system is independent of the the way it is implemented. As a commenter rightly pointed out, there is an OS specific component to paging models. This is subordinate to the hardware's way of doing things.
32 bit and 64 bit x86 processors have different paging schemes so you can't really talk about the x86 paging model without also specifying the word size of the processor.
What follows is a massively condensed version of the 32 bit x86 paging model, using the simplest version of it. There are many additional tweaks that are possible and I know that various OS's make use of them. I'm not going into those because I'm not really familiar with the internals of most OS's and because you really shouldn't go into that until you have a grasp on the simpler stuff. If you want the to know all of the wonderful quirks of the x86 paging model, you can go to the Intel docs: Intel System Programming Guide
In the simplest paging model, the memory space is divided into 4KB blocks called pages. A contiguous chunk of 1024 of these is mapped to a page table (which is also 4KB in size). For a further level of indirection, All 1024 page tables are mapped to a 4KB page directory and the base of this directory sits in a special register %cr3 in the processor. This two level structure is in place because most memory spaces in the OS are sparse which means that most of it is unused. You don't want to keep a bunch of page tables around for memory that isn't touched.
When you get a memory address, the most significant 10 bits index into the page directory, which gives you the base of the page table. The next 10 bits index into that page table to give you the base of the physical page (also called the physical frame). Finally, the last 12 bits index into the frame. The MMU does all of this for you, assuming you've set %cr3 to the correct value.
64 bit systems have a 4 level paging system because their memory spaces are much more sparse. Also, it is possible to page sizes that are not 4KB.
To actually get to your questions:
All of this paging information (tables, directories etc) sits in kernel memory. Note that kernel memory is one big chuck and there is no concept of having kernel memory for a single process.
There is only one page directory per process. This is because the page directory defines a memory space and each process has exactly one memory space.
The last paragraph above gives you the way an address is chopped up.
Edit: Clean up and minor modifications.

Overall that's pretty much correct.
If memory serves, a few details are a bit off though:
The paging for the kernel memory doesn't change per-process, so all the page tables are always visible to the kernel.
In theory, there's also a segment-based translation step. Most practical systems (e.g., *BSD, Linux, Windows, OS/X), however, use segments with their base set to 0 and limit set to the address space limit, so this step ends up as essentially a NOP.

Related

What's the mechanism behind P2V, V2P macro in Xv6

I know there is a mapping mechanism in how a virtual address turns out into a physical.
Just like the following, a linear address contains three parts
Page Directory index
Page Table index
Offset
Here is the illustration:
Now, when I take a look at the source code of Xv6 in memorylayout.h
#define V2P(a) (((uint) (a)) - KERNBASE)
#define P2V(a) (((void *) (a)) + KERNBASE)
#define V2P_WO(x) ((x) - KERNBASE) // same as V2P, but without casts
#define P2V_WO(x) ((x) + KERNBASE) // same as P2V, but without casts
How can the V2P or P2V work correctly without doing the process of the address translation?

The V2P and P2V macros doesn't do more than you think they do. They are just subtract and add the KERNBASE constant, which marked at 2 GB.
It seems you understand the MMU's hardware mapping mechanism correctly.
The mapping rules are saved by a per process page tables. Those tables forms the process's virtual space.
Specifically, in XV6, process's virtual space structure is being built (by the appropriate mappings) as follows: virtual address space layout
As the diagram above shows, XV6 specifically builds the process's virtual address space so that 2GB - 4GB virtual addresses maps to 0 to PHYSTOP physical addresses (respectively). As explained in the XV6 official commentary:
Xv6 includes all mappings needed for the kernel to run in every process’s page table;
these mappings all appear above KERNBASE. It maps virtual addresses KERNBASE:KERNBASE+PHYSTOP
to 0:PHYSTOP.
The motivation for this decision is also specified:
One reason for this mapping is so that the
kernel can use its own instructions and data. Another reason is that the kernel sometimes
needs to be able to write a given page of physical memory, for example when
creating page table pages; having every physical page appear at a predictable virtual
address makes this convenient.
In other words, because the kernel made all page tables map 2GB virtual to 0 physical (and up to PHYSTOP), we can easily find the physical address of a virtual address that is above 2GB.
For example: The physical address of the virtual address 0x10001000 can easily found: it is 0x00001000 we just subtract 2GB, because that's how we mapped it to be.
This "workaround" can help the kernel do the conversion easily, for example, for building and manipulating page tables (where physical address needs to be calculated and written to memory).
Of course that the above "workaround" is not free, as this makes us waste valuable virtual address space (2GB of it) because now every physical address has at least already 1 virtual address. even if we never going to use it. that leaves the real process only the 2GB left from the entire virtual address (it is 4 GB in total, because we use 32bit to count addresses). This also was explained in the XV6 official commentary:
A defect of this arrangement is that xv6 cannot make
use of more than 2 GB of physical memory
I recommend you reading more about this manner in the XV6 commentary under "Process address space" header. XV6 Commentary

Does a compiler have consider the kernel memory space when laying out memory?

I'm trying to reconcile a few concepts.
I know of virtual memory is shared (mapped) between the kernel and all user processes, which I read here. I also know that when the compiler generates addresses for code + data, the kernel must load them at the correct virtual addresses for that process.
To constrain the scope of the question, I'll just mean gcc when I mention 'the compiler'.
So does the compiler need to be compliant each new release of an OS, to know not to place code or data at the high memory addresses reserved for the kernel? As in, someone writing that piece of the compiler must know those details of how the kernel plans to load the program (lest the compiler put executable code in high memory)?
Or am I confusing different concepts? I got a bit confused when going through this tutorial, especially at the very bottom where it has OS code in low memory addresses, because I thought Linux uses high memory for the kernel.

The compiler doesn't determine the address ranges in memory at which things are placed. That's handled by the OS.
When the program is first executed, the loader places the various portions of the program and its libraries in memory. For memory that's allocated dynamically, large chunks are allocated from the OS and then sometimes divided into smaller chunks.
The OS loader knows where to load things. And the OS's virtual memory allocation logic how to find safe, empty spaces in the address space the process uses.
I'm not sure what you mean by the "high memory addresses reserved for the kernel". If you're talking about a 2G/2G or 3G/1G split on a 32-bit operating system, that is a fundamental design element of those OSes that use it. It doesn't change with versions.
If you're talking about high physical memory, then no. Compilers don't care about physical memory.

Linux gives each application its own memory space, distinct from the kernel. The page table contains the translations between this memory space and physical RAM, and the kernel sets up the page table so there's no interference.
That said, the compiler usually doesn't even care where the program is loaded in memory. Why would it?

Can OS generate same logical Address for two different processes?

As far I know CPU generates logical address for each instruction on run time.
Now this logical address will point to linear or virtual address of the instruction.
Now my questions are ,
1) Can OS generate same logical address for two different processes ?
With reference to "In virtual memory, can two different processes have the same address?" , If two different processes can have same virtual address in that case it is also quit possible that logical addresses can also be the same.
2) Just to clarify my understanding whenever we write a complex C code or simple "hello world" code,Virtual address will be generated at build time (compile->Assemble->link) where logical address will generated by CPU at run time ?
Please clarify my doubts above and also do correct me if I am on wrong way.

The logical address and the virtual address are the same thing. The CPU translates from logical/virtual addresses to physical addresses during execution.
As such, yes, it's not just possible but quite common for two processes to use the same virtual addresses. Under a 32-bit OS this happens quite routinely, simply because the address space is fairly constrained, and there's often more physical memory than address space. But to give one well-known example, the traditional load address for Windows executables is 0x400000 (I might have the wrong number of zeros on the end, but you get the idea). That means essentially every process running on Windows would typically be loaded at that same logical/virtual address.
More recently, Windows (like most other OSes) has started to randomize the layout of executable modules in memory. Since most of a 32-bit address space is often in use, this changes the relative placement of the modules (their order in memory) but means many of the same locations are used in different processes (just for different modules in each).
A 64-bit OS has a much larger address space available, so when it's placing modules at random addresses it has many more choices available. That larger number of choices means there's a much smaller chance of the same address happening to be used in more than one process. It's probably still possible, but certainly a lot less likely.

How can a process with a 32-bit address space access large amounts of memory efficiently? [duplicate]

This question already has an answer here:
moving a structure from 32-bit to 64-bit in C programming language [closed]
(1 answer)
Closed 9 years ago.
We have a process with a 32-bit address space that needs to access more memory than it can address directly. Most of the source code for this process cannot be changed. We can change the modules that are used to manage access to the data. The interfaces to those modules may include 64-bit pieces of data that identify the memory to be accessed.
We currently have an implementation in which the interface modules use interprocess communication with a 64-bit process to transfer data to and from the address space of that 64-bit process.
Is there a better way?

Very few platforms support mixing 32-bit and 64-bit code. If you need more than 2 or 3 GB of address space, your options are:
Recompile the whole application as 64-bit, or
Use memory-mapped files to page in and out large chunks of data.
Recompiling is easy. Accessing more than 2 or 3 GB of memory in a 32-bit program is hard.
Note that recompiling a 32-bit application as a 64-bit application requires no changes to your code or functionality, barring a few bugs that might turn up if your code has unportable constructs. Things like:
size_t round_to_16(size_t x)
{
return x & ~15; // BUG in 64-bit code, should be ~(size_t) 15
}

As stated in various comments, the situation is:
There is a 32-bit process of which a small portion can be altered. The rest is essentially pre-compiled and cannot be changed.
The small portion currently communicates with a 64-bit process to transfer selected data between a 64-bit address space and the address space of the 32-bit process.
You seek alternatives, presumably with higher performance.
Interprocess communication is generally fast. There may not be a faster method unless:
your system has specialized hardware for accelerating memory transfers, or
your system has means of remapping memory (more below).
Unix has calls such as shmat and mmap that allow processes to attach to shared memory segments and to map portions of their address spaces to offsets within shared memory segments. It is possible that calls such as these can support mapping portions of a 32-bit address space into large shared memory segments that exist in a large physical address space.
For example, the call mmap takes a void * parameter for the address to map in the process’ address space and an off_t parameter for the offset into a shared memory segment. Conceivably, the off_t type may be a 64-bit type even though the void * is only a 32-bit pointer. I have not investigated this.
Remapping memory is conceivable faster than transferring memory by copy operations, since it can involve simply changing the virtual address map instead of physically moving data.

Virtual memory management in Fortran under Mac OS X

I'm writing a Fortran 90 program (compiled using gfortran) to run under Mac OS X. I have 13 data arrays, each comprising about 0.6 GB of data My machine is maxed out at 8 GB real memory, and if I try to hold all 13 arrays in memory at once, I'm basically trying to use all 8 GB, which I know isn't possible in view of other system demands. So I know that the arrays would be subject to swapping. What I DON'T know is how this managed by the operating system. In particular,
Does the OS swap out entire data structures (e.g., arrays) when it needs to make room for other data structures, or does it rather do it on a page-by-page basis? That is, does it swap out partial arrays, based on which portions of the array have been least-recently accessed?
The answer may determine how I organize the arrays. If partial arrays can get swapped out, then I could store everything in one giant array (with indexing to select which of the 13 subarrays I need) and trust the OS to manage everything efficiently. Otherwise, I might preserve separate and distinct arrays, each one individually fitting comfortably within the available physical memory.

Operating systems are not typically made aware of structures (like arrays) in user memory. Most operating systems I'm aware of, including Mac OS X, swap out memory on a page-by-page basis.

Although the process is often wrongly called swapping, on x86 as well as on many modern architectures, the OS performs paging to what is still called the swap device (mostly because of historical reasons). The virtual memory space of each process is divided into pages and a special table, called process page table, holds the mapping between pages in virtual memory and frames in physical memory. Each page can be mapped or not mapped. Further mapped pages can be present or not present. Access to an unmapped page results in segmentation fault. Access to a non-present page results in page fault which is further handled by the OS - it takes the page from the swap device and installs it into a frame in the physical memory (if any is available). The standard page size is 4 KiB on x86 and almost any other widespread architecture nowadays. Also, modern MMUs (Memory Management Units, often an integral part of the CPU) support huge pages (e.g. 2 MiB) that can be used to reduce the amount of entries in the page tables and thus leave more memory for user processes.
So paging is really fine grained in comparison with your data structures and one often has loose or no control whatsoever over how the OS does it. Still, most Unices allow you to give instructions and hints to the memory manager using the C API, available in the <sys/mman.h> header file. There are functions that allows you to lock a certain portion of memory and prevent the OS from paging it out to the disk. There are functions that allows you to hint the OS that a certain memory access pattern is to be expected so that it can optimise the way it moves pages in and out. You may combine these with clearly developed data structures in order to achieve some control over paging and to get the best performance of a given OS.