Is virtual address process-specific? - c

I've been studying memory management related topics. I'm wondering, whether I've understood it correctly:
pointer(virtual) address is process specific
different processes can have pointers with same addresses, but these pointers get translated to different physical addresses
Am I correct about these statements? If yes, do they apply for architectures x86, x86-64 and ARMv7, ARMv8?

Well except for:
different processes can have pointers with same addresses, but these pointers get translated to different physical addresses
while this is the general case, of course different processes could share mapped pages (look into shared memory) and then the pointers could point to the same data, given the pages are mapped to the same locations in virtual address space.
But yes, that's the correct understanding.

Related

where does address of variables stored in a memory?

whenever we need to find the address of the variable we use below syntax in C and it prints a address of the variable. what i am trying to understand is the address that returned is actual physical memory location or compiler throwing a some random number. if it is either physical or random, where did it get those number or where it has to be stored in memory. actually does address of the memory location takes space in the memory?
int a = 10;
printf("ADDRESS:%d",&a);
ADDRESS: 2234xxxxxxxx
This location is from the virtual address space, which is allocated to your program. In other words, this is from the virtual memory, which your OS maps to a physical memory, as and when needed.
It depends on what type of system you've got.
Low-end systems such as microcontroller applications often only supports physical addresses.
Mid-range CPUs often come with a MMU (memory mapping unit) which allows so-called virtual memory to be placed on top of the physical memory. Meaning that a certain part of the code could be working from address 0 to x, though in reality those virtual addresses are just aliases for physical ones.
High-end systems like PC typically only allows virtual memory access and denies applications direct access to physical memory. They often also use Address space layout randomization (ASLR) to produce random address layouts for certain kinds of memory, in order to prevent hacks that exploit hard-coded addresses.
In either case, the actual address itself does not take up space in memory.
Higher abstraction layer concepts such as file systems may however store addresses in look-up tables etc and then they will take up memory.
… is the address that returned is actual physical memory location or compiler throwing a some random number
In general-purpose operating systems, the addresses in your C program are virtual memory addresses.1
if it is either physical or random, where did it get those number or where it has to be stored in memory.
The software that loads your program into memory makes the final decisions about what addresses are used2, and it may inform your program about those addresses in various ways, including:
It may put the start addresses of certain parts of the program in designated processor registers. For example, the start address of the read-only data of your program might be put in R17, and then your program would use R17 as a base address for accessing that data.
It may “fix up” addresses built into your program’s instructions and data. The program’s executable file may contain information about places in your program’s instructions or data that need to be updated when the virtual addresses are decided. After the instructions and data are loaded into memory, the loader will use the information in the file to find those places and update them.
With position-independent code, the program counter itself (a register in the processor that contains the address of the instruction the processor is currently executing or about to execute) provides address information.
So, when your program wants to evaluate &x, it may take the offset of x from the start of the section it is in (and that offset is built into the program by the compiler and possibly updated by the linker) and adds it to the base address of that section. The resulting sum is the address of x.
actually does address of the memory location takes space in the memory?
The C standard does not require the program to use any memory for the address of x, &x. The result of &x is a value, like the result of 3*x. The only thing the compiler has to do with a value is ensure it gets used for whatever further expression it is used in. It is not required to store it in memory. However, if the program is dealing with many values in a piece of code, so there are not enough processor registers to hold them all, the compiler may choose to store values in memory temporarily.
Footnotes
1 Virtual memory is a conceptual or “imaginary” address space. Your program can execute with virtual addresses because the hardware automatically translates virtual addresses to physical addresses while it is executing the program. The operating system creates a map that tells the hardware how to translate virtual addresses to physical addresses. (The map may also tell the hardware certain virtual memory is not actually in physical memory at the moment. In this case, the hardware interrupts the program and starts an operating system routine which deals with the issue. That routine arranges for the needed data to be loaded into memory and then updates the virtual memory map to indicate that.)
2 There is usually a general scheme for how parts of the program are laid out in memory, such as starting the instructions in one area and setting up space for stack in another area. In modern systems, some randomness is intentionally added to the addresses to foil malicious people trying to take advantage of bugs in programs.

Virtual/Logical Memory and Program relocation

Virtual memory along with logical memory helps to make sure programs do not corrupt each others data.
Program relocation does an almost similar thing of making sure that multiple programs does not corrupt each other.Relocation modifies object program so that it can be loaded at a new, alternate address.
How are virtual memory, logical memory and program relocation related ? Are they similar ?
If they are same/similar, then why do we need program relocation ?
Relocatable programs, or said another way position-independent code, is traditionally used in two circumstances:
systems without virtual memory (or too basic virtual memory, e.g. classic MacOS), for any code
for dynamic libraries, even on systems with virtual memory, given that a dynamic library could find itself lodaded on an address that is not its preferred one if other code is already at that space in the address space of the host program.
However, today even main executable programs on systems with virtual memory tend to be position-independent (e.g. the PIE* build flag on Mac OS X) so that they can be loaded at a randomized address to protect against exploits, e.g. those using ROP**.
* Position Independent Executable
** Return-Oriented Programming
Virtual memory does not prevent programs from interfering with out other. It is logical memory that does so. Unfortunately, it is common for the two concepts to be conflated to "virtual memory."
There are two types of relocation and it is not clear which you are referring to. However, they are connected. On the other hand, the concept is not really related to virtual memory.
The first concept of relocatable code. This is critical for shared libraries that usually have to be mapped to different addresses.
Relocatable code uses offsets rather than absolute addresses. When a program results in an instruction sequence something like:
JMP SOMELABEL
. . .
SOMELABEL:
The computer or assembler encodes this as
JUMP the-number-of-bytes-to-SOMELABEL
rather than
JUMP to-the-address-of-somelabel.
By using offsets the code works the same way no matter where the JMP instruction is located.
The second type of relocation uses the first. In the past relocation was mostly used for libraries. Now, some OS's will load program segments at different places in memory. That is intended for security. It is designed to keep malicious cracks that depend upon the application being loaded at a specific address.
Both of these concepts work with or without virtual memory.
Note that generally the program is not modified to relocated it. I generally, because an executable file will usually have some addresses that need to be fixed up at run time.

Can OS generate same logical Address for two different processes?

As far I know CPU generates logical address for each instruction on run time.
Now this logical address will point to linear or virtual address of the instruction.
Now my questions are ,
1) Can OS generate same logical address for two different processes ?
With reference to "In virtual memory, can two different processes have the same address?" , If two different processes can have same virtual address in that case it is also quit possible that logical addresses can also be the same.
2) Just to clarify my understanding whenever we write a complex C code or simple "hello world" code,Virtual address will be generated at build time (compile->Assemble->link) where logical address will generated by CPU at run time ?
Please clarify my doubts above and also do correct me if I am on wrong way.
The logical address and the virtual address are the same thing. The CPU translates from logical/virtual addresses to physical addresses during execution.
As such, yes, it's not just possible but quite common for two processes to use the same virtual addresses. Under a 32-bit OS this happens quite routinely, simply because the address space is fairly constrained, and there's often more physical memory than address space. But to give one well-known example, the traditional load address for Windows executables is 0x400000 (I might have the wrong number of zeros on the end, but you get the idea). That means essentially every process running on Windows would typically be loaded at that same logical/virtual address.
More recently, Windows (like most other OSes) has started to randomize the layout of executable modules in memory. Since most of a 32-bit address space is often in use, this changes the relative placement of the modules (their order in memory) but means many of the same locations are used in different processes (just for different modules in each).
A 64-bit OS has a much larger address space available, so when it's placing modules at random addresses it has many more choices available. That larger number of choices means there's a much smaller chance of the same address happening to be used in more than one process. It's probably still possible, but certainly a lot less likely.

the architecture of on-disk data structures

The closest I've come to finally understanding the architecture of on disk btrees is this.
It's simple and very easy to read and understand. But I still feel confused. It doesn't seem like there is any in memory data structure at all. Am I missing something? What makes this a btree? Is it just the array of longs that "point" to the keys of their child nodes? Is that efficient? Is that just how most databases and filesystems are designed?
Are there ways of implementing on disk btrees (or other data structures) in memory? Where each nodes contains a file offset or something?
Node pointers are typically stored on disk as addresses (for example using long integers).
In general an implementation choose to use either physical or logical addresses:
Physical addresses specify the actual offset (within a file or similar) where the node is stored.
In contrast, logical addresses require some kind of mechanism that resolve to a physical address each time a pointer is navigated/traversed.
Physical addressing is faster (because no resolve mechanism is needed). However, logical addressing could allow nodes to be reorganized without having to rewrite pointers. The ability to be able to re-organize nodes in this way can be used as the basis for implementing good clustering, space utilization and even low-level data distribution.
Some implementations use a combination of logical and physical addressing such that each address is composed of a logical address that refer (dynamically) to a segment (blob) and a physical address within that segment.
It is important to note that node addresses are disk based, therefore they cannot be directly translated to in-memory pointers.
In some cases it is beneficial to convert disk-based pointers to memory pointers when data is loaded into memory (and then convert back to disk-based pointers when writing).
This conversion is sometimes called pointer swizzling, and can be implemented in many ways. The fundamental idea is that the data addressed by swizzled in-memory pointers shall not be loaded into memory before the pointer is navigated/traversed.
The common approaches to this is to either use a logical in-memory addressing scheme or to use memory mapped files. Memory mapped files use virtual memory addressing in which memory pages are not loaded into memory before they are accessed. Virtual memory mapped files are provided by the OS. This approach is sometimes called page faulted addressing because accessing data on a memory page that is not mapped into memory yet will cause a page fault interrupt.

Dereferencing a pointer at lower level in C

When malloc returns a pointer (a virtual address of a block of data),
char *p = malloc (10);
p has a virtual address, (say x). And p holds a virtual address of a block of 10 addresses.
Say these virtual addresses are from y to y+10.
These 10 addresses belong to a page , and the virtual --> physical mapping is placed in the page table.
When processor dereferences the pointer p, say printf("%c", *p); , how does the processor know that it has to access the address at y ?
Is the page table accessed twice in order to dereference a pointer ,in other words -print the address pointed by p ? How exactly is it done, can anybody explain ?
Also, for accessing stack variables, does the processor have to access it through page table ?
Isn't the stack pointer register (SP) not pointing to the stack already ?
I think there's a muddling of different layers.
First, page tables: This is a data structure that uses some memory to provide pointers to more memory. Given a particular virtual address, it can deconstruct it into indices into the table. Right now, this is happening under the cover in the kernel, but it's possible to implement this same idea in user space.
Now, the next step is processes. Each process gets its own view of memory and hence has its one set of page tables. How the processor know where these different page tables reside? In a special control register called cr3. Changing processes is sometimes called a context switch; and rightly so because setting cr3 changes the processes view of virtual memory.
But the next question is, how does the processor even understand the concept of virtual memory? Well, in some older architectures (MIPs comes to mind), the system would keep a cache of recently translated memory and provides guidance for how to handle virtual memory access. In x86, the cache (more commonly called a translation lookaside buffer) is actually implemented in hardware. The processor stores these translations so it can handle the page table lookups automatically. If there's a cache miss, then it will actually traverse the page table structure as set up by the OS to lookup what it should reference.
Of course, this means there must be at least two different modes for the processor: one that assumes the addresses are direct and one that traverses the page tables. The first mode, real mode, is there on boot and only around long enough to set up the tables before the bootloader turns on virtual mode and jumps to the beginning of the rest of the code.
The short answer to my long explanation is that in all likelihood, the page tables aren't accessed at all because the processor already has the address translations.
And p holds a virtual address of a block of 10 addresses.
You're confused. p is a pointer holding the address of a 10-byte block; how these bytes are interpreted is up to the application.

Resources