revisiting elf memory mapping with understanding of virtual memory

revisiting elf memory mapping with understanding of virtual memory - c

with reference to following questions
1 How are the different segments like heap, stack, text related to the physical memory?
2 memory allocation in data/bss/heap and stack
3 how does a program runs in memory and the way memory is handled by Operating system
4 How convert address in elf to physical address
especially in question 1 what is asked to clear still needs experts comments.
at the same time different answers/comments in opposite directions making this more confusing. what we know is that elf is loaded in virtual address space and when needed the actual physical page, process is provided with that page by MMU. I have severe doubts about this. if mappings are only made in virtual address then how execution is started? one possibility is that segments from elf binary are loaded into virtual address space and an empty pagetable is also created which is populated on page faults on each memory access. And in this population there is no role of ELF binary stored on disk.I can delete my elf stored on disk with out effecting execution?
if above understanding is right confusion lies in lack of knowledge on our side about virtual memory. As VM is considered deception but it seems it is more than that. lets say int x=12 when this line is loaded in virtual space it means there is record keeping for value of x (12) in VM. when some instruction mov x,register is executed for the first time, a page is created in RAM only to give actual space to make this instruction run?
After noticing the reputation scores of mine and others with similar question even if this is basic question, this concept is creating a lot of confusions for me at least.

if above understanding is right confusion lies in lack of knowledge on our side about virtual memory.
The understanding you state is not right, but your confusion certainly does seem to be largely about virtual memory. The point of a virtual memory system is to allow each process to access the whole address space (more or less) as if it were the only process running, and without regard to how much physical RAM is available. "Virtual" does not mean "fake" or any such thing -- with one or two caveats, if a program has virtual memory assigned to it, then it has bona fide storage. The "virtual" part has to do with the real location at any given time of the storage associated with any particular virtual address.
You also seem to be confusing yourself by associating ELF too closely with virtual memory. Certainly the OS will normally load an ELF executable into virtual memory (backed by some combination of physical RAM and disk that may vary over time), but it would do the same with a program in any other executable format it supports, and there are others.
I have severe doubts about this. if mappings are only made in virtual address then how execution is started?
The operating system kernel is the exception -- it runs in real memory. Managing virtual memory for all running processes is one of its jobs, and the page table for each process belongs to it, not to the process. The kernel handles setting up and starting processes, including their page tables, and at all times it knows the actual location of all storage assigned to each one, and to what virtual addresses that storage is mapped.
And in this population there is no role of ELF binary stored on disk. I can delete my elf stored on disk with out [a]ffecting execution?
The ELF binary stored on disk contains executable code and data for the program that will be loaded into the process's (virtual) memory when the program is launched. This is the job of the ELF dynamic linker (ld.linux.so.2, for example). Certainly the ELF file is needed to launch the program. After the process has started, then sure, the on-disk ELF binary can be unlinked (via the rm command, for example) without affecting the execution of the program, since it has been loaded into memory (on Unix-like systems; Windows is different, but does not support ELF). That binary will then be unavailable to launch any new instance of the program.
Note, however, that unlinking is not necessarily the same thing as deleting. As long as any process has the unlinked file open, it will remain allocated on the file system. This may well be the case for a running ELF program, as one good loading strategy is to map parts of the binary file into memory (the .text and .rodata sections, for example). This will hold the file open as long any process with such mappings is alive, even if the binary is unlinked from the file system.

Related

Why do we need address virtualization in an operating system?

I am currently taking a course in Operating Systems and I came across address virtualization. I will give a brief about what I know and follow that with my question.
Basically, the CPU(modern microprocessors) generates virtual addresses and then an MMU(memory management unit) takes care of translating those virtual address to their corresponding physical addresses in the RAM. The example that was given by the professor is there is a need for virtualization because say for example: You compile a C program. You run it. And then you compile another C program. You try to run it but the resident running program in memory prevents loading a newer program even when space is available.
From my understanding, I think having no virtualization, if the compiler generates two physical addresses that are the same, the second won't run because it thinks there isn't enough space for it. When we virtualize this, as in the CPU generates only virtual addresses, the MMU will deal with this "collision" and find a spot for the other program in RAM.(Our professor gave the example of the MMU being a mapping table, that takes a virtual address and maps it to a physical address). I thought of that idea to be very similar to say resolving collisions in a hash table.
Could I please get some input on my understanding and any further clarification is appreciated.

Could I please get some input on my understanding and any further clarification is appreciated.
Your understanding is roughly correct.
Clarifications:
The data structures are nothing like a hash table.
If anything, the data structures are closer to a BTree, but even there are important differences with that as well. It is really closest to a (Java) N-dimensional array which has been sparsely allocated.
It is mapping pages rather than complete virtual / physical addresses. (A complete address is a page address + an offset within the page.).
There is no issue with collision. At any point in time, the virtual -> physical mappings for all users / processes give a one-to-one mapping from (process id + virtual page) to a either a physical RAM page or a disk page (or both).
The reasons we use virtual memory are:
process isolation; i.e. one process can't see or interfere with another processes memory
simplifying application writing; i.e. each process thinks it has a contiguous set off memory addresses, and the same set each time. (To a first approximation ...)
simplifying compilation, linking, loading; i.e. the compilers, etc there is no need to "relocate" code at compile time or run time to take into account other.
to allow the system to accommodate more processes than it has physical RAM for ... though this comes with potential risks and performance penalties.

I think you have a fundamental misconception about what goes on in an operating system in regard to memory.
(1) You are describing logical memory, not virtual memory. Virtual memory refers to the use of disk storage to simulate memory. Unmapped pages of logical memory get mapped to disk space.
Sadly, the terms logical memory and virtual memory get conflated but they are distinct concepts the the distinction is becoming increasingly important.
(2) Programs run in a PROCESS. A process only runs one program at a time (in unix each process generally only runs one program (two if you count the cloned caller) in its life.
In modern systems each process process gets a logical address space (sequential addresses) that can be mapped to physical locations or no location at all. Generally, part of that logical address space is mapped to a kernel area that is shared by all processes. The logical address space is create with the process. No address space—no process.
In a 32-bit system, addresses 0-7FFFFFFF might be user address that are (generally) mapped to unique physical locations while 80000000-FFFFFFFFmight be mapped to a system address space that is the same for all processes.
(3) Logical memory management primarily serves as a means of security; not as a means for program loading (although it does help in that regard).
(4) This example makes no sense to me:
You compile a C program. You run it. And then you compile another C program. You try to run it but the resident running program in memory prevents loading a newer program even when space is available.
You are ignoring the concept of a PROCESS. A process can only have one program running at a time. In systems that do permit serial running of programs with the same process (e.g., VMS) the executing program prevents loading another program (or the loading of another program causes the termination of the running program). It is not a memory issue.
(5) This is not correct at all:
From my understanding, I think having no virtualization, if the compiler generates two physical addresses that are the same, the second won't run because it thinks there isn't enough space for it. When we virtualize this, as in the CPU generates only virtual addresses, the MMU will deal with this "collision" and find a spot for the other program in RAM.
The MMU does not deal with collisions. The operating system sets up tables that define the logical address space when the process start. Logical memory has nothing to do with hash tables.
When a program accesses logical memory the rough sequence is:
Break down the address into a page and an offset within the page.
Does the page have an corresponding entry in the page table? If not FAULT.
Is the entry in the page table valid? If not FAULT.
Does the page table entry allow the type of access (read/write/execute) requested in the current operating mode (kernel/user/...)? If not FAULT.
Does the entry map to a physical page? If not PAGE FAULT (go load the page from disk--virtual memory––and try again).
Access the physical memory referenced by the page table.

Does Virtual Memory area struct only comes into picture when there is a page fault?

Virtual Memory is a quite complex topic for me. I am trying to understand it. Here is my understanding for a 32-bit system. Example RAM is just 2GB. I have tried reading many links, and I am not confident at the moment. I would like you people to help me in clearing up my concepts. Please acknowledge my points, and also please answer for what you feel is wrong. I have also a confused section in my points. So, here starts the summary.
Every process thinks it is only running. It can access the 4GB of memory - virtual address space.
When a process access a virtual address it is translated to physical address via MMU.
This MMU is a part of a CPU - a hardware.
When the MMU cannot translate the address to a physical one, it raises a page fault.
On page fault, the kernel is notified. The kernel check the VM area struct. If it can find it - may be on disk. It will do some page-in /page-out. And get this memory on the RAM.
Now MMU will again try and will succeed this time.
In case the kernel cannot find the address, it will raise a signal. For example, invalid access will raise a SIGSEGV.
Confused points.
Does Page table is maintained in Kernel? This VM area struct has a page table ?
How MMU cannot find the address in physical RAM. Let's say it translates to some wrong address in RAM. Still the code will execute, but it will be a bad address. How MMU ensures that it is reading a right data? Does it consult Kernel VM area everytime?
Is the Mapping table - virtual to physical is inside a MMU. I have read it that is maintained by an individual process. If it is inside a process, why I can't see it.
Or if it is MMU, how MMU generates the address - is it that Segment + 12-bit shift -> Page frame number, and then the addition of offset (bits -1 to 10) -> gives a physical address.
Does it mean that for a 32-bit architecture, with this calculation in my mind. I can determine the physical address from a virtual address.
cat /proc/pid_value/maps. This shows me the current mapping of the vmarea. Basically, it reads the Vmarea struct and prints it. That means that this is important. I am not able to fit this piece in the complete picture. When the program is executed does the vmarea struct is generated. Is VMAREA comes only into the picture when the MMU cannnot translate the address i.e. Page fault? When I print the vmarea it displays the address range , permission and mapped to file descriptor, and offset. I am sure this file descriptor is the one in the hard-disk and the offset is for that file.
The high-mem concept is that kernel cannot directly access the Memory region greater than 1 GB(approx). Thus, it needs a page table to indirectly map it. Thus, it will temporarily load some page table to map the address. Does HIGH MEM will come into the picture everytime. Because Userspace can directly translate the address via MMU. On what scenario, does kernel really want to access the High MEM. I believe the kernel drivers will mostly be using kmalloc. This is a direct memory + offset address. In this case no mapping is really required. So, the question is on what scenario a kernel needs to access the High Mem.
Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run LInux?

Does Page table is maintained in Kernel? This VM area struct has a page table ?
Yes. Not exactly: each process has a mm_struct, which contains a list of vm_area_struct's (which represent abstract, processor-independent memory regions, aka mappings), and a field called pgd, which is a pointer to the processor-specific page table (which contains the current state of each page: valid, readable, writable, dirty, ...).
The page table doesn't need to be complete, the OS can generate each part of it from the VMAs.
How MMU cannot find the address in physical RAM. Let's say it translates to some wrong address in RAM. Still the code will execute, but it will be a bad address. How MMU ensures that it is reading a right data? Does it consult Kernel VM area everytime?
The translation fails, e.g. because the page was marked as invalid, or a write access was attempted against a readonly page.
Is the Mapping table - virtual to physical is inside a MMU. I have read it that is maintained by an individual process. If it is inside a process, why I can't see it.
Or if it is MMU, how MMU generates the address - is it that Segment + 12-bit shift -> Page frame number, and then the addition of offset (bits -1 to 10) -> gives a physical address.
Does it mean that for a 32-bit architecture, with this calculation in my mind. I can determine the physical address from a virtual address.
There are two kinds of MMUs in common use. One of them only has a TLB (Translation Lookaside Buffer), which is a cache of the page table. When the TLB doesn't have a translation for an attempted access, a TLB miss is generated, the OS does a page table walk, and puts the translation in the TLB.
The other kind of MMU does the page table walk in hardware.
In any case, the OS maintains a page table per process, this maps Virtual Page Numbers to Physical Frame Numbers. This mapping can change at any moment, when a page is paged-in, the physical frame it is mapped to depends on the availability of free memory.
cat /proc/pid_value/maps. This shows me the current mapping of the vmarea. Basically, it reads the Vmarea struct and prints it. That means that this is important. I am not able to fit this piece in the complete picture. When the program is executed does the vmarea struct is generated. Is VMAREA comes only into the picture when the MMU cannnot translate the address i.e. Page fault? When I print the vmarea it displays the address range , permission and mapped to file descriptor, and offset. I am sure this file descriptor is the one in the hard-disk and the offset is for that file.
To a first approximation, yes. Beyond that, there are many reasons why the kernel may decide to fiddle with a process' memory, e.g: if there is memory pressure it may decide to page out some rarely used pages from some random process. User space can also manipulate the mappings via mmap(), execve() and other system calls.
The high-mem concept is that kernel cannot directly access the Memory region greater than 1 GB(approx). Thus, it needs a page table to indirectly map it. Thus, it will temporarily load some page table to map the address. Does HIGH MEM will come into the picture everytime. Because Userspace can directly translate the address via MMU. On what scenario, does kernel really want to access the High MEM. I believe the kernel drivers will mostly be using kmalloc. This is a direct memory + offset address. In this case no mapping is really required. So, the question is on what scenario a kernel needs to access the High Mem.
Totally unrelated to the other questions. In summary, high memory is a hack to be able to access lots of memory in a limited address space computer.
Basically, the kernel has a limited address space reserved to it (on x86, a typical user/kernel split is 3Gb/1Gb [processes can run in user space or kernel space. A process runs in kernel space when a syscall is invoked. To avoid having to switch the page table on every context-switch, on x86 typically the address space is split between user-space and kernel-space]). So the kernel can directly access up to ~1Gb of memory. To access more physical memory, there is some indirection involved, which is what high memory is all about.
Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run Linux?
Laptop/desktop processors come with an MMU. x86 supports paging since the 386.
Linux, specially the variant called µCLinux, supports processors without MMUs (!MMU). Many embedded systems (ADSL routers, ...) use processors without an MMU. There are some important restrictions, among them:
Some syscalls don't work at all: e.g fork().
Some syscalls work with restrictions and non-POSIX conforming behavior: e.g mmap()
The executable file format is different: e.g bFLT or ELF-FDPIC instead of ELF.
The stack cannot grow, and its size has to be set at link-time.
When a program is loaded first the kernel will setup a kernel VM-Area for that process is it? This Kernel VM Area actually holds where the program sections are there in the memory/HDD. Then the entire story of updating CR3 register, and page walkthrough or TLB comes into the picture right? So, whenever there is a pagefault - Kernel will update the page table by looking at Kernel virtual memory area is it? But they say Kernel VM area keeps updating. How this is possible, since cat /proc/pid_value/map will keep updating.The map won't be constant from start to end. SO, the real information is available in the Kernel VM area struct is it? This is the acutal information where the section of program lies, it could be HDD or physical memory -- RAM? So, this is filled during process loading is it, the first job? Kernel does the page in page out on page fault, and will update the Kernel VM area is it? So, it should also know the entire program location on the HDD for page-in / page out right? Please correct me here. This is in continuation to my first question of the previous comment.
When the kernel loads a program, it will setup several VMAs (mappings), according to the segments in the executable file (which on ELF files you can see with readelf --segments), which will be text/code segment, data segment, etc... During the lifetime of the program, additional mappings may be created by the dynamic/runtime linkers, by the memory allocator (malloc(), which may also extend the data segment via brk()), or directly by the program via mmap(),shm_open(), etc..
The VMAs contain the necessary information to generate the page table, e.g. they tell whether that memory is backed by a file or by swap (anonymous memory). So, yes, the kernel will update the page table by looking at the VMAs. The kernel will page in memory in response to page faults, and will page out memory in response to memory pressure.
Using x86 no PAE as an example:
On x86 with no PAE, a linear address can be split into 3 parts: the top 10 bits point to an entry in the page directory, the middle 10 bits point to an entry in the page table pointed to by the aforementioned page directory entry. The page table entry may contain a valid physical frame number: the top 22 bits of a physical address. The bottom 12 bits of the virtual address is an offset into the page that goes untranslated into the physical address.
Each time the kernel schedules a different process, the CR3 register is written to with a pointer to the page directory for the current process. Then, each time a memory access is made, the MMU tries to look for a translation cached in the TLB, if it doesn't find one, it looks for one doing a page table walk starting from CR3. If it still doesn't find one, a GPF fault is raised, the CPU switches to Ring 0 (kernel mode), and the kernel tries to find one in the VMAs.
Also, I believe this reading from CR, page directory->page-table->Page frame number-memory address this all done by MMU. Am I correct?
On x86, yes, the MMU does the page table walk. On other systems (e.g: MIPS), the MMU is little more than the TLB, and on TLB miss exceptions the kernel does the page table walk by software.

Though this is not going to be the best answer, iw ould like to share my thoughts on confused points.
1. Does Page table is maintained...
Yes. kernel maintains the page tables. In fact it maintains nested page tables. And top of the page tables is stored in top_pmd. pmd i suppose it is page mapping directory. You can traverse through all the page tables using this structure.
2. How MMU cannot find the address in physical RAM.....
I am not sure i understood the question. But in case because of some problem, the instruction is faulted or out of its instruction area is being accessed, you generally get undefined instruction exception resulting in undefined exception abort. If you look at the crash dumps, you can see it in the kernel log.
3. Is the Mapping table - virtual to physical is inside a MMU...
Yes. MMU is SW+HW. HW is like TLB and all. The mapping tables are stored here. For instructions, that is for code section i always converted the physical-virtual address and always they matched. And almost all the times it matches for Data sections as well.
4. cat /proc/pid_value/maps. This shows me the current mapping of the vmarea....
This is more used for analyzing the virtual addresses of user space stacks. As you know virtually all the user space programs can have 4 GB of virtual address. So unlike kernel if i say 0xc0100234. You cannot directly go and point to the istruction. So you need this mapping and the virtual address to point the instruction based on the data you have.
5. The high-mem concept is that kernel cannot directly access the Memory...
High-mem corresponds to user space memory(some one correct me if i am wrong). When kernel wants to read some data from a address at user space you will be accessing the HIGHMEM.
6. Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run LInux?
MMU as i mentioned is HW + SW. So mostly it would be coming with the chipset. and the SW would be generally architecture dependent. You can disable MMU from kernel config and build. I have never tried it though. Mostly these days allthe chipsets have it. But small boards i think they disable MMU. I am not entirely sure though.
As all these are conceptual questions, i may be lacking some knowledge and be wrong at places. If so others please correct me.

Is the Kernel Virtual Memory struct first formed when the process is about to execute?

I have been bothering with similar questions indirectly on my other posts. Now, my understanding is better. Thus, my questions are better. So, I want to summarize the facts here. This example is based on X86-32-bit system.
Please say yes/no to my points. If no, then please explain.
MMU will look into the CR3 register to find the Process - Page Directory base address.
The CR3 register is set by the kernel.
Now MMU after reading the Page directory base address, will offset to the Page Table index (calculated from VA), from here it will read the Page frame number, now it will find the offset on the page frame number based on the VA given. It gets the physical memory address. All this is done in MMU right? Don't know when MMU is disabled, who will do all this circus? If software then it will be slow right?
I know then page fault occurs when the MMU cannot resolve the address. The kernel is informed. The kernel will update the page table based on the reading from kernel virtual memory area struct. Am I correct?
Keeping in mind, the point 4. Does it mean that before executing any process. Perhaps during loading process. Does Kernel first fills the kernel virtual memory area struct. For example, where the section of memory will be BSS, Code, DS,etc. It could be that some sections are in RAM, and some are in Storage device. When the sections of the program is moved from storage to main memory, I am assuming that kernel would be updating the Kernel virtual memory area struct. Am I correct here? So, it is the kernel who keeps a close track on the program location - whether in storage device or RAM - inode number of device and file offset.
Sequence wise -> During Process loading ( may be a loader program)-> Kernel will populate the data in the kernel virtual memory area struct. It will also set the CR3 register. Now Process starts executing, it will initially get some frequent page faults.Now the VM area struct will be updated (if required) and then the page table. Now, MMU will succeed in translating the address. So, when I say process accessing a memory, it is the MMU which is accessing the memory on behalf of the process. This is all about user-space. Kernel space is entirely different. The kernel space doesn't need the MMU, it can directly map to the physical address - low mem. For high mem ( to access user space from kernel space), it will do the temporary page table updation - internally. This is a separate page table for kernel, a temporary one. The kernel space doesn't need MMU. Am I correct?

Don't know when MMU is disabled, who will do all this circus?
Nobody. All this circus is intended to do two things: translate the virtual address you gave it into a real address, and if it can't do that then to abort the instruction entirely and start executing a routine addressed from an architecturally pre-defined address, see "page fault" there for the basic one.
When the MMU is shut off, no translation is done and the address you gave it is fed directly down the CPU's address-processing pipe just as any address the MMU might have translated it to would have been.
So, when I say process accessing a memory, it is the MMU which is accessing the memory on behalf of the process.
You're on the right track here, the MMU is mediating the access, but it isn't doing the access. It's doing only what you described before, translating it. What's generally called the Load/Store unit, gets it next, and it's the one that handles talking to whatever holds the closest good copy of the data at that address, "does the access".
The kernel space doesn't need the MMU, it can directly map to the physical address
That depends on how you define "need". It can certainly shut it off, but it almost never does. First, it has to talk to user space, and the MMU has to be running to translate what user space has to addresses the Load-Store unit can use. Second, the flexibility and protection provided by the MMU are very valuable, they're not discarded without a really compelling reason. I know at least one OS will (or would, it's been a while) run some bulk copies MMU-off, but that's about it.

Mapping of Virtual Address to Physical Address

I have a doubt when each process has it's own separate page table then why is there s system wide page table required ? Also if Page table is such that it maps virtual address to a physical address then I think two process may map to same physical address because all process have same virtual address space . Any good link on system wide page table will also solve my problem?

Each process has its own independent virtual address space - two processes can have virtpage 1 map to different physpages. Processes can participate in shared memory, in which case they each have some virtpage mapping to the same physpage.
The virtual address space of a process can be used to map virtpages to physpages, to memory mapped files, devices, etc. Virtpages don't have to be wired to RAM. A process could memory-map an entire 1GB file - in which case, its physical memory usage might only be a couple megs, but its virtual address space usage would be 1GB or more. Many processes could do this, in which case the sum of virtual address space usage across all processes might be, say, 40 GB, while the total physical memory usage might be only, say, 100 megs; this is very easy to do on 32-bit systems.
Since lots of processes load the same libraries, the OS typically puts the libs in one set of read-only executable pages, and then loads mappings in the virtpage space for each process to point to that one set of pages, to save on physical memory.
Processes may have virtpage mappings that don't point to anything, for instance if part of the process's memory got written to the page file - the process will try to access that page, the CPU will trigger a page fault, the OS will see the page fault and handle it by suspending the process, reading the pages back into ram from the page file and then resuming the process.
There are typically 3 types of page faults. The first type is when the CPU does not have the virtual-physical mapping in the TLB - the processor invokes the pagefault software interrupt in the OS, the OS puts the mapping into the processor for that process, then the proc re-runs the offending instructions. These happen thousands of times a second.
The second type is when the OS has no mapping because, say, the memory for the process has been swapped to disk, as explained above. These happen infrequently on a lightly loaded machine, but happen more often as memory pressure is increased, up to 100s to 1000s of times per second, maybe even more.
The third type is when the OS has no mapping because the mapping does not exist - the process is trying to access memory that does not belong to it. This generates a segfault, and typically, the process is killed. These aren't supposed to happen often, and solely depend on how well written the software is on the machine, and does not have anything to do with scheduling or machine load.
Even if you already knew that, I figured I throw that in for the community.

When a binary file runs, does it copy its entire binary data into memory at once? Could I change that?

Does it copy the entire binary to the memory before it executes? I am interested in this question and want to change it into some other way. I mean, if the binary is 100M big (seems impossible), I could run it while I am copying it into the memory. Could that be possible?
Or could you tell me how to see the way it runs? Which tools do I need?

The theoretical model for an application-level programmer makes it appear that this is so. In point of fact, the normal startup process (at least in Linux 1.x, I believe 2.x and 3.x are optimized but similar) is:
The kernel creates a process context (more-or-less, virtual machine)
Into that process context, it defines a virtual memory mapping that maps
from RAM addresses to the start of your executable file
Assuming that you're dynamically linked (the default/usual), the ld.so program
(e.g. /lib/ld-linux.so.2) defined in your program's headers sets up memory mapping for shared libraries
The kernel does a jmp into the startup routine of your program (for a C program, that's
something like crtprec80, which calls main). Since it has only set up the mapping, and not actually loaded any pages(*), this causes a Page Fault from the CPU's Memory Management Unit, which is an interrupt (exception, signal) to the kernel.
The kernel's Page Fault handler loads some section of your program, including the part
that caused the page fault, into RAM.
As your program runs, if it accesses a virtual address that doesn't have RAM backing
it up right now, Page Faults will occur and cause the kernel to suspend the program
briefly, load the page from disc, and then return control to the program. This all
happens "between instructions" and is normally undetectable.
As you use malloc/new, the kernel creates read-write pages of RAM (without disc backing files) and adds them to your virtual address space.
If you throw a Page Fault by trying to access a memory location that isn't set up in the virtual memory mappings, you get a Segmentation Violation Signal (SIGSEGV), which is normally fatal.
As the system runs out of physical RAM, pages of RAM get removed; if they are read-only copies of something already on disc (like an executable, or a shared object file), they just get de-allocated and are reloaded from their source; if they're read-write (like memory you "created" using malloc), they get written out to the ( page file = swap file = swap partition = on-disc virtual memory ). Accessing these "freed" pages causes another Page Fault, and they're re-loaded.
Generally, though, until your process is bigger than available RAM — and data is almost always significantly larger than the executable — you can safely pretend that you're alone in the world and none of this demand paging stuff is happening.
So: effectively, the kernel already is running your program while it's being loaded (and might never even load some pages, if you never jump into that code / refer to that data).
If your startup is particularly sluggish, you could look at the prelink system to optimize shared library loads. This reduces the amount of work that ld.so has to do at startup (between the exec of your program and main getting called, as well as when you first call library routines).
Sometimes, linking statically can improve performance of a program, but at a major expense of RAM — since your libraries aren't shared, you're duplicating "your libc" in addition to the shared libc that every other program is using, for example. That's generally only useful in embedded systems where your program is running more-or-less alone on the machine.
(*) In point of fact, the kernel is a bit smarter, and will generally preload some pages
to reduce the number of page faults, but the theory is the same, regardless of the
optimizations

No, it only loads the necessary pages into memory. This is demand paging.
I don't know of a tool which can really show that in real time, but you can have a look at /proc/xxx/maps, where xxx is the PID of your process.

While you ask a valid question, I don't think it's something you need to worry about. First off, a binary of 100M is not impossible. Second, the system loader will load the pages it needs from the ELF (Executable and Linkable Format) into memory, and perform various relocations, etc. that will make it work, if necessary. It will also load all of its requisite shared library dependencies in the same way. However, this is not an incredibly time-consuming process, and one that doesn't really need to be optimized. Arguably, any "optimization" would have a significant overhead to make sure it's not trying to use something that hasn't been loaded in its due course, and would possibly be less efficient.
If you're curious what gets mapped, as fge says, you can check /proc/pid/maps. If you'd like to see how a program loads, you can try running a program with strace, like:
strace ls
It's quite verbose, but it should give you some idea of the mmap() calls, etc.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight