Putting CPU and Memory Management model all together - c

WARNING: This is long but I hope it can be useful for people like me in the future.
I think I know what program counter is, how lazy memory allocation works, what MMU does, how virtual memory address is mapped to physical address and the purpose of L1, L2 caches. What I really have trouble with is is how they all fit together in a high level when we run a C code.
Suppose I have this C code:
#include <stdio.h>
#include <stdlib.h>
int main()
{
int* ptr;
int n = 1000000, i = 0;
// Dynamically allocate memory using malloc()
ptr = (int*)malloc(n * sizeof(int));
ptr[0] = 99;
i += 100;
printf("%d\n", ptr[0]);
free(ptr);
return 0;
}
So here is my attempt to put everything together:
After execve() is called, part of the executable is loaded into the memory, e.g. text and data segment, but most of the code are not - they are loaded on demand (demand paging).
The address of the first instruction is in the process table's program counter (PC) field as well as physically in the PC register, ready to be used.
As the CPU executes instructions, PC is updated (usually +1, but jump can go to a different address).
Enter the main function: ptr, n, and i are in the stack.
Next, when we call malloc, the C library will ask the OS (I think via sbrk() sys call, or is it mmap()?) to allocate some memory on the heap.
malloc succeeds in this case, returning a virtual memory address (VMA), but the physically memory may not have been allocated yet. The page table doesn't contain the VMA, so when CPU tries to access such VMA, a page fault will be generated.
In our case, when we do ptr[0] = 99, CPU raises a page fault. I am not sure if the entire array is allocated or just the first page (4k size) though.
But now I don't know how to put cache access into the picture. How does i put into L1 cache? How does it relate to VMA?
Sorry if this is confusing. I just hope someone could help walk me through the entire process...

Before the program runs, the operating system and the C runtime setup the necessary values in the CPU registers.
As you've already noted, the intended PC value is set by the operating system (e.g. by the loader) and then the CPU's PC (aka IP) register is set, probably with a "return from interrupt" instruction that both switches to user mode (activating the virtual memory map for that process) along with loading the CPU with the proper PC value (a virtual address).
In addition, the SP register is set somehow: in some systems this will be done similar to the PC during the "return from interrupt", but in other (older) systems the user code sets the SP to a prearranged location. In either case the SP also holds a virtual memory address.
Usually the first instruction in that runs in the user process is in a routine traditionally called _start in a library called crt0 (C RunTime 0 (aka startup)). _start is usually written in assembly and handles the transition from the operating system to user mode. As needed _start will establish anything else necessary for C code to be called, and then, call main. If main returns to _start, it will do an exit syscall.
The CPU caches (and probably TLBs) will be cold when _start's first instruction gets control. All addresses in user mode are virtual memory addresses that designate memory within the (virtual) address space of the process. The processor is running in user mode. Probably the operating system has preloaded the page holding _start (or a least the start of _start). So when the processor performs an instruction fetch from _start, it will probably TLB miss, but not page fault, and then cache miss.
The TLB is a set of registers forming a cache in the CPU that support virtual to physical address translations/mappings. The TLB, when it misses, will be loaded from a structure in the virtual memory mapping for the process, such as the page tables. Since that first page is preloaded, the attempt to map will succeed, and the TLB will then be filled with the proper mappings from the virtual PC page to the physical page. However, the L1/L2, etc.. caches are also cold, so the access next causes a cache miss. The memory system will satisfy the cache miss by filling a cache line at each level. Finally an instruction word or group of words is provided to the processor, and it begins executing instructions.
If a virtual address for code (by way of the PC) or data (by some dereference) is not present in the TLB, then the processor will consult the page tables, and a miss there can cause a recoverable or non-recoverable page fault. Recoverable page faults are virtual to physical mappings that are not present in the page tables, because the data is on disc and operating system intervention is required; whereas non-recoverable faults are accesses to virtual memory that are in error, i.e. not allowed as they refer to virtual memory that has not been allocated/authorized by the operating system.
Variable i is known to main as a stack-relative location. So, when main wants to write to i it will write to memory and an offset from SP, e.g. SP+8 (i could also be a register variable, but I digress). Since the SP is a pointer holding a virtual memory address, i then has a virtual address. That virtual address goes thru the above described steps: TLB mapping from virtual page to physical page, possible page faulting, and then possible cache miss. Subsequent access will yield TLB hits, and cache hits, so as to run at full speed. (The operating system will probably also preload some but not all stack pages before running the process.)
A malloc operation will use some system calls that ultimately cause additional virtual memory to be added to the process. (Though as you also note, malloc gets more than enough for the current request so the system calls are not done every malloc.) malloc will return a virtual memory address, i.e. a pointer in the user mode virtual address space. For memory just obtained by a system call, the TLB and caches are also probably code, and it is possible that the page is not even loaded yet as well. In the latter case, a recoverable page fault will happen and the OS will allocate a physical page to use. If the OS is smart it will know that this is a new data page, and so can fill it with zeros instead of loading it from the paging file. Then it will set up the page table entries for the proper mapping, and resume the user process, which will probably then TLB miss, fill a TLB entry from the page tables, and then cache miss, and fill cache lines from the physical page.

Related

Confusion of virtual memory

Consider a sample below.
char* p = (char*)malloc(4096);
p[0] = 'a';
p[1] = 'b';
The 4KB memory is allocated by calling malloc(). OS handles the memory request by the user program in user-space. First, OS requests memory allocation to RAM, then RAM gives physical memory address to OS. Once OS receives physical address, OS maps the physical address to virtual address then OS returns the virtual address which is the address of p to user program.
I wrote some value(a and b) in virtual address and they are really written into main memory(RAM). I'm confusing that I wrote some value in virtual address, not physical address, but it is really written to main memory(RAM) even though I didn't care about them.
What happens in behind? What OS does for me? I couldn't found relevant materials in some books(OS, system programming).
Could you give some explanation? (Please omit the contents about cache for easier understanding)
A detailed answer to your question will be very long - and too long to fit here at StackOverflow.
Here is a very simplified answer to a little part of your question.
You write:
I'm confusing that I wrote some value in virtual address, not physical address, but it is really written to main memory
Seems you have a very fundamental misunderstanding here.
There is no memory directly "behind" a virtual address. Whenever you access a virtual address in your program, it is automatically translated to a physical address and the physical address is then used for access in main memory.
The translation happens in HW, i.e. inside the processor in a block called "MMU - Memory management unit" (see https://en.wikipedia.org/wiki/Memory_management_unit).
The MMU holds a small but very fast look-up table that tells how a virtual address is to be translated into a physical address. The OS configures this table but after that, the translation happens without any SW being involved and - just to repeat - it happens whenever you access a virtual memory address.
The MMU also takes some kind of process ID as input in order to do the translation. This is need because two different processes may use the same virtual address but they will need translation to two different physical addresses.
As mentioned above the MMU look-up table (TLB) is small so the MMU can't hold a all translations for a complete system. When the MMU can't do a translation, it can make an exception of some kind so that some OS software can be triggered. The OS will then re-program the MMU so that the missing translation gets into the MMU and the process execution can continue. Note: Some processors can do this in HW, i.e. without involving the OS.
You have to understand that virtual memory is virtual, and it can be more extensive than physical memory RAM, so it is mapped differently. Although they are actually the same.
Your programs use virtual memory addresses, and it is your OS who decides to save in RAM. If it fills up, then it will use some space on the hard drive to continue working.
But the hard drive is slower than the RAM, that's why your OS uses an algorithm, which could be Round-Robin, to exchange pages of memory between the hard drive and RAM, depending on the work being done, ensuring that the data that are most likely to be used are in fast memory. To swap pages back and forth, the OS does not need to modify virtual memory addresses.
Summary overlooking a lot of things
You want to understand how virtual memory works. There's lots of online resources about this, here's one I found that seems to do a fair job of trying to explain it without getting too crazy in technical details, but also doesn't gloss over important terms.
https://searchstorage.techtarget.com/definition/virtual-memory
For Linux on x86 platforms, the assembly equivalent of asking for memory is basically a call into the kernel using int 0x80 with some parameters for the call set into some registers. The interrupt is set at boot by the OS to be able to answer for the request. It is set in the IDT.
An IDT descriptor for 32 bits systems looks like:
struct IDTDescr {
uint16_t offset_1; // offset bits 0..15
uint16_t selector; // a code segment selector in GDT or LDT
uint8_t zero; // unused, set to 0
uint8_t type_attr; // type and attributes, see below
uint16_t offset_2; // offset bits 16..31
};
The offset is the address of the entry point of the handler for that interrupt. So interrupt 0x80 has an entry in the IDT. This entry points to an address for the handler(also called ISR). When you call malloc(), the compiler will compile this code to a system call. The system call returns in some register the address of the allocated memory. I'm pretty sure as well that this system call will actually use the sysenter x86 instruction to switch into kernel mode. This instruction is used alongside an MSR register to securely jump into kernel mode from user mode at the address specified in the MSR (Model Specific Register).
Once in kernel mode, all instructions can be executed and access to all hardware is unlocked. To provide with the request the OS doesn't "ask RAM for memory". RAM isn't aware of what memory the OS uses. RAM just blindly answers to asserted pins on it's DIMM and stores information. The OS just checks at boot using the ACPI tables that were built by the BIOS to determine how much RAM there is and what are the different devices that are connected to the computer to avoid writing to some MMIO (Memory Mapped IO). Once the OS knows how much RAM is available (and what parts are usable), it will use algorithms to determine what parts of available RAM every process should get.
When you compile C code, the compiler (and linker) will determine the address of everything right at compilation time. When you launch that executable the OS is aware of all memory the process will use. So it will set up the page tables for that process accordingly. When you ask for memory dynamically using malloc(), the OS determines what part of physical memory your process should get and changes (during runtime) the page tables accordingly.
As to paging itself, you can always read some articles. A short version is the 32 bits paging. In 32 bits paging you have a CR3 register for each CPU core. This register contains the physical address of the bottom of the Page Global Directory. The PGD contains the physical addresses of the bottom of several Page Tables which themselves contain the physical addresses of the bottom of several physical pages (https://wiki.osdev.org/Paging). A virtual address is split into 3 parts. The 12 bits to the right (LSB) are the offset in the physical page. The 10 bits in the middle are the offset in the page table and the 10 MSB are the offset in the PGD.
So when you write
char* p = (char*)malloc(4096);
p[0] = 'a';
p[1] = 'b';
you create a pointer of type char* and making a system call to ask for 4096 bytes of memory. The OS puts the first address of that chunk of memory into a certain conventional register (which depends on the system and OS). You should not forget that the C language is just a convention. It is up to the operating system to implement that convention by writing a compatible compiler. It means that the compiler knows what register and what interrupt number to use (for the system call) because it was specifically written for that OS. The compiler will thus take the address stored into this certain register and store it into this pointer of type char* during runtime. On the second line you are telling the compiler that you want to take the char at the first address and make it an 'a'. On the third line you make the second char a 'b'. In the end, you could write an equivalent:
char* p = (char*)malloc(4096);
*p = 'a';
*(p + 1) = 'b';
The p is a variable containing an address. The + operation on a pointer increments this address by the size of what is stored in that pointer. In this case, the pointer points to a char so the + operation increments the pointer by one char (one byte). If it was pointing to an int then it would be incremented of 4 bytes (32 bits). The size of the actual pointer depends on the system. If you have a 32 bits system then the pointer is 32 bits wide (because it contains an address). On a 64 bits system the pointer is 64 bits wide. A static memory equivalent of what you did is
char p[4096];
p[0] = 'a';
p[1] = 'b';
Now the compiler will know at compile time what memory this table will get. It is static memory. Even then, p represents a pointer to the first char of that array. It means you could write
char p[4096];
*p = 'a';
*(p + 1) = 'b';
It would have the same result.
First, OS requests memory allocation to RAM,…
The OS does not have to request memory. It has access to all of memory the moment it boots. It keeps its own database of which parts of that memory are in use for what purposes. When it wants to provide memory for a user process, it uses its own database to find some memory that is available (or does things to stop using memory for other purposes and then make it available). Once it chooses the memory to use, it updates its database to record that it is in use.
… then RAM gives physical memory address to OS.
RAM does not give addresses to the OS except that, when starting, the OS may have to interrogate the hardware to see what physical memory is available in the system.
Once OS receives physical address, OS maps the physical address to virtual address…
Virtual memory mapping is usually described as mapping virtual addresses to physical addresses. The OS has a database of the virtual memory addresses in the user process, and it has a database of physical memory. When it is fulfilling a request from the process to provide virtual memory and it decides to back that virtual memory with physical memory, the OS will inform the hardware of what mapping it choose. This depends on the hardware, but a typical method is that the OS updates some page table entries that describe what virtual addresses get translated to what physical addresses.
I wrote some value(a and b) in virtual address and they are really written into main memory(RAM).
When your process writes to virtual memory that is mapped to physical memory, the processor will take the virtual memory address, look up the mapping information in the page table entries or other database, and replace the virtual memory address with a physical memory address. Then it will write the data to that physical memory.

Does Virtual Memory area struct only comes into picture when there is a page fault?

Virtual Memory is a quite complex topic for me. I am trying to understand it. Here is my understanding for a 32-bit system. Example RAM is just 2GB. I have tried reading many links, and I am not confident at the moment. I would like you people to help me in clearing up my concepts. Please acknowledge my points, and also please answer for what you feel is wrong. I have also a confused section in my points. So, here starts the summary.
Every process thinks it is only running. It can access the 4GB of memory - virtual address space.
When a process access a virtual address it is translated to physical address via MMU.
This MMU is a part of a CPU - a hardware.
When the MMU cannot translate the address to a physical one, it raises a page fault.
On page fault, the kernel is notified. The kernel check the VM area struct. If it can find it - may be on disk. It will do some page-in /page-out. And get this memory on the RAM.
Now MMU will again try and will succeed this time.
In case the kernel cannot find the address, it will raise a signal. For example, invalid access will raise a SIGSEGV.
Confused points.
Does Page table is maintained in Kernel? This VM area struct has a page table ?
How MMU cannot find the address in physical RAM. Let's say it translates to some wrong address in RAM. Still the code will execute, but it will be a bad address. How MMU ensures that it is reading a right data? Does it consult Kernel VM area everytime?
Is the Mapping table - virtual to physical is inside a MMU. I have read it that is maintained by an individual process. If it is inside a process, why I can't see it.
Or if it is MMU, how MMU generates the address - is it that Segment + 12-bit shift -> Page frame number, and then the addition of offset (bits -1 to 10) -> gives a physical address.
Does it mean that for a 32-bit architecture, with this calculation in my mind. I can determine the physical address from a virtual address.
cat /proc/pid_value/maps. This shows me the current mapping of the vmarea. Basically, it reads the Vmarea struct and prints it. That means that this is important. I am not able to fit this piece in the complete picture. When the program is executed does the vmarea struct is generated. Is VMAREA comes only into the picture when the MMU cannnot translate the address i.e. Page fault? When I print the vmarea it displays the address range , permission and mapped to file descriptor, and offset. I am sure this file descriptor is the one in the hard-disk and the offset is for that file.
The high-mem concept is that kernel cannot directly access the Memory region greater than 1 GB(approx). Thus, it needs a page table to indirectly map it. Thus, it will temporarily load some page table to map the address. Does HIGH MEM will come into the picture everytime. Because Userspace can directly translate the address via MMU. On what scenario, does kernel really want to access the High MEM. I believe the kernel drivers will mostly be using kmalloc. This is a direct memory + offset address. In this case no mapping is really required. So, the question is on what scenario a kernel needs to access the High Mem.
Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run LInux?
Does Page table is maintained in Kernel? This VM area struct has a page table ?
Yes. Not exactly: each process has a mm_struct, which contains a list of vm_area_struct's (which represent abstract, processor-independent memory regions, aka mappings), and a field called pgd, which is a pointer to the processor-specific page table (which contains the current state of each page: valid, readable, writable, dirty, ...).
The page table doesn't need to be complete, the OS can generate each part of it from the VMAs.
How MMU cannot find the address in physical RAM. Let's say it translates to some wrong address in RAM. Still the code will execute, but it will be a bad address. How MMU ensures that it is reading a right data? Does it consult Kernel VM area everytime?
The translation fails, e.g. because the page was marked as invalid, or a write access was attempted against a readonly page.
Is the Mapping table - virtual to physical is inside a MMU. I have read it that is maintained by an individual process. If it is inside a process, why I can't see it.
Or if it is MMU, how MMU generates the address - is it that Segment + 12-bit shift -> Page frame number, and then the addition of offset (bits -1 to 10) -> gives a physical address.
Does it mean that for a 32-bit architecture, with this calculation in my mind. I can determine the physical address from a virtual address.
There are two kinds of MMUs in common use. One of them only has a TLB (Translation Lookaside Buffer), which is a cache of the page table. When the TLB doesn't have a translation for an attempted access, a TLB miss is generated, the OS does a page table walk, and puts the translation in the TLB.
The other kind of MMU does the page table walk in hardware.
In any case, the OS maintains a page table per process, this maps Virtual Page Numbers to Physical Frame Numbers. This mapping can change at any moment, when a page is paged-in, the physical frame it is mapped to depends on the availability of free memory.
cat /proc/pid_value/maps. This shows me the current mapping of the vmarea. Basically, it reads the Vmarea struct and prints it. That means that this is important. I am not able to fit this piece in the complete picture. When the program is executed does the vmarea struct is generated. Is VMAREA comes only into the picture when the MMU cannnot translate the address i.e. Page fault? When I print the vmarea it displays the address range , permission and mapped to file descriptor, and offset. I am sure this file descriptor is the one in the hard-disk and the offset is for that file.
To a first approximation, yes. Beyond that, there are many reasons why the kernel may decide to fiddle with a process' memory, e.g: if there is memory pressure it may decide to page out some rarely used pages from some random process. User space can also manipulate the mappings via mmap(), execve() and other system calls.
The high-mem concept is that kernel cannot directly access the Memory region greater than 1 GB(approx). Thus, it needs a page table to indirectly map it. Thus, it will temporarily load some page table to map the address. Does HIGH MEM will come into the picture everytime. Because Userspace can directly translate the address via MMU. On what scenario, does kernel really want to access the High MEM. I believe the kernel drivers will mostly be using kmalloc. This is a direct memory + offset address. In this case no mapping is really required. So, the question is on what scenario a kernel needs to access the High Mem.
Totally unrelated to the other questions. In summary, high memory is a hack to be able to access lots of memory in a limited address space computer.
Basically, the kernel has a limited address space reserved to it (on x86, a typical user/kernel split is 3Gb/1Gb [processes can run in user space or kernel space. A process runs in kernel space when a syscall is invoked. To avoid having to switch the page table on every context-switch, on x86 typically the address space is split between user-space and kernel-space]). So the kernel can directly access up to ~1Gb of memory. To access more physical memory, there is some indirection involved, which is what high memory is all about.
Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run Linux?
Laptop/desktop processors come with an MMU. x86 supports paging since the 386.
Linux, specially the variant called µCLinux, supports processors without MMUs (!MMU). Many embedded systems (ADSL routers, ...) use processors without an MMU. There are some important restrictions, among them:
Some syscalls don't work at all: e.g fork().
Some syscalls work with restrictions and non-POSIX conforming behavior: e.g mmap()
The executable file format is different: e.g bFLT or ELF-FDPIC instead of ELF.
The stack cannot grow, and its size has to be set at link-time.
When a program is loaded first the kernel will setup a kernel VM-Area for that process is it? This Kernel VM Area actually holds where the program sections are there in the memory/HDD. Then the entire story of updating CR3 register, and page walkthrough or TLB comes into the picture right? So, whenever there is a pagefault - Kernel will update the page table by looking at Kernel virtual memory area is it? But they say Kernel VM area keeps updating. How this is possible, since cat /proc/pid_value/map will keep updating.The map won't be constant from start to end. SO, the real information is available in the Kernel VM area struct is it? This is the acutal information where the section of program lies, it could be HDD or physical memory -- RAM? So, this is filled during process loading is it, the first job? Kernel does the page in page out on page fault, and will update the Kernel VM area is it? So, it should also know the entire program location on the HDD for page-in / page out right? Please correct me here. This is in continuation to my first question of the previous comment.
When the kernel loads a program, it will setup several VMAs (mappings), according to the segments in the executable file (which on ELF files you can see with readelf --segments), which will be text/code segment, data segment, etc... During the lifetime of the program, additional mappings may be created by the dynamic/runtime linkers, by the memory allocator (malloc(), which may also extend the data segment via brk()), or directly by the program via mmap(),shm_open(), etc..
The VMAs contain the necessary information to generate the page table, e.g. they tell whether that memory is backed by a file or by swap (anonymous memory). So, yes, the kernel will update the page table by looking at the VMAs. The kernel will page in memory in response to page faults, and will page out memory in response to memory pressure.
Using x86 no PAE as an example:
On x86 with no PAE, a linear address can be split into 3 parts: the top 10 bits point to an entry in the page directory, the middle 10 bits point to an entry in the page table pointed to by the aforementioned page directory entry. The page table entry may contain a valid physical frame number: the top 22 bits of a physical address. The bottom 12 bits of the virtual address is an offset into the page that goes untranslated into the physical address.
Each time the kernel schedules a different process, the CR3 register is written to with a pointer to the page directory for the current process. Then, each time a memory access is made, the MMU tries to look for a translation cached in the TLB, if it doesn't find one, it looks for one doing a page table walk starting from CR3. If it still doesn't find one, a GPF fault is raised, the CPU switches to Ring 0 (kernel mode), and the kernel tries to find one in the VMAs.
Also, I believe this reading from CR, page directory->page-table->Page frame number-memory address this all done by MMU. Am I correct?
On x86, yes, the MMU does the page table walk. On other systems (e.g: MIPS), the MMU is little more than the TLB, and on TLB miss exceptions the kernel does the page table walk by software.
Though this is not going to be the best answer, iw ould like to share my thoughts on confused points.
1. Does Page table is maintained...
Yes. kernel maintains the page tables. In fact it maintains nested page tables. And top of the page tables is stored in top_pmd. pmd i suppose it is page mapping directory. You can traverse through all the page tables using this structure.
2. How MMU cannot find the address in physical RAM.....
I am not sure i understood the question. But in case because of some problem, the instruction is faulted or out of its instruction area is being accessed, you generally get undefined instruction exception resulting in undefined exception abort. If you look at the crash dumps, you can see it in the kernel log.
3. Is the Mapping table - virtual to physical is inside a MMU...
Yes. MMU is SW+HW. HW is like TLB and all. The mapping tables are stored here. For instructions, that is for code section i always converted the physical-virtual address and always they matched. And almost all the times it matches for Data sections as well.
4. cat /proc/pid_value/maps. This shows me the current mapping of the vmarea....
This is more used for analyzing the virtual addresses of user space stacks. As you know virtually all the user space programs can have 4 GB of virtual address. So unlike kernel if i say 0xc0100234. You cannot directly go and point to the istruction. So you need this mapping and the virtual address to point the instruction based on the data you have.
5. The high-mem concept is that kernel cannot directly access the Memory...
High-mem corresponds to user space memory(some one correct me if i am wrong). When kernel wants to read some data from a address at user space you will be accessing the HIGHMEM.
6. Does the processor specifically comes with the MMU support. Those who doesn't have MMU support cannot run LInux?
MMU as i mentioned is HW + SW. So mostly it would be coming with the chipset. and the SW would be generally architecture dependent. You can disable MMU from kernel config and build. I have never tried it though. Mostly these days allthe chipsets have it. But small boards i think they disable MMU. I am not entirely sure though.
As all these are conceptual questions, i may be lacking some knowledge and be wrong at places. If so others please correct me.

Is the Kernel Virtual Memory struct first formed when the process is about to execute?

I have been bothering with similar questions indirectly on my other posts. Now, my understanding is better. Thus, my questions are better. So, I want to summarize the facts here. This example is based on X86-32-bit system.
Please say yes/no to my points. If no, then please explain.
MMU will look into the CR3 register to find the Process - Page Directory base address.
The CR3 register is set by the kernel.
Now MMU after reading the Page directory base address, will offset to the Page Table index (calculated from VA), from here it will read the Page frame number, now it will find the offset on the page frame number based on the VA given. It gets the physical memory address. All this is done in MMU right? Don't know when MMU is disabled, who will do all this circus? If software then it will be slow right?
I know then page fault occurs when the MMU cannot resolve the address. The kernel is informed. The kernel will update the page table based on the reading from kernel virtual memory area struct. Am I correct?
Keeping in mind, the point 4. Does it mean that before executing any process. Perhaps during loading process. Does Kernel first fills the kernel virtual memory area struct. For example, where the section of memory will be BSS, Code, DS,etc. It could be that some sections are in RAM, and some are in Storage device. When the sections of the program is moved from storage to main memory, I am assuming that kernel would be updating the Kernel virtual memory area struct. Am I correct here? So, it is the kernel who keeps a close track on the program location - whether in storage device or RAM - inode number of device and file offset.
Sequence wise -> During Process loading ( may be a loader program)-> Kernel will populate the data in the kernel virtual memory area struct. It will also set the CR3 register. Now Process starts executing, it will initially get some frequent page faults.Now the VM area struct will be updated (if required) and then the page table. Now, MMU will succeed in translating the address. So, when I say process accessing a memory, it is the MMU which is accessing the memory on behalf of the process. This is all about user-space. Kernel space is entirely different. The kernel space doesn't need the MMU, it can directly map to the physical address - low mem. For high mem ( to access user space from kernel space), it will do the temporary page table updation - internally. This is a separate page table for kernel, a temporary one. The kernel space doesn't need MMU. Am I correct?
Don't know when MMU is disabled, who will do all this circus?
Nobody. All this circus is intended to do two things: translate the virtual address you gave it into a real address, and if it can't do that then to abort the instruction entirely and start executing a routine addressed from an architecturally pre-defined address, see "page fault" there for the basic one.
When the MMU is shut off, no translation is done and the address you gave it is fed directly down the CPU's address-processing pipe just as any address the MMU might have translated it to would have been.
So, when I say process accessing a memory, it is the MMU which is accessing the memory on behalf of the process.
You're on the right track here, the MMU is mediating the access, but it isn't doing the access. It's doing only what you described before, translating it. What's generally called the Load/Store unit, gets it next, and it's the one that handles talking to whatever holds the closest good copy of the data at that address, "does the access".
The kernel space doesn't need the MMU, it can directly map to the physical address
That depends on how you define "need". It can certainly shut it off, but it almost never does. First, it has to talk to user space, and the MMU has to be running to translate what user space has to addresses the Load-Store unit can use. Second, the flexibility and protection provided by the MMU are very valuable, they're not discarded without a really compelling reason. I know at least one OS will (or would, it's been a while) run some bulk copies MMU-off, but that's about it.

example of address translation

I have doubt with respect to the address space.
I have thought that the RAM if 4 GB is split up into 2 halves for kernel space(1GB) and user space(3GB).
1] Does RAM also maintains stack,heap,code and data section as hard disk.
2] Won't the process running is not given a boundary where the stack, data, code and heap have to grow in RAM.
3] My thought was that the stack,heap,code and data segment all be in the consecutive address space given to the process at the time of process creation.
4] How does the CPU takes the correct address of the process to execute, as the processes are not contiguous in physical memory.
No, only the virtual memory address space is split in two. Physical memory, the RAM in the machine, contains an entirely random collection of blocks that map to virtual memory addresses. From both operating system pages and user program pages. Much like the image shows although it is a bit misleading about showing the OS pages at the bottom.
That mapping constantly changes, a page fault is the essential mechanism to get a virtual memory page mapped to RAM. Which is triggered when a program accesses a virtual memory page that isn't present in RAM yet. As needed, RAM pages may be unmapped to make room, their content is either discarded or written to the pagefile. Code is usually discardable, it can be read back from the executable file, data usually isn't.
Some pages in RAM are special, they contain code and data that's used by drivers. They are page-locked. Required when the driver handles device interrupts and the code/data used by the interrupt handler must be present in RAM to allow the interrupt to be handled, can't afford a page fault at such a critical time. The probable reason the image was drawn like that.

how is virtual address translated to its physical address on backing store?

We have address translation table to translate virtual address (VA) of a process to its corresponding physical address in RAM, but if the table does not have any entry for a VA , it results in page fault and kernal goes to backing store (often a hard drive) and fetch the corresponding data and update the RAM and address translation table. So my question is how does the OS come to know what is the address corresponding to a VA in backing store ? Does it have a separate translation table for that?
A process starts by allocating virtual memory. That eventually will cause a page fault when the program starts actually addressing the virtual memory address. The OS knows that the memory access is valid. Since it was allocated explicitly.
So no harm done, the OS simply maps the VM address to a physical address.
If the page fault is for an address that was not previously requested to be a valid VM address then the processor will discover that there is no page table entry for the address. And will instead raise an GP fault, an AccessViolation or segfault in your program. Kaboom, program over.
There is no direct correlation, at least not in the way that you suppose.
The operating system divides virtual and phsyical RAM as well as swap space (backing store) and mapped files into pages, most commonly 4096 bytes.
When your code accesses a certain address, this is always a virtual address within a page that is either valid-in-core, valid-not-accessed, valid-out-of-core, or invalid. The OS may have other properties (such as "has been written to") in its books, but they're irrelevant for us here.
If the page is in-core, then it has a physical address, otherwise it does not. When swapped out and in again, the same identical page could in theory very well land in a different physical region of memory. Similarly, the page after some other page in memory (virtual or physical) could be before that page in the swap file or in a memory-mapped file. There's no guarantee for that.
Thus, there is no such thing as translating a virtual address to a physical address in backing store. There is only a translation from a virtual address to a page which may temporarily have a physical address. In the easiest case, "translating" means dividing by 4096, but of course other schemes are possible.
Further, every time your code accesses a memory location, the virtual address must be translated to a physical one. There exists dedicated logic inside a CPU to do this translation fully automatically (for a very small subset of "hot" pages, often as few as 64), or in a hardware-assisted way, which usually involves a lookup in a more or less complicated hierarchical structure.
This is also a fault, but it's one that you don't see. The only faults that you get to see are the ones when the OS doesn't have a valid page (or can't supply it for some reason), and thus can't assign a physical address to the to-be-translated virtual one.
When your program asks for memory, the OS remembers that certain pages are valid, but they do not exist yet because you have never accessed them.
The first time you access a page, a fault happens and obviously its address is nowhere in the translation tables (how could it be, it doesn't exist!). Thus the OS looks into its books and (assuming the page is valid) it either loads the page from disk or assigns the address of a zero page otherwise.
Not rarely, the OS will cheat and all zero pages are the same write-protected zero page until you actually write to it (at which point, a fault occurs and you are secretly redirected to a different physical memory area, one which you can write to, too.
Otherwise, that is if you haven't reserved memory, the OS sends a signal (or an equivalent, Windows calls it "exception") which will terminate your process unless handled.
For a multitude of reasons, the OS may later decide to remove one or several pages from your working set. This normally does not immediately remove them, but maked them candidates for being swapped (for non-mapped data) or discarded (for mapped data) in case more memory is needed. When you access an address in one of these pages again, it is either re-added to your working set (likely pushing another one out) or reloaded from disk.
In either case, all the OS needs to know is how to translate your virtual address to a page identifier of some sort (e.g. "page frame number"), and whether the page is resident (and at what address).
I think your question answer is issue about interrupt table.
Page fault is a kind of software interrupt, and operating system must have some solution to that interrupt.And the solution code is already in the os kernel, and that piece of code address is right at the interrupt table.So the page fault happen, os will go to that piece of code to get the unmapped page into the physical memory.
This is OS-specific, but many implementations share logic with memory-mapped file features (so that anonymous pages actually are memory-mapped views of the pagefile, flagged so that the content can be discards at unmapping instead of flushed).
For Windows, much of this is documented here, on the CreateFileMapping page

Resources