Continuing my endeavors in OS development research, I have constructed an almost complete picture in my head. One thing still eludes me.
Here is the basic boot process, from my understanding:
1) BIOS/Bootloader perform necessary checks, initialize everything.
2) The kernel is loaded into RAM.
3) Kernel performs its initializations and starts scheduling tasks.
4) When a task is loaded, it is given a virtual address space in which it resides. Including the .text, .data, .bss, the heap and stack. This task "maintains" its own stack pointer, pointing to its own "virtual" stack.
5) Context switches merely push the register file (all CPU registers), the stack pointer and program counter into some kernel data structure and load another set belonging to another process.
In this abstraction, the kernel is a "mother" process inside of which all other processes are hosted. I tried to convey my best understanding in the following diagram:
Question is, first is this simple model correct?
Second, how is the executable program made aware of its virtual stack? Is it the OS job to calculate the virtual stack pointer and place it in the relevant CPU register? Is the rest of the stack bookkeeping done by CPU pop and push commands?
Does the kernel itself have its own main stack and heap?
Thanks.
Question is, first is this simple model correct?
Your model is extremely simplified but essentially correct - note that the last two parts of your model aren't really considered to be part of the boot process, and the kernel isn't a process. It can be useful to visualize it as one, but it doesn't fit the definition of a process and it doesn't behave like one.
Second, how is the executable program made aware of its virtual stack?
Is it the OS job to calculate the virtual stack pointer and place it
in the relevant CPU register? Is the rest of the stack bookkeeping
done by CPU pop and push commands?
An executable C program doesn't have to be "aware of its virtual stack." When a C program is compiled into an executable, local variables are usually referenced in relative to the stack pointer - for example, [ebp - 4].
When Linux loads a new program for execution, it uses the start_thread macro (which is called from load_elf_binary) to initialize the CPU's registers. The macro contains the following line:
regs->esp = new_esp;
which will initialize the CPU's stack pointer register to the virtual address that the OS has assigned to the thread's stack.
As you said, once the stack pointer is loaded, assembly commands such as pop and push will change its value. The operating system is responsible for making sure that there are physical pages that correspond to the virtual stack addresses - in programs that use a lot of stack memory, the number of physical pages will grow as the program continues its execution. There is a limit for each process that you can find by using the ulimit -a command (on my machine the maximum stack size is 8MB, or 2KB pages).
Does the kernel itself have its own main stack and heap?
This is where visualizing the kernel as a process can become confusing. First of all, threads in Linux have a user stack and a kernel stack. They're essentially the same, differing only in protections and location (kernel stack is used when executing in Kernel Mode, and user stack when executing in User Mode).
The kernel itself does not have its own stack. Kernel code is always executed in the context of some thread, and each thread has its own fixed-size (usually 8KB) kernel stack. When a thread moves from User Mode to Kernel Mode, the CPU's stack pointer is updated accordingly. So when kernel code uses local variables, they are stored on the kernel stack of the thread in which they are executing.
During system startup, the start_kernel function initializes the kernel init thread, which will then create other kernel threads and begin initializing user programs. So after system startup the CPU's stack pointer will be initialized to point to init's kernel stack.
As far as the heap goes, you can dynamically allocate memory in the kernel using kmalloc, which will try to find a free page in memory - its internal implementation uses get_zeroed_page.
You forgot one important point: Virtual memory is enforced by hardware, typically known as the MMU (Memory Management Unit). It is the MMU that converts virtual addresses to physical addresses.
The kernel typically loads the address of the base of the page table for a specific process into a register in the MMU. This is what task-switches the virtual memory space from one process to another. On x86, this register is CR3.
Virtual memory protects processes' memory from each other. RAM for process A is simply not mapped into process B. (Except for e.g. shared libraries, where the same code memory is mapped into multiple processes, to save memory).
Virtual memory also protect kernel memory space from a user-mode process. Attributes on the pages covering kernel address space are set so that, when the processor is running in user-mode, it is not allowed to execute there.
Note that, while the kernel may have threads of its own, which run entirely in kernel space, the kernel shouldn't really be thought of a "a mother process" that runs independently of your user-mode programs. The kernel basically is "the other half" of your user-mode program! Whenever you issue a system call, the CPU automatically transitions into kernel mode, and starts executing at a pre-defined location, dictated by the kernel. The kernel system call handler then executes on your behalf, in the kernel-mode context of your process. Time spent in the kernel handling your request is accounted for, and "charged to" your process.
The helpful ways of thinking about kernel in context of relationships with processes and threads
Model provided by you is very simplified but correct in general.
In the same time the way of thinking about kernel as about "mother process" isn't best, but it still has some sense.
I would like to propose another two better models.
Try to think about kernel as about special kind of shared library.
Like a shared library kernel is shared between different processes.
System call is performed in a way which is conceptually similar to the routine call from shared library.
In both cases, after call, you execute of "foreign" code but in the context your native process.
And in both cases your code continues to perform computations based on stack.
Note also, that in both cases calls to "foreign" code lead to blocking of execution of your "native" code.
After return from the call, execution continues starting in the same point of code and with the same state of the stack from which call was performed.
But why we consider kernel as a "special" kind of shared library? Because:
a. Kernel is a "library" that is shared by every process in the system.
b. Kernel is a "library" that shares not only section of code, but also section of data.
c. Kernel is a specially protected "library". Your process can't access kernel code and data directly. Instead, it is forced to call kernel controlled manner via special "call gates".
d. In the case of system calls your application will execute on virtually continuous stack. But in reality this stack will be consist from two separated parts. One part is used in user mode and the second part will be logically attached to the top of your user mode stack during entering the kernel and deattached during exit.
Another useful way of thinking about organization of computations in your computer is consideration of it as a network of "virtual" computers which doesn't has support of virtual memory.
You can consider process as a virtual multiprocessor computer that executes only one program which has access to all memory.
In this model each "virtual" processor will be represented by thread of execution.
Like you can have a computer with multiple processors (or with multicore processor) you can have multiple oncurrently running threads in your process.
Like in your computer all processors have shared access to the pool of physical memory, all threads of your process share access to the same virtual address space.
And like separate computers are physically isolated from each other, your processes also isolated from each other but logically.
In this model kernel is represented by server having direct connections to each computer in the network with star topology.
Similarly to a networking servers, kernel has two main purposes:
a. Server assembles all computers in single network.
Similarly kernel provides a means of inter-process communication and synchronization. Kernel works as a man in the middle which mediates entire communication process (transfers data, routes messages and requests etc.).
b. Like server provides some set of services to each connected computer, kernel provides a set of services to the processes. For example, like a network file server allows computers read and write files located on shared storage, your kernel allows processes to do the same things but using local storage.
Note, that following the client-server communication paradigm, clients (processes) are the only active actors in the network. They issue request to the server and between each other. Server in its turn is a reactive part of the system and it never initiate communication. Instead it only replies to incoming requests.
This models reflect the resource sharing/isolation relationships between each part of the system and the client-server nature of communication between kernel and processes.
How stack management is performed, and what role plays kernel in that process
When the new process starts, kernel, using hints from executable image, decides where and how much of virtual address space will have reserved for the user mode stack of initial thread of the process.
Having this decision, kernel sets the initial values for the set of processor registers, which will be used by main thread of process just after start of the execution.
This setup includes setting of the initial value of stack pointer.
After actual start of process execution, process itself becomes responsible for stack pointer.
More interesting fact is that process is responsible for initialization of stack pointers of each new thread created by it.
But note that kernel kernel is responsible for allocation and management of kernel mode stack for each and every thread in the system.
Note also that kernel is resposible for physical memory allocation for the stack and usually perform this job lazily on demand using page faults as hints.
Stack pointer of running thread is managed by thread itself. In most cases stack pointer management is performed by compiler, when it builds executable image. Compiler usually tracks stack pointer value and maintain it's consistency by adding and tracking all instructions that relates to the stack.
Such instructions not limited only by "push" and "pop". There are many CPU instructions which affects the stack, for example "call" and "ret", "sub ESP" and "add ESP", etc.
So as you can see, actual policy of stack pointer management is mostly static and known before process execution.
Sometimes programs have a special part of the logic that performs special stack management.
For example implementations of coroutines or long jumps in C.
In fact, you are allowed to do whatever you want with the stack pointer in your program if you want.
Kernel stack architectures
I'm aware about three approaches to this issue:
Separate kernel stack per thread in the system. This is an approach adopted by most well-known OSes based on monolithic kernel including Windows, Linux, Unix, MacOS.
While this approach leads to the significant overhead in terms of memory and worsens cache utilization, but it improves preemption of the kernel, which is critical for the monolithic kernels with long-running system calls especially in the multi-processor environment.
Actually, long time ago Linux had only one shared kernel stack and entire kernel was covered by Big Kernel Lock that limits the number of threads, which can concurrently perform system call, by only one thread.
But linux kernel developers has quickly recognized that blocking execution of one process which wants to know for instance its PID, because another process already have started send of a big packet through very slow network is completely inefficient.
One shared kernel stack.
Tradeoff is very different for microkernels.
Small kernel with short system calls allows microkernel designers to stick to the design with single kernel stack.
In the presence of proof that all system calls are extremely short, they can benefit from improved cache utilization and smaller memory overhead, but still keep system responsiveness on the good level.
Kernel stack for each processor in the system.
One shared kernel stack even in microkernel OSes seriously affects scalability of the entire operating system in multiprocessor environment.
Due to this, designers frequently follow approach which is looks like compromise between two approaches described above, and keep one kernel stack per each processor (processor core) in the system.
In that case they benefit from good cache utilization and small memory overhead, which are much better than in the stack per thread approach and slightly worser than in single shared stack approach.
And in the same time they benefit from the good scalability and responsiveness of the system.
Thanks.
Related
when a program started the OS will create a virtual memory, which divided into stack, heap, data, text to run a process on it.I know that each segment is used for specification purpose such as text saves the binary code of program, data saves static and global variable. My question is why the OS need to create the virtual memory and divide it into the segments ? How about if OS just use the physical memory and the process run directly on the physical memory. I think maybe the answer is related with running many process at the same time, sharing memory between process but i am not sure. It is kind if you give me an example about the benefit of creating virtual memory and dividing it into the segments.
In an environment with memory protection via a memory mapping unit, all memory is virtual (mapped via the MMU). It's possible to simply map each virtual address linearly to physical addresses, while still using the protection capabilities of the MMU, but doing that makes no sense. There are many reasons to prefer that memory not be directly mappped, such as being able to share program instructions and shared library code between instances of the same program or different programs, being able to fork, etc.
Can a process access all of the RAM or does the CPU give the process a specific part which the kernel decides, and the process (running in user space) can't change? In other words - is a process sandboxed by hardware, or can it do anything, but is monitored by the OS?
EDIT
I'm told in the comments that this is too broad, so let's assume x86/x64. I'll also add that the question arose while reading what I understood to say that processes can access all RAM - which seems to conflict with what I've read about security in OSs.
If you count MS-DOS as an "operating system", then processes can do anything (and aren't monitored). Even Windows95 doesn't have real memory protection, and a buggy process can crash the machine by scribbling over the wrong memory.
If you only count modern OSes with privilege separation (Unix/Linux, Windows NT and derivates), then processes are sandboxed.
AFAIK, there aren't really systems where there's monitoring of any kind other than "fault if you try to do something". The kernel sets the boundaries, and the user-space process gets a fault if it tries to go outside them.
If you're imagining that maybe the kernel looks at what an unprivileged process does, and adapts accordingly, then no, that's not what happens.
See
https://en.wikipedia.org/wiki/Memory_protection: Usually achieved by giving each process its own virtual address space (virtual memory). This is hardware-supported: every address your code uses is translated to a physical address by a fast translation cache (TLB), which caches the translation tables set up by the OS (aka page tables).
A process can't directly modify its own page tables: it has to ask the kernel to map more physical memory into its address space (e.g. as part of malloc()). So the kernel has a chance to verify that the request is ok before doing it.
Also, a process can ask the kernel to copy data to/from files (or other things) into its memory space. (write/read system calls).
https://en.wikipedia.org/wiki/User_space: normal processes run in user-mode, which is a mode provided by the hardware where privileged instructions will trap to the kernel.
This is purely academical question not related to any OS
We have x86 CPU and operating memory, this memory resembles some memory pool, that consist of addressable memory units that can be read or written to, using their address by MOV instruction of CPU (we can move memory from / to this memory pool).
Given that our program is the kernel, we have a full access to whole this memory pool. However if our program is not running directly on hardware, the kernel creates some "virtual" memory pool which lies somewhere inside the physical memory pool, our process consider it just as the physical memory pool and can write to it, read from it, or change its size usually by calling something like sbrk or brk (on Linux).
My question is, how is this virtual pool implemented? I know I can read whole linux source code and maybe one year I find it, but I can also ask here :)
I suppose that one of these 3 potential solutions is being used:
Interpret the instructions of program (very ineffective and unlikely): the kernel would just read the byte code of program and interpret each instruction individually, eg. if it saw a request to access memory the process isn't allowed to access it wouldn't let it.
Create some OS level API that would need to be used in order to read / write to memory and disallow access to raw memory, which is probably just as ineffective.
Hardware feature (probably best, but have no idea how that works): the kernel would say "dear CPU, now I will send you instructions from some unprivileged process, please restrict your instructions to memory area 0x00ABC023 - 0xDEADBEEF" the CPU wouldn't let the user process do anything wrong with the memory, except for that range approved by kernel.
The reason why am I asking, is to understand if there is any overhead in running program unprivileged behind the kernel (let's not consider overhead caused by multithreading implemented by kernel itself) or while running program natively on CPU (with no OS), as well as overhead in memory access caused by computer virtualization which probably uses similar technique.
You're on the right track when you mention a hardware feature. This is a feature known as protected mode and was introduced to x86 by Intel on the 80286 model. That evolved and changed over time, and currently x86 has 4 modes.
Processors start running in real mode and later a privileged software (ring0, your kernel for example) can switch between these modes.
The virtual addressing is implemented and enforced using the paging mechanism (How does x86 paging work?) supported by the processor.
On a normal system, memory protection is enforced at the MMU, or memory management unit, which is a hardware block that configurably maps virtual to physical addresses. Only the kernel is allowed to directly configure it, and operations which are illegal or go to unmapped pages raise exceptions to the kernel, which can then discipline the offending process or fetch the missing page from disk as appropriate.
A virtual machine typically uses CPU hardware features to trap and emulate privileged operations or those which would too literally interact with hardware state, while allowing ordinary operations to run directly and thus with moderate overall speed penalty. If those are unavailable, the whole thing must be emulated, which is indeed slow.
How can I make my pthreads execute a function each time they are rescheduled by the kernel?
I need to identify on which physical CPU/socket (not logical core) my thread is being scheduled at and cannot afford to do this all the time.
Can the wakeup routine be hooked somehow to make the necessary updates to TLS only when the thread is actually being rescheduled?
As to why I need this: I have code which executes AMOs appx every 70ns per thread which is fine if the address is not cached on another socket, deploying the same code on two sockets gives a 15 times performance impact because of frequent cache invalidations. I intend to allocate memory especially for this which is only shared among threads running the same L3 cache. So I need to identify on which socket I am running and address the correct memory block. I could obviously call sched_getcpu and compare this to the physical CPU ID in /proc/cpuinfo, but this is a rather big overhead. I cannot afford to allocate thread-private memory for each thread though, too expensive.
From what I have read in Linux Kernel Development, Third Edition, there is no service nor interface, provided by the kernel, for what you want. Using pthread_setaffinity (as suggested above by #osgx, or, in more recent linux kernel implementations, pthread_setaffinity_np) or caching a TLS key per cpu socket in the beginning (as suggested above by #caf) are perhaps the best methods to use in that direction.
my question:
1)In book modern operating system, it says the threads and processes can be in kernel mode or user mode, but it does not say clearly what's the difference between them .
2)Why the switch for the kernel-mode threads and process costs more than the switch for user-mode threads and process?
3) now, I am learning Linux,I want to know how would I create threads and processes in Kernel mode and user mode respectively IN LINUX SYSTEM?
4)In book modern operating system, it says that it is possible that process would be in user- mode, but the threads which are created in the user-mode process can be in kernel mode. How would this be possible?
There are some terminology problems due more to historical accident than anything else here.
"Thread" usually refers to thread-of-control within a process, and may (does in this case) mean "a task with its own stack, but which shares access to everything not on that stack with other threads in the same protection domain".
"Process" tends to refer to a self-contained "protection domain" which may (and does in this case) have the ability to have multiple threads within it. Given two processes P1 and P2, the only way for P1 to affect P2 (or vice versa) is through some particular defined "communications channel" such as a file, pipe, or socket; via "inter-process" signals like Unix/Linux signals; and so on.
Since threads don't have this kind of barrier between each other, one thread can easily interfere with (corrupt the data used by) another thread.
All of this is independent of user vs kernel, with one exception: in "the kernel"—note that there is an implicit assumption here that there is just one kernel—you have access to the entire machine state at all times, and full privileges to do anything. Hence you can deliberately (or in some cases accidentally) disregard or turn off hardware protection and mess with data "belonging to" someone else.
That mostly covers several possibly-confused items in Q1. As for Q2, the answer to the question as asked is "it doesn't". In general, because threads do not involve (as much) protection, it's cheaper to switch from one thread to another: you do not have to tell the hardware (in whatever fashion) that it should no longer allow various kinds of access, since threads T1 and T2 have "the same" access. Switching between processes, however, as with P1 and P2, you "cross a protection barrier", which has some penalty (the actual penalty varies widely with hardware, and to some extent the skills of the OS writers).
It's also worth noting that crossing from user to kernel mode, and vice versa, is also crossing a protection domain, which again has some kind of cost.
In Linux, there are a number of ways for user processes to create what amount to threads, including both "POSIX threads" (pthreads) and the clone call (details for clone, which is extremely flexible, are beyond the scope of this answer). If you want to write portable code, you should probably stick with pthreads.
Within the Linux kernel, threads are done completely differently, and you will need Linux kernel documentation.
I can't properly answer Q4 since I don't have the book and am not sure what they are referring to here. My guess is that they mean that whenever any user process-or-thread makes a "system call" (requests some service from the OS), this crosses that user/kernel protection barrier, and it is then up to the kernel to verify that the user code has appropriate privileges for that operation, and then to do that operation. The part of the kernel that does this is running with kernel-level protections and thus needs to be more careful.
Some hardware (mostly obsolete these days) has (or had) more than just two levels of hardware-provided protection. On these systems, "user processes" had the least direct privilege, but above those you would find "executive mode", "system mode", and (most privileged) "kernel" or "nucleus" mode. These were intended to lower the cost of crossing the various protection barriers. Code running in "executive" did not have full access to everything in the machine, so it could, for instance, just assume that a user-provided address was valid, and try to use it. If that address was in fact invalid, the exception would rise to the next higher level. With only two levels—"user", unprivileged; and "kernel", completely-privileged—kernel code must be written very carefully. However, it's possible to provide "virtual machines" at low cost these days, which pretty much obsoletes the need for multiple hardware levels of protection. One simply writes a true kernel, then lets it run other things in what they "think" is "kernel mode". This is what VMware and other "hypervisor" systems do.
User-mode threads are scheduled in user mode by something in the process, and the process itself is the only thing handled by the kernel scheduler.
That means your process gets a certain amount of grunt from the CPU and you have to share it amongst all your user mode threads.
Simple case, you have two processes, one with a single thread and one with a hundred threads.
With a simplistic kernel scheduling policy, the thread in the single-thread process gets 50% of the CPU and each thread in the hundred-thread process gets 0.5% each.
With kernel mode threads, the kernel itself manages your threads and schedules them independently. Using the same simplistic scheduler, each thread would get just a touch under 1% of the CPU grunt (101 threads to share the 100% of CPU).
In terms of why kernel mode switching is more expensive, it probably has to do with the fact that you need to switch to kernel mode to do it. User mode threads do all their stuff in user mode (obviously) so there's no involving the kernel in a thread switch operation.
Under Linux, you create threads (and processes) with the clone call, similar to fork but with much finer control over things.
Your final point is a little obtuse. I can't be certain but it's probably talking about user and kernel mode in the sense that one could be executing user code and another could be doing some system call in the kernel (which requires switching to kernel or supervisor mode).
That's not the same as the distinction when talking about the threading support (user or kernel mode support for threading). Without having a copy of the book to hand, I couldn't say definitively, but that'd be my best guess.