How does kernel restrict processes to their own memory pool? - c

This is purely academical question not related to any OS
We have x86 CPU and operating memory, this memory resembles some memory pool, that consist of addressable memory units that can be read or written to, using their address by MOV instruction of CPU (we can move memory from / to this memory pool).
Given that our program is the kernel, we have a full access to whole this memory pool. However if our program is not running directly on hardware, the kernel creates some "virtual" memory pool which lies somewhere inside the physical memory pool, our process consider it just as the physical memory pool and can write to it, read from it, or change its size usually by calling something like sbrk or brk (on Linux).
My question is, how is this virtual pool implemented? I know I can read whole linux source code and maybe one year I find it, but I can also ask here :)
I suppose that one of these 3 potential solutions is being used:
Interpret the instructions of program (very ineffective and unlikely): the kernel would just read the byte code of program and interpret each instruction individually, eg. if it saw a request to access memory the process isn't allowed to access it wouldn't let it.
Create some OS level API that would need to be used in order to read / write to memory and disallow access to raw memory, which is probably just as ineffective.
Hardware feature (probably best, but have no idea how that works): the kernel would say "dear CPU, now I will send you instructions from some unprivileged process, please restrict your instructions to memory area 0x00ABC023 - 0xDEADBEEF" the CPU wouldn't let the user process do anything wrong with the memory, except for that range approved by kernel.
The reason why am I asking, is to understand if there is any overhead in running program unprivileged behind the kernel (let's not consider overhead caused by multithreading implemented by kernel itself) or while running program natively on CPU (with no OS), as well as overhead in memory access caused by computer virtualization which probably uses similar technique.

You're on the right track when you mention a hardware feature. This is a feature known as protected mode and was introduced to x86 by Intel on the 80286 model. That evolved and changed over time, and currently x86 has 4 modes.
Processors start running in real mode and later a privileged software (ring0, your kernel for example) can switch between these modes.
The virtual addressing is implemented and enforced using the paging mechanism (How does x86 paging work?) supported by the processor.

On a normal system, memory protection is enforced at the MMU, or memory management unit, which is a hardware block that configurably maps virtual to physical addresses. Only the kernel is allowed to directly configure it, and operations which are illegal or go to unmapped pages raise exceptions to the kernel, which can then discipline the offending process or fetch the missing page from disk as appropriate.
A virtual machine typically uses CPU hardware features to trap and emulate privileged operations or those which would too literally interact with hardware state, while allowing ordinary operations to run directly and thus with moderate overall speed penalty. If those are unavailable, the whole thing must be emulated, which is indeed slow.

Related

Are processes "sandboxed" by hardware?

Can a process access all of the RAM or does the CPU give the process a specific part which the kernel decides, and the process (running in user space) can't change? In other words - is a process sandboxed by hardware, or can it do anything, but is monitored by the OS?
EDIT
I'm told in the comments that this is too broad, so let's assume x86/x64. I'll also add that the question arose while reading what I understood to say that processes can access all RAM - which seems to conflict with what I've read about security in OSs.
If you count MS-DOS as an "operating system", then processes can do anything (and aren't monitored). Even Windows95 doesn't have real memory protection, and a buggy process can crash the machine by scribbling over the wrong memory.
If you only count modern OSes with privilege separation (Unix/Linux, Windows NT and derivates), then processes are sandboxed.
AFAIK, there aren't really systems where there's monitoring of any kind other than "fault if you try to do something". The kernel sets the boundaries, and the user-space process gets a fault if it tries to go outside them.
If you're imagining that maybe the kernel looks at what an unprivileged process does, and adapts accordingly, then no, that's not what happens.
See
https://en.wikipedia.org/wiki/Memory_protection: Usually achieved by giving each process its own virtual address space (virtual memory). This is hardware-supported: every address your code uses is translated to a physical address by a fast translation cache (TLB), which caches the translation tables set up by the OS (aka page tables).
A process can't directly modify its own page tables: it has to ask the kernel to map more physical memory into its address space (e.g. as part of malloc()). So the kernel has a chance to verify that the request is ok before doing it.
Also, a process can ask the kernel to copy data to/from files (or other things) into its memory space. (write/read system calls).
https://en.wikipedia.org/wiki/User_space: normal processes run in user-mode, which is a mode provided by the hardware where privileged instructions will trap to the kernel.

Operating system kernel and processes in main memory

Continuing my endeavors in OS development research, I have constructed an almost complete picture in my head. One thing still eludes me.
Here is the basic boot process, from my understanding:
1) BIOS/Bootloader perform necessary checks, initialize everything.
2) The kernel is loaded into RAM.
3) Kernel performs its initializations and starts scheduling tasks.
4) When a task is loaded, it is given a virtual address space in which it resides. Including the .text, .data, .bss, the heap and stack. This task "maintains" its own stack pointer, pointing to its own "virtual" stack.
5) Context switches merely push the register file (all CPU registers), the stack pointer and program counter into some kernel data structure and load another set belonging to another process.
In this abstraction, the kernel is a "mother" process inside of which all other processes are hosted. I tried to convey my best understanding in the following diagram:
Question is, first is this simple model correct?
Second, how is the executable program made aware of its virtual stack? Is it the OS job to calculate the virtual stack pointer and place it in the relevant CPU register? Is the rest of the stack bookkeeping done by CPU pop and push commands?
Does the kernel itself have its own main stack and heap?
Thanks.
Question is, first is this simple model correct?
Your model is extremely simplified but essentially correct - note that the last two parts of your model aren't really considered to be part of the boot process, and the kernel isn't a process. It can be useful to visualize it as one, but it doesn't fit the definition of a process and it doesn't behave like one.
Second, how is the executable program made aware of its virtual stack?
Is it the OS job to calculate the virtual stack pointer and place it
in the relevant CPU register? Is the rest of the stack bookkeeping
done by CPU pop and push commands?
An executable C program doesn't have to be "aware of its virtual stack." When a C program is compiled into an executable, local variables are usually referenced in relative to the stack pointer - for example, [ebp - 4].
When Linux loads a new program for execution, it uses the start_thread macro (which is called from load_elf_binary) to initialize the CPU's registers. The macro contains the following line:
regs->esp = new_esp;
which will initialize the CPU's stack pointer register to the virtual address that the OS has assigned to the thread's stack.
As you said, once the stack pointer is loaded, assembly commands such as pop and push will change its value. The operating system is responsible for making sure that there are physical pages that correspond to the virtual stack addresses - in programs that use a lot of stack memory, the number of physical pages will grow as the program continues its execution. There is a limit for each process that you can find by using the ulimit -a command (on my machine the maximum stack size is 8MB, or 2KB pages).
Does the kernel itself have its own main stack and heap?
This is where visualizing the kernel as a process can become confusing. First of all, threads in Linux have a user stack and a kernel stack. They're essentially the same, differing only in protections and location (kernel stack is used when executing in Kernel Mode, and user stack when executing in User Mode).
The kernel itself does not have its own stack. Kernel code is always executed in the context of some thread, and each thread has its own fixed-size (usually 8KB) kernel stack. When a thread moves from User Mode to Kernel Mode, the CPU's stack pointer is updated accordingly. So when kernel code uses local variables, they are stored on the kernel stack of the thread in which they are executing.
During system startup, the start_kernel function initializes the kernel init thread, which will then create other kernel threads and begin initializing user programs. So after system startup the CPU's stack pointer will be initialized to point to init's kernel stack.
As far as the heap goes, you can dynamically allocate memory in the kernel using kmalloc, which will try to find a free page in memory - its internal implementation uses get_zeroed_page.
You forgot one important point: Virtual memory is enforced by hardware, typically known as the MMU (Memory Management Unit). It is the MMU that converts virtual addresses to physical addresses.
The kernel typically loads the address of the base of the page table for a specific process into a register in the MMU. This is what task-switches the virtual memory space from one process to another. On x86, this register is CR3.
Virtual memory protects processes' memory from each other. RAM for process A is simply not mapped into process B. (Except for e.g. shared libraries, where the same code memory is mapped into multiple processes, to save memory).
Virtual memory also protect kernel memory space from a user-mode process. Attributes on the pages covering kernel address space are set so that, when the processor is running in user-mode, it is not allowed to execute there.
Note that, while the kernel may have threads of its own, which run entirely in kernel space, the kernel shouldn't really be thought of a "a mother process" that runs independently of your user-mode programs. The kernel basically is "the other half" of your user-mode program! Whenever you issue a system call, the CPU automatically transitions into kernel mode, and starts executing at a pre-defined location, dictated by the kernel. The kernel system call handler then executes on your behalf, in the kernel-mode context of your process. Time spent in the kernel handling your request is accounted for, and "charged to" your process.
The helpful ways of thinking about kernel in context of relationships with processes and threads
Model provided by you is very simplified but correct in general.
In the same time the way of thinking about kernel as about "mother process" isn't best, but it still has some sense.
I would like to propose another two better models.
Try to think about kernel as about special kind of shared library.
Like a shared library kernel is shared between different processes.
System call is performed in a way which is conceptually similar to the routine call from shared library.
In both cases, after call, you execute of "foreign" code but in the context your native process.
And in both cases your code continues to perform computations based on stack.
Note also, that in both cases calls to "foreign" code lead to blocking of execution of your "native" code.
After return from the call, execution continues starting in the same point of code and with the same state of the stack from which call was performed.
But why we consider kernel as a "special" kind of shared library? Because:
a. Kernel is a "library" that is shared by every process in the system.
b. Kernel is a "library" that shares not only section of code, but also section of data.
c. Kernel is a specially protected "library". Your process can't access kernel code and data directly. Instead, it is forced to call kernel controlled manner via special "call gates".
d. In the case of system calls your application will execute on virtually continuous stack. But in reality this stack will be consist from two separated parts. One part is used in user mode and the second part will be logically attached to the top of your user mode stack during entering the kernel and deattached during exit.
Another useful way of thinking about organization of computations in your computer is consideration of it as a network of "virtual" computers which doesn't has support of virtual memory.
You can consider process as a virtual multiprocessor computer that executes only one program which has access to all memory.
In this model each "virtual" processor will be represented by thread of execution.
Like you can have a computer with multiple processors (or with multicore processor) you can have multiple oncurrently running threads in your process.
Like in your computer all processors have shared access to the pool of physical memory, all threads of your process share access to the same virtual address space.
And like separate computers are physically isolated from each other, your processes also isolated from each other but logically.
In this model kernel is represented by server having direct connections to each computer in the network with star topology.
Similarly to a networking servers, kernel has two main purposes:
a. Server assembles all computers in single network.
Similarly kernel provides a means of inter-process communication and synchronization. Kernel works as a man in the middle which mediates entire communication process (transfers data, routes messages and requests etc.).
b. Like server provides some set of services to each connected computer, kernel provides a set of services to the processes. For example, like a network file server allows computers read and write files located on shared storage, your kernel allows processes to do the same things but using local storage.
Note, that following the client-server communication paradigm, clients (processes) are the only active actors in the network. They issue request to the server and between each other. Server in its turn is a reactive part of the system and it never initiate communication. Instead it only replies to incoming requests.
This models reflect the resource sharing/isolation relationships between each part of the system and the client-server nature of communication between kernel and processes.
How stack management is performed, and what role plays kernel in that process
When the new process starts, kernel, using hints from executable image, decides where and how much of virtual address space will have reserved for the user mode stack of initial thread of the process.
Having this decision, kernel sets the initial values for the set of processor registers, which will be used by main thread of process just after start of the execution.
This setup includes setting of the initial value of stack pointer.
After actual start of process execution, process itself becomes responsible for stack pointer.
More interesting fact is that process is responsible for initialization of stack pointers of each new thread created by it.
But note that kernel kernel is responsible for allocation and management of kernel mode stack for each and every thread in the system.
Note also that kernel is resposible for physical memory allocation for the stack and usually perform this job lazily on demand using page faults as hints.
Stack pointer of running thread is managed by thread itself. In most cases stack pointer management is performed by compiler, when it builds executable image. Compiler usually tracks stack pointer value and maintain it's consistency by adding and tracking all instructions that relates to the stack.
Such instructions not limited only by "push" and "pop". There are many CPU instructions which affects the stack, for example "call" and "ret", "sub ESP" and "add ESP", etc.
So as you can see, actual policy of stack pointer management is mostly static and known before process execution.
Sometimes programs have a special part of the logic that performs special stack management.
For example implementations of coroutines or long jumps in C.
In fact, you are allowed to do whatever you want with the stack pointer in your program if you want.
Kernel stack architectures
I'm aware about three approaches to this issue:
Separate kernel stack per thread in the system. This is an approach adopted by most well-known OSes based on monolithic kernel including Windows, Linux, Unix, MacOS.
While this approach leads to the significant overhead in terms of memory and worsens cache utilization, but it improves preemption of the kernel, which is critical for the monolithic kernels with long-running system calls especially in the multi-processor environment.
Actually, long time ago Linux had only one shared kernel stack and entire kernel was covered by Big Kernel Lock that limits the number of threads, which can concurrently perform system call, by only one thread.
But linux kernel developers has quickly recognized that blocking execution of one process which wants to know for instance its PID, because another process already have started send of a big packet through very slow network is completely inefficient.
One shared kernel stack.
Tradeoff is very different for microkernels.
Small kernel with short system calls allows microkernel designers to stick to the design with single kernel stack.
In the presence of proof that all system calls are extremely short, they can benefit from improved cache utilization and smaller memory overhead, but still keep system responsiveness on the good level.
Kernel stack for each processor in the system.
One shared kernel stack even in microkernel OSes seriously affects scalability of the entire operating system in multiprocessor environment.
Due to this, designers frequently follow approach which is looks like compromise between two approaches described above, and keep one kernel stack per each processor (processor core) in the system.
In that case they benefit from good cache utilization and small memory overhead, which are much better than in the stack per thread approach and slightly worser than in single shared stack approach.
And in the same time they benefit from the good scalability and responsiveness of the system.
Thanks.

When running a program in Windows, what dictates the allowable memory for that program?

If I were to write a program in C and run it in Windows, is there something in the Win API that dictates whether or not a certain block of memory can be accessed by the program? If I want to be able to have the program access any block of memory that I want, is there something I have to disable? I realize that this is risky and can result in damaging the operating system.
In modern Windows (Windows with NT Kernel) the operating systems controls the way memory is accessed. So, the answer is: NO. There is nothing you can do about it. You won't be able to get your process to access ANY block of memory you want.
You could have done it in Win 3.0, Win 3.11, Win 95, Win 98, Win ME.
Yes, that's possible with VirtualAlloc(), the low level function that allocates virtual memory pages. The flProtect argument specifies how the memory can be accessed by the process, specifying PAGE_NOACCESS is possible, albeit that it is not exactly used very often.
If you are actually talking about RAM then no, a user mode program never has direct access to physical memory on a protected mode operating system like Windows. It can only ever address virtual memory, the mapping to RAM is performed by the OS kernel. Only code that runs in ring 0 has the capability. Denying access to certain physical addresses only makes sense for a memory-mapped I/O device. Which would already have a driver that reserves the address space.
You cannot/will not/must not access kernel memory. Modern operating systems except in kernel mode don't allow to allocate from those memory regions.

How shared memory would be accessed in manycore systems

In multicore systems, such as 2, 4, 8 cores, we typically use mutexes and semaphores to access shared memory. However, I can foresee that these methods would induce a high overhead for future systems with many cores. Are there any alternative methods that would be better for future many core systems for accessing shared memories.
Transactional memory is one such method.
I'm not sure how far in the future you want to go. But in the long-long run, shared memory as we know it right now (single address space accessible by any core) is not scalable. So the programming model will have to change at some point and make the lives of programmers harder as it did when we went to multi-core.
But for now (perhaps for another 10 years) you can get away with transactional memory and other hardware/software tricks.
The reason I say shared-memory is not scalable in the long run is simply due to physics. (similar to how single-core/high-frequency hit a barrier)
In short, transistors can't shrink to less than the size of an atom (barring new technology), and signals can't propagate faster than the speed of light. Therefore, memory will get slower and slower (with respect to the processor) and at some point, it becomes infeasible to share memory.
We can already see this effect right now with NUMA on the multi-socket systems. Large-scale supercomputers are neither shared-memory nor cache-coherent.
1) Lock only the memory part your are accessing, and not the entire table ! This is done with the help of a big hash table. The bigger the table, the finer the lock mechanism is.
2) If you can, only lock on writing, not on reading (this requires that there is no problem in reading the "previous value" while it is being updated, which is very often a valid case).
Access to shared memory at the lowest level in any multi-processor/core/threaded application synchronization depends on the bus lock. Such a lock may incur hundreds of (CPU) wait states as it also encompasses locking those I/O buses that have bus-mastering devices including DMA. Theoretically it is possible to envision a medium-level lock that can be invoked in situations when the programmer is certain that the memory area being locked won't be affected by any I/O bus. Such a lock would be much faster because it only needs to synchronize the CPU caches with main memory which is fast, at least in comparison to latency of the slowest I/O buses. Whether programmers in general would be competent to determine when to use which bus lock adds worrying implications to its mainstream feasibility. Such a lock could also require its own dedicated external pins for synchronization with other processors.
In multi-processor Opteron systems each processor has its own memory which becomes part of the entire memory that all installed processors can "see". A processor trying to access memory which turns out to be attached to another processor will transparently complete the access - albeit more slowly - through a high-speed interconnect bus (called HyperTransport) to the processor in charge of that memory (the NUMA concept). As long as a processor and its cores are working with the memory physically connected to it processing will be fast. In addition, many processors are equipped with several external memory buses to multiply their overall memory bandwidth.
A theoretical medium-level lock could, on Opteron systems, be implemented using the HyperTransport interconnections.
As for any forseeable future the classic approach of locking as seldom as possible and for as short a time as possible by implementing efficient algorithms (and associated data structures) that are used when the locks are in place still holds true.

How to use more than 3 GB in a process on 32-bit PAE-enabled Linux app?

PAE (Physical Address Extension) was introduced in CPUs back in 1994. This allows a 32-bit processor to access 64 GB of memory instead of 4 GB. Linux kernels offer support for this starting with 2.3.23. Assume I am booting one of these kernels, and want to write an application in C that will access more than 3 GB of memory (why 3 GB? See this).
How would I go about accessing more than 3 GB of memory? Certainly, I could fork off multiple processes; each one would get access to 3 GB, and could communicate with each other. But that's not a realistic solution for most use cases. What other options are available?
Obviously, the best solution in most cases would be to simply boot in 64-bit mode, but my question is strictly about how to make use of physical memory above 4 GB in an application running on a PAE-enabled 32-bit kernel.
You don't, directly -- as long as you're running on 32-bit, each process will be subject to the VM split that the kernel was built with (2GB, 3GB, or if you have a patched kernel with the 4GB/4GB split, 4GB).
One of the simplest ways to have a process work with more data and still keep it in RAM is to create a shmfs and then put your data in files on that fs, accessing them with the ordinary seek/read/write primitives, or mapping them into memory one at a time with mmap (which is basically equivalent to doing your own paging). But whatever you do it's going to take more work than using the first 3GB.
Or you could fire up as many instances of memcached as needed until all physical memory is mapped. Each memcached instance could make 3GiB available on a 32 bit machine.
Then access memory in chunks via the APIs and language bindings for memcached. Depending on the application, it might be almost as fast as working on a 64-bit platform directly. For some applications you get the added benefit of creating a scalable program. Not many motherboards handle more than 64GiB RAM but with memcached you have easy access to as much RAM as you can pay for.
Edited to note, that this approach of course works in Windows too, or any platform which can run memcached.
PAE is an extension of the hardware's address bus, and some page table modifications to handle that. It doesn't change the fact that a pointer is still 32 bits, limiting you to 4G of address space in a single process. Honestly, in the modern world the proper way to write an application that needs more than 2G (windows) or 3G (linux) of address space is to simply target a 64 bit platform.
On Unix one way to access that more-than 32bit addressable memory in user space by using mmap/munmap if/when you want to access a subset of the memory that you aren't currently using. Kind of like manually paging. Another way (easier) is to implicitly utilize the memory by using different subsets of the memory in multiple processes (if you have a multi-process archeteticture for your code).
The mmap method is essentially the same trick as commodore 128 programmers used to do for bank switching. In these post commodore-64 days, with 64-bit support so readily available, there aren't many good reasons to even think about it;)
I had fun deleting all the hideous PAE code from our product a number of years ago.
You can't have pointers pointing to > 4G of address space, so you'd have to do a lot of tricks.
It should be possible to switch a block of address space between different physical pages by using mmap to map bits of a large file; you can change the mapping at any time by another call to mmap to change the offset into the file (in multiples of the OS page size).
However this is a really nasty technique and should be avoided. What are you planning on using the memory for? Surely there is an easier way?
Obviously, the best solution in most cases would be to simply boot in 64-bit mode, but my question is strictly about how to make use of physical memory above 4 GB in an application running on a PAE-enabled 32-bit kernel.
There's nothing special you need to do. Only the kernel needs to address physical memory, and with PAE, it knows how to address physical memory above 4 GB. The application will use memory above 4 GB automatically and with no issues.

Resources