Relationship between system calls API, syscall instruction and exception mechanism (interrupts) - c

I'm trying to understand the relationship between C language system calls API, syscall assembler instruction and the exception mechanism (interrupts) used to switch contexts between processes. There's a lot to study out on my own, so please bear with me.
Is my understanding correct that C language system calls are implemented by compiler as syscall's with respective code in assembly, which in turn, are implemented by OS as exceptions mechanism (interrupts)?
So the call to the write function in the following C code:
#include <unistd.h>
int main(void)
{
write(2, "There was an error writing to standard out\n", 44);
return 0;
}
Is compiled to assembly as a syscall instruction:
mov eax,4 ; system call number (sys_write)
syscall
And the instruction, in turn, is implemented by OS as exceptions mechanism (interrupt)?

TL;DR
The syscall instruction itself acts like a glorified jump, it's a hardware-supported way to efficiently and safely jump from unprivileged user-space into the kernel.
The syscall instruction jumps to a kernel entry-point that dispatches the call.
Before x86_64 two other mechanisms were used: the int instruction and the sysenter instruction.
They have different entry-points (still present today in 32-bit kernels, and 64-bit kernels that can run 32-bit user-space programs).
The former uses the x86 interrupt machinery and can be confused with the exceptions dispatching (that also uses the interrupt machinery).
However, exceptions are spurious events while int is used to generate a software interrupt, again, a glorified jump.
The C language doesn't concern itself with system calls, it relies on the C runtime to perform all the interactions with the environment of the future program.
The C runtime implements the above-mentioned interactions through an environment specific mechanism.
There could be various layers of software abstractions but in the end the OS APIs get called.
The term API is used to denote a contract, strictly speaking using an API doesn't require to invoke a piece of kernel code (the trend is to implement non-critical functions in userspace to limit the exploitable code), here we are only interested in the subset of the API that requires a privilege switch.
Under Linux, the kernel exposes a set of services accessible from userspace, these entry-points are called system calls.
Under Windows, the kernel services (that are accessed with the same mechanism of the Linux analogues) are considered private in the sense that they are not required to be stable across versions.
A set of DLL/EXE exported functions are used as entry-points instead (e.g. ntoskrnl.exe, hal.dll, kernel32.dll, user32.dll) that in turn use the kernel services through a (private) system call.
Note that under Linux, most system calls have a POSIX wrapper around it, so it's possible to use these wrappers, that are ordinary C functions, to invoke a system call.
The underlying ABI is different, so is for the error reporting; the wrapper translates between the two worlds.
The C runtime calls the OS APIs, in the case of Linux the system calls are used directly because they are public (in the sense that are stable across versions), while for Windows the usual DLLs, like kernel32.dll, are marked as dependencies and used.
We are reduced to the point where an user-mode program, being it part of the C runtime (Linux) or part of an API DLL (Windows), need to invoke a code in the kernel.
The x86 architecture historically offered different ways to do so, for example, a call gate.
Another way is through the int instruction, it has a few advantages:
It is what the BIOS and the DOS did in their times.
In real-mode, using an int instructions is suitable because a vector number (e.g. 21h) is easier to remember than a far address (e.g. 0f000h:0fff0h).
It saves the flags.
It is easy to set up (setting up ISR is relatively easy).
With the modernization of the architecture this mechanism turned out to have a big disadvantage: it is slow.
Before the introduction of the sysenter (note, sysenter not syscall) instruction there was no faster alternative (a call gate would be equally slow).
With the advent of the Pentium Pro/II[1] a new pair of instructions sysenter and sysexit were introduced to make system calls faster.
Linux started using them since the version 2.5 and are still used today on 32-bit systems I believe.
I won't explain the whole mechanism of the sysenter instruction and the companion VDSO necessary to use it, it is only needed to say that it was faster than the int mechanism (I can't find an article from Andy Glew where he says that sysenter turned out to be slow on Pentium III, I don't know how it performs nowadays).
With the advent of x86-64 the AMD response to sysenter, i.e. the syscall/sysret pair, began the de-facto way to switch from user-mode to kernel-mode.
This is due to the fact that sysenter is actually fast and very simple (it copies rip and rflags into rcx and r11 respectively, masks rflags and jump to an address set in IA32_LSTAR).
64-bit versions of both Linux and Windows use syscall.
To recap, control can be given to the kernel through three mechanism:
Software interrupts.
This was int 80h for 32-bit Linux (pre 2.5) and int 2eh for 32-bit Windows.
Via sysenter.
Used by 32-bit versions of Linux since 2.5.
Via syscall.
Used by 64-bit versions of Linux and Windows.
Here is a nice page to put it in a better shape.
The C runtime is usually a static library, thus pre-compiled, that uses one of the three methods above.
The syscall instruction transfers control to a kernel entry-point (see entry_64.s) directly.
It is an instruction that just does so, it is not implemented by the OS, it is used by the OS.
The term exception is overloaded in CS, C++ has exceptions, so do Java and C#.
The OS can have a language agnostic exception trapping mechanism (under windows it was once called SEH, now has been rewritten).
The CPU also has exceptions.
I believe we are talking about the last meaning.
Exceptions are dispatched through interrupts, they are a kind of interrupt.
It goes unsaid that while exceptions are synchronous (they happen at specific, replayable points) they are "unwanted", they are exceptional, in the sense that programmers tend to avoid them and when they happen is due to either a bug, an unhandled corner case or a bad situation.
They, thus, are not used to transfer control to the kernel (they could).
Software interrupts (that are synchronous too) were used instead; the mechanism is almost exactly the same (exceptions can have a status code pushed on the kernel stack) but the semantic is different.
We never deferenced a null-pointer, accessed an unmapped page or similar to invoke a system call, we used the int instruction instead.

Is my understanding correct that C language system calls are implemented by compiler as syscall's with respective code in assembly […]?
No.
The C compiler handles system calls the same way that it handles calls to any other function:
; write(2, "There was an error writing to standard out\n", 44);
mov $44, %edx
lea .LC0(%rip), %rsi ; address of the string
mov $2, %edi
call write
The implementation of these functions in libc (your system's C library) will probably contain a syscall instruction, or whatever the equivalent is on your system's architecture.

EDIT
Yes, the C application calls a C library function which buried in the C library solution is a system specific call or set of calls, which use an architecturally specific way to reach the operating system, which has an exception/interrupt handler setup to deal with these system calls. Actually doesnt have to be architecturally specific, can simply jump/call to a well known address, but with modern desire for security and protection modes, a simple call wont have those added features, still functionally correct though.
How the library is implemented is implementation defined. And how the compiler connects your code to that library runtime or link time has a number of combinations as to how that can happen, there is no one way it can or needs to happen, so it is implementation defined as well. So long as it is functionally correct and doesnt interfere with the C standards then it can work.
With operating systems like windows and linux and others on our phones and tables there is a strong desire to isolate the applications from the system so they cannot do damage in various ways, so protection is desired, and you need to have an architecturally specific way to make a function call into the operating system that is not a normal call as it switches modes. If the architecture has more than one way to do this then the operating system can choose one or more of the ways as part of their design.
A "software interrupt" is one common way as with hardware interrupts most solutions include a table of handler addresses, by extending that table and having some of the vectors be tied to a software created "interrupt" (hitting a special instruction rather than a signal changing state on an input) but go through the same stop, save some state, call the vector, etc.

Not a direct answer to the question but this might interest you (I don't have enough karma to comment) - it explains all the user space execution (including glibc and how it does syscalls) in detail:
http://www.maizure.org/projects/printf/index.html
You'll probably be interested in particular in 'Step 8 - Final string written to standard output':
And what does __libc_write look like...?
000000000040f9c0 <__libc_write>:
40f9c0: 83 3d c5 bb 2a 00 00 cmpl $0x0,0x2abbc5(%rip) # 6bb58c <__libc_multiple_threads>
40f9c7: 75 14 jne 40f9dd <__write_nocancel+0x14>
000000000040f9c9 <__write_nocancel>:
40f9c9: b8 01 00 00 00 mov $0x1,%eax
40f9ce: 0f 05 syscall
...cut...
Write simply checks the threading state and, assuming all is well,
moves the write syscall number (1) in to EAX and enters the kernel.
Some notes:
x86-64 Linux write syscall is 1, old x86 was 4
rdi refers to stdout
rsi points to the string
rdx is the string size count
Note that this was for the author's x86-64 Linux system.
For x86, this provides some help:
http://www.tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
Under Linux the execution of a system call is invoked by a maskable interrupt or exception class transfer, caused by the instruction int 0x80. We use vector 0x80 to transfer control to the kernel. This interrupt vector is initialized during system startup, along with other important vectors like the system clock vector.
But as a general answer for a Linux kernel:
Is my understanding correct that C language system calls are implemented by compiler as syscall's with respective code in assembly, which in turn, are implemented by OS as exceptions mechanism (interrupts)?
Yes

Related

Writing an OS in C language

I would like to know if an operating system can be written in only a language such as C .Can it be done using only C or do I need to use the inline assembly with c?
There are parts of a typical kernel where it's necessary to do things that C doesn't support, including:
accessing CPU's special registers (e.g. control registers and MSRs on 80x86)
accessing CPU's special features (e.g. CPUID, LGDT, LIDT instructions on 80x86)
managing virtual address spaces (e.g. TLB invalidation)
saving and loading state during task switches
accessing special address spaces (e.g. the IO ports on 80x86)
supporting ways for CPU to switch privilege levels
To write an OS in pure C you need to either avoid all of these things (which is likely to be severely limiting - e.g. single-tasking, single address space, no protection, no IRQs, ...) or cheat by adding "non-standard implementation defined extensions" to C.
Note that the amount of assembly that you need is very little - e.g. a kernel consisting of 5 million lines of C might only need 100 lines of inline assembly, which works out to "0.00% of the code (with a little rounding error)".
For boot code/boot loader; it depends on the boot environment. For example, if you booted from UEFI firmware then it's no problem (as its API is designed for high level languages), but if you booted from BIOS firmware then you can't use C alone (due to "unsupportable" calling conventions).

What software-visible processor state needs to go in a jmp_buf on an x86-64 processor?

As stated, what software-visible processor state needs to go in a jmp_buf on an x86-64 processor when setjmp(jmp_buf env) is called? What processor state does not?
I have been reading a lot about setjmp and longjmp but couldn't find a clear answer to my question. I know it is implementation dependent but I would like to know for the x86_64 architecture.
From the following implementation
it seems that on an x86-64 machine all the callee saved registers (%r12-%r15, %rbp, %rbx) need to be saved as well as the stack pointer, program counter and all the saved arguments of the current environment. However I'm not sure about that, hope someone could clarify that for me.
For example, which x86-64 registers need to be saved? What about condition flags? For example, I think the floating point registers do not need to be saved because they don't contribute to the state of the program.
That's because of the calling convention. setjmp is a function-call that can return multiple times (the first time when you actually call it, later times when a child function calls longjmp), but it's still a function call. Like any function call, the compiler assumes that all call-clobbered registers have been clobbered, so longjmp doesn't need to restore them.
So yes, they're not part of the "program state" on a function call boundary because the compiler-generated asm is definitely not keeping any values in them.
You're looking at glibc's implementation for the x86-64 System V ABI, where all vector / x87 registers are call-clobbered and thus don't have to be saved.
In the Windows x86-64 calling convention, xmm6-15 are call-preserved (just the low 128 bits, not the upper portions of y/zmm6-15), and would have to be part of the jmp_buf.
i.e. it's not the CPU architecture that's relevant here, it's the software calling convention.
Besides the call-preserved registers, one key thing is that it's only legal to longjmp to a jmp_buf saved by a parent function, not from any arbitrary function after the function that called setjmp has returned.
If setjmp had to support that, it would have to save the entire stack frame, or actually (for the function to be able to return, and that parent to be able to return, etc.) the whole stack all the way up to the top. This is obviously insane, and thus it's clear why longjmp has that restriction of only being able to jump to parent / (great) grandparent functions, so it just has to restore the stack pointer to point at the still-existing stack frame and restore whatever local variables in that function might have been modified since setjmp.
(On C / C++ implementations on architectures / calling conventions that use something other than a normal call-stack, a similar argument about the jump-target function being able to return still applies.)
As the jmp_buf is the only place that can be used to restore processor state on a longjmp, it's generally everything that is needed to restore the full state of the machine as it was when setjmpis called.
This obviously depends very much on the processor and the compiler (what exactly does it use of the CPU's features to store program state):
On an ideal pure-stack machine that holds information of CPU state nowhere but the stack, that would be the stack pointer only. Other than in very old or purely academical implementations, such machines do rarely exist. You could, however, write a compiler on a modern machine like an x86 that solely uses the stack to store such information. For such a hypothetical compiler, saving the stack pointer only would suffice to restore program state.
On a more common, practical machine, this might be the stack pointer and the full set of registers used to store program status.
On some CPUs that store program status information in other places, for example in a zero page, and compilers that make use of such CPU features, the jmp_buff would also need to store a copy of this zero page (certain 65xx CPUs or ATmel AVR MCUs and their compilers might use this feature)

Forcing a function to restore all registers before making a function call

I am using an EK-LM4F120XL board, which contains a cortex-M4 processor. I also use GCC-ARM-none-eabi as toolchain.
I am building on a little hobby project, which slowly becomes an operating system. An important part of this is that I need to switch out registers to switch processes. This happens inside an interrupt and this specific processor makes sure that all the temporary registers (r0-r3, r12, lr) are pushed to the process stack. So in order to continue I need to write the content of r4-r11 and the SP to a place in memory, I need to load the r4-r11 of the new process, load its stackpointer and return. Additionally the lr value contains some information about the process that was interrupted, so I need information from that register too.
All of this works, because I wrote it in assembly. I linked the assembly function directly to the interrupt, so I have full control over what happens to the registers. The combination of C and inline assembly did not work because the compiler usually pushes some registers to the stack and that is fatal. But the OS is growing and the context change is growing along: there are now also some global variables that need changing, etc. All of this is doable in assembly, but its becoming a pain: assembly is hard to read and to debug. So I want a C/Assemlby combo. Basically I am looking for something like this:
void contextSwitch(void){
//Figure out what the next process will be
//Change every variable that needs changing
// Restore register state to the moment of interrupt. The following function will not return in the sense that it will end the interrupt.
swapRegisters(oldProc, newProc);
}
And then write only swapRegisters in assembly. Is there a way to achieve this? Is my solution even the best solution?
There is no portable method of directly accessing CPU registers in C; you will need assembler, in-line assembler, compiler intrinsics or a kernel library (that uses assembler code).
The details of how that is done for Cortex-M are well covered elsewhere and probably too complex to be repeated here: The specifics of doing this in Cortex-M4(F) are described at the ARM Info Center site here. The approach is broadly similar for the Cortex-M3 except for the FPU considerations, an M3 specific description of context switching is provided in this Embedded.com article.
As you can never have enough explanations because different authors make some things clearer than others or give better or more directly applicable examples, here's another - also M3 based, but will work on M4 if not using the FPU or for M4's without an FPU. And yet another example.

Why some part of an os has to be written in assembly? [duplicate]

This question already has answers here:
Is assembly strictly required to make the "lowest" part of an operating system?
(3 answers)
Closed 9 years ago.
The scheduler of my mini os is written in assembly and I wonder why. I found out that the instruction eret can't be generated by the C compiler, is this somthing that can be generalized to other platforms than Nios and also x86 and/or MIPS architechture? Since I believe that part of os is always written in assembly and I'm searching for why a systems programmer must know assembly to write an operating system. Is is the case that there are builtin limitations of the C compiler that can't generate certain assembly instructions like the eret that returns the program to what is was doing after an interrupt?
The generic answer is for one of three reasons:
Because that particular type of code can't be written in C. I think eret is a "return from exception" instruction, so there is no C equivalent to this (because hardware exceptions such as page faults, divide by zero or similar are not C/C++ style exceptions). Another example may be saving the registers onto the stack when task-switching, and saving the stack pointer into the task-control block. The C code can't do that, because there is no direct access to the stack pointer.
Because the compiler won't produce as good code as someone clever writing assembler. Some specialized operations can be hard to write in C - the compiler may not generate very good code, or the code gets very convoluted to achieve something that is simple in assembler.
The startup of C code needs to be written in assembler, because a C program needs certain things set up before you can run actual C code. For example configuring the stack-pointer, and some other registers.
Yep, that is it. There are instructions that you cannot generate using C language. And there are usually one or some instructions required for an OS so some assembly is required. This is true for pretty much any instruction set, x86, arm, mips, and so on. C compilers let you do inline assembly for jamming instructions in but the language itself cant really handle the nuances of each instruction set and try to account for them. Some compilers will add compiler specific things to for example return the function using an interrupt flavor of return. It is so much easier to just write assembly where needed than to customize the language or compilers so there is really no demand there.
The C language expresses the things it is specified to express: fundamental arithmetic operations, assignment of values to variables, and branches and function calls. Objects may be allocated using static, automatic (local), or dynamic (malloc) storage duration. If you want something outside this conceptual scope, you need something other than pure C.
The C language can be extended arbitrarily, and many platforms define syntax for things like defining a function or variable at a particular address.
But the hardware of the CPU cares about a lot of details, such as the values of flag registers. The part of the scheduler which switches threads needs to be able to save all the registers to memory before doing anything, because overwriting any register would lose essential data in the interrupted thread.
The only way to be able to write such a thing in C, would be for the compiler to provide a C function which generates the finely-tuned assembly. And then you're essentially back at square 1, because the important details are still at the level of the assembly code.
Vendors with multiple product lines of microcontrollers sometimes go out of their way to allow C source compatibility even at the lowest levels, to allow their customers to port code (or conversely, to prevent them from going to another vendor when they need to switch platforms). But the distinction between C and assembly blurs at a certain point, when you're calling pseudo-functions that generate specific instructions (known as intrinsics).
Some things that cannot be done in C or that, if they can be done, are better done in assembly because they are more straightforward and/or maintainable that way include:
Execute return-from-exception and return-from-interrupt instructions.
Read from and write to special processor registers (which control processor state, memory mapping, cache configuration, exception management, and more).
Perform atomic reads and writes to special addresses that are connections to hardware devices rather than memory.
Perform load and store instructions of particular sizes or characteristics to special addresses as described above. (E.g., writing to a certain devices might require using only a store-16-bits instruction and not a regular store-32-bits instruction.)
Execute instructions for memory barriers or ordering, cache control, and flushing of memory maps.
Generally, C is mostly designed to do computations (read inputs, calculate things, write outputs) and not to control a machine (interact with all the controls and devices in the machine).

Mechanism of the Boehm Weiser Garbage Collector

I was reading the paper "Garbage Collector in an Uncooperative Environment" and wondering how hard it would be to implement it. The paper describes a need to collect all addresses from the processor (in addition to the stack). The stack part seems intuitive. Is there any way to collect addresses from the registers other than enumerating each register explicitly in assembly? Let's assume x86_64 on a POSIX-like system such as linux or mac.
SetJmp
Since Boehm and Weiser actually implemented their GC, then a basic source of information is the source code of that implementation (it is opensource).
To collect the register values, you may want to subvert the setjmp() function, which saves a copy of the registers in a custom structure (at least those registers which are supposed to be preserved across function calls). But that structure is not standardized (its contents are nominally opaque) and setjmp() may be specially handled by the C compiler, making it a bit delicate to use for anything other than a longjmp() (which is already quite hard as it is). A piece of inline assembly seems much easier and safer.
The first hard part in the GC implementation seems to be able to reliably detect the start and end of stacks (note the plural: there may be threads, each with its own stack). This requires delving into ill-documented details of OS ABI. When my desktop system was an Alpha machine running FreeBSD, the Boehm-Weiser implementation could not run on it (although it supported Linux on the same processor).
The second hard part will be when trying to go generational, trapping write accesses by playing with page access rights. This again will require reading some documentation of questionable existence, and some inline assembly.
I think on x86_86 they use the flushrs assembly instruction to put the registers on the stack. I am sure someone on stack overflow will correct me if this is wrong.
It is not hard to implement a naive collector: it's just an algorithm after all. The hard bits are as stated, but I will add the worst ones: tracking exceptions is nasty, and stopping threads is even worse: that one can't be done at all on some platforms. There's also the problem of trapping all pointers that get handed over to the OS and lost from the program temporarily (happens a lot in Windows window message handlers).
My own multi-threaded GC is similar to the Boehm collector and more or less standard C++ with few hacks (using jmpbuf is more or less certain to work) and a slightly less hostile environment (no exceptions). But it stops the world by cooperation, which is very bad: if you have a busy CPU the idle ones wait for it. Boehm uses signals or other OS features to try to stop threads but the support is very flaky.
And note also the Intel i64 processor has two stacks per thread .. a bit hard to account for this kind of thing generically.

Resources