Are system calls directly send to the kernel? - c

I have a couple of assumptions, most likely some of them will be incorrect. Please correct me where they are wrong.
We could categorize the functions in a program written in C as follows:
Functions that are sent to dynamically loaded libraries:
These are sent to the library that translates them in to multiple standard C-functions
The library passes them on to libc where they are translated into multiple system calls.
Libc passes those on to the kernel where they are executed and the returns are sent back to libc.
Libc will collect the returs, group them by c-function and use them to create 1 return for each c-function. These returns will be send back to the dynamically loaded library.
This library will collect all returns and use them to create 1 return that is send back to the original program.
Functions that are either defined in the code or part of statically compiled libraries: Everything is the same as the category above but:
They program already does the translation into standard C functions where they are known or into functions calling dynamically loaded libraries in the other case.
The standard c functions are send to libc, the others to the dynamically loaded libraries (where they will be handled as above).
The creation of 1 final return based on the returns from both types of functions will be done by the program
Functions that are standard C functions: They will just be sent to libc which will communicate with the kernel in the same way as above and 1 return will be sent to the program
Functions that are system calls: They are NOT sent directly to the kernel but have to pass to libc although it doesn't do any extra work.
Security checks (permissions, writing to unallocated mem, ...) are always done by the kernel, although libc and libraries above might also check it first.
All POSIX-compliant systems follow these rules

It might not be the same on Linux and on some other POSIX system (like FreeBSD).
On Linux, the ABI defines how a system call is done. Read about Linux kernel interfaces. The system calls are listed in syscalls(2) (but see also /usr/include/asm*/unistd.h ...). Read also vdso(7). The assembler HowTo explains more details, but for 32 bits i686 only.
Most Linux libc are free software, you can study their source code. IMHO the source code of musl-libc is very readable.
To simplify a tiny bit, most system calls (e.g. write(2)) are small C functions in the libc which:
call the kernel using SYSENTER machine instruction (and take care of passing the system call number and its arguments with the kernel convention, which is not the usual C ABI). What the kernel considers as a system call is only that machine instruction (and conventions about it).
handle the failure case, by passing it to errno(3) and returning -1.
(IIRC, on failure, the carry -or perhaps the overflow- flag bit is set when the kernel returns from SYSENTER; but I could be wrong in the details)
handle the success case, by returning a result.
You could invoke system calls without libc, with some assembler code. This is unusual, but has been done (e.g. in BusyBox or in Bones).
So the libc code for write is doing some tiny extra work (passing arguments, handling failure & errno and success cases).
Some few system calls (probably getpid & clock_gettime) avoid the overhead of the SYSENTER machine instruction (and user-mode -> kernel-mode switch) thanks to vDSO.

No you can't categorize things like that. When you program in C (but that makes no difference in almost all other languages), there is only functions and whatever is the real status of these, you call them exactly the same way. This is defined by ABI (how to pass parameters, get returned values, etc) and enforced by the compiler/linker. Of course some functions are just stubs. For example stubs to shared libraries functions (stubs may be need to load the library, dynamic link to the real function, etc) or system calls (this is more technical and differs from kernel to kernel). But from the viewpoint of your program everything is the same (this is why it is hard to understand difference between fread and read at the beginning: you call them the same way, they make almost the same job, what's the difference?).
POSIX doesn't say a single word about kernels... It just lists the C (and formerly ADA) API of a set of functions with minimal semantic (plus some command, tools, etc). Implementation of these is totally free.

Related

Are OS libraries written in assembly or in C

I ask this, because I am getting very conflicting definitions of System calls.
One one hand, I have seen the definition that they are an API the OS provides that a user program can call. Since this API is a high level interface, it has to be implemented in a high level language like C.
On the other hand, I have seen that the actual OS syscalls are machine instructions, for which you have to set certain registers to call (according to some compliance standard set by the OS). But this looks nothing like the UNIX APIs like open(), write() and read(), so what is going on here.
I have also read that these high level interfaces are implemented in the C libraries which do the actual assembly code syscalls. In that case, why do we say the OS provides this interface when it is actually provided by the C language. What if I want to perform a UNIX syscall directly to the OS without having to use C?
There are two open functions - one, the syscall open exposed by the operating system (e.g. Linux), and two, the C-library function open, exposed by the C standard library (e.g. glibc).
You can see two different man pages for these functions - run man 2 open to see the man page regarding the syscall, and man 3 open to see the man page regarding the C standard function.
Functions you mentioned like open, write, and read can be confusing - because they exist both as syscalls and as C standard functions. But they are separate entities entirely - in fact, glibc's open function doesn't even use the open syscall - it uses the openat syscall.
On Windows, where the syscall open doesn't even exist - the C standard library function open does still exist, and uses WinAPI's CreateFile behind the scenes.
What if I want to perform a UNIX syscall directly to the OS without
having to use C?
This is possible - indeed, glibc has to do it to implement C standard library functions. But it's tricky, and involves implementing wrappers for the syscalls and sometimes even handcrafting assembly.
If you want to see things for yourself, you can look at how glibc implements open:
int
__libc_open (const char *file, int oflag, ...)
{
int mode = 0;
if (__OPEN_NEEDS_MODE (oflag))
{
va_list arg;
va_start (arg, oflag);
mode = va_arg (arg, int);
va_end (arg);
}
return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag, mode);
}
...
weak_alias (__libc_open, open)
notice that the function ends with a call to the macro SYSCALL_CANCEL, which will end up calling the OS-exposed openat syscall.
Are OS libraries written in assembly or in C
That is a question that can not really be answered as it depends. Technically there are no limitations on the implementation (i.e. it can be written in any language, though C is probably the most common followed by assembly).
The important part here is the ABI. This defines how OS calls can be made.
You can make system calls in assembly (if you know the ABI you can manually write all the code to comply), the C compiler knows the ABI and will automatically generate all the code required to make a call.
Most languages though allow you to make system calls, they will either know the ABI or have a wrapper API that translates the calls from a language call to the appropriate ABI for that OS.
I ask this, because I am getting very conflicting definitions of System calls.
The definitions will depend on the context. You will have to give examples of what the definitions are AND in what context they are being used.
One one hand, I have seen the definition that they are an API the OS provides that a user program can call.
Sure this is one way to look at it.
More strictly I would ays the OS provides a set of interfaces that can be used to perform privileged tasks. Now those interfaces can be exposed via an API provided by a particular environment that makes them easier to use.
Since this API is a high level interface, it has to be implemented in a high level language like C.
Sort of true.
An environment can expose an API does not mean that it needs a high level language (and C is not a high level language, it is one step above assembly, it is considered a low level language). And just because it is exposed by the language does not mean it is implemented in that language.
On the other hand, I have seen that the actual OS syscalls are machine instructions, for which you have to set certain registers to call (according to some compliance standard set by the OS).
OK. Here we have moved from System Calls to syscalls. We should be very careful on how we use these terms to make sure we are not conflating different terms.
I would (and this is a bit abstract still) think about the computer as several levels of abstraction:
Hardware
------ --------------
syscalls
OS --------------
System Calls (read/write etc..)
------ --------------
Language Interface (read/write etc..)
You can poke the hardware directly if you want (if you know how), but it is better if you can make syscalls (if you know how), but it better to use the OS System Calls which use a well defined ABI, but it better to use the language interface (what you would call the API) to call the underlying System Calls.
But this looks nothing like the UNIX APIs like open(), write() and read(), so what is going on here.
Here the UNIX OS provides the open/close/read interface.
The C libraries provides a very thin API wrapper interface above the the OS System Calls. The C compiler will then generate the correct instructions to call the System Calls using the correct ABI, which in turn will call the next layer down in the OS to use the syscalls.
I have also read that these high level interfaces are implemented in the C libraries which do the actual assembly code syscalls.
The high level interface can be written in any language. But the C one is so easy to use that most other languages don't bother doing it themselves but simply call via the C interface.
It's VERRRY rare to ever directly write something in assembly. By writing in C you can compile it for many different CPU architectures whereas by writing in assembly you are basically stuck with one specific architecture. Most operating systems are written in C. We say the OS provides the interface because you are interacting with the operating system which happens to be written in C.

Linux system calls vs C lib functions

I have bit bit of confusion regarding these two so here are my questions;
The Linux man-pages project lists all these functions:
https://www.kernel.org/doc/man-pages/
Looking at recvfrom as an example, this function exists both as a Linux system call as well as a C library function. Their documentation seems different but they are both reachable using #include <sys/socket.h>.
I don't understand their difference?
I also thought systems calls are defined using hex values which can be implemented in assembly directly, their list is here:
https://syscalls.kernelgrok.com/
However I cannot find recvfrom in the above link. I'm a bit confused between Linux system calls vs C lib functions at this point!
Edit: To add to the questions, alot of functions are under (3) but not (2), i.e clean. Does this mean these are done by C runtime directly without relying on system calls and the underlying OS?
First, understand that that C functions and system calls are two completely different things.
System calls not wrapped in the C library must be called through the syscall function. One example of such a call is gettid.
To create a gettid system call wrapper with syscall, do this:
#define _GNU_SOURCE
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
pid_t gettid(void)
{
pid_t tid = (pid_t)syscall(SYS_gettid);
return tid;
}
Here's an excerpt from the NOTES section of the man-page, which explicitly states that this function is not defined within the C library:
NOTES
Glibc does not provide a wrapper for this system call; call it using syscall(2).
recvfrom is a C library wrapper around a system call.
Everything in section (2) is a system call. Everything in section (3) is not. Everything in section (3) (with a few notable exceptions, such as getumask) has a definition in the C library. About half of everything in section (2) does not have a definition (or wrapper) within the C library (with the exception of functions mandated by POSIX, and some other extensions, which all do), as with gettid.
When calling recvfrom in C, the C library calls the kernel to do the syscall.
The syscall function is the function that puts the system call number in the %eax register and uses int $0x80.
The reason you don't see recvfrom in https://syscalls.kernelgrok.com/ is because https://syscalls.kernelgrok.com/ is very, very incomplete.
The reason there are many functions in (3) that you don't see in (2) is because many functions on (3) don't have a system call. They may or may not rely on system calls, they just don't have a system call with that specific name that backs them.
exists both under the linux system call
The way userspace programs communicate with kernel is by using syscall function. All syscall() does is push some number on specific registers and then execute a special interrupt instruction. On the interrupt the execution is transferred to kernel, kernel then reads data from userspace using special registers.
Each syscall has a number and different arguments. User space programs are expected to "find out" arguments for each syscall by for example inspecting the documentation.
Linux system call is just a number, like __NR_recvfrom which is equal to 231 on x86-64 architecture.
C Lib function
A C library function is a function implemented by the C library implementation. So, for example glibc implements recvfrom as a simple wrapper around syscall(__NR_recvfrom, ...). This is C interface the library provides programmers to access kernel related functions. So C programmers wouldn't need to read the documentation for each syscall and have nice C interface to call the kernel.
However I cannot find recvfrom in the above link.
Don't use the link then. At best inspect kernel sources under uapi directory.
First of all, functions listed in section (2) are functions. They are different from functions in section (3) in that there is always a system call behind.
Those function will usually do additional work to make them behave like POSIX functions (converting returned value to -1 and errno), or to just make them usable (clone syscall requires libc integration to be useful). Sometimes arguments are passed to a system call differently than function prototype suggests, for example they can be packed into a structure and pointer to that structure can be passed through a register.
Sometimes a new syscall is added to fix some issues of the older syscall. In this can a function can be implemented using a new syscall transparently (see mmap vs mmap2, sys_select vs sys_old_select).
As for recvfrom, socket-related functions are implemented by either their respective syscalls or by a legacy sys_socketcall. For example musl still has this code:
#ifdef SYS_socket
#define __socketcall(nm,a,b,c,d,e,f) syscall(SYS_##nm, a, b, c, d, e, f)
#define __socketcall_cp(nm,a,b,c,d,e,f) syscall_cp(SYS_##nm, a, b, c, d, e, f)
#else
#define __socketcall(nm,a,b,c,d,e,f) syscall(SYS_socketcall, __SC_##nm, \
((long [6]){ (long)a, (long)b, (long)c, (long)d, (long)e, (long)f }))
#define __socketcall_cp(nm,a,b,c,d,e,f) syscall_cp(SYS_socketcall, __SC_##nm, \
((long [6]){ (long)a, (long)b, (long)c, (long)d, (long)e, (long)f }))
#endif
which tries to use appropriate syscall if available, backing off to socketcall otherwise.
Looking at recvfrom as an example, this function exists both as a Linux system call as well as a C library function.
I was able to find 2 pages for recvfrom:
recvfrom(2) in Linux Programmer's Manual
recvfrom(3p) in POSIX Programmer's Manual
Often, the Linux page also tells how Linux version of the function differs from POSIX one.
They are different from functions in section (3) in that there is always a system call behind [section 2].
Not necessarily. Section 2 is Linux-specific API for user-space applications. Linus Torvalds insists that user-space applications must never be broken because of Linux kernel API changes. glibc or another library normally implement the functions, to maintain the stable user-space API and delegate to the kernel.

What is not allowed in restricted C for ebpf?

From bpf man page:
eBPF programs can be written in a restricted C that is compiled
(using the clang compiler) into eBPF bytecode. Various features are
omitted from this restricted C, such as loops, global variables,
variadic functions, floating-point numbers, and passing structures as
function arguments.
AFAIK the man page it's not updated. I'd like to know what is exactly forbidden when using restricted C to write an eBPF program? Is what the man page says still true?
It is not really a matter of what is “allowed” in the ELF file itself. This sentence means that once compiled into eBPF instructions, your C code may produce code that would be rejected by the verifier. For example, loops in BPF programs have long been rejected in BPF programs, because there was no guaranty that they would terminate (the only workaround was to unroll them at compile time).
So you can basically use pretty much whatever you want in C and produce successfully an ELF object file. But then you want it to pass the verifier. What components will surely result in the verifier complaining? Let's have a look at the list from man page:
Loops: Linux version 5.3 introduces support for bounded loops, so loops now work to some extent. “Bounded loops” means loops for which the verifier has a way to tell they will eventually finish: typically, a for (i = 0; i < CONSTANT; i++) kind loop should work (assuming i is not modified in the block).
Global variables: There has been some work recently to support global variables, but they are processed in a specific way (as single-entry maps) and I have not really experimented with them, so I don't know how transparent this is and if you can simply have global variables defined in your program. Feel free to experiment :).
Variadic functions: Pretty sure this is not supported, I don't see how that would translate in eBPF at the moment.
Floating point numbers: Still not supported.
Passing structure as function arguments: Not supported, although passing pointers to structs should work I think.
If you are interested in this level of details, you should really have a look at Cilium's documentation on BPF. It is not completely up-to-date (only the very new features are missing), but much more complete and accurate than the man page. In particular, the LLVM section has a list of items that should or should not work in C programs compiled to eBPF. In addition to the aforedmentioned items, they cite:
(All function needing to be inlined, no function calls) -> This one is outdated, BPF has function calls.
No shared library calls: This is true. You cannot call functions from standard libraries, or functions defined in other BPF programs. You can only call into functions defined in the same BPF programs, or BPF helpers implemented in the kernel, or perform “tail calls”.
Exception: LLVM built-in functions for memset()/memcpy()/memmove()/memcmp() are available (I think they're pretty much the only functions you can call, other than BPF helpers and your other BPF functions).
No const string or arrays allowed (because of how they are handled in the ELF file): I think this is still valid today?
BPF program stack is limited to 512 bytes, so your C program must not result in an executable that attempts to use more.
Additional allowed or good-to-know items are listed. I can only encourage you to dive into it!
Let's go over these:
Variadic functions, floating-point numbers, and passing structures as function arguments are still not possible. As far as I know, there is no ongoing work to support these.
Global variables should be supported in kernels >=5.2, thanks to a recent patchset by Daniel Borkmann.
Unbounded loops are still unsupported. There is limited support for bounded loops in kernels >=5.3. I'm using "limited" here because a small program may still be rejected if the loop is too large.
The best way to know what the last versions of Linux allow is to read the verifier's code and/or to follow the bpf mailing list (you might also want to follow the netdev mailing list as some patchsets may still land there). I've also found checking patchwork to be quite efficient as it more clearly outlines the state of each patchset.

Getting printf in assembly with only system calls?

I am looking to understand the printf() statement at the assembly level. However most of the assembly programs do something like call an external print function whose dependency is met by some other object file that the linker adds on. I would like to know what is inside that print function in terms of system calls and very basic assembly code. I want a piece of assembly code where the only external calls are the system calls, for printf. I'm thinking of something like a de assembled object file. Where can I get something like that??
I would suggest instead to stay first at the C level, and study the source code of some existing C standard library free software implementation on Linux. Look into the source code of musl-libc or of GNU libc (a.k.a. glibc). You'll understand that several intermediate (usually internal) functions are useful between printf and the basic system calls (listed in syscalls(2) ...). Use also strace(1) on a sample C program doing printf (e.g. the usual hello-world example).
In particular, musl-libc has a very readable stdio/printf.c implementation, but you'll need to follow several other C functions there before reaching the write(2) syscall. Notice that some buffering is involved. See also setvbuf(3) & fflush(3). Several answers (e.g. this and that one) explain the chain between functions like printf and system calls (up to kernel code).
I want a piece of assembly code where the only external calls are the system calls, for printf
If you want exactly that, you might start from musl-libc's stdio/printf.c, add any additional source file from musl-libc till you have no more external undefined symbols, and compile all of them with gcc -flto -O2 and perhaps also -S, you probably will finish with a significant part of musl-libc in object (or assembly) form (because printf may call malloc and many other functions!)... I'm not sure it is worth the pain.
You could also statically link your libc (e.g. libc.a). Then the linker will link only the static library members needed by printf (and any other function you are calling).
To be picky, system calls are not actually external calls (your libc write function is actually a tiny wrapper around the raw system call). You could make them using SYSENTER machine instructions (but using vdso(7) is preferable: more portable, and perhaps quicker), and you don't even need a valid stack pointer (on x86_64) to make a system call.
You can write Linux user-level programs without even using the libc; the bones implementation of Scheme is such a program (and you'll find others).
The function printf() is in the standard C library, so it is linked into your program and not copied into it. Dynamically linked libraries save memory because you don't have the exact same code copied in resident memory for every program that uses it.
Think about what printf() does. Interpreting the formatted string and generating the correct output is fairly complex. The series of functions that printf() belongs to also buffers the output. You probably don't really want to re-implement all of this in assembly. The standard C library is omnipresent, and probably available for you.
Maybe you're looking for write(2), which is the system call for unbuffered writes of just bytes to a file descriptor. You'd have to generate the string to print beforehand and format it yourself. (See also open(2) for opening files.)
To disassemble a binary, you can use objdump:
objdump -d binary
where binary is some compiled binary. This gives opcodes and human readable instructions. You probably want to redirect to a file and read elsewhere.
You can disassemble the standard C binary on your system and try to interpret it if you want (strongly not recommended). The problem is that it will be far too complex to understand. Things like printf() were written in C, then compiled and assembled. You can't (within a reasonable number of decades) restore the high level structure from the assembly of a compiled (non-trivial) program. If you really want to try this, good luck.
An easier thing to do is to look at the C source code for printf() itself. The real work is actually done in vfprintf() which is in stdio-common/vfprintf.c of the GNU C library source code.

Implementation of system calls / traps within Linux kernel source

I'm currently learning about operating systems the use of traps to facilitate system calls within the Linux kernel. I've located the table of the traps in traps.c and the implementation of many of the traps within entry.S.
However, I'm instructed to find an implementation of two system calls in the Linux kernel which utilize traps to implement a system call. Although I can find the definition of the traps themselves, I'm not sure what a "call" to one of these traps within the kernel would look like. Therefore, I'm struggling to find an example of this behavior.
Before anyone asks, yes, this is homework.
As a note, I'm using Github to browse the kernel source, since kernel.org is down:
https://github.com/torvalds/linux/
For the x86 architecture the SYCALL_VECTOR (0x80) interrupt is used only for 32bit kernels. You can see the interrupt vector layout in arch/x86/include/asm/irq_vectors.h. The trap_init() function from traps.c is the one that sets the trap handler defined in entry_32.S:
set_system_trap_gate(SYSCALL_VECTOR, &system_call);
For the 64bit kernels, the new SYSENTER (Intel) or SYSCALL (AMD) intructions are used for performance reasons. The syscall_init() function from arch/x86/kernel/cpu/common.c sets up the "handler" defined in entry_64.S and bearing the same name (system_call).
For the user-space perspetive you might want to take a look at this page (a bit outdated for the function/file names).
I'm instructed to find an implementation of two system calls in the Linux kernel which utilize traps to implement a system call
Every system call utilizes a trap (interrupt 0x80 if I recall correctly) so the "kernel" bit will be turned on in PSW, and privileged operations will be available to the processor.
As you mentioned the system calls are specified in entry.S under sys_call_table: and they all start with the "sys" prefix.
you can find the system call function header in: include/linux/syscalls.h, you can find it here:
http://lxr.linux.no/#linux+v3.0.4/include/linux/syscalls.h
Use lxr (as the comment above have already mentioned) in general in order to browse the source code.
Anyhow, the function are implemented using the SYSCALL_DEFINE1 or othe versions of the macro, see
http://lxr.linux.no/#linux+v3.0.4/kernel/sys.c
If you're looking for an actual system call, not an implementation of a system call, maybe you want to check some C libraries. Why would a kernel include a system call? (I'm not talking about a system call implementation, I'm talking about e.g. an actual chdir call for example. There is a chdir system call, which is a request for changing the directory and there is a chdir system call implementation which actually changes it and must be somewhere in the kernel). Ok, maybe some kernels do include some syscalls too but that's another story :)
Anyway, if I get your question right, you're not looking for an implementation but an actual call. GNU libc is too complicated for me, but you can try browsing the dietlibc sources. Some examples:
chdir.S
syscalls.h

Resources