Linux system calls vs C lib functions - c

I have bit bit of confusion regarding these two so here are my questions;
The Linux man-pages project lists all these functions:
https://www.kernel.org/doc/man-pages/
Looking at recvfrom as an example, this function exists both as a Linux system call as well as a C library function. Their documentation seems different but they are both reachable using #include <sys/socket.h>.
I don't understand their difference?
I also thought systems calls are defined using hex values which can be implemented in assembly directly, their list is here:
https://syscalls.kernelgrok.com/
However I cannot find recvfrom in the above link. I'm a bit confused between Linux system calls vs C lib functions at this point!
Edit: To add to the questions, alot of functions are under (3) but not (2), i.e clean. Does this mean these are done by C runtime directly without relying on system calls and the underlying OS?

First, understand that that C functions and system calls are two completely different things.
System calls not wrapped in the C library must be called through the syscall function. One example of such a call is gettid.
To create a gettid system call wrapper with syscall, do this:
#define _GNU_SOURCE
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
pid_t gettid(void)
{
pid_t tid = (pid_t)syscall(SYS_gettid);
return tid;
}
Here's an excerpt from the NOTES section of the man-page, which explicitly states that this function is not defined within the C library:
NOTES
Glibc does not provide a wrapper for this system call; call it using syscall(2).
recvfrom is a C library wrapper around a system call.
Everything in section (2) is a system call. Everything in section (3) is not. Everything in section (3) (with a few notable exceptions, such as getumask) has a definition in the C library. About half of everything in section (2) does not have a definition (or wrapper) within the C library (with the exception of functions mandated by POSIX, and some other extensions, which all do), as with gettid.
When calling recvfrom in C, the C library calls the kernel to do the syscall.
The syscall function is the function that puts the system call number in the %eax register and uses int $0x80.
The reason you don't see recvfrom in https://syscalls.kernelgrok.com/ is because https://syscalls.kernelgrok.com/ is very, very incomplete.
The reason there are many functions in (3) that you don't see in (2) is because many functions on (3) don't have a system call. They may or may not rely on system calls, they just don't have a system call with that specific name that backs them.

exists both under the linux system call
The way userspace programs communicate with kernel is by using syscall function. All syscall() does is push some number on specific registers and then execute a special interrupt instruction. On the interrupt the execution is transferred to kernel, kernel then reads data from userspace using special registers.
Each syscall has a number and different arguments. User space programs are expected to "find out" arguments for each syscall by for example inspecting the documentation.
Linux system call is just a number, like __NR_recvfrom which is equal to 231 on x86-64 architecture.
C Lib function
A C library function is a function implemented by the C library implementation. So, for example glibc implements recvfrom as a simple wrapper around syscall(__NR_recvfrom, ...). This is C interface the library provides programmers to access kernel related functions. So C programmers wouldn't need to read the documentation for each syscall and have nice C interface to call the kernel.
However I cannot find recvfrom in the above link.
Don't use the link then. At best inspect kernel sources under uapi directory.

First of all, functions listed in section (2) are functions. They are different from functions in section (3) in that there is always a system call behind.
Those function will usually do additional work to make them behave like POSIX functions (converting returned value to -1 and errno), or to just make them usable (clone syscall requires libc integration to be useful). Sometimes arguments are passed to a system call differently than function prototype suggests, for example they can be packed into a structure and pointer to that structure can be passed through a register.
Sometimes a new syscall is added to fix some issues of the older syscall. In this can a function can be implemented using a new syscall transparently (see mmap vs mmap2, sys_select vs sys_old_select).
As for recvfrom, socket-related functions are implemented by either their respective syscalls or by a legacy sys_socketcall. For example musl still has this code:
#ifdef SYS_socket
#define __socketcall(nm,a,b,c,d,e,f) syscall(SYS_##nm, a, b, c, d, e, f)
#define __socketcall_cp(nm,a,b,c,d,e,f) syscall_cp(SYS_##nm, a, b, c, d, e, f)
#else
#define __socketcall(nm,a,b,c,d,e,f) syscall(SYS_socketcall, __SC_##nm, \
((long [6]){ (long)a, (long)b, (long)c, (long)d, (long)e, (long)f }))
#define __socketcall_cp(nm,a,b,c,d,e,f) syscall_cp(SYS_socketcall, __SC_##nm, \
((long [6]){ (long)a, (long)b, (long)c, (long)d, (long)e, (long)f }))
#endif
which tries to use appropriate syscall if available, backing off to socketcall otherwise.

Looking at recvfrom as an example, this function exists both as a Linux system call as well as a C library function.
I was able to find 2 pages for recvfrom:
recvfrom(2) in Linux Programmer's Manual
recvfrom(3p) in POSIX Programmer's Manual
Often, the Linux page also tells how Linux version of the function differs from POSIX one.
They are different from functions in section (3) in that there is always a system call behind [section 2].
Not necessarily. Section 2 is Linux-specific API for user-space applications. Linus Torvalds insists that user-space applications must never be broken because of Linux kernel API changes. glibc or another library normally implement the functions, to maintain the stable user-space API and delegate to the kernel.

Related

Are OS libraries written in assembly or in C

I ask this, because I am getting very conflicting definitions of System calls.
One one hand, I have seen the definition that they are an API the OS provides that a user program can call. Since this API is a high level interface, it has to be implemented in a high level language like C.
On the other hand, I have seen that the actual OS syscalls are machine instructions, for which you have to set certain registers to call (according to some compliance standard set by the OS). But this looks nothing like the UNIX APIs like open(), write() and read(), so what is going on here.
I have also read that these high level interfaces are implemented in the C libraries which do the actual assembly code syscalls. In that case, why do we say the OS provides this interface when it is actually provided by the C language. What if I want to perform a UNIX syscall directly to the OS without having to use C?
There are two open functions - one, the syscall open exposed by the operating system (e.g. Linux), and two, the C-library function open, exposed by the C standard library (e.g. glibc).
You can see two different man pages for these functions - run man 2 open to see the man page regarding the syscall, and man 3 open to see the man page regarding the C standard function.
Functions you mentioned like open, write, and read can be confusing - because they exist both as syscalls and as C standard functions. But they are separate entities entirely - in fact, glibc's open function doesn't even use the open syscall - it uses the openat syscall.
On Windows, where the syscall open doesn't even exist - the C standard library function open does still exist, and uses WinAPI's CreateFile behind the scenes.
What if I want to perform a UNIX syscall directly to the OS without
having to use C?
This is possible - indeed, glibc has to do it to implement C standard library functions. But it's tricky, and involves implementing wrappers for the syscalls and sometimes even handcrafting assembly.
If you want to see things for yourself, you can look at how glibc implements open:
int
__libc_open (const char *file, int oflag, ...)
{
int mode = 0;
if (__OPEN_NEEDS_MODE (oflag))
{
va_list arg;
va_start (arg, oflag);
mode = va_arg (arg, int);
va_end (arg);
}
return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag, mode);
}
...
weak_alias (__libc_open, open)
notice that the function ends with a call to the macro SYSCALL_CANCEL, which will end up calling the OS-exposed openat syscall.
Are OS libraries written in assembly or in C
That is a question that can not really be answered as it depends. Technically there are no limitations on the implementation (i.e. it can be written in any language, though C is probably the most common followed by assembly).
The important part here is the ABI. This defines how OS calls can be made.
You can make system calls in assembly (if you know the ABI you can manually write all the code to comply), the C compiler knows the ABI and will automatically generate all the code required to make a call.
Most languages though allow you to make system calls, they will either know the ABI or have a wrapper API that translates the calls from a language call to the appropriate ABI for that OS.
I ask this, because I am getting very conflicting definitions of System calls.
The definitions will depend on the context. You will have to give examples of what the definitions are AND in what context they are being used.
One one hand, I have seen the definition that they are an API the OS provides that a user program can call.
Sure this is one way to look at it.
More strictly I would ays the OS provides a set of interfaces that can be used to perform privileged tasks. Now those interfaces can be exposed via an API provided by a particular environment that makes them easier to use.
Since this API is a high level interface, it has to be implemented in a high level language like C.
Sort of true.
An environment can expose an API does not mean that it needs a high level language (and C is not a high level language, it is one step above assembly, it is considered a low level language). And just because it is exposed by the language does not mean it is implemented in that language.
On the other hand, I have seen that the actual OS syscalls are machine instructions, for which you have to set certain registers to call (according to some compliance standard set by the OS).
OK. Here we have moved from System Calls to syscalls. We should be very careful on how we use these terms to make sure we are not conflating different terms.
I would (and this is a bit abstract still) think about the computer as several levels of abstraction:
Hardware
------ --------------
syscalls
OS --------------
System Calls (read/write etc..)
------ --------------
Language Interface (read/write etc..)
You can poke the hardware directly if you want (if you know how), but it is better if you can make syscalls (if you know how), but it better to use the OS System Calls which use a well defined ABI, but it better to use the language interface (what you would call the API) to call the underlying System Calls.
But this looks nothing like the UNIX APIs like open(), write() and read(), so what is going on here.
Here the UNIX OS provides the open/close/read interface.
The C libraries provides a very thin API wrapper interface above the the OS System Calls. The C compiler will then generate the correct instructions to call the System Calls using the correct ABI, which in turn will call the next layer down in the OS to use the syscalls.
I have also read that these high level interfaces are implemented in the C libraries which do the actual assembly code syscalls.
The high level interface can be written in any language. But the C one is so easy to use that most other languages don't bother doing it themselves but simply call via the C interface.
It's VERRRY rare to ever directly write something in assembly. By writing in C you can compile it for many different CPU architectures whereas by writing in assembly you are basically stuck with one specific architecture. Most operating systems are written in C. We say the OS provides the interface because you are interacting with the operating system which happens to be written in C.

Why was a readdir function added to POSIX library interface when there is a readdir kernel function?

I was surprised to discover the man pages having entries for two conflicting variants of readdir.
in READDIR(2), it specifically states you do not want to use it:
This is not the function you are interested in. Look at readdir(3) for the POSIX conforming C library interface. This page documents the bare kernel system call interface, which is superseded by getdents(2).
I understand a function may become deprecated when another function comes along and does its job better, but I am not familiar with other cases of a userspace function coming in and replacing a kernel function of the same name. Is there a known reason it was chosen to go this route rather than coming up with a new function name (as the man page mentions getdents did when superseding readdir).
The programming interface, POSIX, is stable. You don't just go replacing functions in it unnecessarily because you want to implement the backend more efficiently. The Linux syscall readdir never implemented the readdir function because it has the wrong signature; it was an old, inefficient backend for implementing the readdir function. When a better backend came along, it was obsolete.
You have it completely backwards: it's the library function readdir(3) which predates Linux and its readdir(2) system call, and not the reverse.
Naming the syscall that way was certainly a poor decision, and probably has a story behind it, but it's pretty much irrelevant now, as nobody is using it.
On Unix, directories used to be simple files formatted in a special way, and the system call interface through which they were read was just read(2) [1]. Later systems introduced system calls like getdirentries (44BSD) and getdents (SVR3), but they weren't willing or capable to standardize on an interface, so we're still stuck with the high level and broken [2] readdir(3) library function as the only standard interface for reading a directory.
[1] On some systems like BSD you can still cat a directory, at least when using the default filesystem (FFS).
[2] it's broken because it's not signal safe, and it returns NULL for both error and EOF, which means that the only way it could be safely used is by first setting errno to 0, and checking both its return value and errno afterwards. Yuck.

Are system calls directly send to the kernel?

I have a couple of assumptions, most likely some of them will be incorrect. Please correct me where they are wrong.
We could categorize the functions in a program written in C as follows:
Functions that are sent to dynamically loaded libraries:
These are sent to the library that translates them in to multiple standard C-functions
The library passes them on to libc where they are translated into multiple system calls.
Libc passes those on to the kernel where they are executed and the returns are sent back to libc.
Libc will collect the returs, group them by c-function and use them to create 1 return for each c-function. These returns will be send back to the dynamically loaded library.
This library will collect all returns and use them to create 1 return that is send back to the original program.
Functions that are either defined in the code or part of statically compiled libraries: Everything is the same as the category above but:
They program already does the translation into standard C functions where they are known or into functions calling dynamically loaded libraries in the other case.
The standard c functions are send to libc, the others to the dynamically loaded libraries (where they will be handled as above).
The creation of 1 final return based on the returns from both types of functions will be done by the program
Functions that are standard C functions: They will just be sent to libc which will communicate with the kernel in the same way as above and 1 return will be sent to the program
Functions that are system calls: They are NOT sent directly to the kernel but have to pass to libc although it doesn't do any extra work.
Security checks (permissions, writing to unallocated mem, ...) are always done by the kernel, although libc and libraries above might also check it first.
All POSIX-compliant systems follow these rules
It might not be the same on Linux and on some other POSIX system (like FreeBSD).
On Linux, the ABI defines how a system call is done. Read about Linux kernel interfaces. The system calls are listed in syscalls(2) (but see also /usr/include/asm*/unistd.h ...). Read also vdso(7). The assembler HowTo explains more details, but for 32 bits i686 only.
Most Linux libc are free software, you can study their source code. IMHO the source code of musl-libc is very readable.
To simplify a tiny bit, most system calls (e.g. write(2)) are small C functions in the libc which:
call the kernel using SYSENTER machine instruction (and take care of passing the system call number and its arguments with the kernel convention, which is not the usual C ABI). What the kernel considers as a system call is only that machine instruction (and conventions about it).
handle the failure case, by passing it to errno(3) and returning -1.
(IIRC, on failure, the carry -or perhaps the overflow- flag bit is set when the kernel returns from SYSENTER; but I could be wrong in the details)
handle the success case, by returning a result.
You could invoke system calls without libc, with some assembler code. This is unusual, but has been done (e.g. in BusyBox or in Bones).
So the libc code for write is doing some tiny extra work (passing arguments, handling failure & errno and success cases).
Some few system calls (probably getpid & clock_gettime) avoid the overhead of the SYSENTER machine instruction (and user-mode -> kernel-mode switch) thanks to vDSO.
No you can't categorize things like that. When you program in C (but that makes no difference in almost all other languages), there is only functions and whatever is the real status of these, you call them exactly the same way. This is defined by ABI (how to pass parameters, get returned values, etc) and enforced by the compiler/linker. Of course some functions are just stubs. For example stubs to shared libraries functions (stubs may be need to load the library, dynamic link to the real function, etc) or system calls (this is more technical and differs from kernel to kernel). But from the viewpoint of your program everything is the same (this is why it is hard to understand difference between fread and read at the beginning: you call them the same way, they make almost the same job, what's the difference?).
POSIX doesn't say a single word about kernels... It just lists the C (and formerly ADA) API of a set of functions with minimal semantic (plus some command, tools, etc). Implementation of these is totally free.

Different ways to invoke system calls

In some code, I can see system call are invoked in a strange way, take sched_yield as an example:
#define __NR_sys_sched_yield __NR_sched_yield
inline _syscall0(void, sys_sched_yield);
And then we can use sys_sched_yield().
I'm curious what's the difference between using sched_yield directly and this way.
In src/include/asm/unistd, _syscall0 is defined:
#define _syscall0(type,name) \
type name(void) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
: "=a" (__res) \
: "0" (__NR_##name)); \
__syscall_return(type,__res); \
}
Presumably that's for systems where sched_yieldmight not be available.
As for differences, sched_yield returns -1 on error and sets ERRNO while this implementation presumably returns the raw value from the kernel. Can't tell for sure since you haven't provided definition of _syscall0 which must be a macro.
This is linux which uses glibc. BSD has a sched_yield but has its own libc.
This isn't strange. The syscall0 macro issues an assembler int 0x80 instruction and the syscall number is in the rax register [x86 architecture]. This is the standard syscall interface for linux. Under the hood, all linux glibc functions that are syscall wrappers will do this [or use the more modern sysenter/sysexit x86 instruction pairings]
glibc has a tendency to "usurp" syscalls in its wrapper functions and add stuff around them. For example, when you call fork, it [ultimately] calls __libc_fork which does a huge amount of extra stuff related to threads and file closure, etc.
Generally, glibc makes good choices. But, sometimes highly experienced linux application programmers want the raw syscall behavior, particularly if they're writing system utilities or programs/libraries that must have close interaction with the kernel, device drivers, or device hardware.
Actually, __libc_fork doesn't invoke the fork syscall, it invokes the clone syscall, which is a [harder to use] superset of fork. But, the plain old fork syscall still exists. So, if you want that, you need the macro stuff--and I'll bet there's a sys_fork definition somewhere.
On the other hand, glibc might implement sched_yield ala POSIX as a nop returning -1 and setting errno to ENOSYS. I just checked latest glibc source and I couldn't find the "real" implementation, except for mach. It probably does do the real thing, I just couldn't find it.
Sometimes, linux has a syscall, but glibc doesn't want to support it, or they consider it to be too dangerous for an application programmer, so they leave out the wrapper function. So, the macros are a way to "end around" glibc.
The probable reason for glibc implementing sched_yield as a nop, posix aside, is they consider it "bad" and probably tell you to use nanosleep instead. I've used both and they are not the same, depending on your use case and desired effect.
Sometimes, you need to do the raw, inline syscall. For example, the ELF loader [every system that supports ELF binaries must have one and linux's is ld-linux.so] is invoked by the kernel to load an ELF binary. It must operate before glibc.so is available, because it is what actually links in glibc.so, the ELF loader must have some builtin syscalls for open and read
Also, most systems have a syscall library function that takes a variable number of arguments. You could implement:
#define my_sched_yield() syscall(__NR_sched_yield)
#define my_read(_fd,_buf,_len) syscall(__NR_read,_fd,_buf,_len)
This function handles the kernel's syscall return value/error and sets errno. That's what the __syscall_return macro had to do.
The __NR_* prefix is what linux uses, but other systems have AUE_* or SYS_*

How does C compiler decide whether to call library function or system call

I know that read is system call. But when I read man 2 and man 3 of read it shows me different explanation. So , I am suspecting that read has library function and system call. In such case if I use read in my c program, whether compiler will consider read as library function or system call Please explain me on this confusion.
It doesn't. System calls are present in libc (the C standard library) just like library functions are. The implementations of system calls in libc are just "stubs" which invoke system-specific methods of calling into the kernel.
I'm assuming you're on Linux. On that platform, the manpage read(2) describes the Linux system call, while read(3) describes the POSIX specification for read, if you have the POSIX manpages installed. The latter is in category 3 because POSIX doesn't specify a difference between system calls and library functions.
There's only one read in libc, which is (a thin wrapper around) the system call.

Resources