Different ways to invoke system calls - c

In some code, I can see system call are invoked in a strange way, take sched_yield as an example:
#define __NR_sys_sched_yield __NR_sched_yield
inline _syscall0(void, sys_sched_yield);
And then we can use sys_sched_yield().
I'm curious what's the difference between using sched_yield directly and this way.
In src/include/asm/unistd, _syscall0 is defined:
#define _syscall0(type,name) \
type name(void) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
: "=a" (__res) \
: "0" (__NR_##name)); \
__syscall_return(type,__res); \
}

Presumably that's for systems where sched_yieldmight not be available.
As for differences, sched_yield returns -1 on error and sets ERRNO while this implementation presumably returns the raw value from the kernel. Can't tell for sure since you haven't provided definition of _syscall0 which must be a macro.

This is linux which uses glibc. BSD has a sched_yield but has its own libc.
This isn't strange. The syscall0 macro issues an assembler int 0x80 instruction and the syscall number is in the rax register [x86 architecture]. This is the standard syscall interface for linux. Under the hood, all linux glibc functions that are syscall wrappers will do this [or use the more modern sysenter/sysexit x86 instruction pairings]
glibc has a tendency to "usurp" syscalls in its wrapper functions and add stuff around them. For example, when you call fork, it [ultimately] calls __libc_fork which does a huge amount of extra stuff related to threads and file closure, etc.
Generally, glibc makes good choices. But, sometimes highly experienced linux application programmers want the raw syscall behavior, particularly if they're writing system utilities or programs/libraries that must have close interaction with the kernel, device drivers, or device hardware.
Actually, __libc_fork doesn't invoke the fork syscall, it invokes the clone syscall, which is a [harder to use] superset of fork. But, the plain old fork syscall still exists. So, if you want that, you need the macro stuff--and I'll bet there's a sys_fork definition somewhere.
On the other hand, glibc might implement sched_yield ala POSIX as a nop returning -1 and setting errno to ENOSYS. I just checked latest glibc source and I couldn't find the "real" implementation, except for mach. It probably does do the real thing, I just couldn't find it.
Sometimes, linux has a syscall, but glibc doesn't want to support it, or they consider it to be too dangerous for an application programmer, so they leave out the wrapper function. So, the macros are a way to "end around" glibc.
The probable reason for glibc implementing sched_yield as a nop, posix aside, is they consider it "bad" and probably tell you to use nanosleep instead. I've used both and they are not the same, depending on your use case and desired effect.
Sometimes, you need to do the raw, inline syscall. For example, the ELF loader [every system that supports ELF binaries must have one and linux's is ld-linux.so] is invoked by the kernel to load an ELF binary. It must operate before glibc.so is available, because it is what actually links in glibc.so, the ELF loader must have some builtin syscalls for open and read
Also, most systems have a syscall library function that takes a variable number of arguments. You could implement:
#define my_sched_yield() syscall(__NR_sched_yield)
#define my_read(_fd,_buf,_len) syscall(__NR_read,_fd,_buf,_len)
This function handles the kernel's syscall return value/error and sets errno. That's what the __syscall_return macro had to do.
The __NR_* prefix is what linux uses, but other systems have AUE_* or SYS_*

Related

Linux system calls vs C lib functions

I have bit bit of confusion regarding these two so here are my questions;
The Linux man-pages project lists all these functions:
https://www.kernel.org/doc/man-pages/
Looking at recvfrom as an example, this function exists both as a Linux system call as well as a C library function. Their documentation seems different but they are both reachable using #include <sys/socket.h>.
I don't understand their difference?
I also thought systems calls are defined using hex values which can be implemented in assembly directly, their list is here:
https://syscalls.kernelgrok.com/
However I cannot find recvfrom in the above link. I'm a bit confused between Linux system calls vs C lib functions at this point!
Edit: To add to the questions, alot of functions are under (3) but not (2), i.e clean. Does this mean these are done by C runtime directly without relying on system calls and the underlying OS?
First, understand that that C functions and system calls are two completely different things.
System calls not wrapped in the C library must be called through the syscall function. One example of such a call is gettid.
To create a gettid system call wrapper with syscall, do this:
#define _GNU_SOURCE
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
pid_t gettid(void)
{
pid_t tid = (pid_t)syscall(SYS_gettid);
return tid;
}
Here's an excerpt from the NOTES section of the man-page, which explicitly states that this function is not defined within the C library:
NOTES
Glibc does not provide a wrapper for this system call; call it using syscall(2).
recvfrom is a C library wrapper around a system call.
Everything in section (2) is a system call. Everything in section (3) is not. Everything in section (3) (with a few notable exceptions, such as getumask) has a definition in the C library. About half of everything in section (2) does not have a definition (or wrapper) within the C library (with the exception of functions mandated by POSIX, and some other extensions, which all do), as with gettid.
When calling recvfrom in C, the C library calls the kernel to do the syscall.
The syscall function is the function that puts the system call number in the %eax register and uses int $0x80.
The reason you don't see recvfrom in https://syscalls.kernelgrok.com/ is because https://syscalls.kernelgrok.com/ is very, very incomplete.
The reason there are many functions in (3) that you don't see in (2) is because many functions on (3) don't have a system call. They may or may not rely on system calls, they just don't have a system call with that specific name that backs them.
exists both under the linux system call
The way userspace programs communicate with kernel is by using syscall function. All syscall() does is push some number on specific registers and then execute a special interrupt instruction. On the interrupt the execution is transferred to kernel, kernel then reads data from userspace using special registers.
Each syscall has a number and different arguments. User space programs are expected to "find out" arguments for each syscall by for example inspecting the documentation.
Linux system call is just a number, like __NR_recvfrom which is equal to 231 on x86-64 architecture.
C Lib function
A C library function is a function implemented by the C library implementation. So, for example glibc implements recvfrom as a simple wrapper around syscall(__NR_recvfrom, ...). This is C interface the library provides programmers to access kernel related functions. So C programmers wouldn't need to read the documentation for each syscall and have nice C interface to call the kernel.
However I cannot find recvfrom in the above link.
Don't use the link then. At best inspect kernel sources under uapi directory.
First of all, functions listed in section (2) are functions. They are different from functions in section (3) in that there is always a system call behind.
Those function will usually do additional work to make them behave like POSIX functions (converting returned value to -1 and errno), or to just make them usable (clone syscall requires libc integration to be useful). Sometimes arguments are passed to a system call differently than function prototype suggests, for example they can be packed into a structure and pointer to that structure can be passed through a register.
Sometimes a new syscall is added to fix some issues of the older syscall. In this can a function can be implemented using a new syscall transparently (see mmap vs mmap2, sys_select vs sys_old_select).
As for recvfrom, socket-related functions are implemented by either their respective syscalls or by a legacy sys_socketcall. For example musl still has this code:
#ifdef SYS_socket
#define __socketcall(nm,a,b,c,d,e,f) syscall(SYS_##nm, a, b, c, d, e, f)
#define __socketcall_cp(nm,a,b,c,d,e,f) syscall_cp(SYS_##nm, a, b, c, d, e, f)
#else
#define __socketcall(nm,a,b,c,d,e,f) syscall(SYS_socketcall, __SC_##nm, \
((long [6]){ (long)a, (long)b, (long)c, (long)d, (long)e, (long)f }))
#define __socketcall_cp(nm,a,b,c,d,e,f) syscall_cp(SYS_socketcall, __SC_##nm, \
((long [6]){ (long)a, (long)b, (long)c, (long)d, (long)e, (long)f }))
#endif
which tries to use appropriate syscall if available, backing off to socketcall otherwise.
Looking at recvfrom as an example, this function exists both as a Linux system call as well as a C library function.
I was able to find 2 pages for recvfrom:
recvfrom(2) in Linux Programmer's Manual
recvfrom(3p) in POSIX Programmer's Manual
Often, the Linux page also tells how Linux version of the function differs from POSIX one.
They are different from functions in section (3) in that there is always a system call behind [section 2].
Not necessarily. Section 2 is Linux-specific API for user-space applications. Linus Torvalds insists that user-space applications must never be broken because of Linux kernel API changes. glibc or another library normally implement the functions, to maintain the stable user-space API and delegate to the kernel.

Are system calls directly send to the kernel?

I have a couple of assumptions, most likely some of them will be incorrect. Please correct me where they are wrong.
We could categorize the functions in a program written in C as follows:
Functions that are sent to dynamically loaded libraries:
These are sent to the library that translates them in to multiple standard C-functions
The library passes them on to libc where they are translated into multiple system calls.
Libc passes those on to the kernel where they are executed and the returns are sent back to libc.
Libc will collect the returs, group them by c-function and use them to create 1 return for each c-function. These returns will be send back to the dynamically loaded library.
This library will collect all returns and use them to create 1 return that is send back to the original program.
Functions that are either defined in the code or part of statically compiled libraries: Everything is the same as the category above but:
They program already does the translation into standard C functions where they are known or into functions calling dynamically loaded libraries in the other case.
The standard c functions are send to libc, the others to the dynamically loaded libraries (where they will be handled as above).
The creation of 1 final return based on the returns from both types of functions will be done by the program
Functions that are standard C functions: They will just be sent to libc which will communicate with the kernel in the same way as above and 1 return will be sent to the program
Functions that are system calls: They are NOT sent directly to the kernel but have to pass to libc although it doesn't do any extra work.
Security checks (permissions, writing to unallocated mem, ...) are always done by the kernel, although libc and libraries above might also check it first.
All POSIX-compliant systems follow these rules
It might not be the same on Linux and on some other POSIX system (like FreeBSD).
On Linux, the ABI defines how a system call is done. Read about Linux kernel interfaces. The system calls are listed in syscalls(2) (but see also /usr/include/asm*/unistd.h ...). Read also vdso(7). The assembler HowTo explains more details, but for 32 bits i686 only.
Most Linux libc are free software, you can study their source code. IMHO the source code of musl-libc is very readable.
To simplify a tiny bit, most system calls (e.g. write(2)) are small C functions in the libc which:
call the kernel using SYSENTER machine instruction (and take care of passing the system call number and its arguments with the kernel convention, which is not the usual C ABI). What the kernel considers as a system call is only that machine instruction (and conventions about it).
handle the failure case, by passing it to errno(3) and returning -1.
(IIRC, on failure, the carry -or perhaps the overflow- flag bit is set when the kernel returns from SYSENTER; but I could be wrong in the details)
handle the success case, by returning a result.
You could invoke system calls without libc, with some assembler code. This is unusual, but has been done (e.g. in BusyBox or in Bones).
So the libc code for write is doing some tiny extra work (passing arguments, handling failure & errno and success cases).
Some few system calls (probably getpid & clock_gettime) avoid the overhead of the SYSENTER machine instruction (and user-mode -> kernel-mode switch) thanks to vDSO.
No you can't categorize things like that. When you program in C (but that makes no difference in almost all other languages), there is only functions and whatever is the real status of these, you call them exactly the same way. This is defined by ABI (how to pass parameters, get returned values, etc) and enforced by the compiler/linker. Of course some functions are just stubs. For example stubs to shared libraries functions (stubs may be need to load the library, dynamic link to the real function, etc) or system calls (this is more technical and differs from kernel to kernel). But from the viewpoint of your program everything is the same (this is why it is hard to understand difference between fread and read at the beginning: you call them the same way, they make almost the same job, what's the difference?).
POSIX doesn't say a single word about kernels... It just lists the C (and formerly ADA) API of a set of functions with minimal semantic (plus some command, tools, etc). Implementation of these is totally free.

Where is the definition of the POSIX function "stat" on Linux?

On Windows, stat and pretty much all other C/POSIX functions Windows supplies are defined in msvcrt.dll, which is the C runtime library.
On Linux, I know a lot of POSIX C functions are system calls. I also know when linking a program, you can't have undefined references. I have searched all so files in /lib and /usr/lib for the symbol stat or "mangled/prefixed" form but have not found anything. This is the command I used:
objdump -T /lib/*.so* /usr/lib/*.so* | grep "stat"
It didn't turn up the stat I was looking for.
So my question becomes: where is it, and any other "system calls" defined?
On my Linux machine, I can find the stat (weak) symbol and __stat (non-weak) in /usr/lib/libc.a
You might make linux kernel system calls without even using the libc (but this is probably a bad practice). The Linux Assembly Howto explains (in its chapters 5 & 6) how to do that (on x86 Linux 32 bits at least).
But I think it is a bad idea. Going thru the libc is good practice, and might even be faster (because e.g. of VDSO), and is more portable.
First of all stat is ambiguous; there's a stat syscall and there is a function stat that can be called from user space which calls the syscall. That last function is (on my system at least) defined in /usr/include/sys/stat.h (that's right, it's in the header file). It actually has several definitions (all one liners that call a different function, like e.g. __fxstat) of which one is chosen depending on compiler and system and whatnots.
Anyhow, stat (and other syscalls) are just wrappers that call the kernel (usually with a lot of orchestration). That is why I was initially confused about what you meant. I hope though, I could help despite my unhelpful first comment.
You can call it with syscall(2)
#include <sys/syscall.h>
...
syscall(SYS_stat, path, buf);
see also Linux syscall reference: http://syscalls.kernelgrok.com/

How to force gcc use int for system calls, not sysenter?

Is it possible to force gcc use int instruction for all the system calls, but not sysenter? This question may sound strange but I have to compile some projects like Python and Firefox this way.
Summary
Thanks to jbcreix, I've downloaded glibc 2.9 source code, and modified the lines in sysdeps/unix/sysv/linux/i386/sysdep.h, to disable use of sysenter by #undef I386_USE_SYSENTER, and it works.
Recompile your C library after replacing sysenter by int 0x80 in syscall.s and link again.
This is not compiler generated code which means you are lucky.
The ultimate origin of the actual syscall is here, as the OP says:
http://cvs.savannah.gnu.org/viewvc/libc/sysdeps/unix/sysv/linux/i386/sysdep.h?root=libc&view=markup
And as I suspected there really was a syscall.S it's just that the glibc sources are a labyrinth.
http://cvs.savannah.gnu.org/viewvc/libc/sysdeps/unix/sysv/linux/i386/syscall.S?root=libc&view=markup
So I think he got it right, asveikau.
You don't modify gcc; you modify libc (or more accurately, recompile it) and the kernel. gcc doesn't emit sysenter instructions; it generates calls to the generic syscall(2) interface, which presents a unified front end to system call entry and exit.
Or, you could use a Pentium; SYSENTER wasn't introduced until PII =]. Note the following KernelTrap link for the interesting methods used by Linux: http://kerneltrap.org/node/531

LINUX: Is it possible to write a working program that does not rely on the libc library?

I wonder if I could write a program in the C-programming language that is executable, albeit not using a single library call, e.g. not even exit()?
If so, it obviously wouldn't depend on libraries (libc, ld-linux) at all.
I suspect you could write such a thing, but it would need to have an endless loop at the end, because you can't ask the operation system to exit your process. And you couldn't do anything useful.
Well start with compiling an ELF program, look into the ELF spec and craft together the header, the program segments and the other parts you need for a program. The kernel would load your code and jump to some initial address. You could place an endless loop there. But without knowing some assembler, that's hopeless from the start on anyway.
The start.S file as used by glibc may be useful as a start point. Try to change it so that you can assemble a stand-alone executable out of it. That start.S file is the entry point of all ELF applications, and is the one that calls __libc_start_main which in turn calls main. You just change it so it fits your needs.
Ok, that was theoretical. But now, what practical use does that have?
Answer to the Updated Question
Well. There is a library called libgloss that provides a minimal interface for programs that are meant to run on embedded systems. The newlib C library uses that one as its system-call interface. The general idea is that libgloss is the layer between the C library and the operation system. As such, it also contains the startup files that the operation system jumps into. Both these libraries are part of the GNU binutils project. I've used them to do the interface for another OS and another processor, but there does not seem to be a libgloss port for Linux, so if you call system calls, you will have to do it on your own, as others already stated.
It is absolutely possible to write programs in the C programming language. The linux kernel is a good example of such a program. But also user programs are possible. But what is minimally required is a runtime library (if you want to do any serious stuff). Such one would contain really basic functions, like memcpy, basic macros and so on. The C Standard has a special conformance mode called freestanding, which requires only a very limited set of functionality, suitable also for kernels. Actually, i have no clue about x86 assembler, but i've tried my luck for a very simple C program:
/* gcc -nostdlib start.c */
int main(int, char**, char**);
void _start(int args)
{
/* we do not care about arguments for main. start.S in
* glibc documents how the kernel passes them though.
*/
int c = main(0,0,0);
/* do the system-call for exit. */
asm("movl %0,%%ebx\n" /* first argument */
"movl $1,%%eax\n" /* syscall 1 */
"int $0x80" /* fire interrupt */
: : "r"(c) :"%eax", "%ebx");
}
int main(int argc, char** argv, char** env) {
/* yeah here we can do some stuff */
return 42;
}
We're happy, it actually compiles and runs :)
Yes, it is possible, however you will have to make system calls and set up your entry point manually.
Example of a minimal program with entry point:
.globl _start
.text
_start:
xorl %eax,%eax
incl %eax
movb $42, %bl
int $0x80
Or in plain C (no exit):
void __attribute__((noreturn)) _start() {
while(1);
}
Compiled with:
gcc -nostdlib -o example example.s
gcc -nostdlib -o example example.c
In pure C? As others have said you still need a way to make syscalls, so you might need to drop down to inline asm for that. That said, if using gcc check out -ffreestanding.
You'd need a way to prevent the C compiler from generating code that depends on libc, which with gcc can be done with -fno-hosted. And you'd need one assembly language routine to implement syscall(2). They're not hard to write if you can get suitable OS doco. After that you'd be off to the races.
Well, you would need to use some system calls to load all it's information into memory, so I doubt it.
And you would almost have to use exit(), just because of the way that Linux works.
Yes you can, but it's pretty tricky.
There is essentially absolutely no point.
You can statically link a program, but then the appropriate pieces of the C library are included in its binary (so it doesn't have any dependencies).
You can completely do without the C library, in which case you need to make system calls using the appropriate low-level interface, which is architecture dependent, and not necessarily int 0x80.
If your goal is making a very small self-contained binary, you might be better off static-linking against something like uclibc.

Resources