How does a C library call kernel system calls - c

I know in Unix-like systems c librarys such as glibc acts as an intermediary between the kernel and the userland. So for example when implementing malloc() how does glibc invoke system calls of the linux kernel does it use assembly?

In Linux x86 syscalls (system calls) are made by calling interrupt 0x80. In assembly it is done with:
int $0x80
The choice of syscall is done by passing information to the CPU registers. malloc itself is not a syscall, but malloc algorithm usually uses sbrk or mmap syscalls (brk syscall for sbrk).
For more information on Linux x86 syscalls, you can read this document.
EDIT: as mentioned by Jester in the comments, Intel x86 processors (after Pentium IV) now support systenter/sysexit instructions that don't have the int overhead and on these processors these instructions are used by Linux for syscalls.

Example of calling exit(0) syscall on 0x86 architecture.
movl $1, %eax #$1=number of exit syscall.
movl $0, %ebx #$0=First argument of exit syscall
int 0x80 #call the software interrupt
Every syscall has been given a number that can be found in /usr/include/asm/unistd.h. In the above example exit syscall has number 1.
Then you set the argument required for the syscall. After that you call software interrupt int 0x80.
So for malloc, it will internally call brk or mmap and then it will set the required arguments and then will call int0x80.

Related

vxworks system call trap mechanism

I'm new to VxWorks and working with an ELF binary for VxWorks. System calls appear to trap into the kernel by calling the address _func_syscallTrapHandle which is 0x1234. Since the program must transition into the kernel, am I correct in assuming that the goal of this is to segfault by accessing low memory to enter the kernel? If so does the segfault ISR check the contents of rax and, when it's 0x1234 perform systemcall logic? Why isn't the syscall instruction used instead?
You are describing the system call trap mechanism in vxsim; as VxWorks, in this case, is executed as normal process inside Linux or Windows it cannot use syscall instruction.
An elf binary for real hardware behaves differently.

linux, systemcalls do_execv vs execv? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Quoting from my lecture:
Note the clear borderline between user space and kernel space. User
programs cannot include kernel headers in their code and cannot call
kernel functions directly. In other words, your program can’t simply
call the sys_read() service function to read a file from the disk.
Similarly, kernel code does not call user-space functions like
printf(), does not include user-space header like <stdio.h> or
, and does not link against user-space libraries like libc.
The only gate to kernel mode (and OS services) that’s the user can use
is the syscall instruction as described above.
"User programs cannot include kernel headers" So when I write in my C program getpid() is this user-space function?
What about when I type getpid in terminal is it the same (use-space function)?
I can't access linux header files in my system /home/user/linux-4.15 , so how it's said user space can't access kernel space?
Given the following image:
I have opened some linux file (init/main.c) and saw:
static int run_init_process(const char *init_filename)
{
argv_init[0] = init_filename;
return do_execve(getname_kernel(init_filename),....
}
where is this do_execve declared? the image shows only execv and sys_execv... and what's the difference?
"User programs cannot include kernel headers" So when I write in my C program getpid() is this user-space function?
Yes. It is a thin wrapper in libc that calls the system call. But depending on the architecture and the libc implementation, there might be some bookkeeping in userspace (e.g. caching the result for future calls).
For many simple syscalls, glibc generates these wrappers with preprocessor macros. On my system, a userspace call to getpid goes to file sysdeps/unix/syscall-template.S:
0x00007ffff7ea6244 59 in ../sysdeps/unix/syscall-template.S
(gdb) disassemble
Dump of assembler code for function getpid:
0x00007ffff7ea6240 <+0>: endbr64
=> 0x00007ffff7ea6244 <+4>: mov $0x27,%eax
0x00007ffff7ea6249 <+9>: syscall
0x00007ffff7ea624b <+11>: retq
End of assembler dump.
which simply puts the syscall number in a register and executes a syscall instruction.
The reason we are using this wrapper is to avoid having to know the specifics of the syscall mechanism and number for different architectures and kernels. This makes our program more portable. The libc we link against knows that 0x27 is getpid on this system, and that it should be written into %eax etc.
When the syscall instruction is executed, the processor switches into kernel mode and starts execution from arch/x86/entry/entry_64.S, where entry_SYSCALL_64 calls do_syscall_64 which is in arch/x86/entry/common.c:
regs->ax = sys_call_table[nr](regs);
You can see that it calls the function at index nr of the sys_call_table. This table is populated by a list of symbols (sys_something), where each one is defined by a macro: SYSCALL_DEFINEn where n is the number of parameters. Since getpid does not a parameter, it is defined as SYSCALL_DEFINE0(getpid) in kernel/sys.c:
/**
* sys_getpid - return the thread group id of the current process
*
* Note, despite the name, this returns the tgid not the pid. The tgid and
* the pid are identical unless CLONE_THREAD was specified on clone() in
* which case the tgid is the same in all threads of the same group.
*
* This is SMP safe as current->tgid does not change.
*/
SYSCALL_DEFINE0(getpid)
{
return task_tgid_vnr(current);
}
What about when I type getpid in terminal is it the same (use-space function)?
I don't know of a terminal command getpid, but if there is one, it would be an executable binary (or script) that eventually calls either a syscall, or a libc wrapper of a syscall. Because, the kernel maintains task and process IDs, and userspace code cannot access the kernel memory.
I can't access linux header files in my system /home/user/linux-4.15 , so how it's said user space can't access kernel space?
Did you mean you CAN access the header files? You can access the entire source code, of course. But even if you include those headers in your program, and compile, and somehow link them with your kernel code, that doesn't mean you can run them in kernel mode.
Except, if you use loadable kernel modules. In fact, you need the kernel header files for compiling kernel modules. You can then request the kernel to load and execute those modules in kernel mode. But you need to call another syscall (init_module) to achieve that.
where is this do_execve declared? the image shows only execv and sys_execv... and what's the difference?
Here is the definition of the syscall execve:
SYSCALL_DEFINE3(execve,
const char __user *, filename,
const char __user *const __user *, argv,
const char __user *const __user *, envp)
{
return do_execve(getname(filename), argv, envp);
}
Similar to getpid, execve is defined with a SYSCALL_DEFINEn (this time three parameters) macro which generates the sys_execve symbol. Internally, the kernel calls do_execve. If you search the rest of the file, you'll see that do_execve itself is a wrapper around do_execveat_common. After some checks and initialization, bprm_execve is called, which calls exec_binprm, and so on.
Can you elaborate what's the difference between do_execve and sys_execve?
Not much of a difference. Except, sys_execve symbol is defined by the SYSCALL_DEFINE3 macro and is meant to be called by an architecture-specific syscall mechanism, which can be different from regular C functions (e.g. asmlinkage). do_execve is a regular C function. In this instance it isn't called from any other C code, but it is possible. Calling sys_execve directly from inside the kernel code however, would not be correct.

How are parameters passed to Linux system call ? Via register or stack?

I trying to understand the internals of the Linux kernel by reading Robert Love's Linux Kernel Development.
On page 74 he says the easiest way to pass arguments to a syscall is via :
Somehow, user-space must relay the parameters to the kernel during the
trap.The easiest way to do this is via the same means that the syscall
number is passed: The parameters are stored in registers. On x86-32,
the registers ebx, ecx, edx, esi, and edi contain, in order, the first
five arguments.
Now this is bothering me for a number of reasons:
All syscalls are defined with the asmlinkage option. Which implies that the arguments are always to be found on the stack and not the register. So what is all this business with the registers ?
It may be possible that before the syscall is performed the values are copied on to the kernel stack. I have no idea why that would be efficient but it might be a possibility.
(This answer is for 32-bit x86 Linux to match your question; things are slightly different for 64-bit x86 and other architectures.)
The parameters are passed from userspace in registers as Love says.
When userspace invokes a system call with int $0x80, the kernel syscall entry code gets control. This is written in assembly language and can be seen here, for instance. One of the things this code does is to take the parameters from the registers and push them onto the stack, and then call the appropriate kernel sys_XXX() function (which is written in C). So those functions do indeed expect their arguments on the stack.
It wouldn't work as well to try to pass parameters from userspace to the kernel on the stack. When the system call is made, the CPU switches to a separate kernel stack, so the parameters would have to be copied from the userspace stack to the kernel stack, and this is somewhat complicated. And it would have to be done even for very simple system calls that just take a few numeric arguments and wouldn't otherwise need to access userspace memory at all (think about close() for instance).

Hooking sys_execve on Linux kernel 4.6 or higher

Kernels lower than 4.6 use assembly stubs to harden the hooking of critical system calls like fork, clone, exec etc. Particularly speaking for execve, the following snippet from Kernel-4.5 shows entry stub of execve:
ENTRY(stub_execve)
call sys_execve
return_from_execve:
...
END(stub_execve)
System call table contains this stub's address and this stub further calls original execve. So, to hook execve in this environment we need to patch call sys_execve in stub with our hooking routine and after doing our desired things call the original execve. This all can be seen in action in execmon, a process execution monitoring utility for linux. I'd tested execmon successfully working in Ubuntu 16.04 with kernel 4.4.
Starting from kernel 4.6, upper scheme for critical calls protection had been changed. Now the stub looks like:
ENTRY(ptregs_\func)
leaq \func(%rip), %rax
jmp stub_ptregs_64
END(ptregs_\func)
where \func will expand to sys_execve for execve calls. Again, system call table contains this stub and this stub calls original execve, but now in a more secured manner instead of just doing call sys_execve.
This newer stub, stores called function's address in RAX register and jumps to another stub shown below (comments removed):
ENTRY(stub_ptregs_64)
cmpq $.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
jne 1f
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
popq %rax
jmp entry_SYSCALL64_slow_path
1:
jmp *%rax /* called from C */
END(stub_ptregs_64)
Please have a look on this to see comments and other referenced labels in this stub.
I'd tried hard to come up with some logic to overcome this protection and patch original calls with hooking functions, but no success yet.
Would someone like to join me and help to get out of it.
I completely don't understand where you take the security angle from.
Neither previous nor current from of the func is "hardened".
You never stated why do you want to hook execve either.
The standard hooking mechanism is with kprobes and you can check systemtap for an example consumer.
I had a look at aforementioned 'execmon' code and I find it to be of poor quality and in not fit for learning. For instance https://github.com/kfiros/execmon/blob/master/kmod/syscalls.c#L65
accesses userspace memory directly (no get_user, copy_from_user etc.)
does it twice. first it computes the lengths (unbound!) and then copies stuff in. in particular if someone made strings longer after the compupation, but before they get copied, this triggers a buffer overflow.

Why does the System V AMD64 ABI say to use syscall?

Originally my project had just x86 system call code in it (using 32 bit registers and int $0x80), but then I created a version called src3 that used 64 bit registers and syscall. That worked, until I created src4 which changed the argument to my _exit function so that it only handles a single byte as the desired process exit status value (in my tests anything that needed more than a byte to represent seemed to overflow in shell printouts, so I'm assuming that process exit status values are only 1 byte in size anyway). This broke my _exit function until I changed using syscall to int $0x80.

Resources