Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Quoting from my lecture:
Note the clear borderline between user space and kernel space. User
programs cannot include kernel headers in their code and cannot call
kernel functions directly. In other words, your program can’t simply
call the sys_read() service function to read a file from the disk.
Similarly, kernel code does not call user-space functions like
printf(), does not include user-space header like <stdio.h> or
, and does not link against user-space libraries like libc.
The only gate to kernel mode (and OS services) that’s the user can use
is the syscall instruction as described above.
"User programs cannot include kernel headers" So when I write in my C program getpid() is this user-space function?
What about when I type getpid in terminal is it the same (use-space function)?
I can't access linux header files in my system /home/user/linux-4.15 , so how it's said user space can't access kernel space?
Given the following image:
I have opened some linux file (init/main.c) and saw:
static int run_init_process(const char *init_filename)
{
argv_init[0] = init_filename;
return do_execve(getname_kernel(init_filename),....
}
where is this do_execve declared? the image shows only execv and sys_execv... and what's the difference?
"User programs cannot include kernel headers" So when I write in my C program getpid() is this user-space function?
Yes. It is a thin wrapper in libc that calls the system call. But depending on the architecture and the libc implementation, there might be some bookkeeping in userspace (e.g. caching the result for future calls).
For many simple syscalls, glibc generates these wrappers with preprocessor macros. On my system, a userspace call to getpid goes to file sysdeps/unix/syscall-template.S:
0x00007ffff7ea6244 59 in ../sysdeps/unix/syscall-template.S
(gdb) disassemble
Dump of assembler code for function getpid:
0x00007ffff7ea6240 <+0>: endbr64
=> 0x00007ffff7ea6244 <+4>: mov $0x27,%eax
0x00007ffff7ea6249 <+9>: syscall
0x00007ffff7ea624b <+11>: retq
End of assembler dump.
which simply puts the syscall number in a register and executes a syscall instruction.
The reason we are using this wrapper is to avoid having to know the specifics of the syscall mechanism and number for different architectures and kernels. This makes our program more portable. The libc we link against knows that 0x27 is getpid on this system, and that it should be written into %eax etc.
When the syscall instruction is executed, the processor switches into kernel mode and starts execution from arch/x86/entry/entry_64.S, where entry_SYSCALL_64 calls do_syscall_64 which is in arch/x86/entry/common.c:
regs->ax = sys_call_table[nr](regs);
You can see that it calls the function at index nr of the sys_call_table. This table is populated by a list of symbols (sys_something), where each one is defined by a macro: SYSCALL_DEFINEn where n is the number of parameters. Since getpid does not a parameter, it is defined as SYSCALL_DEFINE0(getpid) in kernel/sys.c:
/**
* sys_getpid - return the thread group id of the current process
*
* Note, despite the name, this returns the tgid not the pid. The tgid and
* the pid are identical unless CLONE_THREAD was specified on clone() in
* which case the tgid is the same in all threads of the same group.
*
* This is SMP safe as current->tgid does not change.
*/
SYSCALL_DEFINE0(getpid)
{
return task_tgid_vnr(current);
}
What about when I type getpid in terminal is it the same (use-space function)?
I don't know of a terminal command getpid, but if there is one, it would be an executable binary (or script) that eventually calls either a syscall, or a libc wrapper of a syscall. Because, the kernel maintains task and process IDs, and userspace code cannot access the kernel memory.
I can't access linux header files in my system /home/user/linux-4.15 , so how it's said user space can't access kernel space?
Did you mean you CAN access the header files? You can access the entire source code, of course. But even if you include those headers in your program, and compile, and somehow link them with your kernel code, that doesn't mean you can run them in kernel mode.
Except, if you use loadable kernel modules. In fact, you need the kernel header files for compiling kernel modules. You can then request the kernel to load and execute those modules in kernel mode. But you need to call another syscall (init_module) to achieve that.
where is this do_execve declared? the image shows only execv and sys_execv... and what's the difference?
Here is the definition of the syscall execve:
SYSCALL_DEFINE3(execve,
const char __user *, filename,
const char __user *const __user *, argv,
const char __user *const __user *, envp)
{
return do_execve(getname(filename), argv, envp);
}
Similar to getpid, execve is defined with a SYSCALL_DEFINEn (this time three parameters) macro which generates the sys_execve symbol. Internally, the kernel calls do_execve. If you search the rest of the file, you'll see that do_execve itself is a wrapper around do_execveat_common. After some checks and initialization, bprm_execve is called, which calls exec_binprm, and so on.
Can you elaborate what's the difference between do_execve and sys_execve?
Not much of a difference. Except, sys_execve symbol is defined by the SYSCALL_DEFINE3 macro and is meant to be called by an architecture-specific syscall mechanism, which can be different from regular C functions (e.g. asmlinkage). do_execve is a regular C function. In this instance it isn't called from any other C code, but it is possible. Calling sys_execve directly from inside the kernel code however, would not be correct.
Related
This is a little bit strange question. I am trying to find a syscall that allowed to execute code on the stack without parameters on i386. I am doing ctf and I success to find a way to call syscall and control eax and have full control on the stack (with argv so just pointer to my strings). now I am jumping to the vdso (thats all the code in the program no dll's or anything else) to run a syscall that will allowed stack execution. but I go on the man page over and over and didn't found something I can use.
$uname -r 4.4.179-0404179-generic
There's no zero-arg Linux system call equivalent to mprotect(stack_base, stack_size, PROT_WRITE|PROT_READ|PROT_EXEC).
Not that I know of, and I wouldn't expect there to be one. Probably the only use case would be to help attackers, which is the opposite of hardening; normally you can make the stack executable via linker options or any specific pages via mprotect with args. There's no need for a shortcut for that.
There's also not one that can set the READ_IMPLIES_EXEC personality for an already-running process, even if you do allow args. (See Using personality syscall to make the stack executable - at best it will have an effect after execve.)
You might be able to use some ROP techniques to get some args set up for mprotect, and then return to the code you injected.
I trying to understand the internals of the Linux kernel by reading Robert Love's Linux Kernel Development.
On page 74 he says the easiest way to pass arguments to a syscall is via :
Somehow, user-space must relay the parameters to the kernel during the
trap.The easiest way to do this is via the same means that the syscall
number is passed: The parameters are stored in registers. On x86-32,
the registers ebx, ecx, edx, esi, and edi contain, in order, the first
five arguments.
Now this is bothering me for a number of reasons:
All syscalls are defined with the asmlinkage option. Which implies that the arguments are always to be found on the stack and not the register. So what is all this business with the registers ?
It may be possible that before the syscall is performed the values are copied on to the kernel stack. I have no idea why that would be efficient but it might be a possibility.
(This answer is for 32-bit x86 Linux to match your question; things are slightly different for 64-bit x86 and other architectures.)
The parameters are passed from userspace in registers as Love says.
When userspace invokes a system call with int $0x80, the kernel syscall entry code gets control. This is written in assembly language and can be seen here, for instance. One of the things this code does is to take the parameters from the registers and push them onto the stack, and then call the appropriate kernel sys_XXX() function (which is written in C). So those functions do indeed expect their arguments on the stack.
It wouldn't work as well to try to pass parameters from userspace to the kernel on the stack. When the system call is made, the CPU switches to a separate kernel stack, so the parameters would have to be copied from the userspace stack to the kernel stack, and this is somewhat complicated. And it would have to be done even for very simple system calls that just take a few numeric arguments and wouldn't otherwise need to access userspace memory at all (think about close() for instance).
I am trying to create a mechanism to read performance counters for processes. I want this mechanism to be executed from within the kernel (version 4.19.2) itself.
I am able to do it from the user space the sys_perf_event_open() system call as follows.
syscall (__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
I would like to invoke this call from the kernel space. I got some basic idea from here How do I use a Linux System call from a Linux Kernel Module
Here are the steps I took to achieve this:
To make sure that the virtual address of the kernel remains valid, I have used set_fs(), get_fs() and get_fd().
Since sys_perf_event_open() is defined in /include/linux/syscalls.h I have included that in the code.
Eventually, the code for calling the systems call looks something like this:
mm_segment_t fs;
fs = get_fs();
set_fs(get_ds());
long ret = sys_perf_event_open(&pe, pid, cpu, group_fd, flags);
set_fs(fs);
Even after these measures, I get an error claiming "implicit declaration of function ‘sys_perf_event_open’ ". Why is this popping up when the header file defining it is included already? Does it have to something with the way one should call system calls from within the kernel code?
In general (not specific to Linux) the work done for systems calls can be split into 3 categories:
switching from user context to kernel context (and back again on the return path). This includes things like changing the processor's privilege level, messing with gs, fiddling with stacks, and doing security mitigations (e.g. for Meltdown). These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
using a "function number" parameter to find the right function to call, and calling it. This typically includes some sanity checks (does the function exist?) and a table lookup, plus code to mangle input and output parameters that's needed because the calling conventions used for system calls (in user space) is not the same as the calling convention that normal C functions use. These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
the final normal C function that ends up being called. This is the function that you might have (see note) been able to call directly without using any of the expensive, useless and/or dangerous system call junk.
Note: If you aren't able to call the final normal C function directly without using (any part of) the system call junk (e.g. if the final normal C function isn't exposed to other kernel code); then you must determine why. For example, maybe it's not exposed because it alters user-space state, and calling it from kernel will corrupt user-space state, so it's not exposed/exported to other kernel code so that nobody accidentally breaks everything. For another example, maybe there's no reason why it's not exposed to other kernel code and you can just modify its source code so that it is exposed/exported.
Calling system calls from inside the kernel using the sys_* interface is discouraged for the reasons that others have already mentioned. In the particular case of x86_64 (which I guess it is your architecture) and starting from kernel versions v4.17 it is now a hard requirement not to use such interface (but for a few exceptions). It was possible to invoke system calls directly prior to this version but now the error you are seeing pops up (that's why there are plenty of tutorials on the web using sys_*). The proposed alternative in the Linux documentation is to define a wrapper between the syscall and the actual syscall's code that can be called within the kernel as any other function:
int perf_event_open_wrapper(...) {
// actual perf_event_open() code
}
SYSCALL_DEFINE5(perf_event_open, ...) {
return perf_event_open_wrapper(...);
}
source: https://www.kernel.org/doc/html/v4.19/process/adding-syscalls.html#do-not-call-system-calls-in-the-kernel
Which kernel version are we talking about?
Anyhow, you could either get the address of the sys_call_table by looking at the System map file, or if it is exported, you can look up the symbol (Have a look at kallsyms.h), once you have the address to the syscall table, you may treat it as a void pointer array (void **), and find your desired functions indexed. i.e sys_call_table[__NR_open] would be open's address, so you could store it in a void pointer and then call it.
Edit: What are you trying to do, and why can't you do it without calling syscalls? You must understand that syscalls are the kernel's API to the userland, and should not be really used from inside the kernel, thus such practice should be avoided.
calling system calls from kernel code
(I am mostly answering to that title; to summarize: it is forbidden to even think of that)
I don't understand your actual problem (I feel you need to explain it more in your question which is unclear and lacks a lot of useful motivation and context). But a general advice -following the Unix philosophy- is to minimize the size and vulnerability area of your kernel or kernel module code, and to deport, as much as convenient, such code in user-land, in particular with the help of systemd, as soon as your kernel code requires some system calls. Your question is by itself a violation of most Unix and Linux cultural norms.
Have you considered to use efficient kernel to user-land communication, in particular netlink(7) with socket(7). Perhaps you also
want some driver specific kernel thread.
My intuition would be that (in some user-land daemon started from systemd early at boot time) AF_NETLINK with socket(2) is exactly fit for your (unexplained) needs. And eventd(2) might also be relevant.
But just thinking of using system calls from inside the kernel triggers a huge flashing red light in my brain and I tend to believe it is a symptom of a major misunderstanding of operating system kernels in general. Please take time to read Operating Systems: Three Easy Pieces to understand OS philosophy.
If one tries to hook certain syscalls via sys_call_table-hooking, e.g. sys_execve this will fail, because they are indirectly called by a stub. For sys_execve this is stub_execve (compare assembly code on LXR).
But what are these stubs good for? Why do only certain system calls like execve(2) and fork(2) require a stub and how is this connected to x86_64? Is there a workaround to hook stubbed syscalls (in a Loadable Kernel Module)?
From here, it says:
"Certain special system calls that need to save a complete full stack frame."
And I think execve is just one of these special system calls.
From the code of stub_execve, If you want to hook it, at least you can try:
Get to understand the meaning of those assembly code and do it by yourself, then you can call your own function in your own assembly code.
From the middle of the assembly code, it has a call sys_execve, you can replace the address of sys_execve to your own hook function.
This question already has answers here:
How to make backtrace()/backtrace_symbols() print the function names?
(5 answers)
Closed 8 years ago.
I am trying to print a backtrace when my C++ program terminated. Function printing backtrace is like below;
void print_backtrace(void){
void *tracePtrs[10];
size_t count;
count = backtrace(tracePtrs, 10);
char** funcNames = backtrace_symbols(tracePtrs, count);
for (int i = 0; i < count; i++)
syslog(LOG_INFO,"%s\n", funcNames[i]);
free(funcNames);
}
It gives an output like ;
desktop program: Received SIGSEGV signal, last error is : Success
desktop program: ./program() [0x422225]
desktop program: ./program() [0x422371]
desktop program: /lib/libc.so.6(+0x33af0) [0x7f0710f75af0]
desktop program: /lib/libc.so.6(+0x12a08e) [0x7f071106c08e]
desktop program: ./program() [0x428895]
desktop program: /lib/libc.so.6(__libc_start_main+0xfd) [0x7f0710f60c4d]
desktop program: ./program() [0x4082c9]
Is there a way to get more detailed backtrace with function names and lines, like gdb outputs?
Yes - pass the -rdynamic flag to the linker. It will cause the linker to put in the link tables the name of all the none static functions in your code, not just the exported ones.
The price you pay is a very slightly longer startup time of your program. For small to medium programs you wont notice it. What you get is that backtrace() is able to give you the name of all the none static functions in your back trace.
However - BEWARE: there are several gotchas you need to be aware of:
backtrace_symbols allocates memory from malloc. If you got into a SIGSEGV due to malloc arena corruption (quite common) you will double fault here and never see your back trace.
Depending on the platform this runs on (e.g. x86), the address/function name of the exact function where you crashed will be replaced in place on the stack with the return address of the signal handler. You need to get the right EIP of the crashed function from the signal handler parameters for those platforms.
syslog is not an async signal safe function. It might take a lock internally and if that lock is taken when the crash occurred (because you crashed in the middle of another call to syslog) you have a dead lock
If you want to learn all the gory details, check out this video of me giving a talk about it at OLS: http://free-electrons.com/pub/video/2008/ols/ols2008-gilad-ben-yossef-fault-handlers.ogg
Feed the addresses to addr2line and it will show you the file name, line number, and function name.
If you're fine with only getting proper backtraces when running through valgrind, then this might be an option for you:
VALGRIND_PRINTF_BACKTRACE(format, ...):
It will give you the backtrace for all functions, including static ones.
The better option I have found is libbacktrace by Ian Lance Taylor:
https://github.com/ianlancetaylor/libbacktrace
backtrace_symbols() does prints only exported symbols and could not be less portable as it requires the GNU libc.
addr2line is nice as it includes file names and line numbers. But it fails as soon as the loader performs relocations. Nowadays as ASLR is common, it will fail very often.
libunwind alone will not allow one to print file names and line numbers. To do this, one needs to parse DWARF debugging information inside the ELF binary file. This can be done using libdwarf, though. But why bother when libbacktrace gives you everything required for free?
Create a pipe
fork()
Make child process execute addr2line
In parent process, convert the addresses returned from backtrace() to hexadecimal
Write the hex addresses to the pipe
Read back the output from addr2line and print/log it
Since you're doing all this from a signal handler, make sure to not use functionality which is not async-signal-safe. You can see a list of async-signal-safe POSIX functions here.
If you don't want to take the "signal a different process that runs gdb on you" approach, which I think gby is advocating, you can also slightly alter your code to call open() on a crash log file and then backtrace_symbols_fd() with the fd returned by open() - both functions are async signal safe according to the glibc manual. You'll need still -rdynamic, of course. Also, from what I've seen, you still sometimes need to run addr2line on some addresses that the backtrace*() functions won't be able to decode.
Also note fork() is not async signal safe: http://article.gmane.org/gmane.linux.man/1893/match=fork+async, at least not on Linux. Neither is syslog(), as somebody already pointed out.
If ou want a very detailled backtrace, you should use ptrace(2) to trace the process you want the backtrace.
You will be able to see all functions your process used but you need some basic asm knowledge