Intercepting stat call with LD_PRELOAD - c

I'm trying to write a shared object that intercepts some filesystem API calls such as open, close, read, write etc., that originate from an application. Interception is done using LD_PRELOAD. I've used strace methodically to find out the APIs called by the application and implement them in the shared library loaded by LD_PRELOAD. When it comes to stat, I found that __xstat and __xstat64 is called instead of stat and I've overridden these two functions. I'm able to trap these API calls. However, in one particular environment, when I use strace I see direct calls to the stat() itself. Like below
25083 03:11:28.424859 close(13) = 0 <0.000045>
>> 25083 03:11:28.424966 stat("/somedir/somefile", 0x7ffe751d2430) = -1 ENOENT (No such file or directory) <0.000050>
25083 03:11:28.425067 clock_gettime(CLOCK_MONOTONIC, {786855, 130369007}) = 0 <0.000029>
The difference I note is that stat is called directly which I don't see in other environments. It is possible that the application calls stat() however I see that stat internally calls __xstat or __xstat64. Another thing I noticed is that stat() isn't even implemented in libc.so library. So this stat() appears to be a direct invocation of the stat() system call. How do I confirm this? And how would an application directly invoke stat() system call?

So this stat() appears to be a direct invocation of the stat() system call. How do I confirm this?
Run the program inside of gdb with catch syscall stat. When the syscall happens, check the call stack with bt and take note of whether you're in libc.so.
And how would an application directly invoke stat() system call?
With inline assembly. Here's an example of it for x86-64:
#include <stdio.h>
#include <sys/stat.h>
#include <sys/syscall.h>
int main(int argc, char **argv) {
if(argc < 2) return 1;
struct stat s;
long rv;
__asm__ volatile(
"syscall"
: "=a"(rv)
: "a"(SYS_stat), "D"(argv[1]), "S"(&s)
: "rcx", "r11", "memory"
);
if(rv) return 1;
printf("%zu\n", s.st_size);
}

Related

linux, systemcalls do_execv vs execv? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Quoting from my lecture:
Note the clear borderline between user space and kernel space. User
programs cannot include kernel headers in their code and cannot call
kernel functions directly. In other words, your program can’t simply
call the sys_read() service function to read a file from the disk.
Similarly, kernel code does not call user-space functions like
printf(), does not include user-space header like <stdio.h> or
, and does not link against user-space libraries like libc.
The only gate to kernel mode (and OS services) that’s the user can use
is the syscall instruction as described above.
"User programs cannot include kernel headers" So when I write in my C program getpid() is this user-space function?
What about when I type getpid in terminal is it the same (use-space function)?
I can't access linux header files in my system /home/user/linux-4.15 , so how it's said user space can't access kernel space?
Given the following image:
I have opened some linux file (init/main.c) and saw:
static int run_init_process(const char *init_filename)
{
argv_init[0] = init_filename;
return do_execve(getname_kernel(init_filename),....
}
where is this do_execve declared? the image shows only execv and sys_execv... and what's the difference?
"User programs cannot include kernel headers" So when I write in my C program getpid() is this user-space function?
Yes. It is a thin wrapper in libc that calls the system call. But depending on the architecture and the libc implementation, there might be some bookkeeping in userspace (e.g. caching the result for future calls).
For many simple syscalls, glibc generates these wrappers with preprocessor macros. On my system, a userspace call to getpid goes to file sysdeps/unix/syscall-template.S:
0x00007ffff7ea6244 59 in ../sysdeps/unix/syscall-template.S
(gdb) disassemble
Dump of assembler code for function getpid:
0x00007ffff7ea6240 <+0>: endbr64
=> 0x00007ffff7ea6244 <+4>: mov $0x27,%eax
0x00007ffff7ea6249 <+9>: syscall
0x00007ffff7ea624b <+11>: retq
End of assembler dump.
which simply puts the syscall number in a register and executes a syscall instruction.
The reason we are using this wrapper is to avoid having to know the specifics of the syscall mechanism and number for different architectures and kernels. This makes our program more portable. The libc we link against knows that 0x27 is getpid on this system, and that it should be written into %eax etc.
When the syscall instruction is executed, the processor switches into kernel mode and starts execution from arch/x86/entry/entry_64.S, where entry_SYSCALL_64 calls do_syscall_64 which is in arch/x86/entry/common.c:
regs->ax = sys_call_table[nr](regs);
You can see that it calls the function at index nr of the sys_call_table. This table is populated by a list of symbols (sys_something), where each one is defined by a macro: SYSCALL_DEFINEn where n is the number of parameters. Since getpid does not a parameter, it is defined as SYSCALL_DEFINE0(getpid) in kernel/sys.c:
/**
* sys_getpid - return the thread group id of the current process
*
* Note, despite the name, this returns the tgid not the pid. The tgid and
* the pid are identical unless CLONE_THREAD was specified on clone() in
* which case the tgid is the same in all threads of the same group.
*
* This is SMP safe as current->tgid does not change.
*/
SYSCALL_DEFINE0(getpid)
{
return task_tgid_vnr(current);
}
What about when I type getpid in terminal is it the same (use-space function)?
I don't know of a terminal command getpid, but if there is one, it would be an executable binary (or script) that eventually calls either a syscall, or a libc wrapper of a syscall. Because, the kernel maintains task and process IDs, and userspace code cannot access the kernel memory.
I can't access linux header files in my system /home/user/linux-4.15 , so how it's said user space can't access kernel space?
Did you mean you CAN access the header files? You can access the entire source code, of course. But even if you include those headers in your program, and compile, and somehow link them with your kernel code, that doesn't mean you can run them in kernel mode.
Except, if you use loadable kernel modules. In fact, you need the kernel header files for compiling kernel modules. You can then request the kernel to load and execute those modules in kernel mode. But you need to call another syscall (init_module) to achieve that.
where is this do_execve declared? the image shows only execv and sys_execv... and what's the difference?
Here is the definition of the syscall execve:
SYSCALL_DEFINE3(execve,
const char __user *, filename,
const char __user *const __user *, argv,
const char __user *const __user *, envp)
{
return do_execve(getname(filename), argv, envp);
}
Similar to getpid, execve is defined with a SYSCALL_DEFINEn (this time three parameters) macro which generates the sys_execve symbol. Internally, the kernel calls do_execve. If you search the rest of the file, you'll see that do_execve itself is a wrapper around do_execveat_common. After some checks and initialization, bprm_execve is called, which calls exec_binprm, and so on.
Can you elaborate what's the difference between do_execve and sys_execve?
Not much of a difference. Except, sys_execve symbol is defined by the SYSCALL_DEFINE3 macro and is meant to be called by an architecture-specific syscall mechanism, which can be different from regular C functions (e.g. asmlinkage). do_execve is a regular C function. In this instance it isn't called from any other C code, but it is possible. Calling sys_execve directly from inside the kernel code however, would not be correct.

undefined reference to `printk'

I want to use printk function in my userspace code, but I don't want to write kernel module. Is it any possibility to do that?
I tried use linux/kernel.h header and linux/module.h but it doesn't work
printk("<1>some text");
Simple Answer is No,
You can't use printk in userspace code by any means.
printk is designed for kernel programmers.
If your intention is to write to syslog -> dmesg, then use syslog() ;
It comes in handy!!
Syslog ManPage
Try This:
#include <stdio.h>
#include <unistd.h>
#include <syslog.h>
int main(void) {
openlog("slog", LOG_PID|LOG_CONS, LOG_USER);
syslog(LOG_EMERG, "Hello from my code ");
closelog();
return 0;
}
To Configure syslog for file redirection:
http://www.softpanorama.org/Logs/syslog.shtml
http://linux.die.net/man/5/syslog.conf
Using of kernel headers in userspace makes behavior or program unpredictable.
One of reasons is the memory where kernel is located is not accessible from userspace directly.
Here you can find some information about these cases:
lwn.net/Articles/113349/
kernelnewbies.org/KernelHeaders

Make a variable local to a thread [duplicate]

Short version of question: What parameter do I need to pass to the clone system call on x86_64 Linux system if I want to allocate a new TLS area for the thread that I am creating.
Long version:
I am working on a research project and for something I am experimenting with I want to create threads using the clone system call instead of using pthread_create. However, I also want to be able to use thread local storage. I don't plan on creating many threads right now, so it would be fine for me to create a new TLS area for each thread that I create with the clone system call.
I was looking at the man page for clone and it has the following information about the flag for the TLS parameter:
CLONE_SETTLS (since Linux 2.5.32)
The newtls argument is the new TLS (Thread Local Storage) descriptor.
(See set_thread_area(2).)
So I looked at the man page for set_thread_area and noticed the following which looked promising:
When set_thread_area() is passed an entry_number of -1, it uses a
free TLS entry. If set_thread_area() finds a free TLS entry, the value of
u_info->entry_number is set upon return to show which entry was changed.
However, after experimenting with this some it appears that set_thread_area is not implemented in my system (Ubunut 10.04 on an x86_64 platform). When I run the following code I get an error that says: set_thread_area() failed: Function not implemented
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <asm/ldt.h>
int main()
{
struct user_desc u_info;
u_info.entry_number = -1;
int rc = syscall(SYS_set_thread_area,&u_info);
if(rc < 0) {
perror("set_thread_area() failed");
exit(-1);
}
printf("entry_number is %d",u_info.entry_number);
}
I also saw that when I use strace the see what happens when pthread_create is called that I don't see any calls to set_thread_area. I have also been looking at the nptl pthread source code to try to understand what they do when creating threads. But I don't completely understand it yet and I think it is more complex than what I'm trying to do since I don't need something that is as robust at the pthread implementation. I'm assuming that the set_thread_area system call is for x86 and that there is a different mechanism used for x86_64. But for the moment I have not been able to figure out what it is so I'm hoping this question will help me get some ideas about what I need to look at.
I am working on a research project and for something I am experimenting with I want to create threads using the clone system call instead of using pthread_create
In the exceedingly unlikely scenario where your new thread never calls any libc functions (either directly, or by calling something else which calls libc; this also includes dynamic symbol resolution via PLT), then you can pass whatever TLS storage you desire as the the new_tls parameter to clone.
You should ignore all references to set_thread_area -- they only apply to 32-bit/ix86 case.
If you are planning to use libc in your newly-created thread, you should abandon your approach: libc expects TLS to be set up a certain way, and there is no way for you to arrange for such setup when you call clone directly. Your new thread will intermittently crash when libc discovers that you didn't set up TLS properly. Debugging such crashes is exceedingly difficult, and the only reliable solution is ... to use pthread_create.
The other answer is absolutely correct in that setting up a thread outside of libc's control is guaranteed to cause trouble at a certain point. You can do it, but you can no longer rely on libc's services, definitely not on any of the pthread_* functions or thread-local variables (defined as such using __thread or thread_local).
That being said, you can set one of the segment registers used for TLS (GS and FS) even on x86-64. The system call to look for is prctl(ARCH_SET_GS, ...).
You can see an example comparing setting up TLS registers on i386 and x86-64 in this piece of code.

Linux function to get mount points

Is there a function (or interface; ioctl, netlink etc) in the standard Linux libs that will return the current mounts directly from the kernel without parsing /proc? straceing the mount command, it looks like it parses files in /proc
Please see the clarification at the bottom of the answer for the reasoning being used in this answer.
Is there any reason that you would not use the getmntent libc library call? I do realize that it's not the same as an 'all in one' system call, but it should allow you to get the relevant information.
#include <stdio.h>
#include <stdlib.h>
#include <mntent.h>
int main(void)
{
struct mntent *ent;
FILE *aFile;
aFile = setmntent("/proc/mounts", "r");
if (aFile == NULL) {
perror("setmntent");
exit(1);
}
while (NULL != (ent = getmntent(aFile))) {
printf("%s %s\n", ent->mnt_fsname, ent->mnt_dir);
}
endmntent(aFile);
}
Clarification
Considering that the OP clarified about trying to do this without having /proc mounted, I'm going to clarify:
There is no facility outside of /proc for getting the fully qualified list of mounted file systems from the linux kernel. There is no system call, there is no ioctl. The /proc interface is the agreed upon interface.
With that said, if you don't have /proc mounted, you will have to parse the /etc/mtab file - pass in /etc/mtab instead of /proc/mounts to the initial setmntent call.
It is an agreed upon protocol that the mount and unmount commands will maintain a list of currently mounted filesystems in the file /etc/mtab. This is detailed in almost all linux/unix/bsd manual pages for these commands. So if you don't have /proc you can sort of rely on the contents of this file. It's not guaranteed to be a source of truth, but conventions are conventions for these things.
So, if you don't have /proc, you would use /etc/mtab in the getmntent libc library call below to get the list of file systems; otherwise you could use one of /proc/mounts or /proc/self/mountinfo (which is recommended nowadays over /proc/mounts).
There is no syscall to list this information; instead, you can find it in the file /etc/mtab

how could I intercept linux sys calls?

Besides the LD_PRELOAD trick , and Linux Kernel Modules that replace a certain syscall with one provided by you , is there any possibility to intercept a syscall ( open for example ) , so that it first goes through your function , before it reaches the actual open ?
Why can't you / don't want to use the LD_PRELOAD trick?
Example code here:
/*
* File: soft_atimes.c
* Author: D.J. Capelis
*
* Compile:
* gcc -fPIC -c -o soft_atimes.o soft_atimes.c
* gcc -shared -o soft_atimes.so soft_atimes.o -ldl
*
* Use:
* LD_PRELOAD="./soft_atimes.so" command
*
* Copyright 2007 Regents of the University of California
*/
#define _GNU_SOURCE
#include <dlfcn.h>
#define _FCNTL_H
#include <sys/types.h>
#include <bits/fcntl.h>
#include <stddef.h>
extern int errorno;
int __thread (*_open)(const char * pathname, int flags, ...) = NULL;
int __thread (*_open64)(const char * pathname, int flags, ...) = NULL;
int open(const char * pathname, int flags, mode_t mode)
{
if (NULL == _open) {
_open = (int (*)(const char * pathname, int flags, ...)) dlsym(RTLD_NEXT, "open");
}
if(flags & O_CREAT)
return _open(pathname, flags | O_NOATIME, mode);
else
return _open(pathname, flags | O_NOATIME, 0);
}
int open64(const char * pathname, int flags, mode_t mode)
{
if (NULL == _open64) {
_open64 = (int (*)(const char * pathname, int flags, ...)) dlsym(RTLD_NEXT, "open64");
}
if(flags & O_CREAT)
return _open64(pathname, flags | O_NOATIME, mode);
else
return _open64(pathname, flags | O_NOATIME, 0);
}
From what I understand... it is pretty much the LD_PRELOAD trick or a kernel module. There's not a whole lot of middle ground unless you want to run it under an emulator which can trap out to your function or do code re-writing on the actual binary to trap out to your function.
Assuming you can't modify the program and can't (or don't want to) modify the kernel, the LD_PRELOAD approach is the best one, assuming your application is fairly standard and isn't actually one that's maliciously trying to get past your interception. (In which case you will need one of the other techniques.)
Valgrind can be used to intercept any function call. If you need to intercept a system call in your finished product then this will be no use. However, if you are try to intercept during development then it can be very useful. I have frequently used this technique to intercept hashing functions so that I can control the returned hash for testing purposes.
In case you are not aware, Valgrind is mainly used for finding memory leaks and other memory related errors. But the underlying technology is basically an x86 emulator. It emulates your program and intercepts calls to malloc/free etc. The good thing is, you do not need to recompile to use it.
Valgrind has a feature that they term Function Wrapping, which is used to control the interception of functions. See section 3.2 of the Valgrind manual for details. You can setup function wrapping for any function you like. Once the call is intercepted the alternative function that you provide is then invoked.
First lets eliminate some non-answers that other people have given:
Use LD_PRELOAD. Yeah you said "Besides LD_PRELOAD..." in the question but apparently that isn't enough for some people. This isn't a good option because it only works if the program uses libc which isn't necessarily the case.
Use Systemtap. Yeah you said "Besides ... Linux Kernel Modules" in the question but apparently that isn't enough for some people. This isn't a good option because you have to load a custom kernal module which is a major pain in the arse and also requires root.
Valgrind. This does sort of work but it works be simulating the CPU so it's really slow and really complicated. Fine if you're just doing this for one-off debugging. Not really an option if you're doing something production-worthy.
Various syscall auditing things. I don't think logging syscalls counts as "intercepting" them. We clearly want to modify the syscall parameters / return values or redirect the program through some other code.
However there are other possibilities not mentioned here yet. Note I'm new to all this stuff and haven't tried any of it yet so I may be wrong about some things.
Rewrite the code
In theory you could use some kind of custom loader that rewrites the syscall instructions to jump to a custom handler instead. But I think that would be an absolute nightmare to implement.
kprobes
kprobes are some kind of kernel instrumentation system. They only have read-only access to anything so you can't use them to intercept syscalls, only log them.
ptrace
ptrace is the API that debuggers like GDB use to do their debugging. There is a PTRACE_SYSCALL option which will pause execution just before/after syscalls. From there you can do pretty much whatever you like in the same way that GDB can. Here's an article about how to modify syscall paramters using ptrace. However it apparently has high overhead.
Seccomp
Seccomp is a system that is design to allow you to filter syscalls. You can't modify the arguments, but you can block them or return custom errors. Seccomp filters are BPF programs. If you're not familiar, they are basically arbitrary programs that users can run in a kernel-space VM. This avoids the user/kernel context switch which makes them faster than ptrace.
While you can't modify arguments directly from your BPF program you can return SECCOMP_RET_TRACE which will trigger a ptraceing parent to break. So it's basically the same as PTRACE_SYSCALL except you get to run a program in kernel space to decide whether you want to actually intercept a syscall based on its arguments. So it should be faster if you only want to intercept some syscalls (e.g. open() with specific paths).
I think this is probably the best option. Here's an article about it from the same author as the one above. Note they use classic BPF instead of eBPF but I guess you can use eBPF too.
Edit: Actually you can only use classic BPF, not eBPF. There's a LWN article about it.
Here are some related questions. The first one is definitely worth reading.
Can eBPF modify the return value or parameters of a syscall?
Intercept only syscall with PTRACE_SINGLESTEP
Is this is a good way to intercept system calls?
Minimal overhead way of intercepting system calls without modifying the kernel
There's also a good article about manipulating syscalls via ptrace here.
Some applications can trick strace/ptrace not to run, so the only real option I've had is using systemtap
Systemtap can intercept a bunch of system calls if need be due to its wild card matching. Systemtap is not C, but a separate language. In basic mode, the systemtap should prevent you from doing stupid things, but it also can run in "expert mode" that falls back to allowing a developer to use C if that is required.
It does not require you to patch your kernel (Or at least shouldn't), and once a module has been compiled, you can copy it from a test/development box and insert it (via insmod) on a production system.
I have yet to find a linux application that has found a way to work around/avoid getting caught by systemtap.
I don't have the syntax to do this gracefully with an LKM offhand, but this article provides a good overview of what you'd need to do: http://www.linuxjournal.com/article/4378
You could also just patch the sys_open function. It starts on line 1084 of file/open.c as of linux-2.6.26.
You might also see if you can't use inotify, systemtap or SELinux to do all this logging for you without you having to build a new system.
If you just want to watch what's opened, you want to look at the ptrace() function, or the source code of the commandline strace utility. If you actually want to intercept the call, to maybe make it do something else, I think the options you listed - LD_PRELOAD or a kernel module - are your only options.
If you just want to do it for debugging purposes look into strace, which is built in top of the ptrace(2) system call which allows you to hook up code when a system call is done. See the PTRACE_SYSCALL part of the man page.
if you really need a solution you might be interested in the DR rootkit that accomplishes just this, http://www.immunityinc.com/downloads/linux_rootkit_source.tbz2 the article about it is here http://www.theregister.co.uk/2008/09/04/linux_rootkit_released/
Sounds like you need auditd.
Auditd allows global tracking of all syscalls or accesses to files, with logging. You can set keys for specific events that you are interested in.
Using SystemTap may be an option.
For Ubuntu, install it as indicated in https://wiki.ubuntu.com/Kernel/Systemtap.
Then just execute the following and you will be listening on all openat syscalls:
# stap -e 'probe syscall.openat { printf("%s(%s)\n", name, argstr) }'
openat(AT_FDCWD, "/dev/fb0", O_RDWR)
openat(AT_FDCWD, "/sys/devices/virtual/tty/tty0/active", O_RDONLY)
openat(AT_FDCWD, "/sys/devices/virtual/tty/tty0/active", O_RDONLY)
openat(AT_FDCWD, "/dev/tty1", O_RDONLY)

Resources