Make a variable local to a thread [duplicate] - c

Short version of question: What parameter do I need to pass to the clone system call on x86_64 Linux system if I want to allocate a new TLS area for the thread that I am creating.
Long version:
I am working on a research project and for something I am experimenting with I want to create threads using the clone system call instead of using pthread_create. However, I also want to be able to use thread local storage. I don't plan on creating many threads right now, so it would be fine for me to create a new TLS area for each thread that I create with the clone system call.
I was looking at the man page for clone and it has the following information about the flag for the TLS parameter:
CLONE_SETTLS (since Linux 2.5.32)
The newtls argument is the new TLS (Thread Local Storage) descriptor.
(See set_thread_area(2).)
So I looked at the man page for set_thread_area and noticed the following which looked promising:
When set_thread_area() is passed an entry_number of -1, it uses a
free TLS entry. If set_thread_area() finds a free TLS entry, the value of
u_info->entry_number is set upon return to show which entry was changed.
However, after experimenting with this some it appears that set_thread_area is not implemented in my system (Ubunut 10.04 on an x86_64 platform). When I run the following code I get an error that says: set_thread_area() failed: Function not implemented
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <asm/ldt.h>
int main()
{
struct user_desc u_info;
u_info.entry_number = -1;
int rc = syscall(SYS_set_thread_area,&u_info);
if(rc < 0) {
perror("set_thread_area() failed");
exit(-1);
}
printf("entry_number is %d",u_info.entry_number);
}
I also saw that when I use strace the see what happens when pthread_create is called that I don't see any calls to set_thread_area. I have also been looking at the nptl pthread source code to try to understand what they do when creating threads. But I don't completely understand it yet and I think it is more complex than what I'm trying to do since I don't need something that is as robust at the pthread implementation. I'm assuming that the set_thread_area system call is for x86 and that there is a different mechanism used for x86_64. But for the moment I have not been able to figure out what it is so I'm hoping this question will help me get some ideas about what I need to look at.

I am working on a research project and for something I am experimenting with I want to create threads using the clone system call instead of using pthread_create
In the exceedingly unlikely scenario where your new thread never calls any libc functions (either directly, or by calling something else which calls libc; this also includes dynamic symbol resolution via PLT), then you can pass whatever TLS storage you desire as the the new_tls parameter to clone.
You should ignore all references to set_thread_area -- they only apply to 32-bit/ix86 case.
If you are planning to use libc in your newly-created thread, you should abandon your approach: libc expects TLS to be set up a certain way, and there is no way for you to arrange for such setup when you call clone directly. Your new thread will intermittently crash when libc discovers that you didn't set up TLS properly. Debugging such crashes is exceedingly difficult, and the only reliable solution is ... to use pthread_create.

The other answer is absolutely correct in that setting up a thread outside of libc's control is guaranteed to cause trouble at a certain point. You can do it, but you can no longer rely on libc's services, definitely not on any of the pthread_* functions or thread-local variables (defined as such using __thread or thread_local).
That being said, you can set one of the segment registers used for TLS (GS and FS) even on x86-64. The system call to look for is prctl(ARCH_SET_GS, ...).
You can see an example comparing setting up TLS registers on i386 and x86-64 in this piece of code.

Related

Where is linux-vdso.so.1 present on the file system

I am learning about VDSO, wrote a simple application which calls gettimeofday()
#define _GNU_SOURCE
#include <sys/syscall.h>
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
struct timeval current_time;
if (gettimeofday(&current_time, NULL) == -1)
printf("gettimeofday");
getchar();
exit(EXIT_SUCCESS);
}
ldd on the binary shows 'linux-vdso'
$ ldd ./prog
linux-vdso.so.1 (0x00007ffce147a000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6ef9e8e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6efa481000)
I did a find for the libvdso library and there is no such library present in my file system.
sudo find / -name 'linux-vdso.so*'
Where is the library present?
It's a virtual shared object that doesn't have any physical file on the disk; it's a part of the kernel that's exported into every program's address space when it's loaded.
It's main purpose to make more efficient to call certain system calls (which would otherwise incur performance issues like this). The most prominent being gettimeofday(2).
You can read more about it here: http://man7.org/linux/man-pages/man7/vdso.7.html
find / -name '*vdso*.so*'
yields
/lib/modules/4.15.0-108-generic/vdso/vdso64.so
/lib/modules/4.15.0-108-generic/vdso/vdso32.so
/lib/modules/4.15.0-108-generic/vdso/vdsox32.so
linux-vdso.so is a virtual symbolic link to the bitness-compatible respective vdso*.so.
vDSO = virtual dynamic shared object
Note on vdsox32:
x32 is a Linux ABI which is kind of a mix between x86 and x64.
It uses 32-bit address size but runs in full 64-bit mode, including all 64-bit instructions and registers available.
Making system calls can be slow. In x86 32-bit systems, you can
trigger a software interrupt (int $0x80) to tell the kernel you
wish to make a system call. However, this instruction is
expensive: it goes through the full interrupt-handling paths in
the processor's microcode as well as in the kernel. Newer
processors have faster (but backward incompatible) instructions
to initiate system calls. Rather than require the C library to
figure out if this functionality is available at run time, the C
library can use functions provided by the kernel in the vDSO.
Note that the terminology can be confusing. On x86 systems, the
vDSO function used to determine the preferred method of making a
system call is named "__kernel_vsyscall", but on x86-64, the term
"vsyscall" also refers to an obsolete way to ask the kernel what
time it is or what CPU the caller is on.
One frequently used system call is gettimeofday(2). This system
call is called both directly by user-space applications as well
as indirectly by the C library. Think timestamps or timing loops
or polling—all of these frequently need to know what time it is
right now. This information is also not secret—any application
in any privilege mode (root or any unprivileged user) will get
the same answer. Thus the kernel arranges for the information
required to answer this question to be placed in memory the
process can access. Now a call to gettimeofday(2) changes from a
system call to a normal function call and a few memory accesses.
Also
You must not assume the vDSO is mapped at any particular location
in the user's memory map. The base address will usually be
randomized at run time every time a new process image is created
(at execve(2) time). This is done for security reasons, to prevent
"return-to-libc" attacks.
And
Since the vDSO is a fully formed ELF image, you can do symbol lookups
on it.
And also
If you are trying to call the vDSO in your own application rather than
using the C library, you're most likely doing it wrong.
as well as
Why does the vDSO exist at all? There are some system calls the
kernel provides that user-space code ends up using frequently, to
the point that such calls can dominate overall performance. This
is due both to the frequency of the call as well as the context-
switch overhead that results from exiting user space and entering
the kernel.

Calling system calls from the kernel code

I am trying to create a mechanism to read performance counters for processes. I want this mechanism to be executed from within the kernel (version 4.19.2) itself.
I am able to do it from the user space the sys_perf_event_open() system call as follows.
syscall (__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
I would like to invoke this call from the kernel space. I got some basic idea from here How do I use a Linux System call from a Linux Kernel Module
Here are the steps I took to achieve this:
To make sure that the virtual address of the kernel remains valid, I have used set_fs(), get_fs() and get_fd().
Since sys_perf_event_open() is defined in /include/linux/syscalls.h I have included that in the code.
Eventually, the code for calling the systems call looks something like this:
mm_segment_t fs;
fs = get_fs();
set_fs(get_ds());
long ret = sys_perf_event_open(&pe, pid, cpu, group_fd, flags);
set_fs(fs);
Even after these measures, I get an error claiming "implicit declaration of function ‘sys_perf_event_open’ ". Why is this popping up when the header file defining it is included already? Does it have to something with the way one should call system calls from within the kernel code?
In general (not specific to Linux) the work done for systems calls can be split into 3 categories:
switching from user context to kernel context (and back again on the return path). This includes things like changing the processor's privilege level, messing with gs, fiddling with stacks, and doing security mitigations (e.g. for Meltdown). These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
using a "function number" parameter to find the right function to call, and calling it. This typically includes some sanity checks (does the function exist?) and a table lookup, plus code to mangle input and output parameters that's needed because the calling conventions used for system calls (in user space) is not the same as the calling convention that normal C functions use. These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
the final normal C function that ends up being called. This is the function that you might have (see note) been able to call directly without using any of the expensive, useless and/or dangerous system call junk.
Note: If you aren't able to call the final normal C function directly without using (any part of) the system call junk (e.g. if the final normal C function isn't exposed to other kernel code); then you must determine why. For example, maybe it's not exposed because it alters user-space state, and calling it from kernel will corrupt user-space state, so it's not exposed/exported to other kernel code so that nobody accidentally breaks everything. For another example, maybe there's no reason why it's not exposed to other kernel code and you can just modify its source code so that it is exposed/exported.
Calling system calls from inside the kernel using the sys_* interface is discouraged for the reasons that others have already mentioned. In the particular case of x86_64 (which I guess it is your architecture) and starting from kernel versions v4.17 it is now a hard requirement not to use such interface (but for a few exceptions). It was possible to invoke system calls directly prior to this version but now the error you are seeing pops up (that's why there are plenty of tutorials on the web using sys_*). The proposed alternative in the Linux documentation is to define a wrapper between the syscall and the actual syscall's code that can be called within the kernel as any other function:
int perf_event_open_wrapper(...) {
// actual perf_event_open() code
}
SYSCALL_DEFINE5(perf_event_open, ...) {
return perf_event_open_wrapper(...);
}
source: https://www.kernel.org/doc/html/v4.19/process/adding-syscalls.html#do-not-call-system-calls-in-the-kernel
Which kernel version are we talking about?
Anyhow, you could either get the address of the sys_call_table by looking at the System map file, or if it is exported, you can look up the symbol (Have a look at kallsyms.h), once you have the address to the syscall table, you may treat it as a void pointer array (void **), and find your desired functions indexed. i.e sys_call_table[__NR_open] would be open's address, so you could store it in a void pointer and then call it.
Edit: What are you trying to do, and why can't you do it without calling syscalls? You must understand that syscalls are the kernel's API to the userland, and should not be really used from inside the kernel, thus such practice should be avoided.
calling system calls from kernel code
(I am mostly answering to that title; to summarize: it is forbidden to even think of that)
I don't understand your actual problem (I feel you need to explain it more in your question which is unclear and lacks a lot of useful motivation and context). But a general advice -following the Unix philosophy- is to minimize the size and vulnerability area of your kernel or kernel module code, and to deport, as much as convenient, such code in user-land, in particular with the help of systemd, as soon as your kernel code requires some system calls. Your question is by itself a violation of most Unix and Linux cultural norms.
Have you considered to use efficient kernel to user-land communication, in particular netlink(7) with socket(7). Perhaps you also
want some driver specific kernel thread.
My intuition would be that (in some user-land daemon started from systemd early at boot time) AF_NETLINK with socket(2) is exactly fit for your (unexplained) needs. And eventd(2) might also be relevant.
But just thinking of using system calls from inside the kernel triggers a huge flashing red light in my brain and I tend to believe it is a symptom of a major misunderstanding of operating system kernels in general. Please take time to read Operating Systems: Three Easy Pieces to understand OS philosophy.

Where can I find system call source code?

In Linux where can I find the source code for all system calls given that I have the source tree? Also if I were to want to look up the source code and assembly for a particular system call is there something that I can type in terminal like my_system_call?
You'll need the Linux kernel sources in order to see the actual source of the system calls. Manual pages, if installed on your local system, only contain the documentation of the calls and not their source itself.
Unfortunately for you, system calls aren't stored in just one particular location in the whole kernel tree. This is because various system calls can refer to different parts of the system (process management, filesystem management, etc.) and therefore it would be infeasible to store them apart from the part of the tree related to that particular part of the system.
The best thing you can do is look for the SYSCALL_DEFINE[0-6] macro. It is used (obviously) to define the given block of code as a system call. For example, fs/ioctl.c has the following code :
SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{
/* do freaky ioctl stuff */
}
Such a definition means that the ioctl syscall is declared and takes three arguments. The number next to the SYSCALL_DEFINE means the number of arguments. For example, in the case of getpid(void), declared in kernel/timer.c, we have the following code :
SYSCALL_DEFINE0(getpid)
{
return task_tgid_vnr(current);
}
Hope that clears things up a little.
From an application's point of view, a system call is an elementary and atomic operation done by the kernel.
The Assembly Howto explains what is happening, in terms of machine instruction.
Of course, the kernel is doing a lot of things when handling a syscall.
Actually, you almost could believe that the entire kernel code is devoted to handle all system calls (this is not entirely true, but almost; from applications' point of view, the kernel is only visible thru system calls). The other answer by Daniel Kamil Kozar is explaining what kernel function is starting the handling of some system call (but very often, many other parts of the kernel indirectly participate to system calls; for example, the scheduler participates indirectly into implementing fork because it manages the child process created by a successful fork syscall).
I know it's old, but I was searching for the source for _system_call() too and found this tidbit
Actual code for system_call entry point can be found in /usr/src/linux/kernel/sys_call.S Actual code for many of the system calls can be found in /usr/src/linux/kernel/sys.c, and the rest are found elsewhere. find is your friend.
I assume this is dated, because I don't even have that file. However, grep found ENTRY(system_call) in arch/x86/kernel/entry_64.S and seems to be the thing that calls the individual system calls. I'm not up on my intel-syntax x86 asm right now, so you'll have to look and see if this is what you wanted.

how could I intercept linux sys calls?

Besides the LD_PRELOAD trick , and Linux Kernel Modules that replace a certain syscall with one provided by you , is there any possibility to intercept a syscall ( open for example ) , so that it first goes through your function , before it reaches the actual open ?
Why can't you / don't want to use the LD_PRELOAD trick?
Example code here:
/*
* File: soft_atimes.c
* Author: D.J. Capelis
*
* Compile:
* gcc -fPIC -c -o soft_atimes.o soft_atimes.c
* gcc -shared -o soft_atimes.so soft_atimes.o -ldl
*
* Use:
* LD_PRELOAD="./soft_atimes.so" command
*
* Copyright 2007 Regents of the University of California
*/
#define _GNU_SOURCE
#include <dlfcn.h>
#define _FCNTL_H
#include <sys/types.h>
#include <bits/fcntl.h>
#include <stddef.h>
extern int errorno;
int __thread (*_open)(const char * pathname, int flags, ...) = NULL;
int __thread (*_open64)(const char * pathname, int flags, ...) = NULL;
int open(const char * pathname, int flags, mode_t mode)
{
if (NULL == _open) {
_open = (int (*)(const char * pathname, int flags, ...)) dlsym(RTLD_NEXT, "open");
}
if(flags & O_CREAT)
return _open(pathname, flags | O_NOATIME, mode);
else
return _open(pathname, flags | O_NOATIME, 0);
}
int open64(const char * pathname, int flags, mode_t mode)
{
if (NULL == _open64) {
_open64 = (int (*)(const char * pathname, int flags, ...)) dlsym(RTLD_NEXT, "open64");
}
if(flags & O_CREAT)
return _open64(pathname, flags | O_NOATIME, mode);
else
return _open64(pathname, flags | O_NOATIME, 0);
}
From what I understand... it is pretty much the LD_PRELOAD trick or a kernel module. There's not a whole lot of middle ground unless you want to run it under an emulator which can trap out to your function or do code re-writing on the actual binary to trap out to your function.
Assuming you can't modify the program and can't (or don't want to) modify the kernel, the LD_PRELOAD approach is the best one, assuming your application is fairly standard and isn't actually one that's maliciously trying to get past your interception. (In which case you will need one of the other techniques.)
Valgrind can be used to intercept any function call. If you need to intercept a system call in your finished product then this will be no use. However, if you are try to intercept during development then it can be very useful. I have frequently used this technique to intercept hashing functions so that I can control the returned hash for testing purposes.
In case you are not aware, Valgrind is mainly used for finding memory leaks and other memory related errors. But the underlying technology is basically an x86 emulator. It emulates your program and intercepts calls to malloc/free etc. The good thing is, you do not need to recompile to use it.
Valgrind has a feature that they term Function Wrapping, which is used to control the interception of functions. See section 3.2 of the Valgrind manual for details. You can setup function wrapping for any function you like. Once the call is intercepted the alternative function that you provide is then invoked.
First lets eliminate some non-answers that other people have given:
Use LD_PRELOAD. Yeah you said "Besides LD_PRELOAD..." in the question but apparently that isn't enough for some people. This isn't a good option because it only works if the program uses libc which isn't necessarily the case.
Use Systemtap. Yeah you said "Besides ... Linux Kernel Modules" in the question but apparently that isn't enough for some people. This isn't a good option because you have to load a custom kernal module which is a major pain in the arse and also requires root.
Valgrind. This does sort of work but it works be simulating the CPU so it's really slow and really complicated. Fine if you're just doing this for one-off debugging. Not really an option if you're doing something production-worthy.
Various syscall auditing things. I don't think logging syscalls counts as "intercepting" them. We clearly want to modify the syscall parameters / return values or redirect the program through some other code.
However there are other possibilities not mentioned here yet. Note I'm new to all this stuff and haven't tried any of it yet so I may be wrong about some things.
Rewrite the code
In theory you could use some kind of custom loader that rewrites the syscall instructions to jump to a custom handler instead. But I think that would be an absolute nightmare to implement.
kprobes
kprobes are some kind of kernel instrumentation system. They only have read-only access to anything so you can't use them to intercept syscalls, only log them.
ptrace
ptrace is the API that debuggers like GDB use to do their debugging. There is a PTRACE_SYSCALL option which will pause execution just before/after syscalls. From there you can do pretty much whatever you like in the same way that GDB can. Here's an article about how to modify syscall paramters using ptrace. However it apparently has high overhead.
Seccomp
Seccomp is a system that is design to allow you to filter syscalls. You can't modify the arguments, but you can block them or return custom errors. Seccomp filters are BPF programs. If you're not familiar, they are basically arbitrary programs that users can run in a kernel-space VM. This avoids the user/kernel context switch which makes them faster than ptrace.
While you can't modify arguments directly from your BPF program you can return SECCOMP_RET_TRACE which will trigger a ptraceing parent to break. So it's basically the same as PTRACE_SYSCALL except you get to run a program in kernel space to decide whether you want to actually intercept a syscall based on its arguments. So it should be faster if you only want to intercept some syscalls (e.g. open() with specific paths).
I think this is probably the best option. Here's an article about it from the same author as the one above. Note they use classic BPF instead of eBPF but I guess you can use eBPF too.
Edit: Actually you can only use classic BPF, not eBPF. There's a LWN article about it.
Here are some related questions. The first one is definitely worth reading.
Can eBPF modify the return value or parameters of a syscall?
Intercept only syscall with PTRACE_SINGLESTEP
Is this is a good way to intercept system calls?
Minimal overhead way of intercepting system calls without modifying the kernel
There's also a good article about manipulating syscalls via ptrace here.
Some applications can trick strace/ptrace not to run, so the only real option I've had is using systemtap
Systemtap can intercept a bunch of system calls if need be due to its wild card matching. Systemtap is not C, but a separate language. In basic mode, the systemtap should prevent you from doing stupid things, but it also can run in "expert mode" that falls back to allowing a developer to use C if that is required.
It does not require you to patch your kernel (Or at least shouldn't), and once a module has been compiled, you can copy it from a test/development box and insert it (via insmod) on a production system.
I have yet to find a linux application that has found a way to work around/avoid getting caught by systemtap.
I don't have the syntax to do this gracefully with an LKM offhand, but this article provides a good overview of what you'd need to do: http://www.linuxjournal.com/article/4378
You could also just patch the sys_open function. It starts on line 1084 of file/open.c as of linux-2.6.26.
You might also see if you can't use inotify, systemtap or SELinux to do all this logging for you without you having to build a new system.
If you just want to watch what's opened, you want to look at the ptrace() function, or the source code of the commandline strace utility. If you actually want to intercept the call, to maybe make it do something else, I think the options you listed - LD_PRELOAD or a kernel module - are your only options.
If you just want to do it for debugging purposes look into strace, which is built in top of the ptrace(2) system call which allows you to hook up code when a system call is done. See the PTRACE_SYSCALL part of the man page.
if you really need a solution you might be interested in the DR rootkit that accomplishes just this, http://www.immunityinc.com/downloads/linux_rootkit_source.tbz2 the article about it is here http://www.theregister.co.uk/2008/09/04/linux_rootkit_released/
Sounds like you need auditd.
Auditd allows global tracking of all syscalls or accesses to files, with logging. You can set keys for specific events that you are interested in.
Using SystemTap may be an option.
For Ubuntu, install it as indicated in https://wiki.ubuntu.com/Kernel/Systemtap.
Then just execute the following and you will be listening on all openat syscalls:
# stap -e 'probe syscall.openat { printf("%s(%s)\n", name, argstr) }'
openat(AT_FDCWD, "/dev/fb0", O_RDWR)
openat(AT_FDCWD, "/sys/devices/virtual/tty/tty0/active", O_RDONLY)
openat(AT_FDCWD, "/sys/devices/virtual/tty/tty0/active", O_RDONLY)
openat(AT_FDCWD, "/dev/tty1", O_RDONLY)

example of thread specific data in C

Does anybody know of (or can post) an example of the use of thread-specific data? I'm looking for something that's clearly explained and easy to understand. I've got a global char * variable that I'm wanting to share between a couple threads, and I'm thinking this is what the thread specific data mechanism in C is for, am I right?
I'm a Linux user!
Actually, thread-specific data is for when you DON'T want to share data between threads -- with thread-specific data, each thread can use the same variable name, but that variable refers to distinct storage.
With gcc, you can declare a variable as thread-specific using the __thread attribute. If you are only trying to make a primitive type thread-specific, and you are only dealing with Linux and GCC, then this is a possible solution. If you actually want to be portable, though, between various unices (a desireable goal), or if you want to make complex data types thread-specific, than you need to use the UNIX routines for that...
The way it works in UNIX is that you use pthread_key_create before any thread is spawned, in order to create a unique variable name. You then use pthread_setspecific and pthread_getspecific to modify/access the data associated with the key. The semantics of the set/get specific functions is that the key behaves as an index into a map, where each thread has its own map, so executing these routines from different threads causes different data to be accessed/modified. If you can use a map, you can use thread-specific storage.
Obviously, when you are done, you need to call the appropriate routines to cleanup the data. You can use pthread_cleanup_push to schedule a cleanup routine to deallocate any datastructures you have associated with the thread-specific key, and you can use pthread_key_destroy when the key is no longer in use.
The errno variable from the original C runtime library is a good example. If a process has two threads making system calls, it would be extremely bad for that to be a shared variable.
thread 1:
int f = open (...);
if (f < 0)
printf ("error %d encountered\n", errno);
thread 2:
int s = socket (...);
if (s < 0)
printf ("error %d encountered\n", errno);
Imagine the confusion if open and socket are called at about the same time, both fail somehow, and both try to display the error number!
To solve this, multi-threaded runtime libraries make errno an item of thread-specific data.
The short answer to your question is: you don't have to do anything to share a variable between multiple thread. All the global variables are shared among all threads by default.
When a variable has to be different for each thread, if you are using a ISO-C99 compliant implementation (like GCC), you only need to add the __thread storage class keyword to your variable declaration, as in:
__thread char *variable;
This will instruct all the tiers in the building chain (cc, ld, ld.so, libc.so and libpthread.so) to manipulate this variable in a special thread-specific way.
The following compilers support this syntax (cf wikipedia):
Sun Studio C/C++
IBM XL C/C++
GNU C
Intel C/C++ (Linux systems)
Borland C++ Builder

Resources