Where is linux-vdso.so.1 present on the file system - c

I am learning about VDSO, wrote a simple application which calls gettimeofday()
#define _GNU_SOURCE
#include <sys/syscall.h>
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
struct timeval current_time;
if (gettimeofday(&current_time, NULL) == -1)
printf("gettimeofday");
getchar();
exit(EXIT_SUCCESS);
}
ldd on the binary shows 'linux-vdso'
$ ldd ./prog
linux-vdso.so.1 (0x00007ffce147a000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6ef9e8e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6efa481000)
I did a find for the libvdso library and there is no such library present in my file system.
sudo find / -name 'linux-vdso.so*'
Where is the library present?

It's a virtual shared object that doesn't have any physical file on the disk; it's a part of the kernel that's exported into every program's address space when it's loaded.
It's main purpose to make more efficient to call certain system calls (which would otherwise incur performance issues like this). The most prominent being gettimeofday(2).
You can read more about it here: http://man7.org/linux/man-pages/man7/vdso.7.html

find / -name '*vdso*.so*'
yields
/lib/modules/4.15.0-108-generic/vdso/vdso64.so
/lib/modules/4.15.0-108-generic/vdso/vdso32.so
/lib/modules/4.15.0-108-generic/vdso/vdsox32.so
linux-vdso.so is a virtual symbolic link to the bitness-compatible respective vdso*.so.
vDSO = virtual dynamic shared object
Note on vdsox32:
x32 is a Linux ABI which is kind of a mix between x86 and x64.
It uses 32-bit address size but runs in full 64-bit mode, including all 64-bit instructions and registers available.
Making system calls can be slow. In x86 32-bit systems, you can
trigger a software interrupt (int $0x80) to tell the kernel you
wish to make a system call. However, this instruction is
expensive: it goes through the full interrupt-handling paths in
the processor's microcode as well as in the kernel. Newer
processors have faster (but backward incompatible) instructions
to initiate system calls. Rather than require the C library to
figure out if this functionality is available at run time, the C
library can use functions provided by the kernel in the vDSO.
Note that the terminology can be confusing. On x86 systems, the
vDSO function used to determine the preferred method of making a
system call is named "__kernel_vsyscall", but on x86-64, the term
"vsyscall" also refers to an obsolete way to ask the kernel what
time it is or what CPU the caller is on.
One frequently used system call is gettimeofday(2). This system
call is called both directly by user-space applications as well
as indirectly by the C library. Think timestamps or timing loops
or polling—all of these frequently need to know what time it is
right now. This information is also not secret—any application
in any privilege mode (root or any unprivileged user) will get
the same answer. Thus the kernel arranges for the information
required to answer this question to be placed in memory the
process can access. Now a call to gettimeofday(2) changes from a
system call to a normal function call and a few memory accesses.
Also
You must not assume the vDSO is mapped at any particular location
in the user's memory map. The base address will usually be
randomized at run time every time a new process image is created
(at execve(2) time). This is done for security reasons, to prevent
"return-to-libc" attacks.
And
Since the vDSO is a fully formed ELF image, you can do symbol lookups
on it.
And also
If you are trying to call the vDSO in your own application rather than
using the C library, you're most likely doing it wrong.
as well as
Why does the vDSO exist at all? There are some system calls the
kernel provides that user-space code ends up using frequently, to
the point that such calls can dominate overall performance. This
is due both to the frequency of the call as well as the context-
switch overhead that results from exiting user space and entering
the kernel.

Related

What are the exact programming instructions that are in user space?

I know that a process switches between user mode and kernel mode for running. I am confused that for every line of code, we should possibly need the kernel. Below is the example, could I get explanation of the kernels role in execution of the following coding lines. Does the following actually require kernel mode.
if(a < 0)
a++
I am confused that for every line of code, we should possibly need the kernel.
Most code in user-space is executed without the kernel being involved. The kernel becomes involved (and the CPU switches from user-space to kernel) when:
a) The user-space code explicitly asks the kernel to do something (calls a system call).
b) There's an IRQ (from a device) that interrupts user-space code.
c) The kernel is providing some functionality that user-space code is unaware of. The most common reason is virtual memory management; but debugging and profiling are other reasons.
d) Asynchronous notifications (e.g. something causing a switch to kernel so that kernel can redirect the program to a suitable signal handler).
e) The user-space code does something illegal (crashes).
Does the following actually require kernel mode.
That code (if(a < 0) a++;) probably won't require kernel's assistance; but it possibly might. For example, if the variable a is in memory that was previously sent to swap space, then any attempt to access a is a request for the kernel to fetch that data from swap space. In a similar way, if the executable file was memory mapped but not loaded yet (a common optimization to improve program startup time), then attempting to execute any instruction (regardless of what the instruction is) could ask the kernel to fetch the code from the executable file on disk.
Short answer:
It depends on what you are trying to do, following code depending on which enviroment and how its compiled it shouldn't need to use the kernel. The CPU executes machine code directly, only trapping to the kernel on instructions like syscall, or on faults like page-fault or an interrupt.
The ISA is designed so that a kernel can set up the page tables in a way that stops user-space from taking over the machine, even though the CPU is fetching bytes of its machine code directly. This is how user-space code can run just as efficiently when it's just operating on its own data, doing pure computation not hardware access.
Long answer:
Comparing something and increasing value of something shouldn't require use of a kernel, On x86 (64 bit) architecture following could be represented like this (in NASM syntax):
; a is in RAX, perhaps a return value from some earlier function
cmp rax, 0 ; if (a<0) implemented as
jnl no_increase ; a jump over the inc if a is Not Less-than 0
inc rax
no_increase:
Actual compilers do it branchlessly, with various tricks as you can see on the Godbolt compiler explorer.
Clearly there aren't any syscalls so this piece of code can be ran on any x86 device but it wouldn't be meaningful
What requires kernels are the system calls now sys calls aren't required to have a device that can output something in theory you can output something by finding a memory location that let's say corresponds to video memory and you can manipulate pixels to output something in the screen but for userland this isn't possible due virtual memory.
A userspace application needs a kernel to exist if a kernel did not exist then userspace wouldn't exist :) and please note not every kernel let's a userspace.
So only doing something like:
write(open(stdout, _O_RDWR), "windows sucks linux rocks", 24);
would obviously require a kernel.
Writing / reading to arbitary memory location for example: 0xB8000 to manipulate video memory doesn't need a kernel.
TL:DR; For example code you provided it needs a kernel to be in userspace but can be written in a system where userspace and kernel doesn't exist at all and work perfectly fine (eg: microcontrollers)
In simpler words: It doesn't require a kernel to be work since it doesn't use any system calls, but for meaningful operation in a modern operating system it would atleast require a exit syscall to exit with a code otherwise you will see Segmentation fault even though there isn't dynamic allocation done by you.
Whatever the code we write is but obvious in the realm of user mode.. Kernel mode is only going to be in picture when you write any code that performs any system call..
and since the if() is not calling any system function it's not going to be in kernel mode.

Why do these instruction counts of ls differ so much? (ptrace vs perf vs qemu)

I want to count the total number of instructions executed when running /bin/ls.
I used 3 methods whose results differ heavily and i dont have a clue why.
1. Instruction counting with ptrace
I wrote a piece of code that invokes an instance of ls and singlesteps through it with ptrace:
#include <stdio.h>
#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <sys/user.h>
#include <sys/reg.h>
#include <sys/syscall.h>
int main()
{
pid_t child;
child = fork(); //create child
if(child == 0) {
ptrace(PTRACE_TRACEME, 0, NULL, NULL);
char* child_argv[] = {"/bin/ls", NULL};
execv("/bin/ls", child_argv);
}
else {
int status;
long long ins_count = 0;
while(1)
{
//stop tracing if child terminated successfully
wait(&status);
if(WIFEXITED(status))
break;
ins_count++;
ptrace(PTRACE_SINGLESTEP, child, NULL, NULL);
}
printf("\n%lld Instructions executed.\n", ins_count);
}
return 0;
}
Running this code gives me 516.678 Instructions executed.
2. QEMU singlestepping
I simulated ls using qemu in singlestep mode and logged all incoming instructions into a log file using the following command:
qemu-x86_64 -singlestep -D logfile -d in_asm /bin/ls
According to qemu ls executes 16.836 instructions.
3. perf
sudo perf stat ls
This command gave me 8.162.180 instructions executed.
I know that most of these instructions come from the dynamic linker and it is fine that they get counted. But why do these numbers differ so much? Shouldn't they all be the same?
Your counting instruction number method with qemu was wrong,the in_asm option only show the translated instruction in a compiled block, so after tb chaining process in qemu, it would dirctly jump to the translated block,leading the count in qemu was less than other tools, so a good way in practice is -d nochain,exec with -singlestep options.
Still, there also have instruction number differce between these tools, i have tried qemu running in different dirctory to produce those logs, the qemu guest program was statically linked, the logs file show different results in counting instruction number, it may be some glibc start or init stuff get involved with environment arguments to cause this differnce.
Why do these instruction counts differ so much? Because they really measure different things, and only the unit of measure is the same. It's as if you were weighing something you brought from the store, and one person weighed everything without packages nor even stickers on it, another was weighing it in packages and included the shopping bags too, and yet another also added the mud you brought into the house on your boots.
That's pretty much what is happening here: the instruction counts are not the instruction counts only of what's inside the ls binary, but can also include the libraries it uses, the services of the kernel loader needed to bring those libraries in, and finally the code executed in the process but in the kernel context. The methods you used all behave differently in that respect. So the question is: what do you need out of that measurement? If you need the "total effort", then certainly the largest number is what you want: this will include some of the overhead caused by the kernel. If you need the "I just want to know what happened in ls", then the smallest number is the one you want.
Your program using PTRACE_SINGLESTEP should count all user-space instructions executed in the process. A syscall instruction counts as one because you can't single-step into the kernel; that's opaque to ptrace.
That should be pretty similar to perf stat --all-user or perf stat -e instructions:u to count user-space instructions. (Probably counting the same within a few instructions out of however many millions). That perf option or :u event modifier tell it to program the HW performance counters to only count the event while the CPU is not at privilege level 0 (kernel mode); modern x86 CPUs have hardware support for this so perf doesn't have to run instructions inside the kernel on every transition to stop and restart counters.
Both of these include everything that happens in user-space, including ld-linux.so dynamic linker code that runs before execution reaches _start in a dynamic executable.
See also How do I determine the number of x86 machine instructions executed in a C program? which includes hand-written asm source for a static executable that only runs 2 instructions in user-space. perf stat --all-user counts 3 instructions for it on my Skylake. That Q&A also has a bunch of other discussion about what happens in a user-space process, and hopefully useful links.
Qemu counting is totally different because it does dynamic translation. See wen liang's answer and What instructions does qemu trace? which Peter Maydell linked in a comment on Kuba's answer here.
If you want to use a tool like this, you might want Intel's SDE, which uses Intel PIN dynamic instrumentation. It can histogram instruction types for you, as well as counting a total. See my answer on How do I determine the number of x86 machine instructions executed in a C program? for links.

Calling system calls from the kernel code

I am trying to create a mechanism to read performance counters for processes. I want this mechanism to be executed from within the kernel (version 4.19.2) itself.
I am able to do it from the user space the sys_perf_event_open() system call as follows.
syscall (__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
I would like to invoke this call from the kernel space. I got some basic idea from here How do I use a Linux System call from a Linux Kernel Module
Here are the steps I took to achieve this:
To make sure that the virtual address of the kernel remains valid, I have used set_fs(), get_fs() and get_fd().
Since sys_perf_event_open() is defined in /include/linux/syscalls.h I have included that in the code.
Eventually, the code for calling the systems call looks something like this:
mm_segment_t fs;
fs = get_fs();
set_fs(get_ds());
long ret = sys_perf_event_open(&pe, pid, cpu, group_fd, flags);
set_fs(fs);
Even after these measures, I get an error claiming "implicit declaration of function ‘sys_perf_event_open’ ". Why is this popping up when the header file defining it is included already? Does it have to something with the way one should call system calls from within the kernel code?
In general (not specific to Linux) the work done for systems calls can be split into 3 categories:
switching from user context to kernel context (and back again on the return path). This includes things like changing the processor's privilege level, messing with gs, fiddling with stacks, and doing security mitigations (e.g. for Meltdown). These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
using a "function number" parameter to find the right function to call, and calling it. This typically includes some sanity checks (does the function exist?) and a table lookup, plus code to mangle input and output parameters that's needed because the calling conventions used for system calls (in user space) is not the same as the calling convention that normal C functions use. These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
the final normal C function that ends up being called. This is the function that you might have (see note) been able to call directly without using any of the expensive, useless and/or dangerous system call junk.
Note: If you aren't able to call the final normal C function directly without using (any part of) the system call junk (e.g. if the final normal C function isn't exposed to other kernel code); then you must determine why. For example, maybe it's not exposed because it alters user-space state, and calling it from kernel will corrupt user-space state, so it's not exposed/exported to other kernel code so that nobody accidentally breaks everything. For another example, maybe there's no reason why it's not exposed to other kernel code and you can just modify its source code so that it is exposed/exported.
Calling system calls from inside the kernel using the sys_* interface is discouraged for the reasons that others have already mentioned. In the particular case of x86_64 (which I guess it is your architecture) and starting from kernel versions v4.17 it is now a hard requirement not to use such interface (but for a few exceptions). It was possible to invoke system calls directly prior to this version but now the error you are seeing pops up (that's why there are plenty of tutorials on the web using sys_*). The proposed alternative in the Linux documentation is to define a wrapper between the syscall and the actual syscall's code that can be called within the kernel as any other function:
int perf_event_open_wrapper(...) {
// actual perf_event_open() code
}
SYSCALL_DEFINE5(perf_event_open, ...) {
return perf_event_open_wrapper(...);
}
source: https://www.kernel.org/doc/html/v4.19/process/adding-syscalls.html#do-not-call-system-calls-in-the-kernel
Which kernel version are we talking about?
Anyhow, you could either get the address of the sys_call_table by looking at the System map file, or if it is exported, you can look up the symbol (Have a look at kallsyms.h), once you have the address to the syscall table, you may treat it as a void pointer array (void **), and find your desired functions indexed. i.e sys_call_table[__NR_open] would be open's address, so you could store it in a void pointer and then call it.
Edit: What are you trying to do, and why can't you do it without calling syscalls? You must understand that syscalls are the kernel's API to the userland, and should not be really used from inside the kernel, thus such practice should be avoided.
calling system calls from kernel code
(I am mostly answering to that title; to summarize: it is forbidden to even think of that)
I don't understand your actual problem (I feel you need to explain it more in your question which is unclear and lacks a lot of useful motivation and context). But a general advice -following the Unix philosophy- is to minimize the size and vulnerability area of your kernel or kernel module code, and to deport, as much as convenient, such code in user-land, in particular with the help of systemd, as soon as your kernel code requires some system calls. Your question is by itself a violation of most Unix and Linux cultural norms.
Have you considered to use efficient kernel to user-land communication, in particular netlink(7) with socket(7). Perhaps you also
want some driver specific kernel thread.
My intuition would be that (in some user-land daemon started from systemd early at boot time) AF_NETLINK with socket(2) is exactly fit for your (unexplained) needs. And eventd(2) might also be relevant.
But just thinking of using system calls from inside the kernel triggers a huge flashing red light in my brain and I tend to believe it is a symptom of a major misunderstanding of operating system kernels in general. Please take time to read Operating Systems: Three Easy Pieces to understand OS philosophy.

Make a variable local to a thread [duplicate]

Short version of question: What parameter do I need to pass to the clone system call on x86_64 Linux system if I want to allocate a new TLS area for the thread that I am creating.
Long version:
I am working on a research project and for something I am experimenting with I want to create threads using the clone system call instead of using pthread_create. However, I also want to be able to use thread local storage. I don't plan on creating many threads right now, so it would be fine for me to create a new TLS area for each thread that I create with the clone system call.
I was looking at the man page for clone and it has the following information about the flag for the TLS parameter:
CLONE_SETTLS (since Linux 2.5.32)
The newtls argument is the new TLS (Thread Local Storage) descriptor.
(See set_thread_area(2).)
So I looked at the man page for set_thread_area and noticed the following which looked promising:
When set_thread_area() is passed an entry_number of -1, it uses a
free TLS entry. If set_thread_area() finds a free TLS entry, the value of
u_info->entry_number is set upon return to show which entry was changed.
However, after experimenting with this some it appears that set_thread_area is not implemented in my system (Ubunut 10.04 on an x86_64 platform). When I run the following code I get an error that says: set_thread_area() failed: Function not implemented
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <asm/ldt.h>
int main()
{
struct user_desc u_info;
u_info.entry_number = -1;
int rc = syscall(SYS_set_thread_area,&u_info);
if(rc < 0) {
perror("set_thread_area() failed");
exit(-1);
}
printf("entry_number is %d",u_info.entry_number);
}
I also saw that when I use strace the see what happens when pthread_create is called that I don't see any calls to set_thread_area. I have also been looking at the nptl pthread source code to try to understand what they do when creating threads. But I don't completely understand it yet and I think it is more complex than what I'm trying to do since I don't need something that is as robust at the pthread implementation. I'm assuming that the set_thread_area system call is for x86 and that there is a different mechanism used for x86_64. But for the moment I have not been able to figure out what it is so I'm hoping this question will help me get some ideas about what I need to look at.
I am working on a research project and for something I am experimenting with I want to create threads using the clone system call instead of using pthread_create
In the exceedingly unlikely scenario where your new thread never calls any libc functions (either directly, or by calling something else which calls libc; this also includes dynamic symbol resolution via PLT), then you can pass whatever TLS storage you desire as the the new_tls parameter to clone.
You should ignore all references to set_thread_area -- they only apply to 32-bit/ix86 case.
If you are planning to use libc in your newly-created thread, you should abandon your approach: libc expects TLS to be set up a certain way, and there is no way for you to arrange for such setup when you call clone directly. Your new thread will intermittently crash when libc discovers that you didn't set up TLS properly. Debugging such crashes is exceedingly difficult, and the only reliable solution is ... to use pthread_create.
The other answer is absolutely correct in that setting up a thread outside of libc's control is guaranteed to cause trouble at a certain point. You can do it, but you can no longer rely on libc's services, definitely not on any of the pthread_* functions or thread-local variables (defined as such using __thread or thread_local).
That being said, you can set one of the segment registers used for TLS (GS and FS) even on x86-64. The system call to look for is prctl(ARCH_SET_GS, ...).
You can see an example comparing setting up TLS registers on i386 and x86-64 in this piece of code.

Where can I find system call source code?

In Linux where can I find the source code for all system calls given that I have the source tree? Also if I were to want to look up the source code and assembly for a particular system call is there something that I can type in terminal like my_system_call?
You'll need the Linux kernel sources in order to see the actual source of the system calls. Manual pages, if installed on your local system, only contain the documentation of the calls and not their source itself.
Unfortunately for you, system calls aren't stored in just one particular location in the whole kernel tree. This is because various system calls can refer to different parts of the system (process management, filesystem management, etc.) and therefore it would be infeasible to store them apart from the part of the tree related to that particular part of the system.
The best thing you can do is look for the SYSCALL_DEFINE[0-6] macro. It is used (obviously) to define the given block of code as a system call. For example, fs/ioctl.c has the following code :
SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{
/* do freaky ioctl stuff */
}
Such a definition means that the ioctl syscall is declared and takes three arguments. The number next to the SYSCALL_DEFINE means the number of arguments. For example, in the case of getpid(void), declared in kernel/timer.c, we have the following code :
SYSCALL_DEFINE0(getpid)
{
return task_tgid_vnr(current);
}
Hope that clears things up a little.
From an application's point of view, a system call is an elementary and atomic operation done by the kernel.
The Assembly Howto explains what is happening, in terms of machine instruction.
Of course, the kernel is doing a lot of things when handling a syscall.
Actually, you almost could believe that the entire kernel code is devoted to handle all system calls (this is not entirely true, but almost; from applications' point of view, the kernel is only visible thru system calls). The other answer by Daniel Kamil Kozar is explaining what kernel function is starting the handling of some system call (but very often, many other parts of the kernel indirectly participate to system calls; for example, the scheduler participates indirectly into implementing fork because it manages the child process created by a successful fork syscall).
I know it's old, but I was searching for the source for _system_call() too and found this tidbit
Actual code for system_call entry point can be found in /usr/src/linux/kernel/sys_call.S Actual code for many of the system calls can be found in /usr/src/linux/kernel/sys.c, and the rest are found elsewhere. find is your friend.
I assume this is dated, because I don't even have that file. However, grep found ENTRY(system_call) in arch/x86/kernel/entry_64.S and seems to be the thing that calls the individual system calls. I'm not up on my intel-syntax x86 asm right now, so you'll have to look and see if this is what you wanted.

Resources