Kernel modules __init macro in C - c

I want to create a loadable kernel module for Linux.
This is the code
#include <linux/module.h>
#include <linux/init.h>
static int __init mymodule_init(void)
{
printk ("My module worked!\n");
return 0;
}
static void __exit mymodule_exit(void)
{
printk ("Unloading my module.\n");
return;
}
module_init(mymodule_init);
module_exit(mymodule_exit);
MODULE_LICENSE("GPL");
Pay now attention to the __init macro. As the doc says:
The __init macro indicates to compiler that that associated function
is only used during initialization. Compiler places all code marked
with __init into a special memory section that is freed after
initialization
I'm trying to understand why the initialization method can end up leaking memory. Is it due to the FIFO disposition of function calls in the stack ?

In very broad strokes:
Executable code (what source code is compiled into) takes up memory. A modern CPU would read the section of memory where the instructions reside, and execute them. For most user space applications, the code segment of a processes memory is loaded once, and is never changed during program execution. The code is always there, unless programmers play around with it.
This isn't a problem, since the OS will manage the processes virtual memory and cold code segments will eventually be unloaded into a swap file. Physical memory is never "wasted" like that in user space.
For the kernel, where code runs in privileged mode, nothing will "unload" unused pages as happens in user mode. If a function is placed into the kernels regular code segment, it will take up physical memory for as long as the kernel runs, which can be quite a long time. If a function is only called once, that's quite a waste of space.
Now while loadable kernel modules can be loaded and unloaded in general, so their code may not take up space indefinitely, it's still somewhat wasteful to take up space for a function that is only going to be called once.
Since moderns CPU's treat code as a form of executable data, it's possible to place that data into a memory segment that is not retained indefinitely. The function is loaded, then called, and then the segment can be used for something else. This is what the __init macro instructs the compiler to do. To emit code which can be easily unloaded after being called.

Related

Can we write to the jiffies variable?

From http://www.makelinux.net/ldd3/chp-7-sect-1.shtml
Needless to say, both jiffies and jiffies_64 must be considered
read-only
I wrote a program to verify and it successfully updates the jiffies value.
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/jiffies.h>
static int __init test_hello_init(void)
{
jiffies = 0;
pr_info("jiffies:%lu\n", jiffies);
return 0;
}
static void __exit test_hello_exit(void)
{
}
MODULE_LICENSE("GPL");
module_init(test_hello_init);
module_exit(test_hello_exit);
This module successfully sets the jiffies to zero. Am I missing something?
What you are reading is merely a warning. It is an unwritten contract between you (kernel module developer) and the kernel. You shouldn't modify the value of jiffies since it is not up to you to do so, and is updated by the kernel according to a set of complicated rules that you should not worry about. The jiffies value is used internally by the scheduler, so bad things can happen modifying it. Chances are that the variable you see in your module is only a thread-local copy of the real one, so modifying could have no effect. In any case, you shouldn't do it. It is only provided to you as additional information that your module might need to know to implement some logic.
Of course, since you are working in C, there is no concept of "permissions" for variables. Anything that is mapped in a readable and writable region of memory can be modified, you could even modify data in read-only memory by changing the permissions first. You can do all sorts of bad stuff if you want. There are a lot of things you're not supposed to alter, even if you have the ability to do so.

Where is linux-vdso.so.1 present on the file system

I am learning about VDSO, wrote a simple application which calls gettimeofday()
#define _GNU_SOURCE
#include <sys/syscall.h>
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
struct timeval current_time;
if (gettimeofday(&current_time, NULL) == -1)
printf("gettimeofday");
getchar();
exit(EXIT_SUCCESS);
}
ldd on the binary shows 'linux-vdso'
$ ldd ./prog
linux-vdso.so.1 (0x00007ffce147a000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6ef9e8e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6efa481000)
I did a find for the libvdso library and there is no such library present in my file system.
sudo find / -name 'linux-vdso.so*'
Where is the library present?
It's a virtual shared object that doesn't have any physical file on the disk; it's a part of the kernel that's exported into every program's address space when it's loaded.
It's main purpose to make more efficient to call certain system calls (which would otherwise incur performance issues like this). The most prominent being gettimeofday(2).
You can read more about it here: http://man7.org/linux/man-pages/man7/vdso.7.html
find / -name '*vdso*.so*'
yields
/lib/modules/4.15.0-108-generic/vdso/vdso64.so
/lib/modules/4.15.0-108-generic/vdso/vdso32.so
/lib/modules/4.15.0-108-generic/vdso/vdsox32.so
linux-vdso.so is a virtual symbolic link to the bitness-compatible respective vdso*.so.
vDSO = virtual dynamic shared object
Note on vdsox32:
x32 is a Linux ABI which is kind of a mix between x86 and x64.
It uses 32-bit address size but runs in full 64-bit mode, including all 64-bit instructions and registers available.
Making system calls can be slow. In x86 32-bit systems, you can
trigger a software interrupt (int $0x80) to tell the kernel you
wish to make a system call. However, this instruction is
expensive: it goes through the full interrupt-handling paths in
the processor's microcode as well as in the kernel. Newer
processors have faster (but backward incompatible) instructions
to initiate system calls. Rather than require the C library to
figure out if this functionality is available at run time, the C
library can use functions provided by the kernel in the vDSO.
Note that the terminology can be confusing. On x86 systems, the
vDSO function used to determine the preferred method of making a
system call is named "__kernel_vsyscall", but on x86-64, the term
"vsyscall" also refers to an obsolete way to ask the kernel what
time it is or what CPU the caller is on.
One frequently used system call is gettimeofday(2). This system
call is called both directly by user-space applications as well
as indirectly by the C library. Think timestamps or timing loops
or polling—all of these frequently need to know what time it is
right now. This information is also not secret—any application
in any privilege mode (root or any unprivileged user) will get
the same answer. Thus the kernel arranges for the information
required to answer this question to be placed in memory the
process can access. Now a call to gettimeofday(2) changes from a
system call to a normal function call and a few memory accesses.
Also
You must not assume the vDSO is mapped at any particular location
in the user's memory map. The base address will usually be
randomized at run time every time a new process image is created
(at execve(2) time). This is done for security reasons, to prevent
"return-to-libc" attacks.
And
Since the vDSO is a fully formed ELF image, you can do symbol lookups
on it.
And also
If you are trying to call the vDSO in your own application rather than
using the C library, you're most likely doing it wrong.
as well as
Why does the vDSO exist at all? There are some system calls the
kernel provides that user-space code ends up using frequently, to
the point that such calls can dominate overall performance. This
is due both to the frequency of the call as well as the context-
switch overhead that results from exiting user space and entering
the kernel.

Kernel Module memory access

I'm new to kernel modules and currently experimenting with it.
I've read that they have the same level access as the kernel itself.
Does this mean they have access to physical memory and can see/overwrite
values of other processes (including the kernel memory space)?
I have written this simple C code to overwrite every memory address but it's not doing anything (expecting the system to just crash, not sure if this is touching physical memory or it's still virtual memory)
I run it with sudo insmod ./test.ko, the code just hangs there (because of the infinite loop of course) but system works fine when I exit manually.
#include <linux/module.h>
#include <linux/kernel.h>
int init_module(void)
{
unsigned char *p = 0x0;
while (true){
*p=0;
p++;
}
return 0;
}
void cleanup_module(void)
{
//
}
Kernel modules run with kernel privileges (including kernel memory and all peripherals). The reason why your code isn´t working is, that you don´t specify the init and exit module. So you can load the module, but the kernel doesn´t call your methods.
Please take a look at this example for a minimal kernel module. Here you will find some explanation about the needed macros.

Can static memory be lazily allocated?

Having a static array in a C program:
#define MAXN (1<<13)
void f() {
static int X[MAXN];
//...
}
Can the Linux kernel choose to not map the addresses to physical memory until the each page is actually used? How can X be full of 0s then, is the memory zeroed when each page is accessed? How does that not impact the performance of the program?
Can the Linux kernel choose to not map the addresses to physical memory until the each page is actually used?
Yes, it does this for all memory (except special memory used by drivers and the kernel itself).
How can X be full of 0s then, is the memory zeroed when each page is accessed?
You're supposed to ignore this detail. As long as the memory is full of zeroes when you access it, we say it's full of zeroes.
How does that not impact the performance of the program?
It does.
Can the Linux kernel choose to not map the addresses to physical memory until the each page is actually used?
Yes, with userspace memory it is always done.
How can X be full of 0s then, is the memory zeroed when each page is accessed?
The kernel maintains a page full of 0s, when the user asks for a new page of the static array (static thus full of 0s before first use), the kernel provides the zeroed page, without permissions for the program to write. Writing to the array causes the copy-on-write mechanism to trigger: a page fault occurs, the kernel then allocates a writable page, maps it and resumes the program from the last instruction (the one that couldn't complete because of the page fault). Note that prezeroing optimizations change the implementation details here, but the theory's the same.
How does that not impact the performance of the program?
The program doesn't have to zero a (potentially) lot of pages on start, and the kernel doesn't actually have to have the memory (one can ask for more memory than the system's got, as long as you don't use it). Page faults will be generated during the execution of the program, but they can be minimized, see mmap() and madvise() with MADV_SEQUENTIAL. Remember that the Translation Lookaside Buffer is not infinite, there are so many entries it can maintain.
Sources: A linux memory FAQ, Introduction to Memory Management in Linux by Alan Ott

When a binary file runs, does it copy its entire binary data into memory at once? Could I change that?

Does it copy the entire binary to the memory before it executes? I am interested in this question and want to change it into some other way. I mean, if the binary is 100M big (seems impossible), I could run it while I am copying it into the memory. Could that be possible?
Or could you tell me how to see the way it runs? Which tools do I need?
The theoretical model for an application-level programmer makes it appear that this is so. In point of fact, the normal startup process (at least in Linux 1.x, I believe 2.x and 3.x are optimized but similar) is:
The kernel creates a process context (more-or-less, virtual machine)
Into that process context, it defines a virtual memory mapping that maps
from RAM addresses to the start of your executable file
Assuming that you're dynamically linked (the default/usual), the ld.so program
(e.g. /lib/ld-linux.so.2) defined in your program's headers sets up memory mapping for shared libraries
The kernel does a jmp into the startup routine of your program (for a C program, that's
something like crtprec80, which calls main). Since it has only set up the mapping, and not actually loaded any pages(*), this causes a Page Fault from the CPU's Memory Management Unit, which is an interrupt (exception, signal) to the kernel.
The kernel's Page Fault handler loads some section of your program, including the part
that caused the page fault, into RAM.
As your program runs, if it accesses a virtual address that doesn't have RAM backing
it up right now, Page Faults will occur and cause the kernel to suspend the program
briefly, load the page from disc, and then return control to the program. This all
happens "between instructions" and is normally undetectable.
As you use malloc/new, the kernel creates read-write pages of RAM (without disc backing files) and adds them to your virtual address space.
If you throw a Page Fault by trying to access a memory location that isn't set up in the virtual memory mappings, you get a Segmentation Violation Signal (SIGSEGV), which is normally fatal.
As the system runs out of physical RAM, pages of RAM get removed; if they are read-only copies of something already on disc (like an executable, or a shared object file), they just get de-allocated and are reloaded from their source; if they're read-write (like memory you "created" using malloc), they get written out to the ( page file = swap file = swap partition = on-disc virtual memory ). Accessing these "freed" pages causes another Page Fault, and they're re-loaded.
Generally, though, until your process is bigger than available RAM — and data is almost always significantly larger than the executable — you can safely pretend that you're alone in the world and none of this demand paging stuff is happening.
So: effectively, the kernel already is running your program while it's being loaded (and might never even load some pages, if you never jump into that code / refer to that data).
If your startup is particularly sluggish, you could look at the prelink system to optimize shared library loads. This reduces the amount of work that ld.so has to do at startup (between the exec of your program and main getting called, as well as when you first call library routines).
Sometimes, linking statically can improve performance of a program, but at a major expense of RAM — since your libraries aren't shared, you're duplicating "your libc" in addition to the shared libc that every other program is using, for example. That's generally only useful in embedded systems where your program is running more-or-less alone on the machine.
(*) In point of fact, the kernel is a bit smarter, and will generally preload some pages
to reduce the number of page faults, but the theory is the same, regardless of the
optimizations
No, it only loads the necessary pages into memory. This is demand paging.
I don't know of a tool which can really show that in real time, but you can have a look at /proc/xxx/maps, where xxx is the PID of your process.
While you ask a valid question, I don't think it's something you need to worry about. First off, a binary of 100M is not impossible. Second, the system loader will load the pages it needs from the ELF (Executable and Linkable Format) into memory, and perform various relocations, etc. that will make it work, if necessary. It will also load all of its requisite shared library dependencies in the same way. However, this is not an incredibly time-consuming process, and one that doesn't really need to be optimized. Arguably, any "optimization" would have a significant overhead to make sure it's not trying to use something that hasn't been loaded in its due course, and would possibly be less efficient.
If you're curious what gets mapped, as fge says, you can check /proc/pid/maps. If you'd like to see how a program loads, you can try running a program with strace, like:
strace ls
It's quite verbose, but it should give you some idea of the mmap() calls, etc.

Resources