How does the kernel know the address of a process file table? - c

A file descriptor contains the index of an entry within the process file table. However, the index alone is not enough to locate a particular entry in the [process] file table. Knowledge about the address of the first entry within the table is also required. So, my question is this: How does the kernel, only provided with the file descriptor as an argument in system calls such as read and write, manage to determine the location of the intended entry within the process file table?
I tried to see what happens under the hood by converting the following C code into x86-64 assembly, but all I got was an additional assembly open instruction.
int main(int argc, char* argv[]) {
FILE* fd = fopen("home/mhdi/miles","r");
return 0;
}
.file "open.c"
.intel_syntax noprefix
.text
.section .rodata
.LC0:
.string "r"
.LC1:
.string "home/mhdi/miles"
.text
.globl main
.type main, #function
main:
.LFB6:
.cfi_startproc
endbr64
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
sub rsp, 32
mov DWORD PTR -20[rbp], edi
mov QWORD PTR -32[rbp], rsi
lea rax, .LC0[rip]
mov rsi, rax
lea rax, .LC1[rip]
mov rdi, rax
call fopen#PLT
mov QWORD PTR -8[rbp], rax
mov eax, 0
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE6:
.size main, .-main
.ident "GCC: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0"
.section .note.GNU-stack,"",#progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:

A file descriptor contains the index of an entry within the process file table. However, the index alone is not enough to locate a particular entry in the [process] file table. Knowledge about the address of the first entry within the table is also required. So, my question is this: How does the kernel, only provided with the file descriptor as an argument in system calls such as read and write, manage to determine the location of the intended entry within the process file table?
A file descriptor (for a process) is an integer value that the kernel gives to the process to identify the file in the user file table. As the kernel and the user process don't share the same virtual memory space, there must be a means for a process to indicate to the kernel that the operation to be done is on one file and not on another (so a process can have several open files at the same time) There's no way for the user process to access the per process file table that the kernel maintains on each process, it is stored in the process' kernel private data, and it is not mapped to the virtual address space of the user process. Historically, it was stored in a per process private area called the u-area, but today the structure contents have changed too much and the contents include things like the inode used for root directory based searches (the root directory of the process), the current working directory inode for searches based on a curren directory basis, parameters like the user limits for the process (in-core memory limit, max file size, max execution time, max memory to allocate, process umask, user group ids for the process...), and the open file table array (for which index indicates the actual file descriptor of the file), the process session id, the kernel stack for the process when running in kernel mode (in a multithreading operating system, there's also a per thread data structure maintained in the kernel to handle things like the user data cpu registers contents in user mode, etc.)
I tried to see what happens under the hood by converting the following C code into x86-64 assembly, but all I got was an additional assembly open instruction.
What you got was a call to the fopen(3) library routine, not a system call.
To get under the hood, you need to start in the kernel source code, as listing assembly code will lead you until a special (normally, the interface to the kernel is done by means of a special assembler instruction that enforces a software trap, which you will see as a single assembler instruction, but you cannot trace further -- in linux/x86 the instruction is INT 0x80)
In this case, you have dissasembled a code that calls fopen(3) which is not a system call, but a standard library function. That is not the special instruction I mentioned above, but a normal subroutine call. In case you had called open(2) (the actual system call that fopen(3) ends calling) you will see that open is accessed by a similar call open instruction, because all system calls are wrapped into C functions that do some housekeeping to make the parameters available to the system call (in Intel processors the way to call the system is by means of an INT 0x80 assembler instruction by software, that generates a long jump to a trap gate that raises the execution level mode of the processor to 0, and changes the virtual memory mapping, etc, etc) and to process the data coming from the kernel on return (like calling any signal handler in case the system has some pending interrupt handler to be called). But what happens in the kernel will be hidden to you, because it is not accessible to the running process. A system call for a process happens like the execution of a single machine instruction, and like you cannot know what has happened to the cpu state in every stage that happens inside a single instruction execution, you cannot know what has happened in between you executed the INT 0x80 and the next instruction you executed.

Related

Linux kernel rootkit: read data from memory and use the retrieved data as an offset

In the development of my BSc thesis (a rootkit for the 5.4 Linux Kernel), I found myself having to identify a function address (i.e., the address of do_syscall_64()) in memory. I don't know it in advance 'cause there is KASLR.
What I'm doing is:
retrieve the system call handler via MSRs;
scan the memory location starting from the base address of entry_SYSCALL_64, which is the system call handler's code block, until I find the actual call to do_syscall_64();
isolate 4 bytes after the opcode (i.e., e8), that is the offset to which the execution flow will jump after the call:
e8 c4 bd f8 ff call 0xffffffff81b8be40 <do_syscall_64>
So, what should I do with the hex offset retrieved?
I found out that the address specified after this call instruction is an offset from the base code segment.
Do I need to convert the offset into decimal and add it to the base code segment address?
Thanks in advance.

How are parameters passed to Linux system call ? Via register or stack?

I trying to understand the internals of the Linux kernel by reading Robert Love's Linux Kernel Development.
On page 74 he says the easiest way to pass arguments to a syscall is via :
Somehow, user-space must relay the parameters to the kernel during the
trap.The easiest way to do this is via the same means that the syscall
number is passed: The parameters are stored in registers. On x86-32,
the registers ebx, ecx, edx, esi, and edi contain, in order, the first
five arguments.
Now this is bothering me for a number of reasons:
All syscalls are defined with the asmlinkage option. Which implies that the arguments are always to be found on the stack and not the register. So what is all this business with the registers ?
It may be possible that before the syscall is performed the values are copied on to the kernel stack. I have no idea why that would be efficient but it might be a possibility.
(This answer is for 32-bit x86 Linux to match your question; things are slightly different for 64-bit x86 and other architectures.)
The parameters are passed from userspace in registers as Love says.
When userspace invokes a system call with int $0x80, the kernel syscall entry code gets control. This is written in assembly language and can be seen here, for instance. One of the things this code does is to take the parameters from the registers and push them onto the stack, and then call the appropriate kernel sys_XXX() function (which is written in C). So those functions do indeed expect their arguments on the stack.
It wouldn't work as well to try to pass parameters from userspace to the kernel on the stack. When the system call is made, the CPU switches to a separate kernel stack, so the parameters would have to be copied from the userspace stack to the kernel stack, and this is somewhat complicated. And it would have to be done even for very simple system calls that just take a few numeric arguments and wouldn't otherwise need to access userspace memory at all (think about close() for instance).

Hooking sys_execve on Linux kernel 4.6 or higher

Kernels lower than 4.6 use assembly stubs to harden the hooking of critical system calls like fork, clone, exec etc. Particularly speaking for execve, the following snippet from Kernel-4.5 shows entry stub of execve:
ENTRY(stub_execve)
call sys_execve
return_from_execve:
...
END(stub_execve)
System call table contains this stub's address and this stub further calls original execve. So, to hook execve in this environment we need to patch call sys_execve in stub with our hooking routine and after doing our desired things call the original execve. This all can be seen in action in execmon, a process execution monitoring utility for linux. I'd tested execmon successfully working in Ubuntu 16.04 with kernel 4.4.
Starting from kernel 4.6, upper scheme for critical calls protection had been changed. Now the stub looks like:
ENTRY(ptregs_\func)
leaq \func(%rip), %rax
jmp stub_ptregs_64
END(ptregs_\func)
where \func will expand to sys_execve for execve calls. Again, system call table contains this stub and this stub calls original execve, but now in a more secured manner instead of just doing call sys_execve.
This newer stub, stores called function's address in RAX register and jumps to another stub shown below (comments removed):
ENTRY(stub_ptregs_64)
cmpq $.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
jne 1f
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
popq %rax
jmp entry_SYSCALL64_slow_path
1:
jmp *%rax /* called from C */
END(stub_ptregs_64)
Please have a look on this to see comments and other referenced labels in this stub.
I'd tried hard to come up with some logic to overcome this protection and patch original calls with hooking functions, but no success yet.
Would someone like to join me and help to get out of it.
I completely don't understand where you take the security angle from.
Neither previous nor current from of the func is "hardened".
You never stated why do you want to hook execve either.
The standard hooking mechanism is with kprobes and you can check systemtap for an example consumer.
I had a look at aforementioned 'execmon' code and I find it to be of poor quality and in not fit for learning. For instance https://github.com/kfiros/execmon/blob/master/kmod/syscalls.c#L65
accesses userspace memory directly (no get_user, copy_from_user etc.)
does it twice. first it computes the lengths (unbound!) and then copies stuff in. in particular if someone made strings longer after the compupation, but before they get copied, this triggers a buffer overflow.

Execution Flow of Child and Parent

After reading on the web that i can't really determine which process runs before i.e child or parent I planned to disable the ASLR on my PC and run debugger to see if i can generate the pattern of execution, the observations I made are below with a GitGist to the GDB disas (full) along with the source code
#include<stdio.h>
#include<sys/types.h>
#include<unistd.h>
int main()
{
fork();
//fork();
printf("LINUX\n");
//printf("my pid is %d",(int) getpid());
fork();
printf("REDHAT\n");
//printf("my pid is %d",(int) getpid());
//fork();
return 0;
}
this is the code i am talking about when i disas it in gdb it gave me:-
gdb-peda$ disas main
Dump of assembler code for function main:
0x000000000000068a <+0>: push rbp
0x000000000000068b <+1>: mov rbp,rsp
0x000000000000068e <+4>: call 0x560 <fork#plt>
0x0000000000000693 <+9>: lea rdi,[rip+0xaa] # 0x744
0x000000000000069a <+16>: call 0x550 <puts#plt>
0x000000000000069f <+21>: call 0x560 <fork#plt>
0x00000000000006a4 <+26>: lea rdi,[rip+0x9f] # 0x74a
0x00000000000006ab <+33>: call 0x550 <puts#plt>
0x00000000000006b0 <+38>: mov eax,0x0
0x00000000000006b5 <+43>: pop rbp
0x00000000000006b6 <+44>: ret
End of assembler dump.
so basically it gives me a set pattern of the execution, so i think that should mean the program should execute always in a particular order i tried disas main about 3 times to see if the order actually ever changes and it does not but when i finally run the binary that's generated it gives me different outputs
root#localhost:~/os/fork analysis# ./forkagain
LINUX
REDHAT
LINUX
REDHAT
REDHAT
REDHAT
root#localhost:~/os/fork analysis# ./forkagain
LINUX
LINUX
REDHAT
REDHAT
REDHAT
REDHAT
which is inconsistent to the observation that i made in the disas, can someone please fill up the gaps in my understanding?
Fork Analysis Full
i tried disas main about 3 times to see if the order actually ever changes
The order is fixed at compile time, so it can never change without you recompiling the program.
In addition, the order is fixed by your program source -- the compiler is not allowed to re-order your output.
What you observe then is the indeterminism introduced by the OS as a result of calling fork -- after the fork there are no guarantees which process will run first, or for how long. The parent may run to completion, then the child. Or the child may run to completion first. Or they may both run with time-slicing, say one line at a time.
In addition, most non-ancient Linux systems today are running on multi-processor machines, and the two independent processes can run simultaneously after the fork.
An additional complication is that your program is not well-defined, because of stdio buffering. While you see 6 lines of output, it might be hard for you to explain this result:
./forkagain | wc -l
8
./forkagain > junk.out; cat junk.out
LINUX
REDHAT
LINUX
REDHAT
LINUX
REDHAT
LINUX
REDHAT
You should add fflush(stdout); before fork to avoid this complication.
P.S. You should also un-learn the bad habit of running as root -- sooner or later you'll make a stupid mistake (like typing rm -rf * in the wrong directory), and will be really sorry you did it as root.
Each process executes in perfectly defined order. The trick here is that there is no guarantee of that each process will execute in one tick (the piece of time the process occupies the execution unit) and there is no guarantee of that two processes forked from the same process will get their tick in the order of forking.
If we assume that the printing of A (LINUX) and B (REDHAT) is the benchmark then you can get any sequence of As and Bs given that:
sequence begins with A
there are total of two As and four Bs
there is are two Bs after each A
AABBBB
ABABBB
ABBABB
are all possible outputs on a preemptive multitasking OS.
P.S. And this answer is not complete without what Employed says.

How does a C library call kernel system calls

I know in Unix-like systems c librarys such as glibc acts as an intermediary between the kernel and the userland. So for example when implementing malloc() how does glibc invoke system calls of the linux kernel does it use assembly?
In Linux x86 syscalls (system calls) are made by calling interrupt 0x80. In assembly it is done with:
int $0x80
The choice of syscall is done by passing information to the CPU registers. malloc itself is not a syscall, but malloc algorithm usually uses sbrk or mmap syscalls (brk syscall for sbrk).
For more information on Linux x86 syscalls, you can read this document.
EDIT: as mentioned by Jester in the comments, Intel x86 processors (after Pentium IV) now support systenter/sysexit instructions that don't have the int overhead and on these processors these instructions are used by Linux for syscalls.
Example of calling exit(0) syscall on 0x86 architecture.
movl $1, %eax #$1=number of exit syscall.
movl $0, %ebx #$0=First argument of exit syscall
int 0x80 #call the software interrupt
Every syscall has been given a number that can be found in /usr/include/asm/unistd.h. In the above example exit syscall has number 1.
Then you set the argument required for the syscall. After that you call software interrupt int 0x80.
So for malloc, it will internally call brk or mmap and then it will set the required arguments and then will call int0x80.

Resources