I was designing a custom scheduler for linux-6.0.19 version. I am a beginner in Linux kernel development. I tried to implement it and am currently facing a difficulty that I cannot figure out.
I want to know how the usual Linux scheduler integrations deal with dead tasks. The issue which I am facing right now is when a process terminates, the kernel crashes (dereferenced null pointer).
Let us consider the following situation, suppose a process is currently on the corresponding runqueue substructure (of my scheduler) of the actual runqueue. Now, this program terminates. What happens after this?
Some code should possibly set the p->__state to TASK_DEAD and then is the process dequeued by a call to dequeue_task? Or as per the p->__state, the necessary dequeue of the task shall be done in the pick_next_task_my_sched or put_prev_task_my_sched accordingly? [If some other section of the kernel actually does the dequeue, then our pick_next_task_my_sched should not be concerned about picking up a dead-task. Also, on_rq shall be set to 0 and put_prev_task_my_sched should not be putting it again into the queue.]
In the file /kernel/sched/core.c I found the following mention in the function finish_task_switch:
A task struct has one reference for the use as "current". If a task dies, then it sets TASK_DEAD in tsk->state and calls schedule one last time. The schedule call will never return, and the scheduled task must drop that reference.
I am unable to figure out the code flow, which accomplishes this. Can anyone, point me to the code regions which does this part? Drop that reference : is the task dequeued? Because when I tried to figure out, the code flow, I just saw that certain fields of the cfs related data structures were being reset. I did not find an explicit call to dequeue_task.
Related
I have been recently interested about Fibers in Windows, but I have hard time using it. The documentation involves function definitions and some example, but still some stuff are not clear to me. I see that CreateFiber definition is defined as:
LPVOID CreateFiber(
SIZE_T dwStackSize,
LPFIBER_START_ROUTINE lpStartAddress,
LPVOID lpParameter
);
So, we specify the stack size, the function for the fiber and possibly a parameter for the function. Now, my questions are:
1) Once fiber is created, I assume the provided functions execution doesn't immediately start, right? I believe one needs to call ConvertThreadToFiber first. But are there any other stuff needed to be done? I mean in the simplest case, how does defining, initiating, running and deleting a simple fiber looks like?
2) Is it possible somehow to check whether we are actually in the fiber? I mean whether fiber is executing inside some other part of the app? If yes, how?
3) Is it possible to get the memory location of the fiber's stack and the actual content of the fiber's stack at any moment we wish? If yes, how?
(Disclaimer: I've only written a few test programs that use fibers in order to verify that they were working properly while running under a performance profiler that I was working on at the time.)
1) As you say, a fiber does not run by itself. It only runs when another thread explicitly switches to it by calling SwitchToFiber. Execution then continues on that fiber until it calls SwitchToFiber and switches back to the original thread or another fiber.
2) It's unclear to me what you are asking here. If the fiber is the only one calling a particular function it can set some variable or call a function and you'll know it was there. If multiple fibers are calling the same function, maybe they could record their thread id and you'd be able to infer which fiber called the function. What's the use case here?
3) If the fiber is executing, it has access to its stack/registers in the normal way. I am not aware of a way to arbitrarily access the stack of a fiber that isn't currently scheduled to run on a thread, but I suppose you could record the address of the stack from within the fiber itself.
For what it's worth, I don't think the fiber support in Windows API is used much.
Task
I have a small kernel module I wrote for my RaspBerry Pi 2 which implements an additional system call for generating power consumption metrics. I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it. Otherwise, the call just skips the bulk of its body and returns success.
Background Work
I've read into the issue at length, and I've found a similar question on SO, but there are numerous problems with it, from my perspective (noted below).
Question
The linked question notes that struct task_struct contains a pointer element to struct cred, as defined in linux/sched.h and linux/cred.h. The latter of the two headers doesn't exist on my system(s), and the former doesn't show any declaration of a pointer to a struct cred element. Does this make sense?
Silly mistake. This is present in its entirety in the kernel headers (ie: /usr/src/linux-headers-$(uname -r)/include/linux/cred.h), I was searching in gcc-build headers in /usr/include/linux.
Even if the above worked, it doesn't mention if I would be getting the the real, effective, or saved UID for the process. Is it even possible to get each of these three values from within the system call?
cred.h already contains all of these.
Is there a safe way in the kernel module to quickly determine which groups the user belongs to without parsing /etc/group?
cred.h already contains all of these.
Update
So, the only valid question remaining is the following:
Note, that iterating through processes and reading process's
credentials should be done under RCU-critical section.
... how do I ensure my check is run in this critical section? Are there any working examples of how to accomplish this? I've found some existing kernel documentation that instructs readers to wrap the relevant code with rcu_read_lock() and rcu_read_unlock(). Do I just need to wrap an read operations against the struct cred and/or struct task_struct data structures?
First, adding a new system call is rarely the right way to do things. It's best to do things via the existing mechanisms because you'll benefit from already-existing tools on both sides: existing utility functions in the kernel, existing libc and high-level language support in userland. Files are a central concept in Linux (like other Unix systems) and most data is exchanged via files, either device files or special filesystems such as proc and sysfs.
I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it.
You can't do this in the kernel. Not only is it wrong from a design point of view, but it isn't even possible. The kernel knows nothing about user names. The only knowledge about users in the kernel in that some privileged actions are reserved to user 0 in the root namespace (don't forget that last part! And if that's new to you it's a sign that you shouldn't be doing advanced things like adding system calls). (Many actions actually look for a capability rather than being root.)
What you want to use is sysfs. Read the kernel documentation and look for non-ancient online tutorials or existing kernel code (code that uses sysfs is typically pretty clean nowadays). With sysfs, you expose information through files under /sys. Access control is up to userland — have a sane default in the kernel and do things like calling chgrp, chmod or setfacl in the boot scripts. That's one of the many wheels that you don't need to reinvent on the user side when using the existing mechanisms.
The sysfs show method automatically takes a lock around the file, so only one kernel thread can be executing it at a time. That's one of the many wheels that you don't need to reinvent on the kernel side when using the existing mechanisms.
The linked question concerns a fundamentally different issue. To quote:
Please note that the uid that I want to get is NOT of the current process.
Clearly, a thread which is not the currently executing thread can in principle exit at any point or change credentials. Measures need to be taken to ensure the stability of whatever we are fiddling with. RCU is often the right answer. The answer provided there is somewhat wrong in the sense that there are other ways as well.
Meanwhile, if you want to operate on the thread executing the very code, you can know it wont exit (because it is executing your code as opposed to an exit path). A question arises what about the stability of credentials -- good news, they are also guaranteed to be there and can be accessed with no preparation whatsoever. This can be easily verified by checking the code doing credential switching.
We are left with the question what primitives can be used to do the access. To that end one can use make_kuid, uid_eq and similar primitives.
The real question is why is this a syscall as opposed to just a /proc file.
See this blogpost for somewhat elaborated description of credential handling: http://codingtragedy.blogspot.com/2015/04/weird-stuff-thread-credentials-in-linux.html
I want to alter the Linux kernel so that every time the current PID changes - i.e., a new process is switched in - some diagnostic code is executed (detailed explanation below, if curious). I did some digging around, and it seems that every time the scheduler chooses a new process, the function context_switch() is called, which makes sense (this is just from a cursory analysis of sched.c/schedule() ).
The problem is, the Linux scheduler is basically black magic to me right now, so I'd like to know if that assumption is correct. Is it guaranteed that, every time a new process is selected to get some time on the CPU, the context_switch() function is called? Or are there other places in the kernel source where scheduling could be handled in other situations? (Or am I totally misunderstanding all this?)
To give some context, I'm working with the MARSS x86 simulator trying to do some instrumentation and measurement of certain programs. The problem is that my instrumentation needs to know which executing process certain code events correspond to, in order to avoid misinterpreting the data. The idea is to use some built-in message passing systems in MARSS to pass the PID of the new process on every context switch, so it always knows what PID is currently in execution. If anyone can think of a simpler way to accomplish that, that would also be greatly appreciated.
Yes, you are correct.
The schedule() will call context_switch() which is responsible for switching from one task to another when the new process has been selected by schedule().
context_switch() basically does two things. It calls switch_mm() and switch_to().
switch_mm() - switch to the virtual memory mapping for the new process
switch_to() - switch the processor state from the previous process to the new process (save/restore registers, stack info and other architecture specific things)
As for your approach, I guess it's fine. It's important to keep things nice and clean when working with the kernel, and try to keep it relatively easy until you gain more knowledge.
I need some help. I have a project to build an alternative scheduler for freeRTos, with a different algorithm, and try to replace it in the OS.
My questions are:
Is it possible in normal time? (for about few months)
How do I recognize the code of the scheduler in the whole OS code?
Given that FreeRTOS is only a few thousands lines of code it is certainly possible within a few months. If you know how to write a scheduler, of course.
However, FreeRTOS doesn't even have a real scheduler. It maintains a list of runnable tasks, and at every scheduling point (return from interrupt or explicit yield), it takes the highest priority task from that list.
To add more answers to question 2:
Task controls are in tasks.c, portable/port.c contains context switches.
Have a look at the source organization doc; a given function name gives away which file it's defined it. There really isn't too many places where they can be either. Use grep :)
I am designing a file system in user space and need to test it. I do not want to use the available benchmarking tools as my requirements are different. So to test the file system I wish to simulate file access operation. To do this, I first use the ftw() function to walk through one f my existing file system(experimental) and list all the files and directories in a file.
Then I invoke a simulator to simulate file access by a number of processes. Thus, the simulator randomly starts a process i.e it forks a thread which does what a real process would have done. The thread randomly selects a file operation (read, write, rename etc) selects arguments to this operation from the list(generated by ftw()) . The thread does a number of such file operations and then exits marking the end of a process. The simulator continues to spawn threads; thread execution can overlap just as real processes do. Now, as operations are performed by threads, files get inserted, deleted, renamed and this is updated in the list of files.
I have not yet started coding. Does the plan seem sane? I am also not sure how to code the simulator...how will it spawn threads over a period of time. Should I be using some random delay to do this.
Thanks
Yep, that seems fairly reasonable to me. I would consider attempting to impose a statistical distribution over your file operations (and accesses to particular files) that is somehow matched to your expected workload. You might be able to find some statistics about typical filesystem workloads as a starting point.
That sounds about right for a decent test case just to make sure it's working. You could use sleep() to wait between spawning threads or just spawn them all at once and have them do an operation then wait a bit, then do another operation, etc... IMO if you hit it hard with a lot of requests and it works then there's a likely chance your filesystem will do just fine. Take an example from PostMark which all it does is append like crazy to different files and other benchmarks that do random access reads/writes in different locations to make sure that the page has to be read from disk.