How is time slice divided among the pthreads in a process? - c

Is the Linux kernel aware of pthreads in the user address space ( which i dont think it is..but i did not find any info abt that). How does the Instruction pointer change when thread switching takes place.. ??

The native NPTL (native posix thread library) used in Linux maps pthreads to "processes that share resources and therefore look like threads" in the kernel. In this way, the kernel's scheduler directly controls the scheduling of pthreads.
A "pthread switch" is done by the exact same code (in the kernel) that handles process switches. Simplified, this would be something like "store previous process state; if the next process uses a different virtual address space then switch virtual address spaces; load next process state;" (where "process state" includes the instruction pointer for the process/thread).

Well the Linux kernel doesn't know about user threads (pthread does in userspace, moreover the kernel doesn't really care about them except it just needs to know what to schedule).
The instruction pointer is changed in the kernel during what's called a context switch. During this switch the kernel essentially asks the scheduler what's next? the scheduler will hand it a task_struct which contains all the information about the thread and the interrupt handler for a context switch will go ahead and set the values on the CPU accordingly (page tables, instruction pointer, etc...) and when that code is done the CPU simply just starts executing from there.

1) The kernel doesn't know about user-level threads. However, NPTL isn't user level
2) This is a really broad question. You should look at an OS book. It will go into depth on that issue and all other involved in a context switch.

Related

Multi-threading on ARM cortex A9 dual core (Linux or VxWorks)

I am working on how dual core (especially in embedded systems) could be beneficial. I would like to compare two targets: one with ARM -cortex-A9 (925 MHz) dual core, and the other with ARM cortex-A8 single core.
I have some ideas (please see below), but I am not sure, I will use the dual core features:
My questions are:
1-how to execute several threads on different cores (without OpenMP, because it didn't work on my target, and it isn't compatible with VxWorks)
2-How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
3-Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
4-How the kernel handle program execution (with a lot of threads) on dual core.
Some tests to compare two architecture regarding OS and Dual core / single core
Dual core VS single core:
create three threads that execute some routines and depend on each results other (like matrix multiplication). Afterwards measure the time taken to on the dual core once and the on the single core (how is it possible without openMP)
Ping pong:
One process sends a message to the other.two processes repeatedly pass a message back and forth. We should investigate how the time taken varies with the size of the message for each architecture.
One to all:
A process with rank 0 sends the same message to all other processes in the program and then receives a message of the same length from all other processes. How does the time taken varies with the size of the messages and with the number of processes?
Short answers wrt. Linux only:
how to execute several threads on different cores
Use multiple processes, or pthreads within a single process.
How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
In Linux, they all belong to the process. (You can declare thread-local variables, though, with for example the __thread keyword.)
In other words, if a thread allocates some memory, that memory is immediately visible to all other threads in the same process. Even if that thread exits, nothing happens to the memory. It is perfectly normal for some other thread to free the memory later on.
Each thread does get their own stack, which by default is quite large. (With pthreads, this is easy to control using a pthread_attr_t that specifies a smaller stack size.)
In general, there is no difference in memory handling or program execution, between a single-threaded and a multi-threaded process, on the kernel side. (Userspace code needs to use memory and threads properly, of course; the kernel does not try to stop stupid userspace code from shooting itself in the head, at all.)
Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
Yes, for example by examining the /proc/cpuinfo pseudofile.
However, it is much more common to leave such details for the system administrator. Services like Apache etc. let the administrator configure the number, either in a configuration file, or in the command line. Even make supports a -j JOBS parameter, allowing multiple compilation tasks to run in parallel.
I very warmly recommend you forget about any detection magic, and instead let the user specify the number of threads used. (If you insist, you could use detection magic for the default, but then only if the user/admin has not given you any hints about the number of threads to be used.)
It is also possible to set the CPU mask for each thread, specifying the set of cores that thread may run on. Other than in benchmarks, where this is done more for repeatability than anything else, or in dedicated processes designed to hog all the resources on a machine, it is very rare.
How the kernel handle program execution (with a lot of threads) on dual core.
The exact same way as it handles a lot of simultaneous processes.
There are some controls (control group stuff, cpu masks) and maybe resource accounting that are handled differently between separate processes and threads within the same process, but in general terms, separate threads in a single process are executed the same way as separate processes are.

How to create light weight kernel thread?

When I create a kernel thread (kthread_run), it becomes a new process.(I could see it using top command) . How can I create a light weight kernel thread(like the one we have in user space)?
If I am not wrong, kthread_create will eventually call fork() which will call clone() with appropriate configuration to create a new process/lw process. Is it possible to create lw kernel thread using clone() or similar apis? Thanks so much in advance.
Kernel threads are always listed in the process table, but this is merely a cosmetical issue. They share the same address space and *-tables, so in this sense they are quite lightweight anyway (i.e. a context-switch isn't very expensive).
If your 2*16 kernel-threads mainly do the same thing, it might be worthy evaluating if the functionality can be moved into a seperate kernel-module, which exposes an API to be used by all 16 kernel-modules and doing the work in only 1 or 2 threads.
Lightweight threads in user space are just a group of processes(or tasks) share the same address space and many other resource. Also, lightweight thread is created faster than a normal process. Linux uses 1 to 1 mapping model, that is, every thread in user space is implemented as a separate process in kernel space.
In Linux, Kernel thread is a process which does not have a valid user space. They are scheduled as normal process, but never enter user land.
So, the answer is that when you understand the meaning of lightweight, you will know that there isn't lightweight kernel thread at all. All kernel threads share the same kernel space address naturally.
Also, top is just a user program, weather appear in top output does not really reflect the nature of the underlying kernel implementation.

set thread affinity in a linux kernel module

as most C programmers know libc gives a non portable functions for thread cpu affinity tuning (pthread_attr_setaffinity_np()). However, what I do not really know is how can this be done when implementing a kernel module. Any answer that mentions or redirects to some real examples would be rather helpful.
You should use kthreads, which stands for kernel threads. To create such on specified cpu, you should invoke kthread_create_on_cpu(). It is defined in include/linux/kthread.h. Thread will be created in sleep state, so you should call wake_up_process() on it. That's all.
You can get one example of using kthreads in my answer in this question.
You can use kthread_bind() function.

Shared semaphore between user and kernel spaces

Short version
Is it possible to share a semaphore (or any other synchronization lock) between user space and kernel space? Named POSIX semaphores have kernel persistence, that's why I was wondering if it is possible to also create, and/or access them from kernel context.
Searching the internet didn't help much due to the sea of information on normal usage of POSIX semaphores.
Long version
I am developing a unified interface to real-time systems in which I have some added book keeping to take care of, protected by a semaphore. These book keepings are done on resource allocation and deallocation, which is done in non-real-time context.
With RTAI, the thread waiting and posting a semaphore however needs to be in real-time context. This means that using RTAI's named semaphore means switching between real-time and non-real-time context on every wait/post in user space, and worse, creating a short real-time thread for every sem/wait in kernel space.
What I am looking for is a way to share a normal Linux or POSIX semaphore between kernel and user spaces so that I can safely wait/post it in non-real-time context.
Any information on this subject would be greatly appreciated. If this is not possible, do you have any other ideas how this task could be accomplished?1
1 One way would be to add a system call, have the semaphore in kernel space, and have user space processes invoke that system call and the semaphore would be all managed in kernel space. I would be happier if I didn't have to patch the kernel just because of this though.
Well, you were in the right direction, but not quite -
Linux named POSIX semaphore are based on FUTex, which stands for Fast User-space Mutex. As the name implies, while their implementation is assisted by the kernel, a big chunk of it is done by user code. Sharing such a semaphore between kernel and user space would require re-implementing this infrastructure in the kernel. Possible, but certainly not easy.
SysV Semaphores on the other hand are implemented completely in kernel and are only accessible to user space via standard system calls (e.g. sem_timedwait() and friends).
This means that every SysV related operations (semaphore creation, taking or release) is actually implemented in the kernel and you can simply call the underlying kernel function from your code to take the same semaphore from the kernel is needed.
Thus, your user code will simply call sem_timedwait(). That's the easy part.
The kernel part is just a little bit more tricky: you have to find the code that implement sem_timedwait() and related calls in the kernel (they are are all in the file ipc/sem.c) and create a replica of each of the functions that does what the original function does without the calls to copy_from_user(...) and copy_to_user(..) and friends.
The reason for this is that those kernel function expect to be called from a system call with a pointer to a user buffer, while you want to call them with parameters in kernel buffers.
Take for example sem_timedwait() - the relevant kernel function is sys_timedwait() in ipc/sem.c (see here: http://lxr.free-electrons.com/source/ipc/sem.c#L1537). If you copy this function in your kernel code and just remove the parts that do copy_from_user() and copy_to_user() and simply use the passed pointers (since you'll call them from kernel space), you'll get kernel equivalent functions that can take SysV semaphore from kernel space, along side user space - so long as you call them from process context in the kernel (if you don't know what this last sentence mean, I highly recommend reading up on Linux Device Drivers, 3rd edition).
Best of luck.
One solution I can think of is to have a /proc (or /sys or whatever) file on a main kernel module where writing 0/1 to it (or read from/write to it) would cause it to issue an up/down on a semaphore. Exporting that semaphore allows other kernel modules to directly access it while user applications would go through the /proc file system.
I'd still wait to see if the original question has an answer.
I'm not really experienced on this by any means, but here's my take. If you look at glibc's implementation of sem_open, and sem_wait, it's really just creating a file in /dev/shm, mmap'ing a struct from it, and using atomic operations on it. If you want to access the named semaphore from user space, you will probably have to patch the tmpfs subsystem. However, I think this would be difficult, as it wouldn't be straightforward to determine if a file is meant to be a named semaphore.
An easier way would probably be to just reuse the kernel's semaphore implementation and have the kernel manage the semaphore for userspace processes. To do this, you would write a kernel module which you associate with a device file. Then define two ioctl's for the device file, one for wait, and one for post. Here is a good tutorial on writing kernel modules, including setting up a device file and adding I/O operations for it. http://www.freesoftwaremagazine.com/articles/drivers_linux. I don't know exactly how to implement an ioctl operation, but I think you can just assign a function to the ioctl member of the file_operations struct. Not sure what the function signature should be, but you could probably figure it out by digging around in the kernel source.
As I'm sure you know, even the best working solution to this would likely be very ugly. If I were in your place, I would simply concede the battle and use rendezvous points to sync the processes
I have read your project's README and I have the following observations. Apologies in advance:
Firstly there already is a universal interface to real time systems. It is called POSIX; certainly VxWorks, Integrity and QNX are POSIX compliant and in my experience there are very few problems with portability if you develop within the POSIX API. Whether POSIX is sane or not is another matter, but it's the one we all use.
[The reason most RTOSes are POSIX compliant is because one of the big markets for them is defence equipment. And the US DoD won't let you use an OS for their non-IT equipment (eg Radars) unless it is POSIX compliant... This has pretty much made it commercially impossible to do an RTOS without giving it POSIX]
Secondly Linux itself can be made into a pretty good real time OS by applying the PREMPT_RT patch set. Of all the RTOSes out there this is probably the best one at the moment from the point of view of making efficient use of all these multi core CPUs. However it's not quite such a hard-realtime OS as the others, so its quid pro quo.
RTAI takes a different approach of in effect placing their own RTOS underneath Linux and making Linux nothing more than one task running in their OS. This approach is ok up to a point, but the big penalty of RTAI is that the real time bit is now (as far as I can tell) not POSIX compliant (though the API looks like they've just stuck rt_ on the front of some POSIX function names) and interaction with other things is now, as you're discovering, quite complicated.
PREEMPT_RT is a much more intrusive patch set than RTAI, but the payback is that everything else (like POSIX and valgrind) stays completely normal. Plus nice things like FTrace are available. Book keeping is then a case of merely using existing tools, not having to write new ones. Also it looks like PREEMPT_RT is gradually worming its way into the mainstream Linux kernel anyway. That would render other patch sets like RTAI pretty much pointless.
So Linux + PREEMPT_RT gives us realtime POSIX plus a bunch of tools, just like all the other RTOSes out there; commonality across the board. Which kinda sounds like the goal of your project.
I apologise for not helping with the with the "how" of your project, and it is highly ungentlemanly of me to query the "why?" of it too. But I feel it is important to know that there are established things out there that seem to heavily overlap with what you're trying to do. Unseating King POSIX is going to be difficult.
I would like to answer this differently: you don't want to do this. There are good reasons why there is no interface to do this kind of thing and there are good reasons why all other kernel subsystems are designed and implemented to never need a lock shared between user and kernel space. The complexity of lock ordering and implicit locking in unexpected places will quickly get out of hand if you start playing around with userland that can prevent the kernel from doing certain things.
Let me recall a very long debugging session I did around 15 years ago to at least shed some light what complex problems you can run into. I was involved in developing a file system where the large portion of the code was in userland. Something like FUSE.
The kernel would do a filesystem operation, package it into a message and send it to the userland daemon and wait for a reply. The userland daemon reads the message, does stuff and writes a reply to the kernel which wakes up and continues with the operation. Simple concept.
One thing you need to understand about filesystems is locking. When you're looking up a name of a file, for example "foo/bar", the kernel somehow gets the node for the directory "foo" then locks it and asks it if it has the file "bar". The filesystem code somehow finds "bar", locks it and then unlocks "foo". The locking protocol is quite straight forward (unless you're doing a rename), parent always gets locked before the child and the child is locked before the parent lock is released. The lookup message for the file is what would get sent to our userland daemon while the directory was still locked, when the daemon replied the kernel would proceed to first lock "bar" and then unlock "foo".
I don't even remember the symptoms we were debugging, but I remember the issue was not trivially reproducible, it required hours and hours of filesystem torture programs until it manifested itself. But after a few weeks we figured out what was going on. Let's say that the full path to our file was "/a/b/c/foo/bar". We're in the process of doing a lookup on "bar", which means that we're holding the lock on "foo". The daemon is a normal userland process so some operations it does can block and can be preempted too. It's actually talking over the network so it can block for a long time. While we're waiting for the userland daemon some other process want to look up "foo" for some reason. To do this, it has the node for "c", locked of course, and asks it to look up "foo". It manages to find it and attempts to lock it (it has to be locked before we can release the lock on "c") and waits for the lock on "foo" to be released. Another process comes in an wants to look up "c", it of course ends up waiting for that lock while holding the lock on "b". Another process waits for "b" and holds "a". Yet another process wants "a" and holds the lock on "/".
This is not a problem, not yet. This sometimes happens in normal filesystems too, locks can cascade all the way up to the root, you wait for a while for a slow disk, the disk responds, the congestions eases up and everyone gets their locks and everything keeps running fine. In our case though, the reason for holding the lock a long time was because the remote server for our distributed filesystem didn't respond. X seconds later the userland daemon times out and just before responding to the kernel that the lookup operation on "bar" has failed it logs a message to syslog with a timestamp. One of the things that the timestamp needs is the timezone information, so it needs to open "/etc/localtime", of course to do that, it needs to start looking up "/etc" and for that it needs to lock "/". "/" is already locked by someone else, so the userland daemon waits for that someone else to unlock "/" while that someone else waits through a chain of 5 processes and locks for the daemon to respond. The system ends up in a total deadlock.
Now, maybe your code will not have problems like this. You're talking about a real-time system so there might be a level of control you have that normal kernels don't. But I'm not sure if adding an unexpected layer of locking complexity would even let you keep real time properties of the system, or really make sure that nothing you do in userland will ever create a deadlock cascade. If you don't page, if you never touch any file descriptor, if you never do memory operations and a bunch of other things I can't really think of right now you could get away with a lock shared between userland and kernel, but it will be hard and you'll probably find unexpected problems.
Multiple solutions exist in Linux/GLIBC but none permit to share explicitly a semaphore between user and kernel spaces.
The kernel provides solutions to suspend threads/processes and the most efficient is the futex. Here are some details about the state of the art of the current implementations to synchronize user space applications.
SYSV services
The Linux System V (SysV) semaphores are a legacy of the eponymous Unix OS. They are based on system calls to lock/unlock semaphores. The corresponding services are:
semget() to get an identifier
semop() to make operations on the semaphores (e.g. incrementation/decrementation)
semctl() to make some control operations on the semaphores (e.g. destruction)
The GLIBC (e.g. 2.31 version) does not provide any added value on top of those services. The library service directly calls the eponymous system call. For example, semop() (in sysdeps/unix/sysv/linux/semtimedop.c) directly invokes the corresponding system call:
int
__semtimedop (int semid, struct sembuf *sops, size_t nsops,
const struct timespec *timeout)
{
/* semtimedop wire-up syscall is not exported for 32-bit ABIs (they have
semtimedop_time64 instead with uses a 64-bit time_t). */
#if defined __ASSUME_DIRECT_SYSVIPC_SYSCALLS && defined __NR_semtimedop
return INLINE_SYSCALL_CALL (semtimedop, semid, sops, nsops, timeout);
#else
return INLINE_SYSCALL_CALL (ipc, IPCOP_semtimedop, semid,
SEMTIMEDOP_IPC_ARGS (nsops, sops, timeout));
#endif
}
weak_alias (__semtimedop, semtimedop)
Nowadays, SysV semaphores (as well as other SysV IPC like shared memory and message queues) are considered deprecated because as they need a system call for each operation, they slow down the calling processes with systematic context switches. New applications should use POSIX compliant services available through the GLIBC.
POSIX services
POSIX semaphores are based on Fast User Mutexes (FUTEX). The principle consists to increment/decrement the semaphore counter in user space with atomic operations as long as there is no contention. But when there is contention (multiple threads/processes want to "lock" the semaphore at the same time), a futex() system call is done to either wake up waiting threads/processes when the semaphore is "unlocked" or suspend threads/processes waiting for the semaphore to be released. From performance point of view, this makes a big difference compared to the above SysV services which systematically required a system call for any operation. The POSIX services are implemented in GLIBC for the user space part of the operations (atomic operations) with a switch into kernel space only when there is contention.
For example, in GLIBC 2.31, the service to lock a semaphore is located in nptl/sem_waitcommon.c. It checks the value of the semaphore to decrement it with an atomic operation (in __new_sem_wait_fast()) and invokes the futex() system call (in __new_sem_wait_slow()) to suspend the calling thread only if the semaphore was equal to 0 before the attempt to decrement it.
static int
__new_sem_wait_fast (struct new_sem *sem, int definitive_result)
{
[...]
uint64_t d = atomic_load_relaxed (&sem->data);
do
{
if ((d & SEM_VALUE_MASK) == 0)
break;
if (atomic_compare_exchange_weak_acquire (&sem->data, &d, d - 1))
return 0;
}
while (definitive_result);
return -1;
[...]
}
[...]
static int
__attribute__ ((noinline))
__new_sem_wait_slow (struct new_sem *sem, clockid_t clockid,
const struct timespec *abstime)
{
int err = 0;
[...]
uint64_t d = atomic_fetch_add_relaxed (&sem->data,
(uint64_t) 1 << SEM_NWAITERS_SHIFT);
pthread_cleanup_push (__sem_wait_cleanup, sem);
/* Wait for a token to be available. Retry until we can grab one. */
for (;;)
{
/* If there is no token available, sleep until there is. */
if ((d & SEM_VALUE_MASK) == 0)
{
err = do_futex_wait (sem, clockid, abstime);
[...]
The POSIX services based on the futex are for examples:
sem_init() to create a semaphore
sem_wait() to lock a semaphore
sem_post() to unlock a semaphore
sem_destroy() to destroy a semaphore
To manage mutex (i.e. binary semaphores), it is possible to use the pthread services. They are also based on the futex. For examples:
pthread_mutex_init() to create/initialize a mutex
pthread_mutex_lock/unlock() to lock/unlock a mutex
pthread_mutex_destroy() to destroy a mutex
I was thinking about ways that kernel and user land share things directly i.e. without syscall/copyin-out cost. One thing I remembered was the RDMA model where the kernel writes/reads directly from user space, with synchronization of course. You may want to explore that model and see if it works for your purpose.

How does sched_setaffinity() work?

I am trying to understand how the linux syscall sched_setaffinity() works. This is a follow-on from my question here.
I have this guide, which explains how to use the syscall and has a pretty neat (working!) example.
So I downloaded the Linux 2.6.27.19 kernel sources.
I did a 'grep' for lines containing that syscall, and I got 91 results. Not promising.
Ultimately, I'm trying to understand how the kernel is able to set the instruction pointer for a specific core (or processor.)
I am familiar with how single-core-single-thread programs work. One might issue a 'jmp foo' instruction, and this basically sets the IP to the memory address of the 'foo' label. But when one has multiple cores, one has to say "fetch the next instruction at memory address foo, and set the instruction pointer for core number 2 to begin execution there."
Where, in the assembly code, are we specifying which core performs that operation?
Back to the kernel code: what is important here? The file 'kernel/sched.c' has a function called sched_setaffinity(), but returns type "long" - which is inconsistent with its manual page. So what is important here? Which of these modules shows the assembly instructions issued? What module is reading the 'task_struct', looking at the 'cpus_allowed' member, and then translating that into an instruction? (I've also thumbed through the glibc source - but I think it just makes a call to the kernel code to accomplish this task.)
sched_setaffinity() simply tells the scheduler which CPUs is that process/thread allowed to run on, then calls for a re-schedule.
The scheduler actually runs on each one of the CPUs, so it gets a chance to decide what task to execute next on that particular CPU.
If you're interested in how you can actually call some code on other CPUs, I suggest you take a look at smp_call_function_single(). In case we want to call something on another CPU, this calls generic_exec_single(). The latter simply adds the function to the target CPU's call queue and forces a reschedule through some IPI stuff (if the queue was empty).
Bottom line is: there no actual SMP variant of the _jmp_ instruction. Instead, code running on other CPUs cooperates in order to accomplish the task.
I think the thing you are not understanding is that the kernel is running on all the CPU cores. At every timer interrupt (~1000 per second), the scheduler runs on each CPU and chooses a process to run. There is no one CPU that somehow tells the others to start running a process. sched_setaffinity() works by just setting flags on the process. The scheduler reads these flags and will not run that process on its CPU if it is set not to.
Where, in the assembly code, are we specifying which core performs that operation?
There is no assembly involved here. Every task (thread) is assigned to a single CPU (or core in your terms) at a time. To stop running on a given CPU and resume on another, the task has to "migrate" (also this). When a task migrates from one CPU to another, the scheduler picks the CPU which is more idle among the CPUs allowed by sched_setaffinity().
There is no magic assembly instructions issued. The kernel has a more low-level view of the hardware, each CPU is a separate object, very different than how it looks like for user-space processes (in user-space, CPUs are almost invisible).
Check this out: B Operating System Programming Guidelines

Resources