What is meant by interleaved "multi-threading" in c? - c

Can anybody explain what is an interleaved multi-threading means?
Real-time examples are also allowed.

This is described by Intel as hyper-threading. The CPU has a single core with 2 register sets.
These can be used to increase the utilisation of the core.
This is opaque to code, in that it behaves like 2 cores. However only one at a time can run.
If multi-threaded, your code still needs atomic, mutexes etc

The wiki says:
The purpose of interleaved multithreading is to remove all data
dependency stalls from the execution pipeline. Since one thread is
relatively independent from other threads, there's less chance of one
instruction in one pipe stage needing an output from an older
instruction in the pipeline. Conceptually, it is similar to preemptive
multitasking used in operating systems; an analogy would be that the
time slice given to each active thread is one CPU cycle.

This Wikipedia page explains it pretty well.
Basically it's about interleaving instructions from different OS-level threads on the CPU, to reduce the risk of costly dependencies between instructions.

Related

Modern System Architecture?

What could happen if we used Peterson's solution to the critical section problem on a modern computer? It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems? Are there any advantages to using semaphores VS mutex locks?
Hey interesting question! So basically in order to understand what you're asking you have to ensure that you know what it is you're asking. The critical section is just the part of a program that should not be concurrently executed by any more than one of that program's processes or threads at a time. Multiple concurrent accesses are not allowed, so all that means is that only one process is interacting with the system at a time. Typically this "critical section" accesses a resource like a data structure, or network connection.
Mutual Exclusion or mutex just describes the requirement that only one concurrent process is in the critical section at a time, so concurrent access to shared data must ensure this "mutual exclusion".
So this introduces the problem! How do we assure that processes run completely independently of other processes, in other words, how do we ensure "atomic access" to the various critical sections by the threads?
There are a few solutions to the "critical-section problem" but the one you mention is Peterson's solution so we will discuss that.
Peterson's algorithm is designed for mutual exclusion and allows two tasks to share a single-use resource. They use shared memory for communicating.
In the algorithm, two tasks will compete for the critical section; you'll have to look into mutex, bound waiting and other properties a bit more for a full understanding, but the just of it is that in peterson's method, a process waits 1 turn and 1 turn only to get entrance into the critical section, if it gives priority to the other task or process, then that process will run to completion and hereby allowing the other process to enter the critical section.
That is the original solution proposed.
However this has no guarantee of working on today's multiprocessing modern architectures and it only works for two concurrent tasks. It is kind of messy on modern computers when it comes to reading and writing because it has an out-of-order type of execution, so sometimes sequential operations happen in an incorrect order and thus there are limitations. I suggest you also take a look at locks. Hope that helps :)
Can anyone else think of anything to add that I might have missed?
It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems?
No. Any modern systems with "less strict" memory ordering will have ways to make the memory ordering more strict where it matters (e.g. fences).
Are there any advantages to using semaphores VS mutex locks?
Mutexes are typically simpler and faster (in the same way that a boolean is simpler than a counter); but ignoring overhead a mutex is equivalent to a semaphore with "resource count = 1".
What could happen if we used Peterson's solution to the critical section problem on a modern computer?
The big problem here is that most modern operating systems support some kind of multi-tasking (e.g. multiple processes, where each process can have multiple threads), there's usually 100 other processes (just for the OS alone), and modern hardware has power management (where you try to avoid power consumption by putting CPUs to sleep when they can't do useful work). This means that (unbounded) spinning/busy waiting is a horrible idea (e.g. you can have N CPUs being wasted spinning/trying to acquire a lock while the task that currently holds the lock isn't running on any CPU because the scheduler decided that 1234 other tasks should get 10 ms of CPU time each).
Instead; to avoid (excessive) spinning you want to ask the scheduler to block your task until/unless the lock actually can be acquired; and (especially for heavily contended locks) you probably want "fairness" (to avoid the risk of timing problems that lead to some tasks being repeatedly lucky while other tasks starve and make no progress).
This ends up being "no spinning", or "brief spinning" (to avoid scheduler overhead in cases where the task holding the lock actually can/does release it quickly); followed by the task being put on a FIFO queue and the scheduler giving the CPU to a different task or putting the CPU to sleep; where if the lock is released the scheduler wakes up the first task on the FIFO queue. Of course it's never that simple (e.g. for performance you want to do as much as you can in user-space; and you need special care and cooperating between user-space and kernel to avoid race conditions - the lock being released before a task is put on the wait queue).
Fortunately modern systems also provide simpler ways to implement locks (e.g. "atomic compare and swap"), so there's no need to resort to Peterson's algorithm (even if its just for insertion/removal of tasks from the real lock's FIFO queue).

Parallel processing in dual core ARMv7 processor

I am working on Zedboard which contains dual-core ARM A9 processors and runs Linux. The board communicates with an external I/O device.
I have two functions written in 'C' language, which I have to run in parallel.
One function calls a while loop and continuously dumps data to the external device and receives the processed data back into a memory pointer.
The other function reads the data from the pointer location creates a copy of it and does computationally intensive processes (such as FFT, signal alignment etc which is slow).
The external device needs data at 15 million samples per second. which I am able to achieve if I only run the first function and it takes about 70% of one ARM core. When I run both the functions both of the ARM cores reaches its limit and I find that I am not able to provide the data to the external device in the required sample speed.
Is there a way in which I can restrict both the functions in independent cores (it doesn't matter of the second function is slow but the performance of the first function can't be compromised) and still be able to share data between them?
I tried using OpenMP but it didn't work in achieving the required performance. I read about SCHED_SETAFFINITY but had a problem in understanding its implementation.
I have optimized each of my functions as much as I could using NEON constructs/libraries and the auto-vectorization feature of ARM processors.
You can set each separate thread to a different core with:
int sched_setaffinity(pid_t pid,size_t cpusetsize,cpu_set_t *mask);
From the man page:
Description
A process's CPU affinity mask determines the set of CPUs on which it is eligible to run. On a multiprocessor system, setting the CPU affinity mask can be used to obtain performance benefits. For example, by dedicating one CPU to a particular process (i.e., setting the affinity mask of that process to specify a single CPU, and setting the affinity mask of all other processes to exclude that CPU), it is possible to ensure maximum execution speed for that process. Restricting a process to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a process ceases to execute on one CPU and then recommences execution on a different CPU.
But if your code has hard data relations between input and output thread, multithreading can be slower as single core usage! This is hardly related to the memory/cache and especially on arm on all the bridges between core/memory/cache and external bus systems. You should play around with the priority, affinity and maybe other parameters as well.
BTW: "15 million samples per second" and FFT with IO on 1 GHZ Arm with Linux in parallel. Wow! Hot stuff ;)

Is there a way to suspend OS scheduling for the duration of a program?

I have an assignment where I am analyzing the runtime of various sorting algorithms. I have written the code but I think it's an unfair comparison.
My code basically grabs the the clock time before and after the sorting is finished to compute the elapsed time. However, what if the OS decides to interrupt more frequently during the runtime of a specific sorting algorithm, or if it rather decides that some other background application should be given more of the time domain when it's thread comes back up?
I am not a CS major so I may not be entirely correct here, but from what I've read previously I was concerned this might have an impact on the results.
I also realize that if OS scheduling is suspended and the program hangs then there might be a serious problem; I am just wondering if it possible.
Normally, there's no real reason for it. The scheduler will slightly increase the execution time, but if the code runs for a few seconds, the change will be tiny.
So unless you're running heavy applications on the same computer, the amount of noise this will add to your tests is negligible.
In Linux, you can use isolcpus parameter to mark CPUs that won't be used by the scheduler. You can find information here. I'm not sure what's the minimal kernel version.
If you use it, you'll need to use sched_setaffinity, to put your theread on an isolated CPU, because the scheduler won't put it there.
It is not possible, not in user space code. Otherwise, any malicious process could steal the CPU from others.
If you want precise time counting for your process only, I suggest using time command. You can read about it here: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Quick answer: you are most likely interested in user time, assuming your code doesn't make a heavy use of syscalls (which would be rather strange for a sorting algorithm)
On an up-to-date POSIX system (basically Linux) you can use clock_gettime with CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID if you make sure the process doesn't migrate between CPUs (you can set its affinity for example).
The difference in times returned by clock_gettime with those arguments results in exact time the process/thread spent executing. Only pitfall as I mentioned is process migration as the man page says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
This means that you don't really need to suspend all other processes just to measure the execution time of your program.

pthread offer no performance increase when using virtual cores

I am playing around with pthreads for the first time and have noticed something strange when running on my machine.
I have an Intel i5 with 2 physical cores and 4 virtual cores.
When running my program with 2 threads, I get roughly double the performance, yet when running with 4 threads, I get the same performance as two threads. Why is this the case?
Results with 2 threads:
real 0m9.335s
user 0m18.233s
sys 0m0.132s
Results with 4 threads:
real 0m9.427s
user 0m34.130s
sys 0m0.180s
Edit: The code is fully parallelizable and the threads are running independently without any shared resources.
Because you only really have 2 cores. Hyper-threading will not magically create 2 more cores for you. Hyper-threading makes it possible to run 4 threads on the CPU but not simultaneously. It will still allocate the threads on the two physical cores and switch the threads back and forth in the execution pipeline.
The performance increase you may expect is at BEST 30%.
Keep in mind that hyperthreading is basically a way of reusing spare execution units on the CPU for a separate thread of execution. You're still working with the horsepower of two cores, it's just split four ways.
If your code is optimized such that it fully utilizes most of the available EUs, there's no spare resources left once it's running on both physical cores, so the hyperthreaded cores can't do any better.
This old article from when HyperThreading (HT) was first introduced provides a lot of details on how it works (though I'm sure many improvements have been made over the last 10 years). http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf:
Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total.
However, the following sentence shows where HT can bottleneck:
Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses.
If the threads execution are each keeping one or more of those shared resources (such as the execution unit or buses) 100% busy, then the hyperthreading will not improve throughput. Since benchmarks often exercise one aspect of a system (intentionally or not), it's not surprising that one of these shared processor resources would end up being a bottleneck and prevent HT from showing a benefit.
The performance gain when using multiple threads is very difficult to determine. Hyperthreading is also "less than one extra core" in performance for sure.
Besides from that, you may run into memory throughput issues, or your code is contending over locks or some such now that you have more of them - even if your own code is lock-less doesn't mean that for example I/O or some functions you call are completely able to run in parallel - there are sometimes "hidden" shared resources.
But most likely, your processor just can't go any faster.

Measuring CPU clocks consumed by a process

I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?
The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.
No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.
Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.
I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)
Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).

Resources