Parallel processing in dual core ARMv7 processor - c

I am working on Zedboard which contains dual-core ARM A9 processors and runs Linux. The board communicates with an external I/O device.
I have two functions written in 'C' language, which I have to run in parallel.
One function calls a while loop and continuously dumps data to the external device and receives the processed data back into a memory pointer.
The other function reads the data from the pointer location creates a copy of it and does computationally intensive processes (such as FFT, signal alignment etc which is slow).
The external device needs data at 15 million samples per second. which I am able to achieve if I only run the first function and it takes about 70% of one ARM core. When I run both the functions both of the ARM cores reaches its limit and I find that I am not able to provide the data to the external device in the required sample speed.
Is there a way in which I can restrict both the functions in independent cores (it doesn't matter of the second function is slow but the performance of the first function can't be compromised) and still be able to share data between them?
I tried using OpenMP but it didn't work in achieving the required performance. I read about SCHED_SETAFFINITY but had a problem in understanding its implementation.
I have optimized each of my functions as much as I could using NEON constructs/libraries and the auto-vectorization feature of ARM processors.

You can set each separate thread to a different core with:
int sched_setaffinity(pid_t pid,size_t cpusetsize,cpu_set_t *mask);
From the man page:
Description
A process's CPU affinity mask determines the set of CPUs on which it is eligible to run. On a multiprocessor system, setting the CPU affinity mask can be used to obtain performance benefits. For example, by dedicating one CPU to a particular process (i.e., setting the affinity mask of that process to specify a single CPU, and setting the affinity mask of all other processes to exclude that CPU), it is possible to ensure maximum execution speed for that process. Restricting a process to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a process ceases to execute on one CPU and then recommences execution on a different CPU.
But if your code has hard data relations between input and output thread, multithreading can be slower as single core usage! This is hardly related to the memory/cache and especially on arm on all the bridges between core/memory/cache and external bus systems. You should play around with the priority, affinity and maybe other parameters as well.
BTW: "15 million samples per second" and FFT with IO on 1 GHZ Arm with Linux in parallel. Wow! Hot stuff ;)

Related

Multi-threading on ARM cortex A9 dual core (Linux or VxWorks)

I am working on how dual core (especially in embedded systems) could be beneficial. I would like to compare two targets: one with ARM -cortex-A9 (925 MHz) dual core, and the other with ARM cortex-A8 single core.
I have some ideas (please see below), but I am not sure, I will use the dual core features:
My questions are:
1-how to execute several threads on different cores (without OpenMP, because it didn't work on my target, and it isn't compatible with VxWorks)
2-How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
3-Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
4-How the kernel handle program execution (with a lot of threads) on dual core.
Some tests to compare two architecture regarding OS and Dual core / single core
Dual core VS single core:
create three threads that execute some routines and depend on each results other (like matrix multiplication). Afterwards measure the time taken to on the dual core once and the on the single core (how is it possible without openMP)
Ping pong:
One process sends a message to the other.two processes repeatedly pass a message back and forth. We should investigate how the time taken varies with the size of the message for each architecture.
One to all:
A process with rank 0 sends the same message to all other processes in the program and then receives a message of the same length from all other processes. How does the time taken varies with the size of the messages and with the number of processes?
Short answers wrt. Linux only:
how to execute several threads on different cores
Use multiple processes, or pthreads within a single process.
How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
In Linux, they all belong to the process. (You can declare thread-local variables, though, with for example the __thread keyword.)
In other words, if a thread allocates some memory, that memory is immediately visible to all other threads in the same process. Even if that thread exits, nothing happens to the memory. It is perfectly normal for some other thread to free the memory later on.
Each thread does get their own stack, which by default is quite large. (With pthreads, this is easy to control using a pthread_attr_t that specifies a smaller stack size.)
In general, there is no difference in memory handling or program execution, between a single-threaded and a multi-threaded process, on the kernel side. (Userspace code needs to use memory and threads properly, of course; the kernel does not try to stop stupid userspace code from shooting itself in the head, at all.)
Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
Yes, for example by examining the /proc/cpuinfo pseudofile.
However, it is much more common to leave such details for the system administrator. Services like Apache etc. let the administrator configure the number, either in a configuration file, or in the command line. Even make supports a -j JOBS parameter, allowing multiple compilation tasks to run in parallel.
I very warmly recommend you forget about any detection magic, and instead let the user specify the number of threads used. (If you insist, you could use detection magic for the default, but then only if the user/admin has not given you any hints about the number of threads to be used.)
It is also possible to set the CPU mask for each thread, specifying the set of cores that thread may run on. Other than in benchmarks, where this is done more for repeatability than anything else, or in dedicated processes designed to hog all the resources on a machine, it is very rare.
How the kernel handle program execution (with a lot of threads) on dual core.
The exact same way as it handles a lot of simultaneous processes.
There are some controls (control group stuff, cpu masks) and maybe resource accounting that are handled differently between separate processes and threads within the same process, but in general terms, separate threads in a single process are executed the same way as separate processes are.

What is meant by interleaved "multi-threading" in c?

Can anybody explain what is an interleaved multi-threading means?
Real-time examples are also allowed.
This is described by Intel as hyper-threading. The CPU has a single core with 2 register sets.
These can be used to increase the utilisation of the core.
This is opaque to code, in that it behaves like 2 cores. However only one at a time can run.
If multi-threaded, your code still needs atomic, mutexes etc
The wiki says:
The purpose of interleaved multithreading is to remove all data
dependency stalls from the execution pipeline. Since one thread is
relatively independent from other threads, there's less chance of one
instruction in one pipe stage needing an output from an older
instruction in the pipeline. Conceptually, it is similar to preemptive
multitasking used in operating systems; an analogy would be that the
time slice given to each active thread is one CPU cycle.
This Wikipedia page explains it pretty well.
Basically it's about interleaving instructions from different OS-level threads on the CPU, to reduce the risk of costly dependencies between instructions.

CPU load for C process with respect to core(s)

I am about to find out how a specific process, in C, loads the CPU over a certain time frame.
The process may switch processor core during runtime, therefore
I need to handle that too. The CPU is an ARM processor.
I have looked at different ways to get the load, from standard top,
perf and also to calculate the load through the statistics given in the
/proc/[pid]/stat-file.
My thoughts is to have a program that read the /proc/[pid]/stat-file as suggested in the thread:
"How to calculate the CPU usage of a process by PID in Linux from C?" and calculate the load accordingly.
But how would I treat core switching? I need to notice it and adjust the load calculation.
How would you recommend me to achieve this?
Update: How can I see which core the process runs in and by that examine if it has switched core since the last chack assuming I poll the process figures/statistics at least twice?
The perf tool can tell you the amount of cpu-migrations your process has made. That is, how many times the process has switched cpu. It won't tell you which cpu cores though.

pthread offer no performance increase when using virtual cores

I am playing around with pthreads for the first time and have noticed something strange when running on my machine.
I have an Intel i5 with 2 physical cores and 4 virtual cores.
When running my program with 2 threads, I get roughly double the performance, yet when running with 4 threads, I get the same performance as two threads. Why is this the case?
Results with 2 threads:
real 0m9.335s
user 0m18.233s
sys 0m0.132s
Results with 4 threads:
real 0m9.427s
user 0m34.130s
sys 0m0.180s
Edit: The code is fully parallelizable and the threads are running independently without any shared resources.
Because you only really have 2 cores. Hyper-threading will not magically create 2 more cores for you. Hyper-threading makes it possible to run 4 threads on the CPU but not simultaneously. It will still allocate the threads on the two physical cores and switch the threads back and forth in the execution pipeline.
The performance increase you may expect is at BEST 30%.
Keep in mind that hyperthreading is basically a way of reusing spare execution units on the CPU for a separate thread of execution. You're still working with the horsepower of two cores, it's just split four ways.
If your code is optimized such that it fully utilizes most of the available EUs, there's no spare resources left once it's running on both physical cores, so the hyperthreaded cores can't do any better.
This old article from when HyperThreading (HT) was first introduced provides a lot of details on how it works (though I'm sure many improvements have been made over the last 10 years). http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf:
Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total.
However, the following sentence shows where HT can bottleneck:
Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses.
If the threads execution are each keeping one or more of those shared resources (such as the execution unit or buses) 100% busy, then the hyperthreading will not improve throughput. Since benchmarks often exercise one aspect of a system (intentionally or not), it's not surprising that one of these shared processor resources would end up being a bottleneck and prevent HT from showing a benefit.
The performance gain when using multiple threads is very difficult to determine. Hyperthreading is also "less than one extra core" in performance for sure.
Besides from that, you may run into memory throughput issues, or your code is contending over locks or some such now that you have more of them - even if your own code is lock-less doesn't mean that for example I/O or some functions you call are completely able to run in parallel - there are sometimes "hidden" shared resources.
But most likely, your processor just can't go any faster.

Multiple threads and CPU cache

I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thread-0, and concurrently if the same memory is accessed by thread-1, will it get it from the cache ? This question stems from the possibility that these two threads could be running into two different cores of the CPU. So another way of putting this is: do all the cores share the same common cache memory ?
Suppose i have a memory layout like the following
int output[100];
Assume there are 2 CPU cores and hence I spawn two threads to work concurrently. One scheme could be to divide the memory into two chunks, 0-49 and 50-99 and let each thread work on each chunk. Another way could be to let thread-0 work on even indices, like 0 2 4 and so on.. while the other thread work on odd indices like 1 3 5 .... This later technique is easier to implement (specially for 3D data) but I am not sure if I could use the cache efficiently this way.
The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.
For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.
Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.
A few comments have asked about how to go about avoiding this problem.
At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).
To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.
If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int is 4 bytes, then each cacheline will contain exactly 8 int values. As long as the array starts on a cacheline boundary, then output[0] through output[7] will be on one cache line, and output[8] through output[15] on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int values that is a multiple of 8.
If you are storing complicated struct types rather than plain int, the pahole utility will be of use. It will analyse the struct types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your structs using this output - for example, you may want to manually add some padding so that your struct is a multiple of the cache line size.
On POSIX systems, the posix_memalign() function is useful for allocating a block of memory with a specified alignment.
In general it is a bad idea to share overlapping memory regions like if one thread processes 0,2,4... and the other processes 1,3,5... Although some architectures may support this, most architectures will not, and you probably can not specify on which machines your code will run on. Also the OS is free to assign your code to any core it likes (a single one, two on the same physical processor, or two cores on separate processors). Also each CPU usually has a separate first level cache, even if its on the same processor.
In most situations 0,2,4.../1,3,5... will slow down performance extremely up to possibly being slower than a single CPU.
Herb Sutters "Eliminate False Sharing" demonstrates this very well.
Using the scheme [...n/2-1] and [n/2...n] will scale much better on most systems. It even may lead to super linear performance as the cache size of all CPUs in sum can be possibly used. The number of threads used should be always configurable and should default to the number of processor cores found.
I might be mistaking, but whether the core's cache is shared or not depends on the implementation of the CPU. You'd have to look up the technical sheets on the manufacturer's page to check whether each core in your CPU has their own cache or whether the cache was shared.
I was working on image manipulation as well for a security company and sometimes we got corrupted images after running batch operations on threads. After long investigations we came to the conclusion that the cache was shared between CPU Core's and that in rare cases the data was beeing overwritten or replaced with incorrect data.
Whether this is something to keep into account or is rather a rare event I cannot anwser.
Intel documentation
Intel publishes per-generation datasheets which may contain this kind of information.
For example, for the processor i5-3210M which I had on my older computer, I look up the 3rd generation - Datasheet Volume 1 3.3 "Intel Hyper-Threading Technology (Intel HT Technology)" says:
The processor supports Intel Hyper-Threading Technology (Intel HT Technology)
that allows an execution core to function as two logical processors. While some
execution resources such as caches, execution units, and buses are shared, each
logical processor has its own architectural state with its own set of general-purpose registers and control registers.
which confirms that caches are shared in a given hyperthread for that generation of CPUs.
See also:
similar question for cache sharing across cores: How are cache memories shared in multicore Intel CPUs?
further analysis of threads vs cores: https://superuser.com/questions/133082/what-is-the-difference-between-hyper-threading-and-multiple-cores/995858#995858
the architecture spec itself also has a section about the sharing of certain resources that must be valid across all implementations, although it does not mention caches: What does multicore assembly language look like?

Resources