OpenMP on Kirin 650 - arm

Basically, the Kirin 650 ARM core is the one in the Honor 5c phone from Huawei. I am trying to bench C++ code directly on the rooted phone I have. This code is executing Neon instructions and have OpenMP parallelization. When I execute OpenMP functions on the phone it returns:
Num Procs = 8
Num Threads = 1
Max Threads = 8
Num Devices = 0
Default Device = 0
Is Initial Device = 1
These infos seems coherent since the Kirin 650 has 8 cores. However it is also specified in the phone technical details that the Kirin can have 4 cores up to 1.7GHz (power saving cores) and 4 cores up to 2GHz (performance cores).
How does OpenMP handle these as they are asynchronous as I understand it ? With my benchmarks I see speedups when I test No OpenMP against OpenMP - 2 threads and OpenMP - 4 threads but my OpenMP - 8 threads results are disaster.
Does these cores share the same clock (I'm using the clock_gettime instruction from the Hayai library to bench everything) ? Do you have any advice when executing stuff with OpenMP on these kinds of platforms ?

Related

C OpenMP : libgomp: Thread creation failed: Resource temporarily unavailable

I was trying to do a p basic project and It seems like I have run out of threads? Do you guys know how do can I fix the problem?
Here is the code:
int main()
{
omp_set_num_threads(2150);
#pragma omp parallel
{
printf("%d\n",omp_get_thread_num());
}
return 0;
}
and here is the global compiler setting I have written on "other compiler options" on CodeBlocks:
-fopenmp
I am getting the error of:
libgomp: Thread creation failed: Resource temporarily unavailable
I have seen similar threads on the site, but I have not got the answer or solution as of yet.
Specs:
Intel i5 6400
2x8GB ram
Windows 10 64 bit
The problem is
omp_set_num_threads(2150);
The OS will impose a limit on the number of threads a process can create. This may be indirect for example by limiting the stack size. Creating 2150 exceeds the limits.
You mention that you've got the intel 15 6400 which is a quad core chip. Try setting the number of threads to something more reasonable. In your case:
omp_set_num_threads(4);
For numerical processing, performance will likely suffer when using more than 4 threads on a 4 core system

cpuinfo to decide processor affinity

I've tried to decide processor affinity rule for my applications according to /proc/cpuinfo , My redhat Linux showes
processor : 0 to 47 , means server has 48 processor unit
physical id : 0 to 3 , means server has 4 cpu sockets
cpu cores : 6 , means each socket has 6 cores
siblings : 12 , means each core has 2 hyperthreads
So totally , this server has 4 * 6 * 2 = 48 processor units , am I correct so far ?
What I like to do is to use sched_setaffinity function , first I like to know is
the hyperthreads in the same core , for example ...
processor 0 : physical id:0,core id: 0 ...
processor 24 : physical id:0,core id: 0 ...
If in my application , I use CPU_SET(0, &mask) in thread1 , CPU_SET(24, &mask)
in thread2 , then I might can say that thread1 and thread2 will share the same L1 cache,
and of course share the same L2 cache , too ...am I correct in this guess ?
You can only guarantee fully shared caches if your threads are being scheduled on the same core (i.e. different hyperthreads) in which case your approach is correct.
But keep in mind that scheduling two tasks on the same core will not necessarily make them run faster then if you schedule them on different cores. L3 which is commonly shared among all cores is very fast.
You need to check how caches are shared among your processors. Most Intel processors share L2 among 2-4 cores and L3 among all cores while most AMD models only share L3.

cpu cores vs threads

My MacBookPro, running BootCamp, has an Intel i7-640M processor, which has 2 cores. Like all the other i7 chips, each core is hyperthreaded, so you can have up to 4 threads. Using Visual Studio 2010 c/c++ to determine these:
coresAvailable = omp_get_num_procs ( );
threadsAvailable = omp_get_max_threads ( ) ;
The "threadsAvailable" comes back with a value of 4, as expected. But "coresAvailable" also is reported as 4.
What am I missing?
omp_get_num_procs returns the number of CPUs the OS reports, and since a hyperthreaded core reports itself as 2 CPUs, a dual-core hyperthreaded chip will report itself as 4 processors.
omp_get_max_threads returns the most threads that will be used in a parallel region of code, so it makes sense that the most threads it will use will be the number of CPUs available.

Behavior of omp_get_max_threads in parallel regions

I compile this bit of code on Snow Leopard and linux and I get different results. On Snow leopard, the first call of omp_get_max_threads returns 2, which is my number of cores, while the second returns 1. On linux, both calls return 4, which is my number of cores. I think Linux has the correct behavior, am I right? Are both correct and I just have a misunderstanding of this function?
#include <stdio.h>
#include <omp.h>
int main() {
printf(" In a serial region; max threads are : %d\n", omp_get_max_threads());
#pragma omp parallel
{
#pragma omp master
printf(" In a parallel region; max threads are : %d\n", omp_get_max_threads());
}
}
Mac output:
In a serial region; max threads are : 2
In a parallel region; max threads are : 1
Linux output:
In a serial region; max threads are : 4
In a parallel region; max threads are : 4
this call is well specified in the openmp spec. linux has the correct behavior here.
with that being said, you are in a master region which is effectively serial and fhe the main thread, so the num threads call is explainable. if you arent tied to pure c I would encourage you to look at the c++ tbb library and particularly the ppl subset, you will find more generality and composability like for nested parallelism. I'm on myphone so I apologize for typos here.
With Apple-supplied gcc 4.2 [gcc version 4.2.1 (Apple Inc. build 5566)] on Leopard, I get the same results as you (except that my MacBook has fewer cores).
In a serial region; max threads are : 2
In a parallel region; max threads are : 1
Ditto for 4.3.4 from MacPorts.
However, with gcc 4.4.2 and 4.5.0 20091231 (experimental) from MacPorts, on the same computer I get:
In a serial region; max threads are : 2
In a parallel region; max threads are : 2
It looks like this isn't a Mac versus Linux issue, but due to the gcc version.
P.S. OpenMP can do nested parallelism.
Just a reminder, there is a forum devoted just to OpenMP, and read by the developers of OpenMP as well as OpenMP experts world wide. It's over at the official OpenMP website: http://openmp.org/forum
Great place to ask questions like this, and to find a lot of other resources at openmp.org
Weird. I always get the expected behaviour (using 4.2.1, build 5646 dot 1) with OS X 10.6.2:
On my Mac Pro
In a serial region; max threads are : 8
In a parallel region; max threads are : 8
and on my iMac
In a serial region; max threads are : 2
In a parallel region; max threads are : 2
Must be something else going on here. Compiling with just?
gcc fname.c -fopenmp

Why would one CPU core run slower than the others?

I was benchmarking a large scientific application, and found it would sometimes run 10% slower given the same inputs. After much searching, I found the the slowdown only occurred when it was running on core #2 of my quad core CPU (specifically, an Intel Q6600 running at 2.4 GHz). The application is a single-threaded and spends most of its time in CPU-intensive matrix math routines.
Now that I know one core is slower than the others, I can get accurate benchmark results by setting the processor affinity to the same core for all runs. However, I still want to know why one core is slower.
I tried several simple test cases to determine the slow part of the CPU, but the test cases ran with identical times, even on slow core #2. Only the complex application showed the slowdown. Here are the test cases that I tried:
Floating point multiplication and addition:
accumulator = accumulator*1.000001 + 0.0001;
Trigonometric functions:
accumulator = sin(accumulator);
accumulator = cos(accumulator);
Integer addition:
accumulator = accumulator + 1;
Memory copy while trying to make the L2 cache miss:
int stride = 4*1024*1024 + 37; // L2 cache size + small prime number
for(long iter=0; iter<iterations; ++iter) {
for(int offset=0; offset<stride; ++offset) {
for(i=offset; i<array_size; i += stride) {
array1[i] = array2[i];
}
}
}
The Question: Why would one CPU core be slower than the others, and what part of the CPU is causing that slowdown?
EDIT: More testing showed some Heisenbug behavior. When I explicitly set the processor affinity, then my application does not slow down on core #2. However, if it chooses to run on core #2 without an explicitly set processor affinity, then the application runs about 10% slower. That explains why my simple test cases did not show the same slowdown, as they all explicitly set the processor affinity. So, it looks like there is some process that likes to live on core #2, but it gets out of the way if the processor affinity is set.
Bottom Line: If you need to have an accurate benchmark of a single-threaded program on a multicore machine, then make sure to set the processor affinity.
You may have applications that have opted to be attached to the same processor(CPU Affinity).
Operating systems would often like to run on the same processor as they can have all their data cached on the same L1 cache. If you happen to run your process on the same core that your OS is doing a lot of its work on, you could experience the effect of a slowdown in your cpu performance.
It sounds like some process is wanting to stick to the same cpu. I doubt it's a hardware issue.
It doesn't necessarily have to be your OS doing the work, some other background daemon could be doing it.
Most modern cpu's have separate throttling of each cpu core due to overheating or power saving features. You may try to turn off power-saving or improve cooling. Or maybe your cpu is bad. On my i7 I get about 2-3 degrees different core temperatures of the 8 reported cores in "sensors". At full load there is still variation.
Another possibility is that the process is being migrated from one core to another while running. I'd suggest setting the CPU affinity to the 'slow' core and see if it's just as fast that way.
Years ago, before the days of multicore, I bought myself a dual-socket Athlon MP for 'web development'. Suddenly my Plone/Zope/Python web servers slowed to a crawl. A google search turned up that the CPython interpreter has a global interpreter lock, but Python threads are backed by OS threads. OS Threads were evenly distributed among the CPUs, but only one CPU can acquire the lock at a time, thus all the other processes had to wait.
Setting Zope's CPU affinity to any CPU fixed the problem.
I've observed something similar on my Haswel laptop. The system was quiet, no X running, just the terminal. Executing the same code with different numactl --physcpubin option gave exactly the same results on all cores, except one. I changed the frequency of the cores to Turbo, to other values, nothing helped. All cores were running with expected speed, except one which was always running slower than the others. That effect survived the reboot.
I rebooted the computer and turned off HyperThreading in the BIOS. When it came back online it was fine again. I then turned on HyperThreading and it is fine till now.
Bizzare. No idea what that could be.

Resources