My MacBookPro, running BootCamp, has an Intel i7-640M processor, which has 2 cores. Like all the other i7 chips, each core is hyperthreaded, so you can have up to 4 threads. Using Visual Studio 2010 c/c++ to determine these:
coresAvailable = omp_get_num_procs ( );
threadsAvailable = omp_get_max_threads ( ) ;
The "threadsAvailable" comes back with a value of 4, as expected. But "coresAvailable" also is reported as 4.
What am I missing?
omp_get_num_procs returns the number of CPUs the OS reports, and since a hyperthreaded core reports itself as 2 CPUs, a dual-core hyperthreaded chip will report itself as 4 processors.
omp_get_max_threads returns the most threads that will be used in a parallel region of code, so it makes sense that the most threads it will use will be the number of CPUs available.
Related
I have a PC with 24 cores. I have an application that needs to dedicate a thread to one of those cores, and the process itself to a few of those cores. The affinity and priorities are now hard-coded, I would like to programmatically determine what set of cores my application should set its affinity to.
I have read to stay away from core 0, I am currently using the last 8 cores on the first CPU for the process and the 12th core for the thread I want to run. Here is sample code which may not be 100% accurate with the parameters.
SetProirityClass(getCurrentProcess(),REAL_TIME_PRIORITY_CLASS);
SetProcessAffinityMask(getCurrentProcess(),0xFF0);
CreatThread(myThread, 0, entryPoint, NUll, 0, 0);//all 0 params besides handle and entry
SetThreadPriorityClass(myThread, TIME_CRITICAL);
SetThreadAffinityMask(myThread, 0x1 << 11);
I know that with elevated priorities (even with base priority 31) there is no way to dedicate a core to an application (Please correct me if I am wrong here since this is exactly what I want to do, non-programmatical solutions would be fine if I could do that). That being said, the OS itself runs "mostly" on a core or a couple of cores. Is it randomly determined on boot? Can I interrogate available cores to programmatically determine which set of cores my process and TIME_CRITICAL thread should be running on?
Is there any way to prevent the kernel threads from stealing time slices of my TIME_CRITICAL thread?
I understand windows is not real time but I'm doing the best with what I have. The solution needs to apply to win 7 but if it is also supported under XP that would be great.
Basically, the Kirin 650 ARM core is the one in the Honor 5c phone from Huawei. I am trying to bench C++ code directly on the rooted phone I have. This code is executing Neon instructions and have OpenMP parallelization. When I execute OpenMP functions on the phone it returns:
Num Procs = 8
Num Threads = 1
Max Threads = 8
Num Devices = 0
Default Device = 0
Is Initial Device = 1
These infos seems coherent since the Kirin 650 has 8 cores. However it is also specified in the phone technical details that the Kirin can have 4 cores up to 1.7GHz (power saving cores) and 4 cores up to 2GHz (performance cores).
How does OpenMP handle these as they are asynchronous as I understand it ? With my benchmarks I see speedups when I test No OpenMP against OpenMP - 2 threads and OpenMP - 4 threads but my OpenMP - 8 threads results are disaster.
Does these cores share the same clock (I'm using the clock_gettime instruction from the Hayai library to bench everything) ? Do you have any advice when executing stuff with OpenMP on these kinds of platforms ?
I am currently working in C multi-threading on server having multiple hexa-core cpus. I want to set affinity of some of my threads to respective cores of a single CPU. I have used pthread_setaffinity_np() and also sched_setaffinity() but i guess the set affinity on the cpus not the cores. am I right?
pthread_setaffinity_np() et al operate in terms of logical CPUs (i.e. cores), not physical ones (i.e. CPU sockets).
I've tried to decide processor affinity rule for my applications according to /proc/cpuinfo , My redhat Linux showes
processor : 0 to 47 , means server has 48 processor unit
physical id : 0 to 3 , means server has 4 cpu sockets
cpu cores : 6 , means each socket has 6 cores
siblings : 12 , means each core has 2 hyperthreads
So totally , this server has 4 * 6 * 2 = 48 processor units , am I correct so far ?
What I like to do is to use sched_setaffinity function , first I like to know is
the hyperthreads in the same core , for example ...
processor 0 : physical id:0,core id: 0 ...
processor 24 : physical id:0,core id: 0 ...
If in my application , I use CPU_SET(0, &mask) in thread1 , CPU_SET(24, &mask)
in thread2 , then I might can say that thread1 and thread2 will share the same L1 cache,
and of course share the same L2 cache , too ...am I correct in this guess ?
You can only guarantee fully shared caches if your threads are being scheduled on the same core (i.e. different hyperthreads) in which case your approach is correct.
But keep in mind that scheduling two tasks on the same core will not necessarily make them run faster then if you schedule them on different cores. L3 which is commonly shared among all cores is very fast.
You need to check how caches are shared among your processors. Most Intel processors share L2 among 2-4 cores and L3 among all cores while most AMD models only share L3.
I was benchmarking a large scientific application, and found it would sometimes run 10% slower given the same inputs. After much searching, I found the the slowdown only occurred when it was running on core #2 of my quad core CPU (specifically, an Intel Q6600 running at 2.4 GHz). The application is a single-threaded and spends most of its time in CPU-intensive matrix math routines.
Now that I know one core is slower than the others, I can get accurate benchmark results by setting the processor affinity to the same core for all runs. However, I still want to know why one core is slower.
I tried several simple test cases to determine the slow part of the CPU, but the test cases ran with identical times, even on slow core #2. Only the complex application showed the slowdown. Here are the test cases that I tried:
Floating point multiplication and addition:
accumulator = accumulator*1.000001 + 0.0001;
Trigonometric functions:
accumulator = sin(accumulator);
accumulator = cos(accumulator);
Integer addition:
accumulator = accumulator + 1;
Memory copy while trying to make the L2 cache miss:
int stride = 4*1024*1024 + 37; // L2 cache size + small prime number
for(long iter=0; iter<iterations; ++iter) {
for(int offset=0; offset<stride; ++offset) {
for(i=offset; i<array_size; i += stride) {
array1[i] = array2[i];
}
}
}
The Question: Why would one CPU core be slower than the others, and what part of the CPU is causing that slowdown?
EDIT: More testing showed some Heisenbug behavior. When I explicitly set the processor affinity, then my application does not slow down on core #2. However, if it chooses to run on core #2 without an explicitly set processor affinity, then the application runs about 10% slower. That explains why my simple test cases did not show the same slowdown, as they all explicitly set the processor affinity. So, it looks like there is some process that likes to live on core #2, but it gets out of the way if the processor affinity is set.
Bottom Line: If you need to have an accurate benchmark of a single-threaded program on a multicore machine, then make sure to set the processor affinity.
You may have applications that have opted to be attached to the same processor(CPU Affinity).
Operating systems would often like to run on the same processor as they can have all their data cached on the same L1 cache. If you happen to run your process on the same core that your OS is doing a lot of its work on, you could experience the effect of a slowdown in your cpu performance.
It sounds like some process is wanting to stick to the same cpu. I doubt it's a hardware issue.
It doesn't necessarily have to be your OS doing the work, some other background daemon could be doing it.
Most modern cpu's have separate throttling of each cpu core due to overheating or power saving features. You may try to turn off power-saving or improve cooling. Or maybe your cpu is bad. On my i7 I get about 2-3 degrees different core temperatures of the 8 reported cores in "sensors". At full load there is still variation.
Another possibility is that the process is being migrated from one core to another while running. I'd suggest setting the CPU affinity to the 'slow' core and see if it's just as fast that way.
Years ago, before the days of multicore, I bought myself a dual-socket Athlon MP for 'web development'. Suddenly my Plone/Zope/Python web servers slowed to a crawl. A google search turned up that the CPython interpreter has a global interpreter lock, but Python threads are backed by OS threads. OS Threads were evenly distributed among the CPUs, but only one CPU can acquire the lock at a time, thus all the other processes had to wait.
Setting Zope's CPU affinity to any CPU fixed the problem.
I've observed something similar on my Haswel laptop. The system was quiet, no X running, just the terminal. Executing the same code with different numactl --physcpubin option gave exactly the same results on all cores, except one. I changed the frequency of the cores to Turbo, to other values, nothing helped. All cores were running with expected speed, except one which was always running slower than the others. That effect survived the reboot.
I rebooted the computer and turned off HyperThreading in the BIOS. When it came back online it was fine again. I then turned on HyperThreading and it is fine till now.
Bizzare. No idea what that could be.