Why would one CPU core run slower than the others? - benchmarking

I was benchmarking a large scientific application, and found it would sometimes run 10% slower given the same inputs. After much searching, I found the the slowdown only occurred when it was running on core #2 of my quad core CPU (specifically, an Intel Q6600 running at 2.4 GHz). The application is a single-threaded and spends most of its time in CPU-intensive matrix math routines.
Now that I know one core is slower than the others, I can get accurate benchmark results by setting the processor affinity to the same core for all runs. However, I still want to know why one core is slower.
I tried several simple test cases to determine the slow part of the CPU, but the test cases ran with identical times, even on slow core #2. Only the complex application showed the slowdown. Here are the test cases that I tried:
Floating point multiplication and addition:
accumulator = accumulator*1.000001 + 0.0001;
Trigonometric functions:
accumulator = sin(accumulator);
accumulator = cos(accumulator);
Integer addition:
accumulator = accumulator + 1;
Memory copy while trying to make the L2 cache miss:
int stride = 4*1024*1024 + 37; // L2 cache size + small prime number
for(long iter=0; iter<iterations; ++iter) {
for(int offset=0; offset<stride; ++offset) {
for(i=offset; i<array_size; i += stride) {
array1[i] = array2[i];
}
}
}
The Question: Why would one CPU core be slower than the others, and what part of the CPU is causing that slowdown?
EDIT: More testing showed some Heisenbug behavior. When I explicitly set the processor affinity, then my application does not slow down on core #2. However, if it chooses to run on core #2 without an explicitly set processor affinity, then the application runs about 10% slower. That explains why my simple test cases did not show the same slowdown, as they all explicitly set the processor affinity. So, it looks like there is some process that likes to live on core #2, but it gets out of the way if the processor affinity is set.
Bottom Line: If you need to have an accurate benchmark of a single-threaded program on a multicore machine, then make sure to set the processor affinity.

You may have applications that have opted to be attached to the same processor(CPU Affinity).
Operating systems would often like to run on the same processor as they can have all their data cached on the same L1 cache. If you happen to run your process on the same core that your OS is doing a lot of its work on, you could experience the effect of a slowdown in your cpu performance.
It sounds like some process is wanting to stick to the same cpu. I doubt it's a hardware issue.
It doesn't necessarily have to be your OS doing the work, some other background daemon could be doing it.

Most modern cpu's have separate throttling of each cpu core due to overheating or power saving features. You may try to turn off power-saving or improve cooling. Or maybe your cpu is bad. On my i7 I get about 2-3 degrees different core temperatures of the 8 reported cores in "sensors". At full load there is still variation.

Another possibility is that the process is being migrated from one core to another while running. I'd suggest setting the CPU affinity to the 'slow' core and see if it's just as fast that way.
Years ago, before the days of multicore, I bought myself a dual-socket Athlon MP for 'web development'. Suddenly my Plone/Zope/Python web servers slowed to a crawl. A google search turned up that the CPython interpreter has a global interpreter lock, but Python threads are backed by OS threads. OS Threads were evenly distributed among the CPUs, but only one CPU can acquire the lock at a time, thus all the other processes had to wait.
Setting Zope's CPU affinity to any CPU fixed the problem.

I've observed something similar on my Haswel laptop. The system was quiet, no X running, just the terminal. Executing the same code with different numactl --physcpubin option gave exactly the same results on all cores, except one. I changed the frequency of the cores to Turbo, to other values, nothing helped. All cores were running with expected speed, except one which was always running slower than the others. That effect survived the reboot.
I rebooted the computer and turned off HyperThreading in the BIOS. When it came back online it was fine again. I then turned on HyperThreading and it is fine till now.
Bizzare. No idea what that could be.

Related

Scale Down the CPU frequency

Is there any way to tell the kernel that I don't need the full CPU power?
Basically, I want to do some calculation while waiting for another process. But I don't need the full CPU power for that. As the CPU load during the computation is still 100%, the frequency is high. I want to tell the kernel that I am satisfied with a lower CPU frequency in order to save energy.
Instead of calculating using the full frequency and then suspend to wait for the other process, I want to try calculating with lower frequency so that the CPU is not in a lower C state when the other process has finished and the frequency can scale back again.
This doesn't make any sense on a multi-process system, particularly not in Linux. The CPU frequency is a very basic parameter, which affects everything running on the computer - including other processes and the OS itself.
If your program would fiddle with the CPU frequency, it would not only dictate the priority of itself, but also the priority of everything in the computer, including the OS. This isn't possible to do on a desktop system, simply because it doesn't make any sense to have a single application process dictate things that not even the OS dares to meddle with.
If saving power is a priority, you should probably look for completely different alternatives than some desktop Linux solution. PC computers only care about 1) speed, 2) speed and also 3) speed.
The kind of things you ask for are common in real-time embedded systems, where CPUs have a "sleep mode", from which it can wake up to execute something, then go back to sleep. It would usually also be possible for such systems to fiddle with the internal PLL to adjust their own frequency, but such solutions are rare. The industry standard way of doing things is to perform all calculations at max speed, then revert to power-saving sleep mode.
In case of multiple cores - there is a way to specify a certain cpu core to an interrupt. In this way you can save a certain CPU to a certain process:
to find the irq number of the task use:
cat /proc/interrupts
look for your irq number.
lets say that the irp number is 99, so in order to set core #2 to handle this irq, do:
echo 2 > /proc/irq/99/smp_affinity
in this way you can save a certain core to handle your special process.
You can actually use nice() to tell the kernel that your process can live with a lower-than-normal scheduling priority. This effectively reduces the amount of time slices your process will get to use the CPU (typically in favor of other processes running at the same time).
On some more modern systems, if this will reduce the overall CPU load significantly, the CPU might eventually even decide to run on lower frequency. But you typically don't have direct influence on that decision.
Note: Depending on the system, you might be having problems to restore the original nice value (i.e. to scale up on priority again) without running with appropriate permissions.
In case your application is I/O bound and is not doing stupid things wasting CPU cycles like busy-waiting, it shouldn't be necessary to revert to reducing your nice value - Modern CPUs and operating systems should be able to detect themselves when the system is mainly idling around and step down autonomously.
modify scalling_goverance accordingly.
The "scaling_governor" feature enables setting a static frequency to the CPU.
Frequency value must be between scaling_min_freq and scaling_max_freq.
When CPU frequency governor is set to "powersave" mode, CPU is set to the lowest static frequency (within the borders of scaling_min_freq and scaling_max_freq).
Check in below path on the target
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors
and select the required scalling governanceby writing to
echo "powersaving"/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
To tune the performance, few of the files can be updated which will make changes in CPU and the frequency and also the scheduler policies.
Based on the performance analysis, and the load balancing.
Modifications can be adopted.
Check /sys/devices/system/cpu/
Example:-
root:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_governo
rs
interactive performance
root#:/sys/devices/system/cpu/cpu0/cpufreq#
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_governor
interactive
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600
##:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600

Regarding CPU utilization

Considering the below piece of C code, I expected the CPU utilization to go up to 100% as the processor would try to complete the job (endless in this case) given to it. On running the executable for 5 mins, I found the CPU to go up to a max. of 48%. I am running Mac OS X 10.5.8; processor: Intel Core 2 Duo; Compiler: GCC 4.1.
int i = 10;
while(1) {
i = i * 5;
}
Could someone please explain why the CPU usage does not go up to 100%? Does the OS limit the CPU from reaching 100%?
Please note that if I added a "printf()" inside the loop the CPU hits 88%. I understand that in this case, the processor also has to write to the standard output stream hence the sharp rise in usage.
Has this got something to do with the amount of job assigned to the processor per unit time?
Regards,
Ven.
You have a multicore processor and you are in a single thread scenario, so you will use only one core full throttle ... Why do you expect the overall processor use go to 100% in a similar context ?
Run two copies of your program at the same time. These will use both cores of your "Core 2 Duo" CPU and overall CPU usage will go to 100%
Edit
if I added a "printf()" inside the loop the CPU hits 88%.
The printf send some characters to the terminal/screen. Sending information, Display and Update is handeled by code outside your exe, this is likely to be executed on another thread. But displaying a few characters does not need 100% of such a thread. That is why you see 100% for Core 1 and 76% for Core 2 which results in the overal CPU usage of 88% what you see.

How can I tell if every core on my machine uses the same timer?

I'm trying to write some code to determine if clock_gettime used with CLOCK_MONOTONIC_RAW will give me results coming from the same hardware on different cores.
From what I understand it is possible for each core to produce independent results but not always. I was given the task of obtaining timings on all cores with a precision of 40 nanoseconds.
The reason I'm not using CLOCK_REALTIME is that my program absolutely must not be affected by NTP adjustments.
Edit:
I have found the unsynchronized_tsc function which tries to test whether the TSC is the same on all cores. I am now attempting to find if CLOCK_MONOTONIC_RAW is based on the TSC.
Final edit:
It turns out that CLOCK_MONOTONIC_RAW is always usable on multi-core systems and does not rely on the TSC even on Intel machines.
To do measurements this precisely; you'd need:
code that's executed on all CPUs, that reads the CPU's time stamp counter and stores it as soon as "an event" occurs
some way to create "an event" that is noticed at the same time by all CPUs
some way to prevent timing problems caused by IRQs, task switches, etc.
Various possibilities for the event include:
polling a memory location in a loop, where one CPU writes a new value and other CPUs stop polling when they see the new value
using the local APIC to broadcast an IPI (inter-processor interrupt) to all CPUs
For both of these methods there are delays between the CPUs (especially for larger NUMA systems) - a write to memory (cache) may be visible on the CPU that made the write immediately, and be visible by a CPU on a different physical chip (in a different NUMA domain) later. To avoid this you may need to find the average of initiating the event on all CPUs. E.g. (for 2 CPUs) one CPU initiates and both measure, then the other CPU initiates and both measure, then results are combined to cancel out any "event propagation latency".
To fix other timing problems (IRQs, task switches, etc) I'd want to be doing these tests during boot where nothing else can mess things up. Otherwise you either need to prevent the problems (ensure all CPUs are running at the same speed, disable IRQs, disable thread switches, stop any PCI device bus mastering, etc) or cope with problems (e.g. run the same test many times and see if you get similar results most of the time).
Also note that all of the above can only ensure that the time stamp counters were in sync at the time the test was done, and don't guarantee that they won't become out of sync after the test is done. To ensure the CPUs remain in sync you'd need to rely on the CPU's "monotonic clock" guarantees (but older CPUs don't make that guarantee).
Finally; if you're attempting to do this in user-space (and not in kernel code); then my advice is to design code in a way that isn't so fragile to begin with. Even if the TSCs on different CPUs are guaranteed to be perfectly in sync at all times, you can't prevent an IRQ from interrupting immediately before or immediately after reading the TSC (and there's no way to atomically do something and read TSC at the same time); and therefore if your code requires such precisely synchronised timing then your code's design is probably flawed.

Time difference for same code of multithreading on different processors?

Hypothetical Question.
I wrote 1 multithreading code, which used to form 8 threads and process the data on different threads and complete the process. I am also using semaphore in the code. But it is giving me different execution time on different machines. Which is OBVIOUS!!
Execution time for same code:
On Intel(R) Core(TM) i3 CPU Machine: 36 sec
On AMD FX(tm)-8350 Eight-Core Processor Machine : 32 sec
On Intel(R) Core(TM) i5-2400 CPU Machine : 16.5 sec
So, my question is,
Is there any kind of setting/variable/command/switch i am missing which could be enabled in higher machine but not enabled in lower machine, which is making higher machine execution time faster? Or, is it the processor only, because of which the time difference is.
Any kind of help/suggestions/comments will be helpful.
Operating System: Linux (Centos5)
Multi-threading benchmarks should be performed with significant statistical sampling (ex: around 50 experiments per machines). Furthermore, the "environement" in which the program runs is important too (ex: was firefox running at the same time or not).
Also, depending on resources consumptions, runtimes can vary. In other words, without a more complete portrait of your experimental conditions, it's impossible to answer your question.
Some observations I have made from my personnal experiment:
Huge memory consumption can alter the results depending on the swapping settings on the machine.
Two "identical" machines with the same OS installed under the same conditions can show different results.
When total throughput is small compared to 5 mins, results appear pretty random.
etc.
I used to have a problem about time measure.My problem is the time in multithread is larger than that in single thread. Finally I found the problem is that not to measure the time in each thread and sum them but to measure out of the all thread. For example:
Wrong measure:
int main(void)
{
//create_thread();
//join_thread();
//sum the time
}
void thread(void *)
{
//measure time in thread
}
Right measure:
int main(void)
{
//record start time
//create_thread();
//join_thread();
//record end time
//calculate the diff
}
void thread(void *)
{
//measure time in thread
}

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Resources