Scale Down the CPU frequency - c

Is there any way to tell the kernel that I don't need the full CPU power?
Basically, I want to do some calculation while waiting for another process. But I don't need the full CPU power for that. As the CPU load during the computation is still 100%, the frequency is high. I want to tell the kernel that I am satisfied with a lower CPU frequency in order to save energy.
Instead of calculating using the full frequency and then suspend to wait for the other process, I want to try calculating with lower frequency so that the CPU is not in a lower C state when the other process has finished and the frequency can scale back again.

This doesn't make any sense on a multi-process system, particularly not in Linux. The CPU frequency is a very basic parameter, which affects everything running on the computer - including other processes and the OS itself.
If your program would fiddle with the CPU frequency, it would not only dictate the priority of itself, but also the priority of everything in the computer, including the OS. This isn't possible to do on a desktop system, simply because it doesn't make any sense to have a single application process dictate things that not even the OS dares to meddle with.
If saving power is a priority, you should probably look for completely different alternatives than some desktop Linux solution. PC computers only care about 1) speed, 2) speed and also 3) speed.
The kind of things you ask for are common in real-time embedded systems, where CPUs have a "sleep mode", from which it can wake up to execute something, then go back to sleep. It would usually also be possible for such systems to fiddle with the internal PLL to adjust their own frequency, but such solutions are rare. The industry standard way of doing things is to perform all calculations at max speed, then revert to power-saving sleep mode.

In case of multiple cores - there is a way to specify a certain cpu core to an interrupt. In this way you can save a certain CPU to a certain process:
to find the irq number of the task use:
cat /proc/interrupts
look for your irq number.
lets say that the irp number is 99, so in order to set core #2 to handle this irq, do:
echo 2 > /proc/irq/99/smp_affinity
in this way you can save a certain core to handle your special process.

You can actually use nice() to tell the kernel that your process can live with a lower-than-normal scheduling priority. This effectively reduces the amount of time slices your process will get to use the CPU (typically in favor of other processes running at the same time).
On some more modern systems, if this will reduce the overall CPU load significantly, the CPU might eventually even decide to run on lower frequency. But you typically don't have direct influence on that decision.
Note: Depending on the system, you might be having problems to restore the original nice value (i.e. to scale up on priority again) without running with appropriate permissions.
In case your application is I/O bound and is not doing stupid things wasting CPU cycles like busy-waiting, it shouldn't be necessary to revert to reducing your nice value - Modern CPUs and operating systems should be able to detect themselves when the system is mainly idling around and step down autonomously.

modify scalling_goverance accordingly.
The "scaling_governor" feature enables setting a static frequency to the CPU.
Frequency value must be between scaling_min_freq and scaling_max_freq.
When CPU frequency governor is set to "powersave" mode, CPU is set to the lowest static frequency (within the borders of scaling_min_freq and scaling_max_freq).
Check in below path on the target
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors
and select the required scalling governanceby writing to
echo "powersaving"/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

To tune the performance, few of the files can be updated which will make changes in CPU and the frequency and also the scheduler policies.
Based on the performance analysis, and the load balancing.
Modifications can be adopted.
Check /sys/devices/system/cpu/
Example:-
root:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_governo
rs
interactive performance
root#:/sys/devices/system/cpu/cpu0/cpufreq#
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_governor
interactive
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600
##:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600

Related

How to use perf or other utilities to measure the elapsed time of a loop/function

I am working on a project requiring profiling the target applications at first.
What I want to know is the exact time consumed by a loop body/function. The platform is BeagleBone Black board with Debian OS and installed perf_4.9.
gettimeofday() can provide a micro-second resolution but I still want more accurate results. It looks perf can give cycles statistics and thus be a good fit for purposes. However, perf can only analyze the whole application instead of individual loop/functions.
After trying the instructions posted in this Using perf probe to monitor performance stats during a particular function, it does not work well.
I am just wondering if there is any example application in C I can test and use on this board for my purpose. Thank you!
Caveat: This is more of comment than an answer but it's a bit too long for just a comment.
Thanks a lot for advising a new function. I tried that but get a little unsure about its accuracy. Yes, it can offer nanosecond resolution but there is inconsistency.
There will be inconsistency if you use two different clock sources.
What I do is first use clock_gettime() to measure a loop body, the approximate elasped time would be around 1.4us in this way. Then I put GPIO instructions, pull high and pull down, at beginning and end of the loop body, respectively and measure the signal frequency on this GPIO with an oscilloscope.
A scope is useful if you're trying to debug the hardware. It can also show what's on the pins. But, in 40+ years of doing performance measurement/improvement/tuning, I've never used it to tune software.
In fact, I would trust the CPU clock more than I would trust the scope for software performance numbers
For a production product, you may have to measure performance on a system deployed at a customer site [because the issue only shows up on that one customer's machine]. You may have to debug this remotely and can't hook up a scope there. So, you need something that can work without external probe/test rigs.
To my surprise, the frequency is around 1.8MHz, i.e., ~500ns. This inconsistency makes me a little confused... – GeekTao
The difference could be just round off error based on different time bases and latency in getting in/out of the device (GPIO pins). I presume you're just using GPIO in this way to facilitate benchmark/timing. So, in a way, you're not measuring the "real" system, but the system with the GPIO overhead.
In tuning, one is less concerned with absolute values than relative. That is, clock_gettime is ultimately based on number of highres clock ticks (at 1ns/tick or better from the system's free running TSC (time stamp counter)). What the clock frequency actually is doesn't matter as much. If you measure a loop/function and get X duration. Then, you change some code and get X+n, this tells you whether the code got faster or slower.
500ns isn't that large an amount. Almost any system wide action (e.g. timeslicing, syscall, task switch, etc.) could account for that. Unless you've mapped the GPIO registers into app memory, the syscall overhead could dwarf that.
In fact, just the overhead of calling clock_gettime could account for that.
Although the clock_gettime is technically a syscall, linux will map the code directly into the app's code via the VDSO mechanism so there is no syscall overhead. But, even the userspace code has some calculations to do.
For example, I have two x86 PCs. On one system the overhead of the call is 26 ns. On another system, the overhead is 1800 ns. Both these systems are at least 2GHz+
For your beaglebone/arm system, the base clock rate may be less, so overhead of 500 ns may be ballpark.
I usually benchmark the overhead and subtract it out from the calculations.
And, on the x86, the actual code just gets the CPU's TSC value (via the rdtsc instruction) and does some adjustment. For arm, it has a similar H/W register but requires special care to map userspace access to it (a coprocessor instruction, IIRC).
Speaking of arm, I was doing a commercial arm product (an nVidia Jetson to be exact). We were very concerned about latency of incoming video frames.
The H/W engineer didn't trust TSC [or software in general ;-)] and was trying to use a scope, an LED [controlled by a GPIO pin] and when the LED flash/pulse showed up inside the video frame (e.g. the coordinates of the white dot in the video frame were [effectively] a time measurement).
It took a while to convince the engineer, but, eventually I was able to prove that the clock_gettime/TSC approach was more accurate/reliable.
And, certainly, easier to use. We had multiple test/development/SDK boards but could only hook up the scope/LED rig on one at a time.

What could be delaying my select() call?

I have a small program running on Linux (on an embedded PC, dual-core Intel Atom 1.6GHz with Debian 6 running Linux 2.6.32-5) which communicates with external hardware via an FTDI USB-to-serial converter (using the ftdi_sio kernel module and a /dev/ttyUSB* device). Essentially, in my main loop I run
clock_gettime() using CLOCK_MONOTONIC
select() with a timeout of 8 ms
clock_gettime() as before
Output the time difference of the two clock_gettime() calls
To have some level of "soft" real-time guarantees, this thread runs as SCHED_FIFO with maximum priority (showing up as "RT" in top). It is the only thread in the system running at this priority, no other process has such priorities. My process has one other SCHED_FIFO thread with a lower priority, while everything else is at SCHED_OTHER. The two "real-time" threads are not CPU bound and do very little apart from waiting for I/O and passing on data.
The kernel I am using has no RT_PREEMPT patches (I might switch to that patch in the future). I know that if I want "proper" realtime, I need to switch to RT_PREEMPT or, better, Xenomai or the like. But nevertheless I would like to know what is behind the following timing anomalies on a "vanilla" kernel:
Roughly 0.03% of all select() calls are timed at over 10 ms (remember, the timeout was 8 ms).
The three worst cases (out of over 12 million calls) were 31.7 ms, 46.8 ms and 64.4 ms.
All of the above happened within 20 seconds of each other, and I think some cron job may have been interfering (although the system logs are low on information apart from the fact that cron.daily was being executed at the time).
So, my question is: What factors can be involved in such extreme cases? Is this just something that can happen inside the Linux kernel itself, i.e. would I have to switch to RT_PREEMPT, or even a non-USB interface and Xenomai, to get more reliable guarantees? Could /proc/sys/kernel/sched_rt_runtime_us be biting me? Are there any other factors I may have missed?
Another way to put this question is, what else can I do to reduce these latency anomalies without switching to a "harder" realtime environment?
Update: I have observed a new, "worse worst case" of about 118.4 ms (once over a total of around 25 million select() calls). Even when I am not using a kernel with any sort of realtime extension, I am somewhat worried by the fact that a deadline can apparently be missed by over a tenth of a second.
Without more information it is difficult to point to something specific, so I am just guessing here:
Interrupts and code that is triggered by interrupts take so much time in the kernel that your real time thread is significantly delayed. This depends on the frequency of interrupts, which interrupt handlers are involved, etc.
A thread with lower priority will not be interrupted inside the kernel until it yields the cpu or leaves the kernel.
As pointed out in this SO answer, CPU System Management Interrupts and Thermal Management can also cause significant time delays (up to 300ms were observed by the poster).
118ms seems quite a lot for a 1.6GHz CPU. But one driver that accidently locks the cpu for some time would be enough. If you can, try to disable some drivers or use different driver/hardware combinations.
sched_rt_period_us and sched_rt_period_us should not be a problem if they are set to reasonable values and your code behaves as you expect. Still, I would remove the limit for RT threads and see what happens.
What else can you do? Write a device driver! It's not that difficult and interrupt handlers get a higher priority than realtime threads. It may be easier to switch to a real time kernel but YMMV.

How can I tell if every core on my machine uses the same timer?

I'm trying to write some code to determine if clock_gettime used with CLOCK_MONOTONIC_RAW will give me results coming from the same hardware on different cores.
From what I understand it is possible for each core to produce independent results but not always. I was given the task of obtaining timings on all cores with a precision of 40 nanoseconds.
The reason I'm not using CLOCK_REALTIME is that my program absolutely must not be affected by NTP adjustments.
Edit:
I have found the unsynchronized_tsc function which tries to test whether the TSC is the same on all cores. I am now attempting to find if CLOCK_MONOTONIC_RAW is based on the TSC.
Final edit:
It turns out that CLOCK_MONOTONIC_RAW is always usable on multi-core systems and does not rely on the TSC even on Intel machines.
To do measurements this precisely; you'd need:
code that's executed on all CPUs, that reads the CPU's time stamp counter and stores it as soon as "an event" occurs
some way to create "an event" that is noticed at the same time by all CPUs
some way to prevent timing problems caused by IRQs, task switches, etc.
Various possibilities for the event include:
polling a memory location in a loop, where one CPU writes a new value and other CPUs stop polling when they see the new value
using the local APIC to broadcast an IPI (inter-processor interrupt) to all CPUs
For both of these methods there are delays between the CPUs (especially for larger NUMA systems) - a write to memory (cache) may be visible on the CPU that made the write immediately, and be visible by a CPU on a different physical chip (in a different NUMA domain) later. To avoid this you may need to find the average of initiating the event on all CPUs. E.g. (for 2 CPUs) one CPU initiates and both measure, then the other CPU initiates and both measure, then results are combined to cancel out any "event propagation latency".
To fix other timing problems (IRQs, task switches, etc) I'd want to be doing these tests during boot where nothing else can mess things up. Otherwise you either need to prevent the problems (ensure all CPUs are running at the same speed, disable IRQs, disable thread switches, stop any PCI device bus mastering, etc) or cope with problems (e.g. run the same test many times and see if you get similar results most of the time).
Also note that all of the above can only ensure that the time stamp counters were in sync at the time the test was done, and don't guarantee that they won't become out of sync after the test is done. To ensure the CPUs remain in sync you'd need to rely on the CPU's "monotonic clock" guarantees (but older CPUs don't make that guarantee).
Finally; if you're attempting to do this in user-space (and not in kernel code); then my advice is to design code in a way that isn't so fragile to begin with. Even if the TSCs on different CPUs are guaranteed to be perfectly in sync at all times, you can't prevent an IRQ from interrupting immediately before or immediately after reading the TSC (and there's no way to atomically do something and read TSC at the same time); and therefore if your code requires such precisely synchronised timing then your code's design is probably flawed.

ARM926EJ-S cycle-counter

Im using an ARM926EJ-S and am trying to figure out whether the ARM can give (e.g. a readable register) the CPU's cycle-counter. I guess a # that will represent the number of cycles since the CPU has been powered.
In my system i have only Low-Res external RTC/Timers. I would like to be able to achieve a Hi-Res timer.
Many thanks in advance!
You probably have only two choices:
Use an instruction-cycle accurate simulator; the problem here is that effectively simulating peripherals and external stimulus can be complex or impossible.
Use a peripheral hardware timer. In most cases you will not be able to run such a timer at the typical core clock rate of an ARM9, and there will be an over head in servicing the timer either side of the period being timed, but it can be used to give execution time over larger or longer running sections of code, which may be of more practical use than cycle count.
While cycle count may be somewhat scalable to different clock rates, it remains constrained by memory and I/O wait states, so is perhaps not as useful as it may seem as a performance metric, except at the micro-level of analysis, and larger performance gains are typically to be had by taking a wider view.
The arm-9 is not equipped with an PMU (Performance Monitoring Unit) as included in the Cortex-family. The PMU is described here. The linux kernel comes equipped with support for using the PMU for benchmarking performance. See here for documentation of the perf tool-set.
Bit unsure about the arm-9, need to dig a bit more...

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Resources