I have a small program running on Linux (on an embedded PC, dual-core Intel Atom 1.6GHz with Debian 6 running Linux 2.6.32-5) which communicates with external hardware via an FTDI USB-to-serial converter (using the ftdi_sio kernel module and a /dev/ttyUSB* device). Essentially, in my main loop I run
clock_gettime() using CLOCK_MONOTONIC
select() with a timeout of 8 ms
clock_gettime() as before
Output the time difference of the two clock_gettime() calls
To have some level of "soft" real-time guarantees, this thread runs as SCHED_FIFO with maximum priority (showing up as "RT" in top). It is the only thread in the system running at this priority, no other process has such priorities. My process has one other SCHED_FIFO thread with a lower priority, while everything else is at SCHED_OTHER. The two "real-time" threads are not CPU bound and do very little apart from waiting for I/O and passing on data.
The kernel I am using has no RT_PREEMPT patches (I might switch to that patch in the future). I know that if I want "proper" realtime, I need to switch to RT_PREEMPT or, better, Xenomai or the like. But nevertheless I would like to know what is behind the following timing anomalies on a "vanilla" kernel:
Roughly 0.03% of all select() calls are timed at over 10 ms (remember, the timeout was 8 ms).
The three worst cases (out of over 12 million calls) were 31.7 ms, 46.8 ms and 64.4 ms.
All of the above happened within 20 seconds of each other, and I think some cron job may have been interfering (although the system logs are low on information apart from the fact that cron.daily was being executed at the time).
So, my question is: What factors can be involved in such extreme cases? Is this just something that can happen inside the Linux kernel itself, i.e. would I have to switch to RT_PREEMPT, or even a non-USB interface and Xenomai, to get more reliable guarantees? Could /proc/sys/kernel/sched_rt_runtime_us be biting me? Are there any other factors I may have missed?
Another way to put this question is, what else can I do to reduce these latency anomalies without switching to a "harder" realtime environment?
Update: I have observed a new, "worse worst case" of about 118.4 ms (once over a total of around 25 million select() calls). Even when I am not using a kernel with any sort of realtime extension, I am somewhat worried by the fact that a deadline can apparently be missed by over a tenth of a second.
Without more information it is difficult to point to something specific, so I am just guessing here:
Interrupts and code that is triggered by interrupts take so much time in the kernel that your real time thread is significantly delayed. This depends on the frequency of interrupts, which interrupt handlers are involved, etc.
A thread with lower priority will not be interrupted inside the kernel until it yields the cpu or leaves the kernel.
As pointed out in this SO answer, CPU System Management Interrupts and Thermal Management can also cause significant time delays (up to 300ms were observed by the poster).
118ms seems quite a lot for a 1.6GHz CPU. But one driver that accidently locks the cpu for some time would be enough. If you can, try to disable some drivers or use different driver/hardware combinations.
sched_rt_period_us and sched_rt_period_us should not be a problem if they are set to reasonable values and your code behaves as you expect. Still, I would remove the limit for RT threads and see what happens.
What else can you do? Write a device driver! It's not that difficult and interrupt handlers get a higher priority than realtime threads. It may be easier to switch to a real time kernel but YMMV.
Related
I am tracing Linux 0.11
https://mirrors.edge.kernel.org/pub/linux/kernel/Historic/old-versions/
I see there are many schedule() call in different place, not just the one inside do_timer().
Few questions here:
do_timer() (#sched.c) will be called every time the timer timeout? This timer is based on an x86 interrupt call?
Since there are many schedule() calls outside of do_timer(), can I say that is kind of preempting? or what's the purpose?
Any operation that blocks calls schedule() to yield control.
Some tasks' state has changed, it needs to be updated in schedule().
Some tasks' are working and still a lot of work, schedule() for balance.
Since there are many schedule() calls outside of do_timer(), can I say that is kind of preempting? or what's the purpose?
For a real OS; most task switches occur because a task blocks waiting for something (user input, network packet, disk IO, ..) or a task unblocks because something it was waiting for happened (and the unblocked task has higher priority and preempts the currently running lower priority task).
The whole "task switch caused by timer IRQ" thing is mostly just a fallback to guard against malicious CPU hogs (denial of service attacks); and for normal software under normal conditions you could disable it (delete the schedule() from the timer IRQ handler) and nobody would notice or care. Note: Some people will say it's also for "non-malicious" CPU bound tasks, but CPU bound tasks are relatively rare, and (ignoring the fact that the Linux scheduler has never been good for task priorities) for CPU bound tasks it's better to rely on an effective system of task priorities (e.g. give the CPU bound tasks a low priority so that almost everything will preempt them).
Also note that various courses on OS theory start with "so simple it never actually happens in practice" concepts, which is almost always a pure round-robin scheduler with tasks that never block (often with "Hey, we can accurately predict the future and know exactly how long each task will run for" nonsense), which is mostly fine as a first step (in a "learn to walk before you run" way) but sucks big salty dog balls if it's not followed by more realistic and more complex concepts (better scheduling algorithms, task priorities, multiple simultaneous scheduling algorithms/"scheduler policies", multi-CPU, interactive/latency sensitive tasks, ..) because it leaves the student/victim with little more than misinformation (e.g. the ever re-occurring "all tasks switches are caused by timer IRQ" misconception).
do_timer() (#sched.c) will be called every time the timer timeout? This timer is based on an x86 interrupt call?
I'm guessing that the timer was the raw PIT chip's IRQ (given that Linux version 0.11 was "absolute beginner developer with no intention of making it portable" historical memorabilia from before thousands of volunteers fixed half of the worst parts).
Also don't forget that the scheduler uses time for two different things - the "current task has used too much CPU time" thing that almost never matters, and figuring out when tasks that are blocked/sleeping (e.g. because they called sleep()) should unblock/wake up. The do_timer() might be for either of these things and might be for both (I don't know without looking at it).
Is there any way to tell the kernel that I don't need the full CPU power?
Basically, I want to do some calculation while waiting for another process. But I don't need the full CPU power for that. As the CPU load during the computation is still 100%, the frequency is high. I want to tell the kernel that I am satisfied with a lower CPU frequency in order to save energy.
Instead of calculating using the full frequency and then suspend to wait for the other process, I want to try calculating with lower frequency so that the CPU is not in a lower C state when the other process has finished and the frequency can scale back again.
This doesn't make any sense on a multi-process system, particularly not in Linux. The CPU frequency is a very basic parameter, which affects everything running on the computer - including other processes and the OS itself.
If your program would fiddle with the CPU frequency, it would not only dictate the priority of itself, but also the priority of everything in the computer, including the OS. This isn't possible to do on a desktop system, simply because it doesn't make any sense to have a single application process dictate things that not even the OS dares to meddle with.
If saving power is a priority, you should probably look for completely different alternatives than some desktop Linux solution. PC computers only care about 1) speed, 2) speed and also 3) speed.
The kind of things you ask for are common in real-time embedded systems, where CPUs have a "sleep mode", from which it can wake up to execute something, then go back to sleep. It would usually also be possible for such systems to fiddle with the internal PLL to adjust their own frequency, but such solutions are rare. The industry standard way of doing things is to perform all calculations at max speed, then revert to power-saving sleep mode.
In case of multiple cores - there is a way to specify a certain cpu core to an interrupt. In this way you can save a certain CPU to a certain process:
to find the irq number of the task use:
cat /proc/interrupts
look for your irq number.
lets say that the irp number is 99, so in order to set core #2 to handle this irq, do:
echo 2 > /proc/irq/99/smp_affinity
in this way you can save a certain core to handle your special process.
You can actually use nice() to tell the kernel that your process can live with a lower-than-normal scheduling priority. This effectively reduces the amount of time slices your process will get to use the CPU (typically in favor of other processes running at the same time).
On some more modern systems, if this will reduce the overall CPU load significantly, the CPU might eventually even decide to run on lower frequency. But you typically don't have direct influence on that decision.
Note: Depending on the system, you might be having problems to restore the original nice value (i.e. to scale up on priority again) without running with appropriate permissions.
In case your application is I/O bound and is not doing stupid things wasting CPU cycles like busy-waiting, it shouldn't be necessary to revert to reducing your nice value - Modern CPUs and operating systems should be able to detect themselves when the system is mainly idling around and step down autonomously.
modify scalling_goverance accordingly.
The "scaling_governor" feature enables setting a static frequency to the CPU.
Frequency value must be between scaling_min_freq and scaling_max_freq.
When CPU frequency governor is set to "powersave" mode, CPU is set to the lowest static frequency (within the borders of scaling_min_freq and scaling_max_freq).
Check in below path on the target
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors
and select the required scalling governanceby writing to
echo "powersaving"/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
To tune the performance, few of the files can be updated which will make changes in CPU and the frequency and also the scheduler policies.
Based on the performance analysis, and the load balancing.
Modifications can be adopted.
Check /sys/devices/system/cpu/
Example:-
root:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_governo
rs
interactive performance
root#:/sys/devices/system/cpu/cpu0/cpufreq#
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_governor
interactive
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600
##:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600
I'm trying to write some code to determine if clock_gettime used with CLOCK_MONOTONIC_RAW will give me results coming from the same hardware on different cores.
From what I understand it is possible for each core to produce independent results but not always. I was given the task of obtaining timings on all cores with a precision of 40 nanoseconds.
The reason I'm not using CLOCK_REALTIME is that my program absolutely must not be affected by NTP adjustments.
Edit:
I have found the unsynchronized_tsc function which tries to test whether the TSC is the same on all cores. I am now attempting to find if CLOCK_MONOTONIC_RAW is based on the TSC.
Final edit:
It turns out that CLOCK_MONOTONIC_RAW is always usable on multi-core systems and does not rely on the TSC even on Intel machines.
To do measurements this precisely; you'd need:
code that's executed on all CPUs, that reads the CPU's time stamp counter and stores it as soon as "an event" occurs
some way to create "an event" that is noticed at the same time by all CPUs
some way to prevent timing problems caused by IRQs, task switches, etc.
Various possibilities for the event include:
polling a memory location in a loop, where one CPU writes a new value and other CPUs stop polling when they see the new value
using the local APIC to broadcast an IPI (inter-processor interrupt) to all CPUs
For both of these methods there are delays between the CPUs (especially for larger NUMA systems) - a write to memory (cache) may be visible on the CPU that made the write immediately, and be visible by a CPU on a different physical chip (in a different NUMA domain) later. To avoid this you may need to find the average of initiating the event on all CPUs. E.g. (for 2 CPUs) one CPU initiates and both measure, then the other CPU initiates and both measure, then results are combined to cancel out any "event propagation latency".
To fix other timing problems (IRQs, task switches, etc) I'd want to be doing these tests during boot where nothing else can mess things up. Otherwise you either need to prevent the problems (ensure all CPUs are running at the same speed, disable IRQs, disable thread switches, stop any PCI device bus mastering, etc) or cope with problems (e.g. run the same test many times and see if you get similar results most of the time).
Also note that all of the above can only ensure that the time stamp counters were in sync at the time the test was done, and don't guarantee that they won't become out of sync after the test is done. To ensure the CPUs remain in sync you'd need to rely on the CPU's "monotonic clock" guarantees (but older CPUs don't make that guarantee).
Finally; if you're attempting to do this in user-space (and not in kernel code); then my advice is to design code in a way that isn't so fragile to begin with. Even if the TSCs on different CPUs are guaranteed to be perfectly in sync at all times, you can't prevent an IRQ from interrupting immediately before or immediately after reading the TSC (and there's no way to atomically do something and read TSC at the same time); and therefore if your code requires such precisely synchronised timing then your code's design is probably flawed.
While doing SMP porting of some of our drivers (on
powerpc target) we observed some behavior on which I need you guys to
shed some light:
On doing a local_irq_disable() on a UP system the jiffies tend to
freeze i.e. the count stops incrementing. Is this expected? I thought
that the decrementer interrupt is 'internal' and should not get
affected by the local_irq_disable() kind off call since I expected it to
disable local IRQ interrupt processing (external interrupt). The
system of course freezes then also upon doing a local_irq_enable() the
jiffies count jumps and it seems to be compensating for the 'time
lapse' between the local_irq_disable() and enable() call.
Doing the same on an SMP system (P2020 with 2 e500 cores) the
results are surprising. Firstly the module that is being inserted to
do this testing always executes on core 1. Further it sometimes does
not see a freeze of 'jiffies' counter and sometimes we see that it
indeed freezes. Again in case of a freeze of count it tends to jump
after doing a local_irq_enable(). I have no idea why this may be
happening.
Do we know in case of an SMP do both cores run a schedule timer, so
that in some cases we do not see a freeze of jiffies counts or is it
just on core 0 ?
Also since the kernel timers rely on 'jiffies' -- this would mean that
none of our kernel timers will fire if local_irq_disable() has been
done? What would be the case this is done on one of the cores in an
SMP system?
There are many other questions, but I guess these will be enough to
begin on a general discussion about the same :)
TIA
NS
Some more comments from the experimentation done.
My understanding at this point in time is that since kernel timers depend on 'jiffies' to fire, they wont actually fire on a UP system when I issue a local_irq_save(). Infact some of our code is based on the assumption that when I do issue a local_irq_save() it guarantees protection against interrupts on the local processor and kernel timers as well.
However carrying out the same experiment on an SMP system, even with both cores executing a local_irq_save(), the jiffies do NOT stop incrementing and the system doesn't freeze. How is this possible ? Is LINUX using some other mechanism to trigger timer interrupts in the SMP system or possibly using IPIs? This also breaks our assumption that local_irq_disable() will protect the system against kernel timers running on the same core atleast.
How do we go about writing a code that is safe against async events i.e. interrupts and kernel timers and is valid for both UP and SMP.
local_irq_disable only disables interrupts on the current core, so, when you're single core, everything is disabled (including timer interrupts) and that is why jiffies are not updated.
When running on SMP, sometimes you happen to disable the interrupts on the core that's updating the jiffies, sometimes not.
This usually is not a problem, because interrupts are supposed to be disabled only for a very short periods, and all scheduled timers will fire after interrupts gets enabled again.
How do you know that your module always run on core 1? On current versions of the kernel, it may even be running on more than one core at the same time (that is, if you didn't forced it to don't do it).
There are several facets to this problem. Lets take them 1 by 1.
1.
a)
local_irq_save() simply clears the IF flag of the eflags register. IRQ handlers can run concurently on the other cores.
global_irq_save() is not available because that would required interprocessor communication to implement and it is not really needed anyway since local irq disabling is intended for very short period of time only.
b)
modern APICs allows IRQ dynamic distribution among the present cores and besides rare exceptions, the kernel essentially programs the necessary registers to obtain a round-robin distribution of the IRQs.
The consequence of that is that if the irqs are disabled long enough locally, when the APIC delivers an IRQ to the core that has them disabled, the end result will be that the system will globally stop receiving this particular IRQ up to the point where the irqs are finally reenabled locally on the core that received the last IRQ of that type.
2.
Concerning the different results concerning jiffies updates and irq disabling, it depends on the selected clocksource.
You can figure out which one is choosen by consulting:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
if you have tsc as clocksource then all cores have it locally. However if your clocksource is something else ie: HPET an external device, then jiffies will become frozen for the reasons described in point #1.
I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.