While doing SMP porting of some of our drivers (on
powerpc target) we observed some behavior on which I need you guys to
shed some light:
On doing a local_irq_disable() on a UP system the jiffies tend to
freeze i.e. the count stops incrementing. Is this expected? I thought
that the decrementer interrupt is 'internal' and should not get
affected by the local_irq_disable() kind off call since I expected it to
disable local IRQ interrupt processing (external interrupt). The
system of course freezes then also upon doing a local_irq_enable() the
jiffies count jumps and it seems to be compensating for the 'time
lapse' between the local_irq_disable() and enable() call.
Doing the same on an SMP system (P2020 with 2 e500 cores) the
results are surprising. Firstly the module that is being inserted to
do this testing always executes on core 1. Further it sometimes does
not see a freeze of 'jiffies' counter and sometimes we see that it
indeed freezes. Again in case of a freeze of count it tends to jump
after doing a local_irq_enable(). I have no idea why this may be
happening.
Do we know in case of an SMP do both cores run a schedule timer, so
that in some cases we do not see a freeze of jiffies counts or is it
just on core 0 ?
Also since the kernel timers rely on 'jiffies' -- this would mean that
none of our kernel timers will fire if local_irq_disable() has been
done? What would be the case this is done on one of the cores in an
SMP system?
There are many other questions, but I guess these will be enough to
begin on a general discussion about the same :)
TIA
NS
Some more comments from the experimentation done.
My understanding at this point in time is that since kernel timers depend on 'jiffies' to fire, they wont actually fire on a UP system when I issue a local_irq_save(). Infact some of our code is based on the assumption that when I do issue a local_irq_save() it guarantees protection against interrupts on the local processor and kernel timers as well.
However carrying out the same experiment on an SMP system, even with both cores executing a local_irq_save(), the jiffies do NOT stop incrementing and the system doesn't freeze. How is this possible ? Is LINUX using some other mechanism to trigger timer interrupts in the SMP system or possibly using IPIs? This also breaks our assumption that local_irq_disable() will protect the system against kernel timers running on the same core atleast.
How do we go about writing a code that is safe against async events i.e. interrupts and kernel timers and is valid for both UP and SMP.
local_irq_disable only disables interrupts on the current core, so, when you're single core, everything is disabled (including timer interrupts) and that is why jiffies are not updated.
When running on SMP, sometimes you happen to disable the interrupts on the core that's updating the jiffies, sometimes not.
This usually is not a problem, because interrupts are supposed to be disabled only for a very short periods, and all scheduled timers will fire after interrupts gets enabled again.
How do you know that your module always run on core 1? On current versions of the kernel, it may even be running on more than one core at the same time (that is, if you didn't forced it to don't do it).
There are several facets to this problem. Lets take them 1 by 1.
1.
a)
local_irq_save() simply clears the IF flag of the eflags register. IRQ handlers can run concurently on the other cores.
global_irq_save() is not available because that would required interprocessor communication to implement and it is not really needed anyway since local irq disabling is intended for very short period of time only.
b)
modern APICs allows IRQ dynamic distribution among the present cores and besides rare exceptions, the kernel essentially programs the necessary registers to obtain a round-robin distribution of the IRQs.
The consequence of that is that if the irqs are disabled long enough locally, when the APIC delivers an IRQ to the core that has them disabled, the end result will be that the system will globally stop receiving this particular IRQ up to the point where the irqs are finally reenabled locally on the core that received the last IRQ of that type.
2.
Concerning the different results concerning jiffies updates and irq disabling, it depends on the selected clocksource.
You can figure out which one is choosen by consulting:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
if you have tsc as clocksource then all cores have it locally. However if your clocksource is something else ie: HPET an external device, then jiffies will become frozen for the reasons described in point #1.
Related
When I am executing a system call to do write or something else, the ISR corresponded to the exception is executing in interrupt mode (on cortex-m3 the IPSR register is having a non-zero value, 0xb). And what I have learned is that when we execute a code in an interrupt mode we can not sleep, we can not use functions that might block ...
My question is that: is there any kind of a mechanism with which the ISR could still executing in interrupt mode and in the same time it could use functions that might block, or is there any kind of trick is implemented.
Caveat: This is more of a comment than an answer but is too big to fit in a comment or series of comments.
TL;DR: Needing to sleep or execute a blocking operation from an ISR is a fundamental misdesign. This seems like an XY problem: https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
Doing sleep [even in application code] is generally a code smell. Why do you feel you need to sleep [vs. some event/completion driven mechanism]?
Please edit your question and add clarification [i.e. don't just add comments]
When I am executing a system call to do write or something else
What is your application doing? A write to what device? What "something else"?
What is the architecture/board, kernel/distro? (e.g. Raspberry Pi running Raspian? nvidia Jetson? Beaglebone? Xilinx FPGA with petalinux?)
What is the H/W device and where did the device driver come from? Did you write the device driver yourself or is it a standard one that comes with the kernel/distro? If you wrote it, please post it in your question.
Is the device configured properly? (e.g.) Are the DTB entries correct?
Is the device a block device, such as a disk controller? Or, is it a character device, such as a UART? Does the device transfer data via DMA? Or, does it transfer data by reading/writing to/from an IO port?
What do you mean by "exception"? Generally, exception is an abnormal condition (e.g. segfault, bus error, etc.). Please describe the exact context/scenario for which this occurs.
Generally, an ISR does little things. (e.g.) Grab [and save] status from the device. Clear/rearm the interrupt in the interrupt controller. Start the next queued transfer request. Wake up the sleeping base level task (usually the task that executed the syscall [waiting on a completion event in kernel mode]).
More elaborate actions are generally deferred and handled in the interrupt's "bottom half" handler and/or tasklet. Or, the base level is woken up and it handles the remaining processing.
What kernel subsystems are involved? Are you using platform drivers? Are you interfacing from within the DMA device driver framework? Are message buses involved (e.g. I2C, SPI, etc.)?
Interrupt and device handling in the linux kernel is somewhat different than what one might do in a "bare metal" system or RTOS (e.g. FreeRTOS). So, if you're coming from those environments, you'll need to think about restructuring your driver code [and/or application code].
What are your requirements for device throughput and latency?
You may wish to consult a good book on linux device driver design. And, you may wish to consult the kernel's Documentation subdirectory.
If you're able to provide more information, I may be able to help you further.
UPDATE:
A system call is not really in the same class as a hardware interrupt as far as the kernel is concerned, even if the CPU hardware uses the same sort of exception vector mechanisms for handling both hardware and software interrupts. The kernel treats the system call as a transition from user mode to kernel mode. – Ian Abbott
This is a succinct/great explanation. The "mode" or "context" has little to do with how we got/get there from a H/W mechanism.
The CPU doesn't really "understand" interrupt mode [as defined by the kernel]. It understands "supervisor" vs "user" privilege level [sometimes called "mode"].
When executing at user privilege level, an interrupt/exception will notice the transition from "user" level to "supvervisor" level. It may have a special register that specifies the address of the [initial] supervisor stack pointer. Atomically, it swaps in the value, pushing the user SP onto the new kernel stack.
If the interrupt is interrupting a CPU that is already at supervisor level, the existing [supervisor] SP will be used unchanged.
Note that x86 has privilege "ring" levels. User mode is ring 3 and the highest [most privileged] level is ring 0. For arm, some arches can have a "hypervisor" privilege level [which is higher privilege than "supervisor" privilege].
The setup of the mode/context is handled in arch/arm/kernel/entry-*.S code.
An svc is a synchronous interrupt [generated by a special CPU instruction]. The resulting context is the context of the currently executing thread. It is analogous to "call function in kernel mode". The resulting context is "kernel thread mode". At that point, it's not terribly useful to think of it as an "interrupt" anymore.
In fact, on some arches, the syscall instruction/mechanism doesn't use the interrupt vector table. It may have a fixed address or use a "call gate" mechanism (e.g. x86).
Each thread has its own stack which is different than the initial/interrupt stack.
Thus, once the entry code has established the context/mode, it is not executing in "interrupt mode". So, the full range of kernel functions is available to it.
An interrupt from a H/W device is asynchronous [may occur at any time the CPU's internal interrupt enable flag is set]. It may interrupt a userspace application [executing in application mode] OR kernel thread mode OR an existing ISR executing in interrupt mode [from another interrupt]. The resulting ISR is executing in "interrupt mode" or "ISR mode".
Because the ISR can interrupt a kernel thread, it may not do certain things. For example, if the CPU were in [kernel] thread mode, and it was in the middle of a kmalloc call [GFP_KERNEL], the ISR would see partial state and any action that tried to adjust the kernel's heap would result in corruption of the heap.
This is a design choice by linux for speed.
Kernel ISRs may be classified as "fast interrupts". The ISR executes with the CPU interrupt enable [IE] flag cleared. No normal H/W interrupt may interrupt the ISR.
So, if another H/W device asserts its interrupt request line [in the external interrupt controller], that request will be "pending". That is, the request line has been asserted but the CPU has not acknowledged it [and the CPU has not jumped via the interrupt table].
The request will remain pending until the CPU allows further interrupts by asserting IE. Or, the CPU may clear the pending interrupt without taking action by clearing the pending interrupt in the interrupt controller.
For a "slow" interrupt ISR, the interrupt entry code will clear the interrupt in the external interrupt controller. It will then rearm interrupts by setting IE and call the ISR. This ISR can be interrupted by other [higher priority] interrupts. The result is a "stacked" interrupt.
I have been searching all over the places, I come to the conclusion is that the interrupts have a higher priority than some of exceptions in the Linux kernel.
An exception [synchronous interrupt] can be interrupted if the IE flag is enabled. An svc is treated differently but after the entry code is executed, the IE flag is set, so the actual syscall code [executing in kernel thread mode] can be interrupted by a H/W interrupt.
Or, in limited circumstances, the kernel code can generate an exception (e.g. a page fault caused by a kernel action [which is usually deemed fatal]).
but I am still looking on how exactly the context switching happen when executing an exception and letting the processor to execute in a thread mode while the SVCall exception is pending (was preempted and have not returned yet)... I think when I understand that, it would be more clear to me.
I think you have to be very careful with the terminology. In particular, when combining terms from disparate sources. Although user mode, kernel thread mode, or interrupt mode can be considered a context [in the dictionary sense of the word], context switching usually means that the current thread is suspended, the scheduler selects a new thread to run and resumes it. That is separate from the user-to-kernel transition.
And if there is any recommended resources about that for ARM-Cortex-M3/4, it would be nice
Here is something: https://interrupt.memfault.com/blog/arm-cortex-m-exceptions-and-nvic But, be very careful in applying the terminology therein. What it considers "pending" only exists in the kernel during the entry code. What is more relevant is what the kernel does to set up mode/context and the terms are not equivalent.
So, from the kernel's standpoint, it's probably better to not consider an svc as "pending".
Not sure if there are similar questions. I tried to backread but can't find any, so here it is.
In my bare-metal application that uses ARM Cortex-A9 (dual core with GIC), some of the interrupt sources are 4 FPGA interrupts (let's say IRQ ID 58, 59, 60, 61) that have the same priority and the idea is that all simultaneously trigger continuously in run-time. I can say the interrupt handlers may qualify as long, but not very long.
All interrupts fire and are detected by GIC and all are flagged as PENDING. The problem is, only the two higher ID'ed interrupts (58, 59) get handled by CPU, starving the other two. Once 58 or 59 are done, their source will trigger again and grab the CPU over and over again. My other interrupts are indefinitely being starved.
I played around with priority, assigning higher interrupts to 60 and 61. Sure enough, 60 and 61 triggered and got handled by CPU, but 58 and 59 are starved. So it's really an issue of starvation.
Is there any way out of here, such that the other two will still be processed given their triggering rate?
Assuming the GIC implementation is one of ARM's designs, then the arbitration scheme for multiple interrupts at the same priority is fixed at "dispatch the lowest-numbered one", so if you were hoping it could be changed to some kind of round-robin scheme you're probably out of luck.
That said, if these interrupts are more or less permanently asserted and you're taking them back-to-back then that's a sign that you probably don't need to use interrupts, or at least that the design of your code is inappropriate. Depending on the exact nature of the task, here are some ideas I'd consider:
Just run a continuous polling loop cycling through each device in turn. If there are periods when each device might not need servicing and it's not straightforward to tell, retain a trivial interrupt handler that just atomically sets a flag/sequence number/etc. to inform the loop who's ready.
Handle all the interrupts on one core, and the actual processing on the other. The handler just grabs the necessary data, stuffs it into a queue, and returns as quickly as possible, while the other guy just steadily chews through the queue.
If catching every single interrupt is less important than just getting "enough" of each of them on average, leave each one disabled for a suitable timeout after handling it. Alternatively, hack up your own round-robin scheduling by having only one enabled at a time, and the handler reenables the next interrupt instead of the one just taken.
In my bare-metal application that uses ARM Cortex-A9 (dual core with GIC)...
Is there any way out of here, such that the other two will still be processed given their triggering rate?
Of course there are many ways.
You have a dual CPU so you can route a set to each CPU; 58/59 to CPU0 and 60/61 to CPU1. It is not clear how you have handled things with the distributor nor the per-CPU interfaces.
A 2nd way is to just read the status in the 58/59 handlers of the 60/61 and do the work. Ie, you can always read a status of another interrupt from the IRQ handler.
You can also service each and every pending interrupt recorded at the start of the IRQ before acknowledging the original source. A variant of '2' implemented at the IRQ controller layer.
I believe that most of these solutions avoid needless context save/restores and should also be more efficient.
Of course if you are asking the CPU to do more work than it can handle, priorities don't matter. The issue may be your code is not efficient; either the bare metal interrupt infrastructure or your FPGA IRQ handler. It is also quite likely the FPGA to CPU interface is not designed well. You may need to add FIFOs in the FPGA to buffer the data so the CPU can handle more data at a time. I have worked with several FPGA designers. They have a lot of flexibility and usually if you ask for something that will make the IRQ handler more efficient, they can implement it.
I have a small program running on Linux (on an embedded PC, dual-core Intel Atom 1.6GHz with Debian 6 running Linux 2.6.32-5) which communicates with external hardware via an FTDI USB-to-serial converter (using the ftdi_sio kernel module and a /dev/ttyUSB* device). Essentially, in my main loop I run
clock_gettime() using CLOCK_MONOTONIC
select() with a timeout of 8 ms
clock_gettime() as before
Output the time difference of the two clock_gettime() calls
To have some level of "soft" real-time guarantees, this thread runs as SCHED_FIFO with maximum priority (showing up as "RT" in top). It is the only thread in the system running at this priority, no other process has such priorities. My process has one other SCHED_FIFO thread with a lower priority, while everything else is at SCHED_OTHER. The two "real-time" threads are not CPU bound and do very little apart from waiting for I/O and passing on data.
The kernel I am using has no RT_PREEMPT patches (I might switch to that patch in the future). I know that if I want "proper" realtime, I need to switch to RT_PREEMPT or, better, Xenomai or the like. But nevertheless I would like to know what is behind the following timing anomalies on a "vanilla" kernel:
Roughly 0.03% of all select() calls are timed at over 10 ms (remember, the timeout was 8 ms).
The three worst cases (out of over 12 million calls) were 31.7 ms, 46.8 ms and 64.4 ms.
All of the above happened within 20 seconds of each other, and I think some cron job may have been interfering (although the system logs are low on information apart from the fact that cron.daily was being executed at the time).
So, my question is: What factors can be involved in such extreme cases? Is this just something that can happen inside the Linux kernel itself, i.e. would I have to switch to RT_PREEMPT, or even a non-USB interface and Xenomai, to get more reliable guarantees? Could /proc/sys/kernel/sched_rt_runtime_us be biting me? Are there any other factors I may have missed?
Another way to put this question is, what else can I do to reduce these latency anomalies without switching to a "harder" realtime environment?
Update: I have observed a new, "worse worst case" of about 118.4 ms (once over a total of around 25 million select() calls). Even when I am not using a kernel with any sort of realtime extension, I am somewhat worried by the fact that a deadline can apparently be missed by over a tenth of a second.
Without more information it is difficult to point to something specific, so I am just guessing here:
Interrupts and code that is triggered by interrupts take so much time in the kernel that your real time thread is significantly delayed. This depends on the frequency of interrupts, which interrupt handlers are involved, etc.
A thread with lower priority will not be interrupted inside the kernel until it yields the cpu or leaves the kernel.
As pointed out in this SO answer, CPU System Management Interrupts and Thermal Management can also cause significant time delays (up to 300ms were observed by the poster).
118ms seems quite a lot for a 1.6GHz CPU. But one driver that accidently locks the cpu for some time would be enough. If you can, try to disable some drivers or use different driver/hardware combinations.
sched_rt_period_us and sched_rt_period_us should not be a problem if they are set to reasonable values and your code behaves as you expect. Still, I would remove the limit for RT threads and see what happens.
What else can you do? Write a device driver! It's not that difficult and interrupt handlers get a higher priority than realtime threads. It may be easier to switch to a real time kernel but YMMV.
I'm trying to write some code to determine if clock_gettime used with CLOCK_MONOTONIC_RAW will give me results coming from the same hardware on different cores.
From what I understand it is possible for each core to produce independent results but not always. I was given the task of obtaining timings on all cores with a precision of 40 nanoseconds.
The reason I'm not using CLOCK_REALTIME is that my program absolutely must not be affected by NTP adjustments.
Edit:
I have found the unsynchronized_tsc function which tries to test whether the TSC is the same on all cores. I am now attempting to find if CLOCK_MONOTONIC_RAW is based on the TSC.
Final edit:
It turns out that CLOCK_MONOTONIC_RAW is always usable on multi-core systems and does not rely on the TSC even on Intel machines.
To do measurements this precisely; you'd need:
code that's executed on all CPUs, that reads the CPU's time stamp counter and stores it as soon as "an event" occurs
some way to create "an event" that is noticed at the same time by all CPUs
some way to prevent timing problems caused by IRQs, task switches, etc.
Various possibilities for the event include:
polling a memory location in a loop, where one CPU writes a new value and other CPUs stop polling when they see the new value
using the local APIC to broadcast an IPI (inter-processor interrupt) to all CPUs
For both of these methods there are delays between the CPUs (especially for larger NUMA systems) - a write to memory (cache) may be visible on the CPU that made the write immediately, and be visible by a CPU on a different physical chip (in a different NUMA domain) later. To avoid this you may need to find the average of initiating the event on all CPUs. E.g. (for 2 CPUs) one CPU initiates and both measure, then the other CPU initiates and both measure, then results are combined to cancel out any "event propagation latency".
To fix other timing problems (IRQs, task switches, etc) I'd want to be doing these tests during boot where nothing else can mess things up. Otherwise you either need to prevent the problems (ensure all CPUs are running at the same speed, disable IRQs, disable thread switches, stop any PCI device bus mastering, etc) or cope with problems (e.g. run the same test many times and see if you get similar results most of the time).
Also note that all of the above can only ensure that the time stamp counters were in sync at the time the test was done, and don't guarantee that they won't become out of sync after the test is done. To ensure the CPUs remain in sync you'd need to rely on the CPU's "monotonic clock" guarantees (but older CPUs don't make that guarantee).
Finally; if you're attempting to do this in user-space (and not in kernel code); then my advice is to design code in a way that isn't so fragile to begin with. Even if the TSCs on different CPUs are guaranteed to be perfectly in sync at all times, you can't prevent an IRQ from interrupting immediately before or immediately after reading the TSC (and there's no way to atomically do something and read TSC at the same time); and therefore if your code requires such precisely synchronised timing then your code's design is probably flawed.
I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.