Low interrupt latency via dedicated architectures and operating systems - arm

This question may seem slightly vague, however I am researching upon how interrupt systems work and their latency times. I am trying to achieve an understanding of how architecture facilities such as FIQ in ARM help decrease latency times. How does this differ from using a operating system that does not have access or can not provide access to this facilities? For example - Windows RT is made for ARM etc, and this operating system is not able to be ported to other architectures.
Simply put - how is interrupt latency different in dedicated architectures that have dedicated operating systems as compared to operating systems that can be ported across many different architectures (Linux for example)?
Sorry for the rant - I'm pretty confused as you can probably tell.

I'll start with your Windows RT example, Windows RT is a port of Windows to the ARM architecture. It is not a 'dedicated operating system'. There are (probably) many OSes that only run on only 1 architecture, but that is more a function of can't be arsed to port them due to some reason.
What does 'port' really mean though?
Windows has a kernel (we'll call is NT here, doesn't matter) and that NT kernel has a bunch of concepts that need to be implemented. These concepts are things like timers, memory virtualisation, exceptions etc...
These concepts are implemented differently between architectures, so the port of the kernel and drivers (I will ignore the rest of the OS here, often that is a recompile only) will be a matter of using the available pieces of silicon to implement the required concepts. This implementation is a called 'port'.
Let's zoom in on interrupts (AKA exceptions) on an ARM that has FIQ and IRQ.
In general an interrupt can occur asynchronously, by that I mean at any time. The CPU is generally busy doing something when an IRQ is asserted so that context (we'll call it UserContext1) needs to be stored before the CPU can use any resources in use by UserContext1. Generally this means storing registers on the stack before using them.
On ARM when an IRQ occurs the CPU will switch to IRQ mode. Registers r13 and r14 have there own copy for IRQ mode, the rest will need to be saved if they are used - so that is what happens. Those stores to memory take some time. The IRQ is handled, UserContext1 is popped back off the stack then IRQ mode is exited.
So the latency in this case might be the time from IRQ assertion to the time the IRQ vector starts executing. That going to be some set number of clock cycles based upon what the CPU was doing when the IRQ happened.
The latency before the IRQ handling can occur is the time from the IRQ assert to the time the CPU has finished storing the context.
The latency before user mode code can execute depends on too much stuff in the OS/Kernel to explain here, but the minimum boils down to the time from the IRQ assertion to the return after restoring UserContext1 + the time for the OS context switch.
FIQ - If you are a hard as nails programmer you might only need to use 7 registers to completely handle your interrupt servicing. I mentioned that IRQ mode has its own copy of 2 registers, well FIQ mode has its own copy of 7 registers. Yup, that's 28 bytes of context that doesn't need to be pushed out into the stack (actually one of them is the link register so it's really 6 you have). That can remove the need to store UserContext1 then restore UserContext1. Thus the latency can be reduced by up to the length of time needed to do that save/restore.
None of this has much to do with the OS. The OS can choose to use or not use these features. The OS can choose to make guarantees regarding how long it will take to execute the OSes concept of an interrupt handler, or it may not. This is one of the basic concepts of an RTOS, the contract about how long before the handler will run.
The OS is designed for some purpose (and that purpose may be 'general') - that target design goal will have a lot more affect on latency than haw many target the OS has been ported to.
Go have a read about something like freertos than buy some hardware and try it. Annotate the code to figure out the latencies you really want to look at. IT will likely be the best way to get your ehad around it.
(*Multi-CPU systems do it the same with but with some synchronization and barrier functions and a sprinkling of complexity)

Related

How to prevent system hang before watchdog timer task kicks in

We are using an ARM AM1808 based Embedded System with an rtos and a File System. We are using C language. We have a watchdog timer implemented inside the Application code. So, whenever something goes wrong in the Application code, the watchdog timer takes care of the system.
However, we are experiencing an issue where the system hangs before the watchdog timer task starts. The system hangs because the File System code is badly coded with so many number of while loops. And sometimes due to a bad NAND(or atleast the File System code thinks it is bad) the code hangs in a while loop and never gets out of it. And what we get is a dead board.
So, the point of giving all the information is to ask you guys whether there is any mechanism which could be implemented in the code that runs before the application code? Is there any hardware watchdog? What steps can be taken in order to make sure we don't get a dead board caused by some while loop.
Professional embedded systems are designed like this:
Pick a MCU with power-on-reset interrupt and on-chip watchdog. This is standard on all modern MCUs.
Implement the below steps from inside the reset interrupt vector.
If the MCU memory is simple to setup, such as just setting the stack pointer, then do so the first thing you do out of reset. This enables C programming. You can usually write the reset ISR in C as long as you don't declare any variables - disassemble to make sure that it doesn't touch any RAM memory addresses until those are available.
If the memory setup is complex - there is a MMU setup or similar - C code will have to wait and you'll have to stick to assembler to prevent accidental stacking caused by C code.
Setup the most fundamental registers, such as mode/peripheral routing registers, watchdog and system clock.
Setup the low-voltage detect hardware, if applicable. Hopefully the out-of-reset state for LVD on the MCU is a sound one.
Application-specific, critical registers such as GPIO direction and internal pull resistor registers should be set from here. Many MCU have pins as inputs by default, making them vulnerable. If they are not meant to be inputs in the application, the time they are kept as such out of reset should be minimized, to avoid problems with noise, transients and ESD.
Setup the MMU, if applicable.
Everything else "CRT", such as initialization of .data and .bss.
Call main().
Please note that pre-made startup code for your MCU is not necessarily made by professionals! It is fairly common that there's an amateur-level "CRT" delivered with your toolchain, which fails to setup the watchdog and clock early on. This is of course unacceptable since:
This makes any program running on that platform a notable safety/poor quality hazard, in case the "CRT" will crash/hang for whatever reason.
This makes the initialization of .data and .bss needlessly, painfully slow, as it is then typically executed with the clock running on the default on-chip RC oscillator or similar.
Please note that even industry de facto startup code such as ARM CMSIS fails to do some of the MCU-specific hardware setups mentioned above. This may or may not be a problem.
There is a hardware watchdog that could be run before the application runs. ARM AM1808 does have a timer that could be implemented as a watchdog, as per documentation: www.ti.com/lit/ds/symlink/am1808.pdf. So, you may wish to set it like that at least during the part of the program that runs through the critical and long section. You at wish to have a piece of booting code that first sets this watchdog, and after the correct initialization, goes to application. In fact, this is a very common approach.

Can the kernel set the interval of the "hardware timer" of the CPU, and does the CPU have a dedicated hardware timer for scheduling?

Based on my understanding, the CPU has a "hardware timer" that fires an interrupt when its interval expires.
The kernel uses this hardware timer to implement the scheduling mechanism for the processes, so if the hardware timer fires an interrupt with the number of 123, the kernel will map this interrupt number to an interrupt handler that executes the scheduler code (which will decide which process to execute next).
I have two questions:
Can the kernel set the interval of the hardware timer, or is the interval a fixed number that can't be changed programmatically?
Does the CPU have a dedicated hardware timer for scheduling or is there many hardware timers, and the kernel can choose whichever timer it wants to use for scheduling?
Edit: The hardware architecture I am more interested in is a PC, but I would like to know if other architectures (for example: a mobile phone, a raspberry PI, etc.) works in a similar way.
Details are hardware specific (might be different with various motherboards, chipsets, processors; read about SouthBridge). Read about High Precision Event Timer (and APIC).
See also OSDEV wiki, notably Programmable Interval Timer.
(so the answer is usually yes to both questions)
From early on, IBM-compatible PCs had PITs (Programmable Interval Timers): IBM PC and IBM PC XT had the Intel 8253, the IBM PC AT introduced the Intel 8254.
From the IBM PC Technical Reference from April 1984, page 1-11:
System Timers
Three programmable timer/counters are used by the system as follows: Channel 0 is a general-purpose timer providing a constant time base for implementing a time-of-day clock, Channel 1 times and requests refresh cycles from the Direct Memory Access (DMA) channel, and Channel 2 supports the tone generation for the speaker. [...]
Channel 0 is exactly the "constant time base," the "interval" you are asking for. And, to answer your 1st question, it is changeable; it is the Programmable Interval Timer.
However, the CPU built into the original IBM PC was the Intel 8088, basically an Intel 8086 with an 8-bit data bus. Real Mode was the state of the art back then; Protected Mode was introduced some years later with the Intel 80286, so effective multitasking, let alone preemptive multitasking or multithreading, were of no concern in those days when DOS reigned the market.
Fast-forwarding to the IBM PC AT, the world was blessed with a Protected Mode-capable CPU, the Intel 80286, and the Intel 8254 was introduced, a "[...] superset of the 8253." (from the 8254 PIT datasheet). If you really want an in-depth understanding of the PITs, read the 8253/8254 datasheets linked at the bottom. It might also be worth looking at Linux. Since the latest kernels are way too complicated to really understand the particular parts in a matter of twenty minutes, I suggest you look at Linux 0.01, the very first release. _timer_interrupt in kernel/system_calls.s might be interesting and from there you can go wherever you want.
Regarding your 2nd question: there are multiple timer sources, but only one is suitable for interval timing, that is, channel 0. IBM-compatibles still comply with the system timer layout shown above. They retain the same functionality, but might add more on top of that or change how the hardware works and how it's packaged. Nowadays, additional timers do exist like high-resolution timers, but using them for interrupt timing instead would break compatibility.
Intel 8253 Datasheet
Intel 8254 Datasheet
IBM PC Technical Reference
IBM PC AT Technical Reference
Can the kernel set the interval of the hardware timer, or is the interval a fixed number that can't be changed programmatically?
Your questions are ENTIRELY processor specific. Some processors have controllable timers. Others have timers that go off at fixed intervals. Most processors you are likely to encounter have adjustable timers, however.
Does the CPU have a dedicated hardware timer for scheduling or is there many hardware timers, and the kernel can choose whichever timer it wants to use for scheduling?
Some processors have only one timer. Most processors these days have multiple timers.

Full software timer : derivate time?

I have been asked a question but I am not sure if I answered it correctly.
"Is it possible to rely only on software timer?"
My answer was "yes, in theory".
But then I added:
"Just relying on hardware timer at the kernel loading (rtc) and then
software only is a mess to manage since we must be able to know
how many cpu cycles each instruction took + eventual cache miss +
branching cost + memory speed and put a counter after each one or
group (good luck with out-of-order cpu).
And do the calculation to derivate the current cpu cycle. That is
insane.
Not talking about the overall performance drop.
The best we could have is a brittle approximation of the time which
become more wrong over time. Even possibly on short laps."
But even if it seems logical to me, did my thinking go wrong?
Thanks
On current processors and hardware (e.g. Intel or AMD or ARM in laptops or desktops or tablets) with common operating systems (Linux, Windows, FreeBSD, MacOSX, Android, iOS, ...) processes are scheduled at random times. So cache behavior is non deterministic. Hence, instruction timing is non reproducible. You need some hardware time measurement.
A typical desktop or laptop gets hundreds, or thousands, of interrupts every second, most of them time related. Try running cat /proc/interrupts on a Linux machine twice, with a few seconds between the runs.
I guess that even with a single-tasked MS-DOS like operating system, you'll still get random behavior (e.g. induced by ACPI, or SMM). On some laptops, the processor frequency can be throttled by its temperature, which depends upon the CPU load and the external temperature...
In practice you really want to use some timer provided by the operating system. For Linux, read time(7)
So you practically cannot rely on a purely software timer. However, the processor has internal timers.... Even in principle, you cannot avoid timers on current processors ....
You might be able, if you can put your hardware in a very controlled environment (thermostatically) to run a very limited software (an OS-like free standing thing) sitting entirely in the processor cache and perhaps then get some determinism, but in practice current laptop or desktop (or tablet) hardware is non-deterministic and you cannot predict the time needed for a given small machine routine.
Timers are extremely useful in interesting (non-trivial) software, see e.g. J.Pitrat CAIA, a sleeping beauty blog entry for an interesting point. Also look at the many uses of watchdog timers in software (e.g. in the Parma Polyhedra Library)
Read also about Worst Case Execution Time (WCET).
So I would say that even in theory it is not possible to rely upon a purely software timer (unless of course that software uses the processor timers, which are hardware circuits). In the previous century (up to 1980s or 1990s) hardware was much more deterministic, and the amount of clock cycles or microsecond needed for each machine instruction was documented (but some instructions, e.g. division, needed a variable amount of time, depending on the actual data!).

To what extent are interrupts supported in Win32?

To what extent are interrupts supported in Win32 beyond processor definitions? For example, x86 machines define at least 18 interrupts, including traps such as the breakpoint trap (INT 3). The other 19-255 interrupts are left open by Intel as software defined interrupts. Are any of these used by Windows/WinAPI or are they just open and free for applications to use as they please? If Windows uses them, where can I find the relevant documentation? I looked on MSDN and could not find anything.
(BTW I am doing compiler, debugger and other system-level programming, so please don't lecture me on your opinions about the advisability of using interrupts in the first place.)
In Win32 apps, there's probably just one interrupt used commonly, int 2Eh. It's used as the system call entry point. It's analogous to int 21h in DOS. The rest of the interrupts aren't used by apps.
Apps, however, can handle some CPU exceptions (and debug breaks) via Structured Exception Handling (SEH)/Vectored Exception Handling (VEH). Windows catches CPU exceptions originating in apps and reflects them back into the apps, if and however possible (Windows is not perfect in imitating the CPU exception model).
Windows uses device interrupts internally and does not let apps mess with them. The x86 CPU handles interrupts in the most privileged mode, where the kernel runs.
Nowadays many device interrupts aren't associated with fixed interrupt vectors and are configurable and you need to work with the various things like PCI to query or change the settings.
If you want to work with devices and interrupts directly, you need to write a kernel-mode driver for Windows. There's the Device Driver Kit (DDK) and books like Windows Internals that can get you started.
Still, if you're looking for specifics of device XYZ and its interrupt programming, you aren't going to find everything or much on MSDN or in the DDK because you'll need hardware-specific information, something that's outside of Microsoft's control. The kernel provides the functionality necessary to do I/O and handle interrupts, but it's ultimately up to device drivers to use them one way or the other.

Any open-source ARM7 emulators suitable for linking with C?

I have an open-source Atari 2600 emulator (Z26), and I'd like to add support for cartridges containing an embedded ARM processor (NXP 21xx family). The idea would be to simulate the 6507 until it tries to read or write a byte of memory (which it will do every 841ns). If the 6507 performs a write, put the address and data on some of the ARM's I/O ports and let the ARM code run 20 cycles, confirm that the ARM is floating its data bus, and let the ARM run for another 38 cycles. If the 6507 performs a read, put the address on the ARM's I/O ports, let the ARM run 38 cycles, grab the data from the ARM's I/O port (hopefully the ARM software will have put it there), and let the ARM run another 20 cycles.
The ARM7 seems pretty straightforward to implement; I don't need to simulate a whole lot of hardware features. Any thoughts?
Edit
What I have in mind would be a routine that would take as a parameter a struct holding the machine state and pointers to a memory access routine. When called, the routine would emulate the ARM's instruction engine, generating appropriate reads, writes, and code fetches. I could then write the memory access routine to regard appropriate areas as flash (with roughly-approximated wait states), RAM, I/O ports, and timer registers. Some other areas would be marked as don't-care, and accesses to any other areas would flag an error and stop the emulator.
Perhaps QEMU uses such a thing internally. Since the ARM emulation would be integrated into an already-existing emulation engine (which I didn't write and don't fully understand--the only parts of Z26 I've patched have been the memory read/write logic) I would need something with a fairly small footprint.
Any idea how QEMU works inside? Any idea what the GPL licence would require if I just use 2% of the code in QEMU--whether I'd have to bundle the code for the whole thing, or just the part that I use, or what?
Try QEMU.
With some work, you can make my emulator do what you want. It was written for ARM920, and the Thumb instruction set isn't done yet. Neither is the MMU/cache interface. Also, it's slow because it is an interpreter. On the bright side, it's all written in C99.
http://code.google.com/p/gp2xemu/
I haven't worked on it for a while (The svn trunk is 2 years old), but if you're going to use the code, I'll be glad to help you out with the missing features. It is licensed under MIT, so it's just the same as the broad BSD license.

Resources