Full software timer : derivate time? - timer

I have been asked a question but I am not sure if I answered it correctly.
"Is it possible to rely only on software timer?"
My answer was "yes, in theory".
But then I added:
"Just relying on hardware timer at the kernel loading (rtc) and then
software only is a mess to manage since we must be able to know
how many cpu cycles each instruction took + eventual cache miss +
branching cost + memory speed and put a counter after each one or
group (good luck with out-of-order cpu).
And do the calculation to derivate the current cpu cycle. That is
insane.
Not talking about the overall performance drop.
The best we could have is a brittle approximation of the time which
become more wrong over time. Even possibly on short laps."
But even if it seems logical to me, did my thinking go wrong?
Thanks

On current processors and hardware (e.g. Intel or AMD or ARM in laptops or desktops or tablets) with common operating systems (Linux, Windows, FreeBSD, MacOSX, Android, iOS, ...) processes are scheduled at random times. So cache behavior is non deterministic. Hence, instruction timing is non reproducible. You need some hardware time measurement.
A typical desktop or laptop gets hundreds, or thousands, of interrupts every second, most of them time related. Try running cat /proc/interrupts on a Linux machine twice, with a few seconds between the runs.
I guess that even with a single-tasked MS-DOS like operating system, you'll still get random behavior (e.g. induced by ACPI, or SMM). On some laptops, the processor frequency can be throttled by its temperature, which depends upon the CPU load and the external temperature...
In practice you really want to use some timer provided by the operating system. For Linux, read time(7)
So you practically cannot rely on a purely software timer. However, the processor has internal timers.... Even in principle, you cannot avoid timers on current processors ....
You might be able, if you can put your hardware in a very controlled environment (thermostatically) to run a very limited software (an OS-like free standing thing) sitting entirely in the processor cache and perhaps then get some determinism, but in practice current laptop or desktop (or tablet) hardware is non-deterministic and you cannot predict the time needed for a given small machine routine.
Timers are extremely useful in interesting (non-trivial) software, see e.g. J.Pitrat CAIA, a sleeping beauty blog entry for an interesting point. Also look at the many uses of watchdog timers in software (e.g. in the Parma Polyhedra Library)
Read also about Worst Case Execution Time (WCET).
So I would say that even in theory it is not possible to rely upon a purely software timer (unless of course that software uses the processor timers, which are hardware circuits). In the previous century (up to 1980s or 1990s) hardware was much more deterministic, and the amount of clock cycles or microsecond needed for each machine instruction was documented (but some instructions, e.g. division, needed a variable amount of time, depending on the actual data!).

Related

what is the reference for timing calculations in linux

I want to clarify about timers in linux, how they are behaving?
I know in micro-controllers the timers/counters we use the reference, timing of machine instruction to execute.so there we could make it loop for how much time we need sleep/timer/counter.
But in linux where & how it will take the reference that if i use sleep(5), exactly 5 seconds are elapsed.If any one know please clarify me kindly.
Every operating system kernel (that I know of) has a whole machine independent framework for timers. This is pretty much one of the most central things a kernel must have because we need timers for everything, process scheduling, dealing with hardware errors, select/poll timeouts, network protocols, etc. At any point in time your kernel has dozens, if not thousands of timers waiting to be executed at some point in the future. Most of them will be canceled and never executed.
The simplest framework that pretty much everyone uses sets up one of the many clocks in a machine to generate an interrupt at a set interval. 100Hz is the most common, Windows (at least in the past) sets it to 64Hz (but it can be changed by any application), some systems experimented with 1024Hz. The timer interrupt fires and the interrupt handler checks if there's anything queued up to do at that time and if there is, it is executed. There has been some work for Linux to improve this so that we can get shorter or longer intervals than 10ms depending on the next scheduled timer, both to improve the precision of the timers and to save power, but in general it works as described above.
If I understand your question correctly, you think that there is something that measures how much certain sequence of instructions takes and then loops until some amount of time passes. This is something that is almost never done because it wastes power and it blocks anything else from running at the same time and is also quite unreliable. It is still done in modern kernels, but very rarely and only when high precision is required when talking to really, really stupid hardware. Last time I had to do it was 17 years ago to talk to some ethernet controller where you had to manually implement MII by bit-banging in software, it was terrible and hung the system for quite a long time every time you (un-)plugged an ethernet cable. Nobody builds hardware that requires this anymore because it really ruins the performance of modern systems.
So in your question, sleep(5) will be implemented by registering a function in the timer framework to be called in 5 seconds from now and then putting the process to sleep. 5 seconds later the timer fires and the process gets awakened again.

Low interrupt latency via dedicated architectures and operating systems

This question may seem slightly vague, however I am researching upon how interrupt systems work and their latency times. I am trying to achieve an understanding of how architecture facilities such as FIQ in ARM help decrease latency times. How does this differ from using a operating system that does not have access or can not provide access to this facilities? For example - Windows RT is made for ARM etc, and this operating system is not able to be ported to other architectures.
Simply put - how is interrupt latency different in dedicated architectures that have dedicated operating systems as compared to operating systems that can be ported across many different architectures (Linux for example)?
Sorry for the rant - I'm pretty confused as you can probably tell.
I'll start with your Windows RT example, Windows RT is a port of Windows to the ARM architecture. It is not a 'dedicated operating system'. There are (probably) many OSes that only run on only 1 architecture, but that is more a function of can't be arsed to port them due to some reason.
What does 'port' really mean though?
Windows has a kernel (we'll call is NT here, doesn't matter) and that NT kernel has a bunch of concepts that need to be implemented. These concepts are things like timers, memory virtualisation, exceptions etc...
These concepts are implemented differently between architectures, so the port of the kernel and drivers (I will ignore the rest of the OS here, often that is a recompile only) will be a matter of using the available pieces of silicon to implement the required concepts. This implementation is a called 'port'.
Let's zoom in on interrupts (AKA exceptions) on an ARM that has FIQ and IRQ.
In general an interrupt can occur asynchronously, by that I mean at any time. The CPU is generally busy doing something when an IRQ is asserted so that context (we'll call it UserContext1) needs to be stored before the CPU can use any resources in use by UserContext1. Generally this means storing registers on the stack before using them.
On ARM when an IRQ occurs the CPU will switch to IRQ mode. Registers r13 and r14 have there own copy for IRQ mode, the rest will need to be saved if they are used - so that is what happens. Those stores to memory take some time. The IRQ is handled, UserContext1 is popped back off the stack then IRQ mode is exited.
So the latency in this case might be the time from IRQ assertion to the time the IRQ vector starts executing. That going to be some set number of clock cycles based upon what the CPU was doing when the IRQ happened.
The latency before the IRQ handling can occur is the time from the IRQ assert to the time the CPU has finished storing the context.
The latency before user mode code can execute depends on too much stuff in the OS/Kernel to explain here, but the minimum boils down to the time from the IRQ assertion to the return after restoring UserContext1 + the time for the OS context switch.
FIQ - If you are a hard as nails programmer you might only need to use 7 registers to completely handle your interrupt servicing. I mentioned that IRQ mode has its own copy of 2 registers, well FIQ mode has its own copy of 7 registers. Yup, that's 28 bytes of context that doesn't need to be pushed out into the stack (actually one of them is the link register so it's really 6 you have). That can remove the need to store UserContext1 then restore UserContext1. Thus the latency can be reduced by up to the length of time needed to do that save/restore.
None of this has much to do with the OS. The OS can choose to use or not use these features. The OS can choose to make guarantees regarding how long it will take to execute the OSes concept of an interrupt handler, or it may not. This is one of the basic concepts of an RTOS, the contract about how long before the handler will run.
The OS is designed for some purpose (and that purpose may be 'general') - that target design goal will have a lot more affect on latency than haw many target the OS has been ported to.
Go have a read about something like freertos than buy some hardware and try it. Annotate the code to figure out the latencies you really want to look at. IT will likely be the best way to get your ehad around it.
(*Multi-CPU systems do it the same with but with some synchronization and barrier functions and a sprinkling of complexity)

What's a good system test for keeping a deadline?

Reading about RTOS, the characteristic of a "hard" RTOS is that it can keep a deadline deterministically but how do we test or prove that the system actually fulfils the requirements?
The MicroC/OS II RTOS is characterized as a hard RTOS but how can I validate that claim? If I have some C code and ISR for my FPGA that can run C programs and make context switch between threads with semaphores similar to what an RTOS does, how can I know whether the OS / RTOS is "hard" or "soft" RTOS?
Can it depend on the application and must it have a timer and therefore using the builtin hardware timer (e.g. the Altera DE2 has a 50 Mhz oscillator) with hardware interrupts is preferred, and then we just test whether threads and processes can be scheduled according to a deadline and we then check if the deadline was met?
Or is there some general practice to what must be included to make the difference between operating system, real-time operating system, and hard and soft RTOS?
Is there some "typical test" with a typical requirement for the label "hard RTOS" ?
It is hard to answer this question, because your premise is wrong.
A system classified as hard realtime is distinguished from a soft realtime system only through the severity of a missed deadline. In hard RT, a missed deadline is classified as a system failure, which may or may not cause harm to hardware and people, while soft realtime usually means that a missed deadline only degrades system performance, but does not bring it to a grinding halt.
A typical example for a hard RT system would be a watchdog that shuts down a system on overheating - if it fails to meet its deadline, the system breaks. Also, general safety-related systems in power plants, or airplanes fall in this category.
A Soft RT example would be video streaming, where a missed deadline causes degraded visual quality or stuttering, but does not necessarily cause a failure of the system.
Long story short, hard and soft RT are characteristics of complete software systems, measured by their specifications and fault models. So typically, it is the application running on the operating system that fits the hard/soft RT criteria, the OS merely provides interfaces with predictable timing behaviour, that allow the application to make timing assumptions.

Can you check performance of a program running with Qemu Simulator?

Say if I am running an ARM simulator using Qemu, is it possible to find the time of execution of a program as it would be on the real ARM processor. In other words if I use functions such as gettimeofday, in a program running on the simulator, to check the elapsed time, will the elapsed time be given accurately through the cycle-accurate simulation?
Investigation in this issue at our company concluded that Qemu (for the ARM) is not cycle accurate. If I remember correctly cycle accuracy is not a goal of Qemu, instead it aims at fast emulation. Beware also that exact timing is dependent on quite unpredictable things like cache hits and misses. It will also depend on the actual architecture chosen. Note that ARM is merely an instruction set IP and several different implementations exist. If in addition an operating system is emulated, things get even more unpredictable.
We use the simulator from ARM to evaluate performance, but even that one is not fully cycle accurate for the latest versions of the ARM architecture.
GEM5
I have seen a researcher use gem5 for this. This paper evaluates how accurate it is. And I've created an easy to get started setup on GitHub.
As Bryan mentioned QEMU is designed for speed: only a valid x86 API behavior must be reached, not necessarily with the right number of cycles or in the same pipeline order. This is also called functional emulation.
Furthermore, DRAM memory accesses are assumed to be immediate, and therefore it makes no sense to emulate caches either. And as we know, current CPUs are basically memory latency hiding machines.
Cycle accurate emulators on the other hand, also emulate CPU internals, and are therefore way slower.
The root of the problem is of course the under documented performance features of processors, which vendors don't release to prevent intellectual property leakage.
GEM5 appears to implement a generic version of common CPU internals, so it should be more cycle accurate than functional emulators, but true cycle accurate emulation is likely impossible without insider knowledge.
Third party emulation implementors must then reverse engineer CPU performance from experiments and existing documentation.
Some of the key "internals" are cache, pipeline and branch prediction.
Related:
Question that asks how cycle accurate emulators are possible at all: How can CAS simulators like PTLsim achieve cycle accurate simulation of x86 hardware?
ARM Cycle-Accurate Simulator

How do Linux OS schedule threads when there are multiple sockets

For example, in a dual socket system with 2 quad core processors, does the thread scheduler tries to keep the threads from the same processes in the same processor? Because interleaving threads of different processes in different processors would slow down performance in the case where threads in a process have a lot of shared memory accesses.
It depends.
On current Intel platforms the BIOS default seems to be that memory is interleaved between the sockets in the system, page by page. Allocate 1Mbyte and half will be on one socket, half on the other. That means that wherever your threads are they have equal access to the data.
This makes it very simple for OSes - anywhere will do.
This can work against you. The SMP hardware environment presented to the OS is synthesised by the CPUs cooperating over QPI. If there's a lot of threads all accessing the same data then those links can get real busy. If they're too busy then that limits the performance, and you're I/O bound. That's where I am; Z80 cores with Intel's memory subsystem design would be just as quick as the nahelem cores I've actually got (ok I might be exagerating...).
At the end of the day the real problem is that memory just isn't quick enough. Intel and AMD have both done some impressive things with memory recently, but we're still hampered by its slowness. Ideally memory would be quick enough so that all cores had clock rate access times to it. The Cell processor sort of did this - each SPE has a bit of SRAM instead of a cache, and once you get your head round them you can make them really sing.
===EDIT===
There is more to it. As Basile Starynkevitch hints the alternate approach is to embrace NUMA.
NUMA is what modern CPUs actually embody, the memory access being non-uniform because the memory on the other CPU sockets is not accessible directly by addressing a bus. The CPUs instead make a request for data over the QPI link (or Hypertransport in AMD's case) to ask the other CPU to fetch data out of its memory and send it back. Because the CPU is doing all this for you in hardware it ends up looking like a conventional SMP environment. And QPI / Hypertransport are very fast, so most of the time it's plenty quick enough.
If you write your code so as to mirror the architecture of the hardware you can in theory make improvements. So this might involve (for example) having two copies of your data in the system, one on each socket. There's memory affinity routines in Linux to specifically allocate memory that way instead of interleaved across all sockets. There's also CPU affinity routines that allow you to control which CPU core a thread is running on, the idea being you run it on a core that is close to the data buffer it will be processing.
Ok, so that might mean a lot of investment in the source code to make that work for you (especially if that data duplication doesn't fit well with the program's flow), but if the QPI has become a problematic bottle neck it's the only thing you can do.
I've fiddled with this to some extent. In a way it's a right faff. The whole mindset of Intel and AMD (and thus the OSes and libraries too) is to give you an SMP environment which, most of the time, is pretty good. However they let you play with NUMA by having a load of library functions you have to call to get the deployment of threads and memory that you want.
However for the edge cases where you want that little bit extra speed it'd be easier if the architecture and OS was rigidly NUMA, no SMP at all. Just like the Cell processor in fact. Easier, not because it'd be simple to write (in fact it would be harder), but if you got it running at all you'd then know for sure that it was as quick as the hardware could ever possibly achieve. With the faked SMP that we have right now you experiment with NUMA but you're mostly left wondering if it's as fast as it possibly could be. It's not like the libraries tell you that you're accessing memory that actually resident on another socket, they just let you do it with no hint that there's room for improvement.

Resources