Effects of debugger on embedded application - c

I'm currently using IAR to debug an STM32f0 micro, and I've noticed some interesting effects on my device's functionality while attempting to debug when the unit is running.
It seems to me that the debugger is slowing down or inhibiting the application in some way, meaning some of the more time critical sections of the code are having trouble executing correctly.
What is the debugger doing to the code in order to allow me to look at the registers/variables/memory/position in source code, and how does this affect execution time?
Note: optimisations are already turned off, as they tend to stuff up IAR's ability to step through code correctly and cause it to sometimes miss breakpoints.

The ARM Cortex-M0 CoreSight on-chip debug unit used of the STM32F0xx is non-intrusive for normal execution. Hardware breakpoints matched instruction-fetch addresses in real-time. If your debugger supports update of memory content and variables while running (rather then at a breakpoint), that may conceivably have an effect, but on STM32F2xx with I have not seen any issues even with very time critical code with with microsecond scale deadlines (F2 is however Cortex-M3 not M0).
Applying conditional breakpoints will slow execution considerably, if the breakpoint location is executed frequently since the processor must be stopped and the condition tested by the host.
A common problem encountered when debugging that may catch the unwary is that when the processor stops on a break-point, the on-chop peripherals and timers continue to run, of then resulting in several interrupts pending when the processor is restarted often with undesirable effects depending on your applications ability to handle such abnormal timing gracefully. The DBGMCU peripheral supports the ability to selectively stop specific timers and peripherals, and to support low-power modes without disconnecting the debugger. You may need to use these features to improve your debug experience.

What is the debugger doing to the code in order to allow me to look at
the registers/variables/memory/position in source code, and how does
this affect execution time?
It depends on your debugger. In general, debuggers slow the application (application performance) whether it is run in emulator or on the device itself based on the level of intrusiveness.
The breakpoints, single stepping can have huge impact on the timing of execution of application as they are intrusive. Also note that the intrusiveness due to these features in turn depends on the offering by the vendor.
Normally, debuggers shall read the CPU registers and memory so that it can be displayed to the user, however this process shall consume memory/time in turn causing side effect.
Some debug techniques used by debuggers involve incorporation of additional information into the application code which in turn can increase the size of application which can effect the execution time. Such intrusive techniques can cause side effects due to this additional code.

Related

Is it wrong to log inner working of an IoT device in case of failure?

I'm currently working on an IoT project and I want to log the execution of my software and hardware.
I want to log them then send them to some DB in case I need to have a look at my device remotely.
The wip IoT device will have to be as minimal as possible so the act of having to write very often inside a flash memory module seems weird to me.
I know that it will run the RTOS OS Nucleus on an Cortex-M4 with some modules connected through SPI.
Can someone with more expertise enlighten me ?
Thanks.
You will have to estimate your hourly/daily/whatever data volume that needs to go into the log and extrapolate to the expected lifetime of your product. Microcontroller flash usually isn't made for logging and thus it features neither enduring flash cells (some 10K-100K write cycles usually compared to 1M or more for dedicated data chips - look it up in the uC spec sheet) nor wear leveling. Wear leveling is any method which prevents software from writing to the same physical cell too frequently (which would e.g. be the directory for a simple file system).
For your log you will have to create a quite clever or complex method to circumvent any flash lifetime problems.
But the problems don't stop there: usually the MCU isn't able to read from Flash memory when writing to it where "writing" means a prolonged (several microseconds up to milliseconds depending on the chip) sequence of instructions controlling the internal Flash statemachine (programming voltage, saturation times, etc.) until the new values have reliably settled in the memory. And, maybe you guessed it, "reading" in this context also means reading instructions, that is you have to make sure that whichever code and interrupts that may occur during the Flash write are only executing code in RAM, cache or other memories and not in the normal instruction memory. It is doable but the more complex the SW system that you are running above the HW layer, the less likely it will work reliably.

How to prevent system hang before watchdog timer task kicks in

We are using an ARM AM1808 based Embedded System with an rtos and a File System. We are using C language. We have a watchdog timer implemented inside the Application code. So, whenever something goes wrong in the Application code, the watchdog timer takes care of the system.
However, we are experiencing an issue where the system hangs before the watchdog timer task starts. The system hangs because the File System code is badly coded with so many number of while loops. And sometimes due to a bad NAND(or atleast the File System code thinks it is bad) the code hangs in a while loop and never gets out of it. And what we get is a dead board.
So, the point of giving all the information is to ask you guys whether there is any mechanism which could be implemented in the code that runs before the application code? Is there any hardware watchdog? What steps can be taken in order to make sure we don't get a dead board caused by some while loop.
Professional embedded systems are designed like this:
Pick a MCU with power-on-reset interrupt and on-chip watchdog. This is standard on all modern MCUs.
Implement the below steps from inside the reset interrupt vector.
If the MCU memory is simple to setup, such as just setting the stack pointer, then do so the first thing you do out of reset. This enables C programming. You can usually write the reset ISR in C as long as you don't declare any variables - disassemble to make sure that it doesn't touch any RAM memory addresses until those are available.
If the memory setup is complex - there is a MMU setup or similar - C code will have to wait and you'll have to stick to assembler to prevent accidental stacking caused by C code.
Setup the most fundamental registers, such as mode/peripheral routing registers, watchdog and system clock.
Setup the low-voltage detect hardware, if applicable. Hopefully the out-of-reset state for LVD on the MCU is a sound one.
Application-specific, critical registers such as GPIO direction and internal pull resistor registers should be set from here. Many MCU have pins as inputs by default, making them vulnerable. If they are not meant to be inputs in the application, the time they are kept as such out of reset should be minimized, to avoid problems with noise, transients and ESD.
Setup the MMU, if applicable.
Everything else "CRT", such as initialization of .data and .bss.
Call main().
Please note that pre-made startup code for your MCU is not necessarily made by professionals! It is fairly common that there's an amateur-level "CRT" delivered with your toolchain, which fails to setup the watchdog and clock early on. This is of course unacceptable since:
This makes any program running on that platform a notable safety/poor quality hazard, in case the "CRT" will crash/hang for whatever reason.
This makes the initialization of .data and .bss needlessly, painfully slow, as it is then typically executed with the clock running on the default on-chip RC oscillator or similar.
Please note that even industry de facto startup code such as ARM CMSIS fails to do some of the MCU-specific hardware setups mentioned above. This may or may not be a problem.
There is a hardware watchdog that could be run before the application runs. ARM AM1808 does have a timer that could be implemented as a watchdog, as per documentation: www.ti.com/lit/ds/symlink/am1808.pdf. So, you may wish to set it like that at least during the part of the program that runs through the critical and long section. You at wish to have a piece of booting code that first sets this watchdog, and after the correct initialization, goes to application. In fact, this is a very common approach.

Can you check performance of a program running with Qemu Simulator?

Say if I am running an ARM simulator using Qemu, is it possible to find the time of execution of a program as it would be on the real ARM processor. In other words if I use functions such as gettimeofday, in a program running on the simulator, to check the elapsed time, will the elapsed time be given accurately through the cycle-accurate simulation?
Investigation in this issue at our company concluded that Qemu (for the ARM) is not cycle accurate. If I remember correctly cycle accuracy is not a goal of Qemu, instead it aims at fast emulation. Beware also that exact timing is dependent on quite unpredictable things like cache hits and misses. It will also depend on the actual architecture chosen. Note that ARM is merely an instruction set IP and several different implementations exist. If in addition an operating system is emulated, things get even more unpredictable.
We use the simulator from ARM to evaluate performance, but even that one is not fully cycle accurate for the latest versions of the ARM architecture.
GEM5
I have seen a researcher use gem5 for this. This paper evaluates how accurate it is. And I've created an easy to get started setup on GitHub.
As Bryan mentioned QEMU is designed for speed: only a valid x86 API behavior must be reached, not necessarily with the right number of cycles or in the same pipeline order. This is also called functional emulation.
Furthermore, DRAM memory accesses are assumed to be immediate, and therefore it makes no sense to emulate caches either. And as we know, current CPUs are basically memory latency hiding machines.
Cycle accurate emulators on the other hand, also emulate CPU internals, and are therefore way slower.
The root of the problem is of course the under documented performance features of processors, which vendors don't release to prevent intellectual property leakage.
GEM5 appears to implement a generic version of common CPU internals, so it should be more cycle accurate than functional emulators, but true cycle accurate emulation is likely impossible without insider knowledge.
Third party emulation implementors must then reverse engineer CPU performance from experiments and existing documentation.
Some of the key "internals" are cache, pipeline and branch prediction.
Related:
Question that asks how cycle accurate emulators are possible at all: How can CAS simulators like PTLsim achieve cycle accurate simulation of x86 hardware?
ARM Cycle-Accurate Simulator

To what extent are interrupts supported in Win32?

To what extent are interrupts supported in Win32 beyond processor definitions? For example, x86 machines define at least 18 interrupts, including traps such as the breakpoint trap (INT 3). The other 19-255 interrupts are left open by Intel as software defined interrupts. Are any of these used by Windows/WinAPI or are they just open and free for applications to use as they please? If Windows uses them, where can I find the relevant documentation? I looked on MSDN and could not find anything.
(BTW I am doing compiler, debugger and other system-level programming, so please don't lecture me on your opinions about the advisability of using interrupts in the first place.)
In Win32 apps, there's probably just one interrupt used commonly, int 2Eh. It's used as the system call entry point. It's analogous to int 21h in DOS. The rest of the interrupts aren't used by apps.
Apps, however, can handle some CPU exceptions (and debug breaks) via Structured Exception Handling (SEH)/Vectored Exception Handling (VEH). Windows catches CPU exceptions originating in apps and reflects them back into the apps, if and however possible (Windows is not perfect in imitating the CPU exception model).
Windows uses device interrupts internally and does not let apps mess with them. The x86 CPU handles interrupts in the most privileged mode, where the kernel runs.
Nowadays many device interrupts aren't associated with fixed interrupt vectors and are configurable and you need to work with the various things like PCI to query or change the settings.
If you want to work with devices and interrupts directly, you need to write a kernel-mode driver for Windows. There's the Device Driver Kit (DDK) and books like Windows Internals that can get you started.
Still, if you're looking for specifics of device XYZ and its interrupt programming, you aren't going to find everything or much on MSDN or in the DDK because you'll need hardware-specific information, something that's outside of Microsoft's control. The kernel provides the functionality necessary to do I/O and handle interrupts, but it's ultimately up to device drivers to use them one way or the other.

Hardware Performance Counters using C [duplicate]

I'd like to use hardware performance counter, specifically x86 CPUs to obtain cache misses or branch mis-prediction. Performance counters are heavily used in advanced profilers like Intel VTune. Please don't be confused performance counters on Windows operating systems.
In order to use these counters in C/C++ program, one may use PAPI: http://icl.cs.utk.edu/papi/
This allows you to easily use performance counters, but on only Linux. PAPI once supported Windows, but not now.
Is there anyone who recently tried PAPI or other APIs to use hardware performance counters on Windows?
You can use RDPMC instruction or __readpmc MSVC compiler intrinsic, which is the same thing.
However, Windows prohibits user-mode applications to execute this instruction by setting CR4.PCE to 0. Presumably, this is done because the meaning of each counter is determined by MSR registers, which are only accessible in kernel mode. In other words, unless you're a kernel-mode module (e.g. a device driver), you are going to get "privileged instruction" trap if you attempt to execute this instruction.
If you're writing a user-mode application, your only option is (as #Christopher mentioned in comments) to write a kernel module which would execute this instruction for you (you'll incur user->kernel call penalty) and enable test signing on your machine so your presumably self-signed "driver" can be loaded. This means you can't easily distribute this app, but that'll work for in-house tuning.
What about this HCP Reference? Does it not provide what you want?

Resources